Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during processing of HEIC input files #2930

Open
robskrob opened this issue Mar 22, 2020 · 25 comments
Open

Error during processing of HEIC input files #2930

robskrob opened this issue Mar 22, 2020 · 25 comments

Comments

@robskrob
Copy link

robskrob commented Mar 22, 2020

Environment

  • Tesseract Version: 4.1.1
  • Platform: macOS Catalina 10.15

Current Behavior:

Whenever I execute $ tesseract images/IMG_3958.HEIC output/grocery_bill I get this error:

$ tesseract images/IMG_3958.HEIC output/grocery_bill
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Error during processing.

Expected Behavior:

I would expect tesseract to output the text from the grocery bill into the output file output/grocery_bill.

Is there something wrong with processing HEIC images? Also, is there a location where I can tail the logs to see if I can get a richer description of the error?

Here's more information about the tesseract program that I installed with Homebrew:

$ tesseract -v
tesseract 4.1.1
 leptonica-1.79.0
  libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE

Also attached please find the image I had tesseract process.
IMG_3958.HEIC.zip

@stweil
Copy link
Contributor

stweil commented Mar 22, 2020

You closed that, so what was the solution?

@robskrob
Copy link
Author

@stweil I apologize for closing without an explanation.

The problem is the image is not a JPG. It is a HEIC. When take a photo with my iphone, upload it to google drive and then download this photo onto my machine, the image is saved as a HEIC file, which to be honest is a new file extension to me.

So the "solution" for me was to convert the file on my machine from HEIC to JPG. Once I did this tesseract had no problem processing the image.

I suppose then my original issue is still a bug -- except there's nothing wrong with the JPG file. It's the HEIC file:

$ tesseract images/IMG_3958.HEIC output/grocery_bill
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Error during processing.

Attached please find the HEIC file

IMG_3958.HEIC.zip

@robskrob robskrob reopened this Mar 22, 2020
@robskrob robskrob changed the title Error during processing of JPG -- macOS Catalina 10.15 Error during processing of HEIC -- macOS Catalina 10.15 Mar 22, 2020
@stweil
Copy link
Contributor

stweil commented Mar 22, 2020

Thanks. So the error message should be improved and report that the input file could not be read. That is not specific for macOS, therefore I changed the title.

@stweil stweil changed the title Error during processing of HEIC -- macOS Catalina 10.15 Error during processing of HEIC input files Mar 22, 2020
@robskrob
Copy link
Author

robskrob commented Mar 22, 2020

Yeah I wanted to tail the logs of my tesseract process so I could potentially learn more about what was going on. I definitely think the error message could be improved. And yes, there's something about the input file that appears to not be readable. And yes, changing the title of the issue makes sense.

@stweil
Copy link
Contributor

stweil commented Mar 22, 2020

See the Wikipedia article on HEIC or HEIF. Maybe Leptonica can be extended to support that new format (so Tesseract would support it, too). There exists a C library libheif which might be used. @DanBloomberg, would that be interesting?

@stweil
Copy link
Contributor

stweil commented Mar 22, 2020

It looks like the format is covered by patents. Debian provides libheif nevertheless, and GIMP supports it. I don't know whether its use in Leptonica and Tesseract would be a problem.

2021-12-04: libheif uses the GNU Lesser General Public License which should be compatible with Leptonica. See https://github.com/strukturag/libheif/.

@DanBloomberg
Copy link

I've asked an expert on coding about this format, and will report here when I hear back.

For leptonica to support it is a high bar, because support is a serious commitment and invariably a lot of work. That work will include passing very intensive fuzzing, both internally at Google and externally by oss-fuzz on github.

@DanBloomberg
Copy link

Each new format supported demands a big cost in development and support effort.

As mentioned on leptonica#546, jpeg-xl seems more interesting: both easier to support and having compatibility with jpeg.

@amitdo
Copy link
Collaborator

amitdo commented Dec 28, 2020

Dan. I understand your view.

The thing is, these two formats (HEIC and AVIF) are becoming quite popular. HEIC is the default file format for photos in iOS. Recently, Firefox followed Chrome's lead, and it now supports AVIF by default (previously users had to enable it manually).

So, since you do not want to support these formats in your software. maybe we, Tesseract devs, should discuss whether we want to support them ourselves.

If we will decide to support them. we will convert them to Leptonica's pix.

To be honest, this was mainly directed toward @stweil, hoping that he would want this enough to implement it... :-)

@DanBloomberg
Copy link

Supporting AVIF and HEIC within tesseract is certainly an option. As I said, my experience with the older I/O libraries is that it's a lot of work. webp has been easier, because there are not a lot of options and the implementation is more "modern", with the basic encoder and decoder going between memory and not file streams (or, worse, with tiff, unix file descriptors). With the fuzzers that have been made for leptonica, both internally in Google and now externally with the oss-fuzz project, maintenance work has increased considerably.

But overall, maintenance on the I/O libraries has been a significant time-sink. I'm glad to have been able to relieve Ray Smith and tesseract of that burden.

I can't promise anything about jpeg-xl, but it does seem to be something I should look into.

@amitdo
Copy link
Collaborator

amitdo commented Dec 28, 2020

I'm glad to have been able to relieve Ray Smith and tesseract of that burden.

Last time we saw Ray here was 2 years ago. I don't know if he plans to return to contribute to this project.

@amitdo
Copy link
Collaborator

amitdo commented Jun 13, 2022

So, since you do not want to support these formats in your software. maybe we, Tesseract devs, should discuss whether we want to support them ourselves.

Answering myself: We should not do it.

@amitdo
Copy link
Collaborator

amitdo commented Jun 13, 2022

Still, a proper error message for unsupported image formats is desired.

@DanBloomberg
Copy link

DanBloomberg commented Jun 13, 2022

pixReadStream() emits the message:
Unknown format: no pix returned
if the format is not supported for reading.

I believe that tesseract should not rely on leptonica error messages -- you might not even emit them by default.
Instead, tesseract should have its own error message if the image file can't be read to a pix.

@zdenop
Copy link
Contributor

zdenop commented Jun 13, 2022

The funny part is that OCR process of this image fails here:

if (lines.empty()) {
return false;
}

Because when the file format is unknown (IFF_UNKNOWN), tesseract API(???) expects it is a file list:

tesseract/src/api/baseapi.cpp

Lines 1201 to 1210 in b5878c2

// Maybe we have a filelist
if (r != 0 || format == IFF_UNKNOWN) {
std::string s;
if (data != nullptr) {
s = buf.c_str();
} else {
std::ifstream t(filename);
std::string u((std::istreambuf_iterator<char>(t)), std::istreambuf_iterator<char>());
s = u.c_str();
}

I think we should remove this "guess what is the input" out of OCR API...

@vid-bin
Copy link

vid-bin commented Oct 24, 2023

Is there any update on this? My entire photo library is heic (including screenshots). I’ve been trying to get OCR working with heic on fscrawler and Nextcloud with no success.

OCR works on my Mac because Apple has support for it in their system but I want to get this working on my NAS.

the only solution would be to convert my library to jpeg which I do not want to do.

@amitdo
Copy link
Collaborator

amitdo commented Oct 24, 2023

We don't plan to support the HEIC format.

@stweil
Copy link
Contributor

stweil commented Oct 24, 2023

No, there isn't any update, neither for Leptonica nor for Tesseract. The only solution is currently to convert from HEIC to JPEG, so Tesseract can process the JPEG file.

@vid-bin
Copy link

vid-bin commented Oct 24, 2023

Does tesseract support AVIF or jpeg-xl then? I don’t want to convert to jpeg because I’ll lose HDR on my photos. The storage space is also significantly higher when converting my images.

Ideally the applications indexing the files would convert to jpeg on the fly for tesseract and then delete the temporary file when done but the ones I’m trying to use do not do this.

@DanBloomberg
Copy link

We've looked a couple of times at jpeg-xl. There is a very high bar to cross before deciding to support a new format with leptonica, and the reasons for this have been described several times. AVIF doesn't meet it. Initially, jpeg-xl was an interesting possibility because it was supported by Google, it is a significantly more efficient encoder than jpeg, and it is designed to have very good compatibility with the jpeg library. All these things meant that there was some likelihood of widespread adoption within a few years, essentially as a replacement for jpeg, and the work of supporting it in leptonica would be less than that for a completely new format.

Just a reminder: part of the work to support any new format is to build and run fuzzers for months, and to harden it not to crash or hang for any possible input.

As of 2022, however, it was evident that Google was not supporting it.
For details why it is no longer of interest, see:
DanBloomberg/leptonica#692 (comment)

@amitdo
Copy link
Collaborator

amitdo commented Oct 25, 2023

@DanBloomberg
Copy link

DanBloomberg commented Oct 25, 2023

For the tesseract input formats page, you might clarify that reading animated webp (a-webp) is not supported -- only writing.

I lost interest in fully supporting a-webp when I found that Google had so little interest in supporting their own format (which is far superior in compression to animated gif) that even with several billion gmail accounts, they didn't let you view an a-webp attachment! Only the first frame is displayed. Nevertheless, a-webp is supported in browsers; see the Google webp faq: https://developers.google.com/speed/webp/faq

@amitdo
Copy link
Collaborator

amitdo commented Oct 25, 2023

Hi Dan,

Thanks for the info. Is the description on that page in the 'Animated GIF' section also correct for animated WebP?

@DanBloomberg
Copy link

No, leptonica can not read a webp anim file. It does not return the first image.

Here are the error messages:

Error in pixReadMemWebP: WebP decode failed
Error in pixReadStream: webp: no pix returned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants