-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error during processing of HEIC input files #2930
Comments
You closed that, so what was the solution? |
@stweil I apologize for closing without an explanation. The problem is the image is not a JPG. It is a HEIC. When take a photo with my iphone, upload it to google drive and then download this photo onto my machine, the image is saved as a HEIC file, which to be honest is a new file extension to me. So the "solution" for me was to convert the file on my machine from HEIC to JPG. Once I did this I suppose then my original issue is still a bug -- except there's nothing wrong with the JPG file. It's the HEIC file:
Attached please find the HEIC file |
Thanks. So the error message should be improved and report that the input file could not be read. That is not specific for macOS, therefore I changed the title. |
Yeah I wanted to tail the logs of my |
See the Wikipedia article on HEIC or HEIF. Maybe Leptonica can be extended to support that new format (so Tesseract would support it, too). There exists a C library libheif which might be used. @DanBloomberg, would that be interesting? |
It looks like the format is covered by patents. Debian provides 2021-12-04: |
I've asked an expert on coding about this format, and will report here when I hear back. For leptonica to support it is a high bar, because support is a serious commitment and invariably a lot of work. That work will include passing very intensive fuzzing, both internally at Google and externally by oss-fuzz on github. |
Maybe Leptonica/Tesseract can support gdk-pixbuf, which will bring indirect support for heic and avif.
|
Each new format supported demands a big cost in development and support effort. As mentioned on leptonica#546, jpeg-xl seems more interesting: both easier to support and having compatibility with jpeg. |
Dan. I understand your view. The thing is, these two formats (HEIC and AVIF) are becoming quite popular. HEIC is the default file format for photos in iOS. Recently, Firefox followed Chrome's lead, and it now supports AVIF by default (previously users had to enable it manually). So, since you do not want to support these formats in your software. maybe we, Tesseract devs, should discuss whether we want to support them ourselves. If we will decide to support them. we will convert them to Leptonica's pix. To be honest, this was mainly directed toward @stweil, hoping that he would want this enough to implement it... :-) |
Supporting AVIF and HEIC within tesseract is certainly an option. As I said, my experience with the older I/O libraries is that it's a lot of work. webp has been easier, because there are not a lot of options and the implementation is more "modern", with the basic encoder and decoder going between memory and not file streams (or, worse, with tiff, unix file descriptors). With the fuzzers that have been made for leptonica, both internally in Google and now externally with the oss-fuzz project, maintenance work has increased considerably. But overall, maintenance on the I/O libraries has been a significant time-sink. I'm glad to have been able to relieve Ray Smith and tesseract of that burden. I can't promise anything about jpeg-xl, but it does seem to be something I should look into. |
Last time we saw Ray here was 2 years ago. I don't know if he plans to return to contribute to this project. |
Answering myself: We should not do it. |
Still, a proper error message for unsupported image formats is desired. |
pixReadStream() emits the message: I believe that tesseract should not rely on leptonica error messages -- you might not even emit them by default. |
The funny part is that OCR process of this image fails here: Lines 974 to 976 in b5878c2
Because when the file format is unknown ( Lines 1201 to 1210 in b5878c2
I think we should remove this "guess what is the input" out of OCR API... |
Is there any update on this? My entire photo library is heic (including screenshots). I’ve been trying to get OCR working with heic on fscrawler and Nextcloud with no success. OCR works on my Mac because Apple has support for it in their system but I want to get this working on my NAS. the only solution would be to convert my library to jpeg which I do not want to do. |
We don't plan to support the HEIC format. |
No, there isn't any update, neither for Leptonica nor for Tesseract. The only solution is currently to convert from HEIC to JPEG, so Tesseract can process the JPEG file. |
Does tesseract support AVIF or jpeg-xl then? I don’t want to convert to jpeg because I’ll lose HDR on my photos. The storage space is also significantly higher when converting my images. Ideally the applications indexing the files would convert to jpeg on the fly for tesseract and then delete the temporary file when done but the ones I’m trying to use do not do this. |
We've looked a couple of times at jpeg-xl. There is a very high bar to cross before deciding to support a new format with leptonica, and the reasons for this have been described several times. AVIF doesn't meet it. Initially, jpeg-xl was an interesting possibility because it was supported by Google, it is a significantly more efficient encoder than jpeg, and it is designed to have very good compatibility with the jpeg library. All these things meant that there was some likelihood of widespread adoption within a few years, essentially as a replacement for jpeg, and the work of supporting it in leptonica would be less than that for a completely new format. Just a reminder: part of the work to support any new format is to build and run fuzzers for months, and to harden it not to crash or hang for any possible input. As of 2022, however, it was evident that Google was not supporting it. |
For the tesseract input formats page, you might clarify that reading animated webp (a-webp) is not supported -- only writing. I lost interest in fully supporting a-webp when I found that Google had so little interest in supporting their own format (which is far superior in compression to animated gif) that even with several billion gmail accounts, they didn't let you view an a-webp attachment! Only the first frame is displayed. Nevertheless, a-webp is supported in browsers; see the Google webp faq: https://developers.google.com/speed/webp/faq |
Hi Dan, Thanks for the info. Is the description on that page in the 'Animated GIF' section also correct for animated WebP? |
No, leptonica can not read a webp anim file. It does not return the first image. Here are the error messages:
|
Environment
Current Behavior:
Whenever I execute
$ tesseract images/IMG_3958.HEIC output/grocery_bill
I get this error:Expected Behavior:
I would expect
tesseract
to output the text from the grocery bill into the output fileoutput/grocery_bill
.Is there something wrong with processing HEIC images? Also, is there a location where I can tail the logs to see if I can get a richer description of the error?
Here's more information about the
tesseract
program that I installed with Homebrew:Also attached please find the image I had
tesseract
process.IMG_3958.HEIC.zip
The text was updated successfully, but these errors were encountered: