Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escimages: layout a bit off when combining .pbm #36

Open
KlustoR opened this issue Jun 29, 2017 · 6 comments
Open

Escimages: layout a bit off when combining .pbm #36

KlustoR opened this issue Jun 29, 2017 · 6 comments

Comments

@KlustoR
Copy link
Contributor

KlustoR commented Jun 29, 2017

Hi,

I have an example where the output differs a bit from the input.
It cut the receipt a little bit short, like there needs to be a whiteline in between.
I have attached the .bin and combined .pbm files, and you should be able to see that under "soda" the dashes ---- are a little bit to high, making tesseract having trouble with that line.
Anything possible to avoid this ?
Tesseract output
.bin and .pbm files

@mike42
Copy link
Contributor

mike42 commented Jul 3, 2017

I think that escimages is behaving correctly here, as the spacing is not part of the images. Ran the input through esc2html, picture illustrates me selecting the whitespace in the browser, and this should be more true to the print-

image

There are a few approaches you could use here-

  • OCR images separately
  • add space when the images are stitched together (not sure how to do this with ImageMagick, but that would be the right place to do it)

Long term, we need a reliable way to convert receipt files to raster images. HTML has the most accurate output we get right now, so wkhtmltoimage is a real option.

@A-Rotsaert
Copy link
Contributor

I've found the solution, after your reply i googled a bit and found this topic https://superuser.com/a/290679 that suggest using +append instead of -append when using imagemagick's convert. I will try that soon and let you know.

@KlustoR
Copy link
Contributor Author

KlustoR commented Jul 3, 2017

I've tried changing the -append to +append but results where not as expected.
I've also tried using wkhtmltoimage, but it doesn't run headless (what is a must for my needs).
Then i tried using xvfb but resulted in a low quality picture.

Also, if i'd OCR the images separately it becomes to slow for my purpose.
Any idea's ? :)

@KlustoR
Copy link
Contributor Author

KlustoR commented Jul 4, 2017

I've went through the source code a bit and found a possible solution.
An Implementation of $cmd -> isAvailableAs('LineBreak') and edit the height of the corresponding image, and add the required 8 bytes of null as needed.
Is this something that could work ?
I dont know how to edit .pbm images height tho.

@mike42
Copy link
Contributor

mike42 commented Jul 9, 2017

This command is currently only going to to extract individual images, so I think you should initially add the whitespace back with ImageMagick. This documentation should hopefully get you started.

I think we could solve this properly in one of several ways:

  • implement a flag in escimages (eg, --include-blank-lines) to additionally output a 24px tall, 1px wide image for each line of text encountered. This is similar to what you are suggesting (low difficulty).
  • Implement a flag in escimages (eg --preserve-formatting) to assemble the output into a single image for OCR users (medium difficulty).
  • Implement a totally separate command, eg esc2image, which renders entire receipts to images, just like esc2html (high difficulty).

Thoughts?

@KlustoR
Copy link
Contributor Author

KlustoR commented Jul 9, 2017

I reckon any of those options would do the trick, first one seemingly the easiest, i would go for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants