Escimages: layout a bit off when combining .pbm #36

KlustoR · 2017-06-29T22:44:27Z

Hi,

I have an example where the output differs a bit from the input.
It cut the receipt a little bit short, like there needs to be a whiteline in between.
I have attached the .bin and combined .pbm files, and you should be able to see that under "soda" the dashes ---- are a little bit to high, making tesseract having trouble with that line.
Anything possible to avoid this ?
Tesseract output
.bin and .pbm files

mike42 · 2017-07-03T07:32:47Z

I think that escimages is behaving correctly here, as the spacing is not part of the images. Ran the input through esc2html, picture illustrates me selecting the whitespace in the browser, and this should be more true to the print-

There are a few approaches you could use here-

OCR images separately
add space when the images are stitched together (not sure how to do this with ImageMagick, but that would be the right place to do it)

Long term, we need a reliable way to convert receipt files to raster images. HTML has the most accurate output we get right now, so wkhtmltoimage is a real option.

A-Rotsaert · 2017-07-03T12:32:39Z

I've found the solution, after your reply i googled a bit and found this topic https://superuser.com/a/290679 that suggest using +append instead of -append when using imagemagick's convert. I will try that soon and let you know.

KlustoR · 2017-07-03T19:28:41Z

I've tried changing the -append to +append but results where not as expected.
I've also tried using wkhtmltoimage, but it doesn't run headless (what is a must for my needs).
Then i tried using xvfb but resulted in a low quality picture.

Also, if i'd OCR the images separately it becomes to slow for my purpose.
Any idea's ? :)

KlustoR · 2017-07-04T18:39:13Z

I've went through the source code a bit and found a possible solution.
An Implementation of $cmd -> isAvailableAs('LineBreak') and edit the height of the corresponding image, and add the required 8 bytes of null as needed.
Is this something that could work ?
I dont know how to edit .pbm images height tho.

mike42 · 2017-07-09T11:14:02Z

This command is currently only going to to extract individual images, so I think you should initially add the whitespace back with ImageMagick. This documentation should hopefully get you started.

I think we could solve this properly in one of several ways:

implement a flag in escimages (eg, --include-blank-lines) to additionally output a 24px tall, 1px wide image for each line of text encountered. This is similar to what you are suggesting (low difficulty).
Implement a flag in escimages (eg --preserve-formatting) to assemble the output into a single image for OCR users (medium difficulty).
Implement a totally separate command, eg esc2image, which renders entire receipts to images, just like esc2html (high difficulty).

Thoughts?

KlustoR · 2017-07-09T14:24:42Z

I reckon any of those options would do the trick, first one seemingly the easiest, i would go for that.

mike42 added the enhancement label Jul 9, 2017

KlustoR mentioned this issue Jul 24, 2017

Added include-blank-lines to preserve formatting #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escimages: layout a bit off when combining .pbm #36

Escimages: layout a bit off when combining .pbm #36

KlustoR commented Jun 29, 2017

mike42 commented Jul 3, 2017

A-Rotsaert commented Jul 3, 2017

KlustoR commented Jul 3, 2017 •

edited

Loading

KlustoR commented Jul 4, 2017

mike42 commented Jul 9, 2017

KlustoR commented Jul 9, 2017

Escimages: layout a bit off when combining .pbm #36

Escimages: layout a bit off when combining .pbm #36

Comments

KlustoR commented Jun 29, 2017

mike42 commented Jul 3, 2017

A-Rotsaert commented Jul 3, 2017

KlustoR commented Jul 3, 2017 • edited Loading

KlustoR commented Jul 4, 2017

mike42 commented Jul 9, 2017

KlustoR commented Jul 9, 2017

KlustoR commented Jul 3, 2017 •

edited

Loading