Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifying and removing asterisks #736

Open
Elias-Est opened this issue Feb 21, 2024 · 8 comments
Open

Identifying and removing asterisks #736

Elias-Est opened this issue Feb 21, 2024 · 8 comments

Comments

@Elias-Est
Copy link

Hi,

I'm currently trying to remove asterisks from a scanned document because they disturb some OCR operations of the software I'm working on (see the images below).

The approach I tried is to convert the image to binary and find the asterisks via pixHMT so I can remove the area around them. However, I found out that the asterisks vary too much to easily define a reliable pattern: The Sels that I created are either too strict or too generic, leading to a lot of false positives and/or negatives.

Now I'm wondering if it even makes sense to continue with this approach or if there are are better ways to identify (and remove) the asterisks? Or are there maybe some tips you can give me that make it easier to find a suitable pattern?

Thank you

Elias


This is for example a (small) part of a document with unwandted asterisks:
asterisks

And the binary version:
asterisks - binary

@DanBloomberg
Copy link
Owner

Those asterisks seem regular enough to identify with a good HM sel (structuring element).
Can you show me the HM sel that you used?
For an example of how to generate these HM sels, be sure to see prog/livre_hmt.c
and run it with the parameters that were used for the figures:

   livre_hmt 1 8
   livre_hmt 2 4

@DanBloomberg
Copy link
Owner

Also, look at prog/findpattern_reg.c

@DanBloomberg
Copy link
Owner

Also, if the asterisks always come in strings of at least 3, I would use a string of at least 3 of them for the HM sel. That will make it more robust against false positives.

Then after the HM transform, dilate by the image that was used to make the HM sel.
Then you have 2 choices to remove the asterisks from the input image:
(1) do a small dilation on the result and then subtract it from the input image
(2) use the result as a seed, fill with clipping to the input image, and then subtract that from the input image.

@Elias-Est
Copy link
Author

Hi @DanBloomberg,

thanks for your response. I looked at the classes you mentioned and how they use the "pixGenerateSelBoundary" method. Using that information, I was able to create a Sel with which I can detect the asterisks quite reliably now. Maybe I'll do some fine-tuning for even better results, but for now, they are sufficient.

Originally, I tried to find the common parts of the asterisks by hand and created Sels with "selCreateFromString", which probably was a bit too optimistic.

Unfortunately, some documents contain single asterisks so using strings of them is not an option.

For removing the found asterisks from the document image, I tried your first suggestion (I'm using pixDilateBrick to dilate), and it works fine, too.

Thank you very much for your help and have a nice weekend.

@DanBloomberg
Copy link
Owner

Glad it worked out! I wrote those functions for creating HM sels from a bitmap because it's much easier than making them by hand, one hit or miss at a time.

Were you able to make a HM sel for a single asterisk that didn't produce many false positives? The asterisk has a distinctive shape, so my guess is that it is possible. Note that a horizontal line under an asterisk can cause trouble it your sel has a 'miss' too far below it, so when you generate the sel with ``pixGenerateSelBoundary(), use botflag = 0``` so that you don't add extra background below the template.

@Elias-Est
Copy link
Author

To improve the detection rate while keeping the false positives to a minimum, I'm actually using two sels now which I created by using pixGenerateSelBoundary with strict values. This works better than using a single more loose one.

Regarding the horizontal line: Removing the asterisks above it works the way you describe. However, I've noticed that the line causes an even larger problem: It makes the numbers and the "EUR" above it unrecognizable for our OCR engine. I tried to remove it using the procedure described in http://www.leptonica.org/line-removal.html, but unfortunately, the lower parts of the characters are fused with the horizontal line so removing the line only leaves unrecognizable leftovers. Do you have any idea what I could do or is this a problem that can't be solved via image processing?

Without dilation
above-line

With dilation (2, 2)
above-line-dilate

@DanBloomberg
Copy link
Owner

DanBloomberg commented Feb 26, 2024

Not easy.

However, there is something you can try for this particular problem.
(1) Find the bounding boxes of connected components, using pixConnCompBB()
(2) Extend each box down about 6 pixels to include the line below each, using boxaAdjustBoxSizes()
(3) Use the boxa of expanded boxes to extract from the original image. One way to do this:

      pix1 = pixCreateTemplate(pixs);      [pixs is original image]
      pixMaskBoxa(pix1, pix1, boxa, L_SET_PIXELS);
      pixAnd(pix1, pix1, pixs);

or, alternatively,

     pixa1 = pixClipRectangles(pixs, boxa);
     pixGetDimensions(pixs, &w, &h, NULL);
     pix1 = pixaDisplay(pixa1, w, h);
     pixaDestroy(&pixa1);

(Because this operation (3) isn't obvious, and it should be easy, I'll add a function that does it)

This unfortunately puts the line under the "R", changing it to a "B". It might also make some numbers unrecognizable, like "4" and "7".

@Elias-Est
Copy link
Author

Sorry for the late reply, I now was finally able to try out what you suggested.

As you already mentioned, this approach improves the readability of some characters but decreases it for others. Unfortunately, as I noticed, the latter seems to be more often the case than the former.

However, in most of the documents, the line is either clearly separated from the text above or the overlap is small enough to still have recognizable characters after removing the line. Therefore, I decided to just remove the line and put up with the occasional failures if too much of the characters above is cut off.

Still, it was worth a try; thank you again for your help and have a nice weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants