Identifying and removing asterisks #736

Elias-Est · 2024-02-21T15:58:46Z

Hi,

I'm currently trying to remove asterisks from a scanned document because they disturb some OCR operations of the software I'm working on (see the images below).

The approach I tried is to convert the image to binary and find the asterisks via pixHMT so I can remove the area around them. However, I found out that the asterisks vary too much to easily define a reliable pattern: The Sels that I created are either too strict or too generic, leading to a lot of false positives and/or negatives.

Now I'm wondering if it even makes sense to continue with this approach or if there are are better ways to identify (and remove) the asterisks? Or are there maybe some tips you can give me that make it easier to find a suitable pattern?

Thank you

Elias

This is for example a (small) part of a document with unwandted asterisks:

And the binary version:

DanBloomberg · 2024-02-21T18:00:38Z

Those asterisks seem regular enough to identify with a good HM sel (structuring element).
Can you show me the HM sel that you used?
For an example of how to generate these HM sels, be sure to see prog/livre_hmt.c
and run it with the parameters that were used for the figures:

   livre_hmt 1 8
   livre_hmt 2 4

DanBloomberg · 2024-02-21T18:02:30Z

Also, look at prog/findpattern_reg.c

DanBloomberg · 2024-02-21T21:09:32Z

Also, if the asterisks always come in strings of at least 3, I would use a string of at least 3 of them for the HM sel. That will make it more robust against false positives.

Then after the HM transform, dilate by the image that was used to make the HM sel.
Then you have 2 choices to remove the asterisks from the input image:
(1) do a small dilation on the result and then subtract it from the input image
(2) use the result as a seed, fill with clipping to the input image, and then subtract that from the input image.

Elias-Est · 2024-02-23T11:12:52Z

Hi @DanBloomberg,

thanks for your response. I looked at the classes you mentioned and how they use the "pixGenerateSelBoundary" method. Using that information, I was able to create a Sel with which I can detect the asterisks quite reliably now. Maybe I'll do some fine-tuning for even better results, but for now, they are sufficient.

Originally, I tried to find the common parts of the asterisks by hand and created Sels with "selCreateFromString", which probably was a bit too optimistic.

Unfortunately, some documents contain single asterisks so using strings of them is not an option.

For removing the found asterisks from the document image, I tried your first suggestion (I'm using pixDilateBrick to dilate), and it works fine, too.

Thank you very much for your help and have a nice weekend.

DanBloomberg · 2024-02-23T18:37:59Z

Glad it worked out! I wrote those functions for creating HM sels from a bitmap because it's much easier than making them by hand, one hit or miss at a time.

Were you able to make a HM sel for a single asterisk that didn't produce many false positives? The asterisk has a distinctive shape, so my guess is that it is possible. Note that a horizontal line under an asterisk can cause trouble it your sel has a 'miss' too far below it, so when you generate the sel with ``pixGenerateSelBoundary(), use botflag = 0``` so that you don't add extra background below the template.

Elias-Est · 2024-02-26T14:07:39Z

To improve the detection rate while keeping the false positives to a minimum, I'm actually using two sels now which I created by using pixGenerateSelBoundary with strict values. This works better than using a single more loose one.

Regarding the horizontal line: Removing the asterisks above it works the way you describe. However, I've noticed that the line causes an even larger problem: It makes the numbers and the "EUR" above it unrecognizable for our OCR engine. I tried to remove it using the procedure described in http://www.leptonica.org/line-removal.html, but unfortunately, the lower parts of the characters are fused with the horizontal line so removing the line only leaves unrecognizable leftovers. Do you have any idea what I could do or is this a problem that can't be solved via image processing?

Without dilation

With dilation (2, 2)

DanBloomberg · 2024-02-26T19:51:54Z

Not easy.

However, there is something you can try for this particular problem.
(1) Find the bounding boxes of connected components, using pixConnCompBB()
(2) Extend each box down about 6 pixels to include the line below each, using boxaAdjustBoxSizes()
(3) Use the boxa of expanded boxes to extract from the original image. One way to do this:

      pix1 = pixCreateTemplate(pixs);      [pixs is original image]
      pixMaskBoxa(pix1, pix1, boxa, L_SET_PIXELS);
      pixAnd(pix1, pix1, pixs);

or, alternatively,

     pixa1 = pixClipRectangles(pixs, boxa);
     pixGetDimensions(pixs, &w, &h, NULL);
     pix1 = pixaDisplay(pixa1, w, h);
     pixaDestroy(&pixa1);

(Because this operation (3) isn't obvious, and it should be easy, I'll add a function that does it)

This unfortunately puts the line under the "R", changing it to a "B". It might also make some numbers unrecognizable, like "4" and "7".

Elias-Est · 2024-03-08T14:20:37Z

Sorry for the late reply, I now was finally able to try out what you suggested.

As you already mentioned, this approach improves the readability of some characters but decreases it for others. Unfortunately, as I noticed, the latter seems to be more often the case than the former.

However, in most of the documents, the line is either clearly separated from the text above or the overlap is small enough to still have recognizable characters after removing the line. Therefore, I decided to just remove the line and put up with the occasional failures if too much of the characters above is cut off.

Still, it was worth a try; thank you again for your help and have a nice weekend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identifying and removing asterisks #736

Identifying and removing asterisks #736

Elias-Est commented Feb 21, 2024

DanBloomberg commented Feb 21, 2024

DanBloomberg commented Feb 21, 2024

DanBloomberg commented Feb 21, 2024

Elias-Est commented Feb 23, 2024

DanBloomberg commented Feb 23, 2024

Elias-Est commented Feb 26, 2024

DanBloomberg commented Feb 26, 2024 •

edited

Loading

Elias-Est commented Mar 8, 2024

Identifying and removing asterisks #736

Identifying and removing asterisks #736

Comments

Elias-Est commented Feb 21, 2024

DanBloomberg commented Feb 21, 2024

DanBloomberg commented Feb 21, 2024

DanBloomberg commented Feb 21, 2024

Elias-Est commented Feb 23, 2024

DanBloomberg commented Feb 23, 2024

Elias-Est commented Feb 26, 2024

DanBloomberg commented Feb 26, 2024 • edited Loading

Elias-Est commented Mar 8, 2024

DanBloomberg commented Feb 26, 2024 •

edited

Loading