-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identifying and removing asterisks #736
Comments
Those asterisks seem regular enough to identify with a good HM sel (structuring element).
|
Also, look at |
Also, if the asterisks always come in strings of at least 3, I would use a string of at least 3 of them for the HM sel. That will make it more robust against false positives. Then after the HM transform, dilate by the image that was used to make the HM sel. |
Hi @DanBloomberg, thanks for your response. I looked at the classes you mentioned and how they use the "pixGenerateSelBoundary" method. Using that information, I was able to create a Sel with which I can detect the asterisks quite reliably now. Maybe I'll do some fine-tuning for even better results, but for now, they are sufficient. Originally, I tried to find the common parts of the asterisks by hand and created Sels with "selCreateFromString", which probably was a bit too optimistic. Unfortunately, some documents contain single asterisks so using strings of them is not an option. For removing the found asterisks from the document image, I tried your first suggestion (I'm using pixDilateBrick to dilate), and it works fine, too. Thank you very much for your help and have a nice weekend. |
Glad it worked out! I wrote those functions for creating HM sels from a bitmap because it's much easier than making them by hand, one hit or miss at a time. Were you able to make a HM sel for a single asterisk that didn't produce many false positives? The asterisk has a distinctive shape, so my guess is that it is possible. Note that a horizontal line under an asterisk can cause trouble it your sel has a 'miss' too far below it, so when you generate the sel with ``pixGenerateSelBoundary() |
To improve the detection rate while keeping the false positives to a minimum, I'm actually using two sels now which I created by using pixGenerateSelBoundary with strict values. This works better than using a single more loose one. Regarding the horizontal line: Removing the asterisks above it works the way you describe. However, I've noticed that the line causes an even larger problem: It makes the numbers and the "EUR" above it unrecognizable for our OCR engine. I tried to remove it using the procedure described in http://www.leptonica.org/line-removal.html, but unfortunately, the lower parts of the characters are fused with the horizontal line so removing the line only leaves unrecognizable leftovers. Do you have any idea what I could do or is this a problem that can't be solved via image processing? |
Not easy. However, there is something you can try for this particular problem.
or, alternatively,
(Because this operation (3) isn't obvious, and it should be easy, I'll add a function that does it) This unfortunately puts the line under the "R", changing it to a "B". It might also make some numbers unrecognizable, like "4" and "7". |
Sorry for the late reply, I now was finally able to try out what you suggested. As you already mentioned, this approach improves the readability of some characters but decreases it for others. Unfortunately, as I noticed, the latter seems to be more often the case than the former. However, in most of the documents, the line is either clearly separated from the text above or the overlap is small enough to still have recognizable characters after removing the line. Therefore, I decided to just remove the line and put up with the occasional failures if too much of the characters above is cut off. Still, it was worth a try; thank you again for your help and have a nice weekend. |
Hi,
I'm currently trying to remove asterisks from a scanned document because they disturb some OCR operations of the software I'm working on (see the images below).
The approach I tried is to convert the image to binary and find the asterisks via pixHMT so I can remove the area around them. However, I found out that the asterisks vary too much to easily define a reliable pattern: The Sels that I created are either too strict or too generic, leading to a lot of false positives and/or negatives.
Now I'm wondering if it even makes sense to continue with this approach or if there are are better ways to identify (and remove) the asterisks? Or are there maybe some tips you can give me that make it easier to find a suitable pattern?
Thank you
Elias
This is for example a (small) part of a document with unwandted asterisks:
And the binary version:
The text was updated successfully, but these errors were encountered: