-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confidence value calculation (CC - WC - PC) - annotation extension #23
Comments
[email protected] said Regarding the Character Confidence (CC): |
I am highly concerned with the attempt to standardize the relationship between A word is much more than the sum of its component glyphs, and its confidence can be affected by many factors other than the confidence of these component glyphs. Trying to standardize this would force ALTO output to ignore these factors, or else to force all OCR software to use the identical algorithm, regardless of what experiments indicate. In my opinion, ALTO should make an effort to be descriptive rather than prescriptive. |
This came up at the 2019-01-25 Board meeting. One idea was to use a sort of registry of algorithms if it was not possible to reconcile the metrics. On the other hand, there was also a feeling that the scale should be consistent. |
HI Uriel, I would be glad to have you on board for the up-coming technical call we will then setup for it. |
@Jo-CCS - please e-mail me with details regarding the technical call, and I'll see if I can attend. I've added my e-mail address to my GitHub profile. I remain convinced that standardizing the method for calculating What I mean by best is: So far I've experimented with different ways of aggregating the glyph confidence scores (e.g. arithmetic vs. geometric mean), different ways of including the glyph splitting scores (as opposed to the glyph recognition scores), different ways of integrating whether or not a word is in the dictionary, and how often the word appears in a reference corpus. Because I deal with languages without standardized spelling, another factor to take into account is the word distance from dictionary entries. I would also like in the future to incorporate corpus-based n-gram models. There are really far too many possibilities to experiment with, and attempting to standardize would necessarily limit |
As agreed at Friday's Board Meeting, we will move ahead with a single topic meeting on this issue, open to anyone who is interested. @urieli: I will include your email in the poll, it is hoped we can have the meeting in April. |
I include the Doodle Poll link here for completeness. |
An information from ABBYY Support: |
Good discussion input. At CCS we have multiple OCR results we compare to each other and thse unified confidence values (results of the different engines calculated to same value ranges) are the identifiers which result to be taken as the best. |
As per our 2019-07-08 Board Meeting and originating in the special single topic ALTO Meeting described above, we invite interested parties to explore and comment on the summary document on OCR Confidence put together by Board Member Ashok Popat of Google. |
For the page / word and character confidence the values for the calculation are not defined in the schema.
To establish a common calculation method the idea was to share the calculation method and to define a common rule for this to make the confidence values comparable.
Here the calculation methods as calculated until now by CCS with docWorks.
Precondition detail:
ABBYY FineReader up to version 7.1: the character confidence range was defined for 28 (good) to 55 (bad)
ABBYY FineReader starting version 8.0: the character confidence range was defined for 0 (good) to 100 (bad)
These ranges have to be transformed into the range defined by ALTO (range 0 to 9; see below). There unsharpness appeares.
CCS continued calculations for WC due to that on more precised values from ABBYY (range 28 - 55 / 0 - 100), Due to that rounding differences can appear on following values of WC from CC within the ALTO!
CC:
The character confidence is defined in ALTO in a scale of "0" to "9" - "0" is best, 9 is worst.
Character Confidence is determined according to ABBYY character confidence.
The results from the Finreader engines are normalized to the ALTO scale of 0 to 9 per character.
e.g. the word FAX - detected 100% ok by OCR engine will have a CC of 000 - one digit for every character.
WC:
Word Confidence is determined based on character level confidence.
The better the character confidence the better the word confidence.
In addition the word confidence is influenced by the dictionary verification.
If a word is found in the dictionary, it increases the word confidence value.
The longer the word, the higher the confidence value.
(Explanation: If a long word (e.g. with 15 characters) is found in dictionary it is pretty sure that the word is correct, while on wrong detected character a match against the dictionary by mistake is unlikely. Short words like 'fun' / 'fan' will both be found in dictionary. There is no improved guarantee by dictionary check, that the right word is detected.)
Due to that also words with 2 or less characters are not checked against the dictionary.
The word confidence is normalized to an interval of "0.00 to "1.00" - "1.00" best, "0.00" worst.
Calculation:
double( (sum CC)/numChar )/1000.0 - normalization to (0,1)
Example:
When a word is in the dictionary, confidence is 1.0, else is computed (mainly average of all “reversed” cc – means for “212” = ((10-2) + (10-1) + (10-2)) / 3 = 25/3 = 8.33, means a WC of 0.83)
For short words, less than 3 chars, the risk is to have incorrect characters. Due to that it is calculated differently. (still pending)
Details:
FR9( FR8.1, FR10 also) : ABBYY character confidence range is between 0-100
The character confidence is normalized to (0,9) . The word confidence is the sum of the characters confidences and in the end this is calculated as an average of the numbers of characters.
Before writing the WC attribute in ALTO, the word confidence is checked against ABBYY dictionary, whenever the word is found in dictionary the confidence increases:
1000 - ((1000 - charConfLevel) / (chars.GetSize()*3));
Otherwise if the word is not found in ABBYY dictionary the initial determined word confidence level is used and normalized to (0,1)
Note:
charConfLevel word confidence - average confidence on character basis.
chars.GetSize number of characters in word
PC:
The Page Confidence is calculated by average dictionary confidence of all alpha-numeric characters.
?
The page confidence is normalized to an interval of "0.00 to "1.00" - "1.00" best, "0.00" worst.
Details:
The confidence is calculated by adding all the confidences of the XMLTexts (sum of character confidence)
set confidenceSum [expr $confidenceSum + $noOfAlphaNumChars * $confidence ]
and in the end the total page confidence is calculated after this formula:
return [ expr $confidenceSum/$pgNoOfAlphaNumChars ]
Note:
confidence- XMLText dictionary confidence
The total characters confidence sum divided by the number of characters on the page, (normalized in the end to (0,1) ) determines the Page Confidence.
If there are zones but no OCR, the returned value is 999 for confidence as for a bad confidence level.
For blank pages the returned value is 100 for confidence – as to full confidence on blank pages.
The text was updated successfully, but these errors were encountered: