Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR correction attributes: CS, ILLS, DBTS #21

Open
jpmoreux opened this issue Jun 17, 2014 · 2 comments
Open

OCR correction attributes: CS, ILLS, DBTS #21

jpmoreux opened this issue Jun 17, 2014 · 2 comments
Assignees

Comments

@jpmoreux
Copy link
Member

Use cases:

These String related attributes can be used to describe human based decisions/actions during the OCR text correction process:
ILLS (boolean, optional): specify if a word is illegible in the source document (and consequently can't be corrected). This status can be used:
- during the production workflow (the control quality process needs to know if a specific word is part or not of the guaranteed text quality perimeter ; besides, this status informs that the provider made a manual task on the word)
- by the viewing software: end users should be informed that some words are illegible in the source document itself (it's not an OCR error...)

DBTS (boolean, optional): specify that a word has been corrected but a doubt remains. Same use cases.
• These two attributes are part of the "production family" attributes, with CS (Correction Status), already defined by the schema.

Remarks: ILLS could be useful on the TextBlock/TextLine types too:

  • areas of the page with physical defaults: stains, blur, etc.
  • areas of the page with scan defaults: curvature near the binding, missing blocks near the margins, etc.

These attributes must be defined with a recommendation: always use the highest level possible to set the attribute (ie: do not set an attribute on all the sub-elements).

Examples:

<String ID="PAG_00000001_ST000029" STYLEREFS="TXT_1" HPOS="3413" VPOS="296" HEIGHT="448" WIDTH="992" WC="0.34" ILLS="true" CONTENT="AnfûràoII"/>

<String ID="PAG_00000001_ST000029" STYLEREFS="TXT_1" HPOS="3413" VPOS="296" HEIGHT="448" WIDTH="992" WC="0.34" DBTS="true" CONTENT="droits"/> 

Schema change:

<xsd:attribute name="ILLS" type="xsd:boolean" use="optional"> 
 <xsd:annotation > 
  <xsd:documentation>The word is illegible in the source document and can't be manually corrected. If the content owner thinks the word is legible, the attribute must be dropped (ILLS="false" is not recommended)< /xsd:documentation  > 
 </xsd:annotation  > 
</xsd:attribute>
<xsd:attribute name="DBTS" type="xsd:boolean" use="optional">  
 <xsd:annotation >
   <xsd:documentation>The word has been manually corrected but a doubt remains. If the content owner thinks the doubt is not legimitate, the attribute must be dropped  (DBTS="false" is not recommended).< /xsd:documentation   >  
 </xsd:annotation >
</xsd:attribute> 
@jpmoreux jpmoreux changed the title "Production family" attributes: CS, ILLS, DBTS "OCR correction" attributes: CS, ILLS, DBTS Jun 17, 2014
@jpmoreux jpmoreux changed the title "OCR correction" attributes: CS, ILLS, DBTS OCR correction attributes: CS, ILLS, DBTS Jun 18, 2014
@cowboyMontana
Copy link
Member

Changed label from 'submitted' to 'discussion'.

@cowboyMontana
Copy link
Member

Assigned Jean Philippe Moreux as change request champion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants