You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently there is no way of distinguishing hard and soft HYP elements.
Example of a hard hyphen:
I separated the words by a non-
breaking space.
Example of a soft hyphen:
I separated the words by a non-break-
ing space.
However, since the OCR system can often distinguish the two (e.g. by checking a lexicon of known words), it should be able to pass this information to downstream systems in the Alto file, since this information could affect OCR-to-text and OCR layer indexing strategies.
I suggest changing the HYP element to include a new HARD_HYPHEN attribute, as follows:
<xsd:elementname="HYP"minOccurs="0">
<xsd:annotation>
<xsd:documentation>A hyphenation char. Can appear only at the end of a line.</xsd:documentation>
</xsd:annotation>
<xsd:complexType>
<xsd:attributename="HEIGHT"type="xsd:float"use="optional"/>
<xsd:attributename="WIDTH"type="xsd:float"use="optional"/>
<xsd:attributename="HPOS"type="xsd:float"use="optional"/>
<xsd:attributename="VPOS"type="xsd:float"use="optional"/>
<xsd:attributename="CONTENT"type="xsd:string"use="required"/>
<xsd:attributename="HARD_HYPHEN"type="xsd:boolean"use="optional">
<xsd:annotation>
<xsd:documentation>True if this is a hard-hyphen (would appear in the word regardless of print location), false if this is a soft hyphen (only appears in the word if it is split at the end of a line).</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
</xsd:complexType>
</xsd:element>
The text was updated successfully, but these errors were encountered:
Currently there is no way of distinguishing hard and soft
HYP
elements.Example of a hard hyphen:
Example of a soft hyphen:
However, since the OCR system can often distinguish the two (e.g. by checking a lexicon of known words), it should be able to pass this information to downstream systems in the Alto file, since this information could affect OCR-to-text and OCR layer indexing strategies.
I suggest changing the
HYP
element to include a newHARD_HYPHEN
attribute, as follows:The text was updated successfully, but these errors were encountered: