Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to distinguish hard and soft hyphens #86

Open
urieli opened this issue Mar 23, 2024 · 2 comments
Open

Make it possible to distinguish hard and soft hyphens #86

urieli opened this issue Mar 23, 2024 · 2 comments

Comments

@urieli
Copy link

urieli commented Mar 23, 2024

Currently there is no way of distinguishing hard and soft HYP elements.

Example of a hard hyphen:

I separated the words by a non-
breaking space.

Example of a soft hyphen:

I separated the words by a non-break-
ing space.

However, since the OCR system can often distinguish the two (e.g. by checking a lexicon of known words), it should be able to pass this information to downstream systems in the Alto file, since this information could affect OCR-to-text and OCR layer indexing strategies.

I suggest changing the HYP element to include a new HARD_HYPHEN attribute, as follows:

<xsd:element name="HYP" minOccurs="0">
  <xsd:annotation>
    <xsd:documentation>A hyphenation char. Can appear only at the end of a line.</xsd:documentation>
  </xsd:annotation>
  <xsd:complexType>
    <xsd:attribute name="HEIGHT" type="xsd:float" use="optional"/>
    <xsd:attribute name="WIDTH" type="xsd:float" use="optional"/>
    <xsd:attribute name="HPOS" type="xsd:float" use="optional"/>
    <xsd:attribute name="VPOS" type="xsd:float" use="optional"/>
    <xsd:attribute name="CONTENT" type="xsd:string" use="required"/>
    <xsd:attribute name="HARD_HYPHEN" type="xsd:boolean" use="optional">
      <xsd:annotation>
        <xsd:documentation>True if this is a hard-hyphen (would appear in the word regardless of print location), false if this is a soft hyphen (only appears in the word if it is split at the end of a line).</xsd:documentation>
      </xsd:annotation>
    </xsd:attribute>
  </xsd:complexType>
</xsd:element>
@cipriandinu
Copy link
Member

Thank you for your proposal, we will discuss it and take it into account for the next release (5.0)

@cipriandinu
Copy link
Member

Maybe this should be discussed in a larger context, see #43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants