-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Length of main glyph and variants #44
Comments
This was discussed on the technical sessions and I think is also explained by the statement of Jean-Philip, that the main glyph should be the one sign and should be limited to 1 to prevent misusage / wrong interpretation for having multiple characters bound to one glyph and then having all kind of possible combinations for the alternatives. |
In an effort to keep ahead of schema issues, ones without a direct schema implication will be closed if deemed to be no longer active or if the discussion has gone full circle. They can be reopened if requested. |
The change proposed by @Jo-CCS and adopted into 4.0-4.2 includes this detail of restricting "character" length that seems overly restrictive to me, not just with respect to OCR results, but on principal grounds: In some languages / scripts, not all relevant characters can be represented by a single Unicode codepoint (not to be confused with Line 1039 in 682bed5
Scripts like Arabic, Hebrew, Devanagari and Bengali heavily rely on combining mark sequences, and even for European languages (esp. in historic texts) there's not always a precomposed codepoint available. For example, German umlauts Please re-open. |
I agree, this should be re-opened. Some glyphs we have in historic prints, like aͤ (LATIN SMALL LETTER A + COMBINING SMALL LETTER E) cannot be represented in a single Unicode code point and the cited XML Schema restriction does not allow us to save them in a valid ALTO document. |
Thanks for the comments, this issue is reopened. |
Separated from #26 (comment)
Glyph variants: The main glyphs are restricted to length 1 but variants to length 3. This could be a bit inconvenient when dealing with OCR results. Say FineReader returns 5 options, some with length 1 and some longer. What happens if the first one is not of length 1, does the ALTO exporter tool then check if there is one with length 1 among the other options and change the order? And why three? For Latin that would probably cover most cases, but for other scripts there might be longer ones.
The text was updated successfully, but these errors were encountered: