Length of main glyph and variants #44

splet · 2017-03-02T15:25:29Z

Separated from #26 (comment)
Glyph variants: The main glyphs are restricted to length 1 but variants to length 3. This could be a bit inconvenient when dealing with OCR results. Say FineReader returns 5 options, some with length 1 and some longer. What happens if the first one is not of length 1, does the ALTO exporter tool then check if there is one with length 1 among the other options and change the order? And why three? For Latin that would probably cover most cases, but for other scripts there might be longer ones.

Jo-CCS · 2017-03-03T10:07:07Z

This was discussed on the technical sessions and I think is also explained by the statement of Jean-Philip, that the main glyph should be the one sign and should be limited to 1 to prevent misusage / wrong interpretation for having multiple characters bound to one glyph and then having all kind of possible combinations for the alternatives.
See #26 (comment)

artunit · 2019-09-28T18:11:04Z

In an effort to keep ahead of schema issues, ones without a direct schema implication will be closed if deemed to be no longer active or if the discussion has gone full circle. They can be reopened if requested.

bertsky · 2021-02-15T18:53:54Z

The change proposed by @Jo-CCS and adopted into 4.0-4.2 includes this detail of restricting "character" length that seems overly restrictive to me, not just with respect to OCR results, but on principal grounds: In some languages / scripts, not all relevant characters can be represented by a single Unicode codepoint (not to be confused with Glyph or grapheme cluster), but that's what the schema enforces:

schema/v4/alto-4-2.xsd

Line 1039 in 682bed5

<xsd:length fixed="true" value="1"/>

Scripts like Arabic, Hebrew, Devanagari and Bengali heavily rely on combining mark sequences, and even for European languages (esp. in historic texts) there's not always a precomposed codepoint available. For example, German umlauts äöü cannot only be decomposed as äöü (with combining trema), but also as aͤoͤuͤ (with combining e). Same with other rare diacritics. One could argue the same for fractions, where only a few like ¾ ⅔ are available precomposed, the others need to be decomposed 3⁄4 2⁄3.

Please re-open.

mikegerber · 2021-02-16T11:36:57Z

I agree, this should be re-opened. Some glyphs we have in historic prints, like aͤ (LATIN SMALL LETTER A + COMBINING SMALL LETTER E) cannot be represented in a single Unicode code point and the cited XML Schema restriction does not allow us to save them in a valid ALTO document.

artunit · 2021-02-16T14:55:40Z

Thanks for the comments, this issue is reopened.

splet self-assigned this Mar 2, 2017

cneud added the 2 discussion label Apr 24, 2018

artunit closed this as completed Sep 28, 2019

bertsky mentioned this issue Feb 15, 2021

ocrd-tesserocr-recognize produces glyph segmentation that ocrd-fileformat-transform can't convert to ALTO OCR-D/ocrd_tesserocr#171

Closed

artunit reopened this Feb 16, 2021

cipriandinu mentioned this issue Aug 2, 2024

Glyphs should allow CONTENT with length above 1 for cases where no precombined character exists #85

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Length of main glyph and variants #44

Length of main glyph and variants #44

splet commented Mar 2, 2017

Jo-CCS commented Mar 3, 2017 •

edited by cneud

Loading

artunit commented Sep 28, 2019

bertsky commented Feb 15, 2021

mikegerber commented Feb 16, 2021

artunit commented Feb 16, 2021

Length of main glyph and variants #44

Length of main glyph and variants #44

Comments

splet commented Mar 2, 2017

Jo-CCS commented Mar 3, 2017 • edited by cneud Loading

artunit commented Sep 28, 2019

bertsky commented Feb 15, 2021

mikegerber commented Feb 16, 2021

artunit commented Feb 16, 2021

Jo-CCS commented Mar 3, 2017 •

edited by cneud

Loading