-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inline Representation: Sections by function, not class #71
Conversation
Different space widths should be indicated using HTML and ` `, `&emsp`, | ||
` `, `‌`, `‍`. | ||
|
||
### Hyphenation | ||
Hyphenation {#hyphenation} | ||
----------- | ||
|
||
Soft hyphens must be represented using the HTML `­` entity. | ||
|
||
The HTML <a href="https://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.2.5">`‎` and | ||
`‏` entities</a> (indicating writing direction) must not be used; all | ||
writing direction changes must be indicated with tags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be under 'Writing Direction' header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, doesn't fit in hyphenation. Maybe move this to
https://kba.github.io/hocr-spec/1.2/#font-lang, replace "with tags" with "dir=
attribute" and reference https://kba.github.io/hocr-spec/1.2/#valdef-ocr-capabilities-ocrp_dir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
<li><a href="#sub-sup"><span class="secno">6.4</span> <span class="content">Superscript and Subscript</span></a> | ||
<li><a href="#whitespace"><span class="secno">6.5</span> <span class="content">Whitespace</span></a> | ||
<li><a href="#hyphenation"><span class="secno">6.6</span> <span class="content">Hyphenation</span></a> | ||
<li><a href="#ruby"><span class="secno">6.7</span> <span class="content">Ruby characters</span></a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also combine some of these sections under one new section, i.e.
HTML entities and Unicode
- Non-breaking spaces must be represented using the HTML
entity. - Different space widths should be indicated using HTML and
 
, 
, 
,‌
,‍
. - Soft hyphens must be represented using the HTML
­
entity. - The HTML
‎
and‏
entities (indicating writing direction) must not be used; all writing direction changes must be indicated with tags. - Furigana and similar constructs must be represented using their correct Unicode encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these aspects are worth their own section. e.g. for Whitespace: explain whether repeated whitespace is meaningful, if non-tabular aligned text should use tabs. For hyphenation, whether that's the only encoding (e.g. altoxml/schema#41).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hyphens are also mentioned in http://kba.github.io/hocr-spec/1.2/#hardbreak
Besides ruby also other special entities are mentioned in the article:
For example, HTML and CSS provide
support for representing fonts, styles, hyphenation,
flexible spacing, justification, kashida (flexible Arabic
characters), Urdu ligatures, Japanese ruby, mixed hor-
izontal/vertical layout, inline changes in writing direc-
tion, and many others.
However, I am also fine with more subsections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the paragraph there before and probably will again once I get to the fonts/language section :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Non-breaking spaces must be represented using the HTML ` ` entity. | ||
|
||
### Non-default spaces | ||
|
||
Different space widths should be indicated using HTML and ` `, `&emsp`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semicolon missing in  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
Superscripts and subscripts, when not in <{ocr_math}> or <{ocr_chem}> formulas, | ||
must be represented using the HTML `<sup>` and `<sub>` tags, even if special | ||
must be represented using the HTML <{sup}> and <{sub}> tags, even if special |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These links are not working and I am not sure there is anything we can link to...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Properly link to dir= and lang= attributes
|
I just find the HTML5 standard way better. I know we have
in there, but we should rather change that than link to an old spec with bad examples: H<sub>2</sub>O
E = mc<sup>2</sup>
<SPAN lang="fr">M<sup>lle</sup> Dupont</SPAN> The first two should not use sub/sup at all. None of the tags should be upper-case. |
#51