Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wish to understand hyphenation support a bit better, appears partially present. #1451

Open
Sunspark-007 opened this issue Dec 4, 2024 · 5 comments
Labels
question Further information is requested

Comments

@Sunspark-007
Copy link

Question:
Most of my books don't have hyphenation support but I found an epub where hyphenation is working.

It has "English" defined as the language and forces justification in the stylesheet. I don't see anything in the stylesheet that says "hyphen". Yet, toggling foliate's hyphenation switch works. I can see the reflow and the hyphens appearing and disappearing.

This is with a flatpak install. What is special about this epub that the slider works, but not with other books?

Is it something in the stylesheet or is it something else?

@Sunspark-007 Sunspark-007 added the question Further information is requested label Dec 4, 2024
@johnfactotum
Copy link
Owner

Foliate sets the hyphens property for p, li, blockquote, dd to auto or none depending on the setting. It appends this style at the end of the <head> element but before the user stylesheet.

So for this to work:

  • The content must be in a language of which you have a hyphenation dictionary intalled. I'm not sure how those are supposed to be installed for Flatpak. The base GNOME runtime does include some hyphenation dictionaries. But I don't know if it only supports English or if it depends on the locale, and whether it's possible to install additional dictionaries or not.
  • The book must correctly specify this language in the document. Foliate sets the document language from the book's metadata if it's not specified in the document.
  • The elements must match the selector p, li, blockquote, dd.
  • The book must not set the hyphens property inline or with higher specificity than p, li, blockquote, dd.

@Sunspark-007
Copy link
Author

Another book I am looking at here not from a commercial publisher, says "und" for language. The whole book is in English, so for that one it missed identifying it.. but that wouldn't be the sole hyphenation blocker because other ones listed as "English" also don't have it.. hmm.. maybe they are not using the selector.. I will have to look again at the stylesheet.

What are the conditions where Foliate can't identify the language from the metadata?

@johnfactotum
Copy link
Owner

johnfactotum commented Dec 5, 2024

Foliate simply reads the language code from the metadata and sets it as the value for the lang attribute on the root element of content documents in the book if such an attribute isn't already present.

So even if the book's langauge is correct, the lang attribute in the content documents can still be incorrect. This is less common, though. More often the reason is that the book uses div rather than p. Probably we could change it so that hyphens is set on the root, and then make headings the exception. I think that's how Readium CSS does it.

@Sunspark-007
Copy link
Author

I have attached a link to the book, no worries on copyright as it's from 1920.

Took a look at the thing. So for undefined, the metadata.opf file literally says UND in the language.. however! toc.ncx does say "xml:lang="en"" so I'm wondering if a condition should be added where if an opf is und, then look at the toc to see if there is a code there?

Looking at the css, it is using divs but the html is using p. I didn't see hyphens named.

A Voyage to Arcturus - David Lindsay

@johnfactotum
Copy link
Owner

johnfactotum commented Dec 6, 2024

Even if we assume, for the sake of the argument, that it is correct or at least somewhat beneficial to set the language of the book from publication resources when dc:language is missing or is und, this would be highly problematic because there's no reason to favor the navigation document (the TOC) over the content documents. There is for example a book from the EPUB 3 Samples: https://github.com/IDPF/epub3-samples/releases/download/20230704/regime-anticancer-arabic.epub. The navigation document is in in French but the rest of the book is all in Arabic. And there's no reason to favor one content document over another, either. But that means it must load and parse every single resource in the book, just to see if there's any language info anywhere. You might even need to produce a statistic to determine the ordering, in the case of multilingual books such as the example just mentioned.

And even then, although it seems reasonable that the package document should describe the languages of the publication resources and so one could fix the absence of language information on this level by looking at its sub-resources, it does not follow that the publication resources themselves can be used to determine the languages of one another, as they all exist independently on the same level.

And even then, I don't think it would be helpful very often, because the majority of books do have a dc:language, as it's required by the EPUB spec. Whereas a lot of people would neglect to use lang in HTML because it isn't required.

And as to whether it is actually beneficial, the general and usual opinion is, I think, that it is better to err on the safe side, i.e. having no hyphenations at all is preferable to having incorrect hyphenations.


Actually even the current practice of making the content documents inherit the language from the metadata is a bit iffy. The EPUB spec has a note clarifying this:

Publication resources do not inherit their language from the dc:language element(s). EPUB creators must set the language of a resource using the intrinsic methods of the format.

There's an even stronger, normative rule in the Reading Systems spec:

In the absence of this information in a publication resource, reading systems MUST NOT assume either the language or the base direction of that resource from information expressed in the package document (i.e., in xml:lang and dir attributes, in hreflang attributes on link elements, or from dc:language elements [epub-33]). Refer to a resource's formal specification for more information about to handle the absence of explicit language or direction information.

However I think these might be a bit misleading because the HTML spec says that you must get the language information "from a higher-level protocol (such as HTTP), if any" to be used as the "final fallback language". My interpretation is that it's okay (or even required) to use the language in the metadata as a fallback if you consider the metadata to be such a "higher-level protocol" (though this would be somewhat debatable).

(There is one thing that Foliate does which is not quite in line with the HTML spec, though, which is that when there are multiple languages set in the metadata, it picks the first one rather than leaving the lang unset or set to the empty string.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants