Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyphenation minimun left/right constraints should be language-specific #2017

Open
Omikhleia opened this issue May 6, 2024 · 4 comments
Open
Labels
bug Software bug issue enhancement Software improvement or feature request

Comments

@Omikhleia
Copy link
Member

Issue

The Knuth-Liang hyphenation always defaults to (2, 2) for left hyphen and right hyphen minima, upon initialization:

SILE._hyphenators[lang] = { minWord = 5, leftmin = 2, rightmin = 2, trie = {}, exceptions = {} }

These are quite sane defaults for the algorithm... but most languages would beg to differ and use different values...

Typically, for instance, English would likely prefer (2, 3), as Babel (LaTeX) implements it:

https://github.com/latex3/babel/blob/d4d55826cd264220b7a8d92b453748564affea54/locale/en/babel-en-GB.ini#L152-L153

Besides segmentation rules and patterns, SILE should likely implement such "per-language" default preferences:

In Babel some are at (2, 2) (e.g. Finnish), most at (2, 3), some at (1,1), etc.

  • Appropriate values could be possibly derived from Babel's ini files for all supported languages....
  • Perhaps dubious sometimes, e.g. for Georgian patterns are generated for (1,2) according to their comments, but Babel uses (2,2) regardless; while Typst (see below) seems to use (1,2)...

Workaround

(Not a general solution)

\lua{
-- To do after having switched to English language i.e. the "en" hyphenator got instantiated
SILE._hyphenators['en'].rightmin = 3
}

Further thought

  • This was probably overlooked (due to other issues), but (language-specific / custom) left/right hyphen minima were actually mentioned in an existing issue (now closed): Justification for Indic scripts (Malayalam) #308, with rather extreme values in the LaTeX example (3, 5).

  • AFAIK, Typst (hypher) seems to implement these right/left minima per languages (in one big file):
    https://github.com/typst/hypher/blob/6b40344866f2d7b2e156db93e91cf105cb75f7a2/src/lang.rs#L201-L205C1.

  • While at it, the current Knuth-Plass line breaker use a single hyphenPenalty (probably as TeX does), but we could use variable penalties depending on initial/final segment lengths. That is to say, rather than being behind LaTeX (and/or TeX, which we are here), there would be a way to have improvements.

@Omikhleia Omikhleia added bug Software bug issue enhancement Software improvement or feature request labels May 6, 2024
@Omikhleia
Copy link
Member Author

Linking to #1994 and #1631 -- I do think this should be part of the same "language refactoring".
Perhaps we should have these in a dedicated "project"?

@Omikhleia
Copy link
Member Author

So we might need additional per-language typography tuning files too?
E.g. for French:

{
   lefthyphenmin = 2,
   righthyphenmin = 3,
   identfirst = false,
}

(For the last one, see #1991 (comment))

Notwithstanding the capability to override them if the user wants it.

@alerque
Copy link
Member

alerque commented May 28, 2024

Yes, this setting should be tunable per language.

And yes the language code related issues are so intertwined they are hard to track and work on. It's hard to sit down and get my head around the problem or know when an individual issue is actionable. Grouping them all in a "project" sounds like a good idea.

@Omikhleia
Copy link
Member Author

Linking to #2001 (comment) - We are not in (2, 2) but likely in (2, 3) by default due to another bug, it seems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software bug issue enhancement Software improvement or feature request
Projects
Development

No branches or pull requests

2 participants