-
-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyphenation problems in Portuguese #2001
Comments
Lets split PT/RU into different issues because tracking down language-specific stuff doesn't always get resolved at the same time/via the same PR. Lets make this issue the PT one please. For hyphenation issues, the first thing to check is if we even have break points to work with. Evidently not: $ ./sile
SILE v0.14.17.r373-g72965ad (LuaJIT 2.1.1700206165) [Rust]
> SILE.showHyphenationPoints("quando", "pt")
quando
> SILE.showHyphenationPoints("apaziguam", "pt")
apa-zi-guam So at least for "quando", for some reason the patterns are not allowing any hyphenation there. According to PT language rules, where should the points be? The screen shots are kind of hard to work with for this because I can't tell if the problem is other metrics (like not having any stretch available) might be contributing to poor break choices. Also I can't even be sure I'm typing the same text as you are entering in many cases. Can you post the actual XMl/SIL input files you're testing too? |
Unless I misunderstood the screenshot, it doesn't look as an hyphenation issue, but rather a justification issue (overfull lines) These examples have fairly short columns: have you tried loosening the justification constraints? As for TeX, by default, overfull lines are preferred over underfull lines when the constraints cannot be respected (on space stretching/shrinking, etc.). You can try tweaking, in order:
There are other settings (pretolerance, and even the space stretchability) that might be changed too, but they are more difficult (IMHO) to tweak "correctly". If this is indeed the issue at stakes, then it pops up quite regularly, e.g. see #620 (comment) I know the documentation mentions we use the TeX paragraph shaping and also explains briefly the settings... Note that making these settings dynamically adaptable (e.g. depending on font size and target line width) could be an interesting exercise for an experimental package, as a possible helper to minimize the occurrence of these situations. We can easily modify the typesetter to account for such dynamic approaches, which was harder in old TeX (i.e. at least before LuaTeX added hooks in many places, though I don't know how much "hackability" it would now have here). |
(BTW, regarding quando, Typst too doesn't hyphenate it (see https://typst.app/tools/hyphenate/) at this point. It's quite logical, as it uses the same TeX hyphenation patterns as SILE -- but at least it shows it's from these original patterns, and not a SILE-specific issue.) |
I ran
|
I've tested and confirm that sometimes this solved the problem, thanks. |
But what should they be? SILE and Typst both use the TeX patterns, and both software show the same hyphenation points here, don't they? |
I forgot to tell, they should be:
|
SILE is using (a Lua port of) https://github.com/hyphenation/tex-hyphen/blob/master/hyph-utf8/tex/generic/hyph-utf8/patterns/tex/hyph-pt.tex So this is likely an issue for https://github.com/hyphenation/tex-hyphen (though it would be easier then if SILE was able to use TeX patterns directly rather than having its own error-prone re-implementation as a Lua table, or to ship with a conversion script). |
(This being said, one can also register exceptions manually, with |
Unless there's something clear to do here, I am going to suggest closing/rejecting this issue, inactive for 2+ months
|
Just throwing this out there, we are in no way limited to using the hyphenation rules from tex-hyphen as is. We can correct them locally in our vendored copy when appropriate, submit fixes upstream if needed, and even use different hyphenation code for different languages. Particularly with the Rust wrapper there are several libraries we could surface. If something is still wrong here (@jodros any references to official grammar guides and/or other discussion on implementations anywhere that help confirm this is a bug) I'd like to actually look into what it is. There may always be exceptions not covered by a codifiable rule, but even if that case we can add exceptions by default if they are well known and agreed on. |
I'm glad to read this. Well, I've just take a look at
Which gave me: The only rule I found missing in the file is Now, regarding the remaining syllables as |
Yes, and I guess I quite understand it now.
Anyway, since we are using the default hard-coded (2, 2), why don't we get "eco-no-mi-co" indeed. Ho ho, weird indeed .... but maybe there's an issue somewhere with Lua lists being 1-based and not 0-based? See:
I think the issue is here: sile/core/hyphenator-liang.lua Lines 95 to 101 in 91cf578
Before applying the constraints, we have
After applying the leftmin
And after applying the rightmin
So we think we are using (2, 2), but we actually behave as (2, 3)... Which might be why #2017 failed to be noticed (English also being recommended at (2, 3) for standard typography...): A bug was hiding another. I think the code should be:
But then I don't understand any longer the root problem I had which triggered me to open #2017, I'll re-investigate it... There might be more that meets the eye here... Any thoughts and insights?1 Footnotes
|
By the way, it's not missing, unless I am mistaken: it's just our current hyphenation patterns (coming from TeX) were likely crafted based on Portuguese from Portugal, and all dictionaries seem to have "económico"... But according from some online resources, "econômico" is from Brazil (grafia no Brasil). It could be interesting to confirm. And if so, I still think it would be a good question to https://github.com/hyphenation/tex-hyphen ... Because even if "we are in no way limited to using the hyphenation rules from tex-hyphen as is", the general solution here would be to support BCP47 and possibly have different hyphenation patterns for different language variants. Admittedly, here it is quite possible that the introduction of this "1nô" in standard Portuguese wouldn't harm it much (I don't know!), but the general picture is that some specificity might need different patterns1 Footnotes
|
So let's recap as the issue got long with several things:
|
Interesting note.
Yes, that's the Brazilian spelling. Maybe there are even other minor differences to be found...
Since most of the issues I had were solved by changing ``linebreak.emergencyStretch`, this is the only remaining point to take of now. |
Noted: hyphenation/tex-hyphen#61 |
Likely: I came accross "antónimo" vs. "antônimo" in a translation file. |
I'm gonna make a list with all major differences soon... |
As I understand it everything this issue needs to track is taken care of except perhaps documentation on all the things that can be done to cope with narrow text width as gracefully as possible. Lets open an issue specific to that. |
One word I noticed to also have some trouble in being hyphenated is
quando
.Yes, I know the first example isn't the best in terms of readability, but it's what I've right now since I'm trying the
parallel
package for now, I could give more examples for Russian soon...The text was updated successfully, but these errors were encountered: