Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve compatability of unicode grapheme width #4223

Open
pascalkuthe opened this issue Aug 30, 2023 · 9 comments
Open

Improve compatability of unicode grapheme width #4223

pascalkuthe opened this issue Aug 30, 2023 · 9 comments
Labels
enhancement New feature or request

Comments

@pascalkuthe
Copy link

pascalkuthe commented Aug 30, 2023

I mentioned this on matrix. I personally consider this a bug but it doesn't really fit the the template and depending on your point of view it could also be viewed as a feature request.

While trying to fix the way helix render multi codepoint graphemes (currently truncating afte rthe first char) I noticed that my fix breaks rendering on wezterm. The reason is that wezterm changes the traditional cell width that would be calculated by wcchar (which would be 5 for for a mean_heart_woman emoji for example) by capping the width of a single grapheme to two.

This breaks existing applications (like kakune) and puts me as an application author in a rough spot where I can either use traditional width calculation and get the correct result on most emulators (kitty, alacrity, xterm, gnome terminal, ...) or use the width calculation used by wezterm and get the correct result on wezterm.

Using the wrong definition of width results in horrible visual artifacts (random highlights, disappearing and ghost characters,...).

Traditionally terminal emulators have simply computed the width of unicode cosepoints individually (merging zero width codepoints) without any grapheme segmentation. I personally feel that this is correct because the unicode standard only defines width on a codepoint level (with the exception of emoji representation which affects two codepoints at once buy that still kind of works because the variant selectors is zero width so its atleast a single cell).

The unicode standard does not define any notion of width for graphemes as far as I am aware (basically saying its font dependent). That makes a lot if sense because some graphemes can get really crazy (able to cover multiple lines, get super wide etc.). The width limit of 2 works well for emojis but not for other graphemes (to be fair other graphemes are kind of nieche anyway).

The kitty emulator handles this by keeping the cell width definition unchanged.The unicode segmentation is then simply handled at the font level so multi grapheme emojis still get merged but they simply endup with some trailing empty space.

Would you perhaps consider adoption a similar mechanism for compatibility? If you prefer the current behaviour perphase some kind of escape code based negotiation could be possible like there is for unicode version support?

Right now I am in a bit of an awkward spot where I as an application developer habe to choose which emulator I want to support properly.

@pascalkuthe pascalkuthe added the enhancement New feature or request label Aug 30, 2023
@pascalkuthe pascalkuthe changed the title Improve comparability for unicode grapheme width Improve comparability of unicode grapheme width Aug 30, 2023
@pascalkuthe pascalkuthe changed the title Improve comparability of unicode grapheme width Improve compatability of unicode grapheme width Aug 30, 2023
@wez
Copy link
Owner

wez commented Aug 30, 2023

Can you share the codepoints from the problematic sequences? That will make it easier for me to look into this.
In theory, wezterm should be compatible with wcswidth. We use https://github.com/ridiculousfish/widecharwidth which provides a wcwidth implementation, but not a wcswidth implementation.

@pascalkuthe
Copy link
Author

pascalkuthe commented Aug 30, 2023

So one example is 👩‍❤️‍👨

Its madeuopof 3 seperate emojis (woman, heart, man) joined by a zero width joiners (plus a variatuon selector for the heart).

In uncuode 9 mode I expect this emoji to take up 5 cells.
In unicode 14 mode I expect it to takeup 6 cells.
In wezterm it takesup 2 cells.

For reference: Traditionally terminal emulators don't support unicode swgmentatuin/graphemes so they render each emoji speratly (that's where the width of 5/6 comes from si.ply the sum of all codepoints widths) For example in alacrity:.

264171016-ca9aeba7-7b98-4340-8b7f-67d8fa8edca7
(Ignore the rest around it the just is that the emojis get render aeperartly).

Kitty keeps the traditional wcchar based cellwidth (and just segments at the font level):

264175970-d755633e-cfb0-4659-94e1-2338b3532cd1(1)

Your wcchar based implementation is quite correct and wcchar dies actually compute the right thing. The problem is that you don't return the result of the sum of all wcchar callsdirectly but instead do width.min(2).

AFAIK wcswidth just calls wcwidth for every Unicode codepoint in the sequence and returns the sum (with the special case of emoji presentation, my crate handles that with a fairly simple one char lookahead without any grapheme segmentation)

I have my own width implementation (unfinished ignore the rest of the repo please :D) which heavily references yours and behaves identical to termwiz except for this upper bound I descrived. See the relevant testcase here:

https://github.com/pascalkuthe/grapheme-width-rs/blob/aef038e78e41f3ec33da04d985aeac1476e2fe02/src/test.rs#L52

@wez
Copy link
Owner

wez commented Aug 30, 2023

FWIW iterm2 has similar behavior to wezterm:
image

have you tried helix and kakoune on that terminal?

I'm open to removing the .min(2) from here, but am unsure of the full ramifications at the moment because it has been some time since I was last deep in this and will need to refresh my memory!

width.min(2)

@kovidgoyal
Copy link

Note that kitty's current behavior with ZWJ is wrong. It's on my TODO list to implement, see kovidgoyal/kitty#3810 at which point it will render in two cells as well. IMO, rendering in two cells for extended ZWJ sequences is the correct behavior. I have not implemented this yet, partially because it's a fair bit of work in kitty internals and partially because as you point out, most terminal emulators don't implement it and almost no terminal programs implement it, so implementing it would break many things.

@wez
Copy link
Owner

wez commented Aug 30, 2023

I'm leaning towards expanding the unicode version escape sequences (https://wezfurlong.org/wezterm/config/lua/config/unicode_version.html) to allow applications to select the behavior they desire; details need designing.

There's a cell structure used by wezterm that builds on a tiny string type that uses a single bit to represent the column width; that will take a bit of effort to adjust for this concept.

@kovidgoyal
Copy link

I would recommend against doing that. It means, for example, one can no longer cat text into a terminal and have predictable results. Imagine a table in a markdown file, its columns will no longer be aligned depending on which terminal you cat it into and even details of the terminals internal state.

When it comes to width considerations, really the only workable longterm solution is for all terminals to converge to a common standard for wcswidth()

@pascalkuthe
Copy link
Author

I would recommend against doing that. It means, for example, one can no longer cat text into a terminal and have predictable results. Imagine a table in a markdown file, its columns will no longer be aligned depending on which terminal you cat it into and even details of the terminals internal state.

When it comes to width considerations, really the only workable longterm solution is for all terminals to converge to a common standard for wcswidth()

Yeah this is really the crux of the issue. I agree that for ZJW emojis specifically width 2 makes sense (not so much for other graphemes tough IMO) but really as an app developer I can't take the specifics of various emulators into account. It would not be ideal to essentially test for know TERM_PROGRAMM values at startup to figure out what width to use (even if I did implemennt that, such a solution is pretty brittle over ssh). It's already a bit of a pain that kitty defaults to respecting emoji presentation/unicode 14, my code supports that but I have to figure out of if the emulator does and ultimately expose a user setting for that.

I think the escape sequences approach is pretty nice because atleast it's much more robust regarding ssh and similar. If all terminals that supported more extensive unicode handling (emoji presentation, zwj emojis,...) allowed me to opt out by sending a escape sequences on startup that would atleast prevent my tui from breaking depending on the emulator it's running on.

Ideally there would be some mechanism to go the other way too (the tui can figure out what the emulator supports). Otherwise I would just keep my tui at the lowest common denominator (for example alacritty has no plans to support zwj emojis or even emoji presentation AFAIK).

Terminfo would seem like a good candidate but that also has the ssh problem so that is a bit brittle but atleast an improvement over querying TERM_PROGRAM.

@kovidgoyal
Copy link

kovidgoyal commented Aug 30, 2023 via email

@wez
Copy link
Owner

wez commented Sep 21, 2023

Looking into this more, it's a PITA to undo the grapheme clustering work, so I've decided not to do it.
Separately over in:

wezterm now reports "permanently enabled" to a DECRQM query from an application that wants to reason about the grapheme clustering behavior. wezterm, contour and (soon: there's a PR open) foot implement this query. Some terminals support setting this parameter at runtime. You can read more about this sequence over in https://github.com/contour-terminal/terminal-unicode-core

wez added a commit that referenced this issue Sep 28, 2023
This isn't currently possible, but
#4223 will introduce that
possibility.  This commit causes width > 2 to "overflow" into
the heap-stored variant.

Sequences that are long enough to produce a column width > 2 are
likely to already be long enough in bytes to overlow into the
heap-stored variant anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants