-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode and zero width charcaters #26
Comments
Hi Gerald! Thanks for the question. My top priority is to prevent the terminal display from becoming garbled. When I added Unicode support, since I was not sure how to deal with these combining characters, I decided to turn them into a question mark, as explained in README.md#Text display rules. So this is not supported at the moment, but I may be able to figure something out. |
Combining characters overlay the previous character, hence the zero width. Where in the code would I find this? What about the space between the glyphs that should be there? Is this because of the column style layout? |
Hi Gerald,
I don't know what may be causing this. There is nothing special I do to add space between glyphs. This may be caused by the terminal application itself.
Okay. So I created a data struct representing the contents of a displayed cell, called The screen is represented by a grid of The current limitation is The functions in the So, in order to support combining characters, the following has to change:
Cheers. |
http://www.unicode.org/reports/tr29/ section "3 Grapheme Cluster Boundaries" |
It definitely is grapheme based. The combining character is part of the grapheme. It sounds like the current implementation is codepoint based, which is almost always the wrong way to do it, but so, so easy. It would be nice if it was as simple as changing a TCellChar into a string. If utf8proc http://juliastrings.github.io/utf8proc/doc/utf8proc_8h.html could be brought in to do all the unicode handling behind the scenes, it would probably simplify things. |
This turned out to be a lot easier than I anticipated. Zero-width characters simply get combined into the previous cell in TText::eat, or are not shown at all. The zero width joiner is always discarded so that it won't merge characters together, changing the width of a whole string. In order to be able to combine several characters in the same screen cell, the size of TCellChar has been raised to 8 bytes. This allows a minimum of one combining character, which should be fine for most non-degenerate use cases. But since we use UTF-8, several codepoints may fit in there. It also preserves the assumption that TCellChar can be casted into a primitive type, so there's not much code that has to be changed elsewhere. The only breaking change is TText::eat, where the 'width' and 'bytes' parameters are now indexes so that we can look back at the previous cell when we find a zero-width character. But this actually makes things simpler for the invokers of TText::eat. So it's a win-win change. See #26.
After some testing I have found Wikipedia articles where 8 bytes were not enough to fit all the diacritics in one cell. So I raised it to 12 bytes, where at least 2 combining characters can fit in the worst case. This should be enough for most real-world, natural language use cases. I don't care about zalgo. 12 bytes and alignas(4) is still a sweet spot for performance where most operations (including comparison) can be carried out in registers. It also preserves sizeof(TScreenCell) == 16, although that struct is likely to become larger in the future if true color support is added. See #26.
Thank you everyone for your suggestions. I tried replacing You could argue that I'm coupling the system with an implementation detail, or doing premature optimization. But the truth is that representing each cell with an individual string is not a good solution to this problem. I'm pretty sure not even GUI applications store text this way. Does Turbo Vision need to delegate Unicode processing to a external library? Actually, it doesn't. Turbo Vision is not a text editing component. What it needs to know is how text is displayed on the terminal, and this is platform-dependent, while the Unicode standard is not. So it doesn't help me at all to know that "👨👩👧👦" is a grapheme cluster if the terminal will display it differently: Even if it's true that an arbitrary number of codepoints can fit in a single cell, I realized that:
So what I did was:
This preserves the already present assumptions, the most important of which is that the width of a string is the sum of the width of its characters. The performance impact of this feature is also minimal, because No changes are required in the source code of Turbo Vision applications, except those using Terminals which do not respect the result of
Should Turbo Vision use an external Unicode library to determine that these characters have a width of 1? Tilde is another application with good Unicode support. It treats these characters as one column wide instead of zero. Guess what, it suffers from screen garbling on Xterm and Alacritty. So you can see how difficult it is to get this right. I suggest you to upgrade to the latest commit and try again. The Turbo text editor has also been updated. At this point, the most improvable thing is string iteration with Cheers. |
Users don't care about graphemes and code points. Users do care about their experience. They just want to have all letters/signs required by their language working :) Perhaps limiting the number of code points per screen cell may play a role in the future if real-world problems arise that may be solved by many-many code points per cell. But history shows that looking too far into the future is not always the best option. Microsoft has decided to look into the future by choosing UTF16 as the standard for their Winapi, and now they live with the most awkward Unicode representation of all. |
The solution to this is to return a boolean from TText::eat that indicates whether it should be invoked again or not. This makes it simpler to use in some cases and more complex in others. Related: #26.
This looks good in my quick tests. Thanks for the work! |
With zero width characters being drawn as a question mark, I'm wondering how to display something like the images attached. In this case, the zero width character should place a dot over the last symbol (image 1), but instead displays a single column wide question mark (image 2)
There is also extra spacing in image 2 that shouldn't be there.
Is there a way to get the string displayed properly?
This is the string
"\xe0\xa4\x95\xe0\xa4\xbe\xe0\xa4\x9a\xe0\xa4\x82\x0a"
The text was updated successfully, but these errors were encountered: