Unicode name comparison (keys, tables). #941

cyanskies · 2022-11-25T04:35:40Z

cyanskies
Nov 25, 2022

String comparison for keys and tables probably needs to be fined more explicitly. When comparing a key for duplication, we can implement string comparison in three ways.

Strings can be compared byte by byte, which can easily produce incorrect results regardless of encoding (same visible glyphs can compare unequal).
Strings can be compared using canonical normalization, strings that generate the same visible glyphs are equal.
Strings can be compared using compatibility normalization, as above but also similar glyphs are equal: (ℌ == H), (i⁹ == i9).

The normalization algorithm is pretty heady-duty and will require implementations for some languages to include external Unicode libraries to handle.

I'm curious what others think about this. How do the current implementations handle this?

abelbraaksma · 2022-11-25T19:59:08Z

abelbraaksma
Nov 25, 2022

Yes, this is a good subject to bring up. I think that, by default, ordinal comparison is the way to go. Implementations may offer other normalisation forms, or localised comparisons.

Alternatively, the most commonly used normalisation form, like the one used in internet specifications, is NFC. From here, quote:

The W3C Character Model for the World Wide Web 1.0: Normalization [CharNorm] and other W3C Specifications (such as XML 1.0 5th Edition) recommend using Normalization Form C for all content, because this form avoids potential interoperability problems arising from the use of canonically equivalent, yet different, character sequences in document formats on the Web. See the W3C Character Model for the Word Wide Web: String Matching and Searching [CharMatch] for more background.

2 replies

eksortso Nov 26, 2022

If I were concerned primarily with the shortest unique keys (including names of tables), then I would back NFC without batting an eye. But in practice, would that be smart to do? My main concern is that a developer will write a name for a key in an un-normalized way, in both the TOML template and the consuming code, and their code won't work because it doesn't use the NFC form of the key and thus doesn't match what the parser put in the table.

I'm leaning towards us recommending against normalizing keys and for using binary comparisons, for this reason. If every TOML document were written by a non-technical user based on a configuration spec or template, I'd recommend that the spec or the template ought to be written so users can either safely write the keys in an unambiguous way, or copy and paste the key names into the code or config in a consistent way so that binary comparisons will succeed.

But I don't know how prevalent un-normalized text is in UTF-8 for different languages and normalizations. Would this approach be a lot more difficult to use for certain languages that typically aren't written in NFC in practice?

abelbraaksma Nov 26, 2022

@eksortso the main issue is precisely that: you write the key in your prog language, and you read it from a file edited by someone else. It highly depends on editor, localization, OS version etc etc how characters are stored. I.e., I can write é (single char) here, but also é (two chars, as e and U+301). How it is saved, I don't know, but you won't know the difference from looking.

(to bring the point home, just copy/paste these two letters here, and you'll see they're different: https://onlinetextcompare.com/).

So, if we were to just do binary comparison, this gives problems, as the code will have a different key than the file. However, if instead we'd do NFC normalization on key comparisons (which means: ignore how it is saved, but ensure normalized comparison), then the problem goes away and both keys are equal.

In terms of prevalent: highly prevalent. Most Unicode text is not normalized, unless you specifically ask to do that. But, if you only have binary comparison, you cannot know what to normalize it into, as you don't know how the prog language itself normalizes. In other words: without normalization, all bets are off (I know, I suggested ordinal comparison above, but I think that was the wrong advice).

ChristianSi · 2022-11-27T09:13:05Z

ChristianSi
Nov 27, 2022

If I see it correctly, NFC and NFD produce the same results regarding string equality? That this, any two strings that are identical under one of them are also identical under the other?

If that's the case, it would be sufficient if implementations use one of these normalization forms to detect duplicate keys/table names. Which of them is used doesn't make a difference here.

On the other hand, NFKC and NFKD normalize more agressively, e.g. converting '…' to '...'. That's clearly over the top and so should not be used.

Another question is whether and how to normalize strings before handing them over to the user/the calling application. The case can certainly be made that they should, since e.g. a user looking for the key (or value) "café" (4 code points) won't find it if the file has "café" (5 code points) and no normalization is performed (following @abelbraaksma's example).

I think a reasonable recommendation would be that TOML parsers should apply NFC to detect duplicate keys and before handing any parsed string over to the calling app, and that TOML writers should apply NFC before writing a TOML bytestream to disk (or sending it over the wire). However, I wouldn't make it a strict requirement ("must" instead of "should") since it may be too high a burden on some implementations.

4 replies

arp242 Nov 27, 2022

I think a reasonable recommendation would be that TOML parsers should apply NFC to detect duplicate keys and before handing any parsed string over to the calling app, and that TOML writers should apply NFC before writing a TOML bytestream to disk (or sending it over the wire). However, I wouldn't make it a strict requirement ("must" instead of "should") since it may be too high a burden on some implementations.

It would require Unicode tables and non-stdlib dependencies in some cases. Right now you can do a "header-only" TOML implementation in C if you wanted to, with required normalisation this would be much harder. In Go I'd have to import x/text which is 38M (it doesn't need to be this large, but like most Unicode libraries it implements more than just normalisation, so this kind of size is about on-par).

We could forbid the Nonspacing_Mark, Enclosing_Mark, and Spacing_Mark categories to exclude combining marks, which should solve most issues. There's still some potential for confusion here (example), but without combining marks it's not a huge practical problem in reality.

ChristianSi Nov 28, 2022

Note that, as others have already mentioned, this affects quoted keys (which any contain arbitrary Unicode characters, except for a few control characters) exactly in the same ways as unquoted keys. So re-restricting the set of characters allowed in unquoted keys won't solve it at all.

eksortso Dec 3, 2022

We would only ever need to normalize keys, right? I wouldn't touch string values otherwise, even if NFC reduced their memory footprint. I can't say anything about Latin languages or English, but I know nothing about, say, how Hangul is typically normalized in practice.

abelbraaksma Dec 6, 2022

@eksortso yes, only keys (or anything we add to the spec that involves string comparison).

marzer · 2022-11-27T13:13:57Z

marzer
Nov 27, 2022

Probably worth pointing out that TOML is approaching 10 years old and has been UTF-8 for that entire time, without anyone ever reporting an issue that would have been solved by key string normalization (that I could find, anyway). I would guess that ordinal comparison (which is what most will be doing I suspect) is fine. Personally I'd prefer if this was deferred to the applications on a per-case basis, or remained a recommendation at most.

5 replies

abelbraaksma Nov 27, 2022

Good point. But with TOML now allowing bare keys with any letter-like character, chances are that it becomes more apparent that certain keys that look the same, aren’t:

café = “open”
café = “closed”

Is this allowed? With ordinal comparison these are distinct and the file is valid. With normalised comparison, they are equal, and a “duplicate key” error is raised.

abelbraaksma Nov 27, 2022

Funny: I tried to copy/paste the characters on iPad. This erases the difference (that is, iPad normalises). But doing the same on Windows, maintains the difference.

Which just goes to show how subtle this can be. You could open a file in some text editor, save it, and keys you didn’t touch may have changed (just because that editor is ‘smart’ and applies NFC or other). The (compiled) application expecting certain keys will now not find the keys anymore.

marzer Nov 27, 2022

But with TOML now allowing bare keys with any letter-like character [...]

Lol, yeah. Guess I only have myself to blame for that!

marzer Nov 27, 2022

(Jokes aside, my gut inclination is still to have it be a recommendation at most, and only elevate it to a specified requirement if there is real-world evidence of it causing issues that aren't just "grown in a lab".)

abelbraaksma Nov 27, 2022

I agree that it should be a recommendation. Besides, prior to that change, quoted keys had the same issue. Also, specific implementations or applications may have different requirements, like localisation on top of normalisation. Like “oe” and “ö” in German being treated as equal.

We cannot possibly cover all of that, unless we would add certain metadata instructions. My vote would be for a default recommendation of NFC, and optionally others, while still allowing fallback to ordinal comparison.

Note that both Windows and Linux come shipped with ICU, so in practice, implementations don’t have to ship this as part of the library, they can just rely on it being provided by the OS. And where absent, fallback to ordinal.

eksortso · 2022-12-12T00:43:32Z

eksortso
Dec 12, 2022

Let's step back a moment and codify what we have discussed so far. I think we ought to recommend specifically NFC as the normalization format to use for keys in tables, but let us acknowledge that ordinal comparisons may be used, and in either case, using two different forms of a key's name is bad practice.

The following text can be inserted into toml.md in the Keys section, after the paragraph and example stating that defining a key multiple times is invalid, which is very apropos:

Because some keys look the same with different Unicode code point sequences, parsers should compare the NFC forms of keys, instead of just their code points. Likewise, encoders should normalize the keys they write using NFC.

# DO NOT DO THIS
# prénom = "Françoise" #but assigned with two different forms
"pr\u00e9nom" = "Françoise"        ## NFC form
"pr\u0065\u0301nom" = "Françoise"  ## NFD form, looks the same as NFC form

The example I used contains two characters, "é" and "ç", which each have different forms under NFC and NFD. The name isn't significant, in case you're wondering about that, but I did write it using different forms of the cedilla. Also, since I used quoted keys, this can be tested in TOML v1.0.0 as it stands.

What do you think? What would you change?

12 replies

marzer Dec 13, 2022

Which may actually be the reason why this whole thing didn't come up before. [...] I can't speak for other platforms as I haven't been too active in these recently.

toml++ is just doing plain ol' ordinal comparsion and I've never had a bug report about it. I also had a look at the backlog for the other popular C and C++ parsers (which are all also using ordinal comparsion, because neither language has a built-in portable way of doing unicode normalization), and couldn't find any issues about it in any of those, either.

Do keep the implementers perspective in mind; if we accidentally make this a requirement or a very strong recommendation (that will implicitly become a requirement if it makes its way into the toml-test suite), we are forcing implementations to either:
a) take on a dependency (making the libraries basically DOA in the C and C++ world)
b) rely on platform-specific OS features (making the libraries non-portable and giving inconsistent results between platforms)
c) implement unicode normalization from scratch internally (decidedly non-trivial)

Also worth nothing: this only impacts people who are doing stupid stuff. If people want to use keys that are visually indistinct, then why protect them? The offending keys would be near each other in the config.

I guess there's also the in-code vs in-config representation of the key, and talks of how the text editor normalizes the code that actually reads the keys, but that's going to normalize according to whatever algorithm the programmer's editor chose and might be different regardless of what we specify/recommend, so it seems a useless exercise on that front, too. The exception here is escape sequences as in @eksortso's example, but doing normalized comparsion where the programmer or user has explicitly, deliberatly chosen a specific codepoint is a counter-intuitive deceit IMO.

I've talked myself into it: I hate this idea - it's nonportable and overly burdensome for implementers. I implore you: please do not alter the spec document to impose or recommend any form of normalization.

eksortso Dec 14, 2022

There's a lot to cover here, so if I missed any points, let me know.

@abelbraaksma Upon re-examination, I think that we may be able to actually avoid mentioning a specific normalization method, and only insist that it be used consistently if at all. I linked to the definitions of the normalization forms above. A lot of us preferred NFC, but NFD could work just as well to handle the representations of glyphs properly. But NFKD or NFKC, I think, are as tightly defined as NFD and NFC are, so serious implementers could pick NFKD and prevent many common problems with similar-looking characters.

Long story short, key comparisons boil down to non-normalized versus normalized.

And it's obvious now that in most environments, the choices are already made by default. C/C++, and Python (as I have discovered), just compare code points without normalization. And the newer Java and .NET implementations normalize automatically when comparing strings.

@ChristianSi I completely agree with you that we ought to clarify that table headers and keys live in the same namespace. That falls a little outside of current discussion, but a new PR which describes the problem and gives a solution would be recommended and well-accepted. I'm short on time as it is though, so is this something that you could take the lead on?

@marzer I get where you're coming from. I happened to struggle with the wording while writing the proposal, for reasons that turned you off this effort. But using MAY instead of SHOULD makes this proposal more palatable, I hope. Which I'll follow up on shortly.

toml.md is written for TOML users, not for implementers. Let's keep it that way.

Unfortunately, that isn't true. The website toml.io is the friendliest resource for users, and it points to toml.md, which contains all the validity requirements that the formal grammar toml.abnf doesn't have. It is true that the one-document spec that toml.md is based upon was originally written as much to be like a manifesto for users as it was the technical document that it became. There's no tighter document in the spec than that. So a future RFC.md could fulfill that need, not to mention move us towards actual RFC standardization. Let's maybe take this up over on #870.

So here now is a much more permissive proposal. It keeps the link to the same Wikipedia article but jumps directly to section on normalization. Let me know what you think.

Parsers may normalize keys before comparing them, instead of just compare keys' code points, in order to prevent similar or identical-looking keys in tables. For instance, the following will produce an error if keys are normalized:

# DO NOT DO THIS

# prénom = "Françoise", using NFC
"pr\u00e9nom" = "Françoise"

# prénom = "Françoise", using NFD
"pr\u0065\u0301nom" = "Françoise"

ChristianSi Dec 14, 2022

Some points here:

The lack of bug reports: I'm not surprised by that and it doesn't show we don't have a problem. In real life, most TOML keys are ASCII and in such cases, of course, normalization issues will never come up. But since ancient times TOML has allowed arbitrary Unicode characters at least in quoted keys and in such cases normalization issues will show up sooner or later. That such keys aren't used frequently explains why there are no bug reports but it surely can't be an excuse to say "heck, what do I care, let's just not support them properly anyway."

I'm skeptical of the new idea to say "may" instead of "should" since that would split TOML implementations into two camps: those that threat "café" and "café" as equal (so that only one of them can be used in any given table) and those that treat them as different. Such a fragmentation is never good, and considering that for most implementations the burden of doing the right thing and normalizing them to the same form will be low, I think it's important for the TOML spec to be biased, strongly suggesting this way as preferred, while still allowing the other way as fallback – but ONLY for those cases where the cost of the preferred solution would be really, really too high. Which is more or just just what "should" or "is recommended" is supposed to do.

Or would it be too high for many? I don't think so. For example, in Python all you have to do is:

import unicodedata
key = unicodedata.normalize('NFC', key)

unicodedata is part of the standard library, so you essentially get this for free. The runtime cost for ASCII-only keys should also be really low, since the key will be returned unchanged in such cases. I suppose for most other scripting and high-level languages the situation will be the same – either they do the normalization thing automatically, or otherwise they'll have at least some standard function that makes it trivial to accomplish.

For lower-level languages such as C and C++, I imagine it will be necessary to import a third-party library instead of being able to rely on the standard lib, but I wouldn't consider that a big burden. And once done, normalization will once more be only one function call away.

So frankly, the only good reason for skipping such a normalization recommendation I can see is if the TOML implementation is intended for embedded devices where computing power or memory are just too limited to make that extra external dependency acceptable. So in that very special circumstance it would be perfectly justified to say "We ignore the recommendation and don't normalize keys here" – it's a recommendation after all ("should" or "is recommended") rather than a requirement, so in such cases (where a good reason exists) there would be nothing wrong with going down that route. But let's limit it too such extreme cases, which will doubtless be rare.

As for which normalization forms can be used: NFC and NFD should indeed both be fine since they will lead to exactly identical results: any two strings that become identical under NFC will also become identical under NFD, so it really shouldn't matter which of them is used.

But the K forms are more aggressive and must NOT be used. For example, they would make ... and … the same key, and I don't think anyone would want or expect that.

arp242 Dec 14, 2022

As for which normalization forms can be used: NFC and NFD should indeed both be fine since they will lead to exactly identical results: any two strings that become identical under NFC will also become identical under NFD, so it really shouldn't matter which of them is used.

Indeed; simply "canonical form" or something to that effect should be good enough. YAML has "YAML supports the need for scalar equality by requiring that every scalar tag must specify a mechanism for producing the canonical form of any formatted content. This form is a Unicode character string which also presents the same content and can be used for equality testing."

Or would it be too high for many? I don't think so. For example, in Python all you have to do is:
import unicodedata
key = unicodedata.normalize('NFC', key)
unicodedata is part of the standard library, so you essentially get this for free

The thing is not all software is written in Python; C, C++, and Go for example don't have anything in stdlib and require importing fairly large dependencies.

This is kind of an old discussion and it's come back for many different environments/formats. My own feeling is that we should simple make a note about it, without any specific recommendation. It's really up to applications rather than TOML implementations in how they want to treat "identity", similar to how an application may choose to treat hello and HELLO as identical, or even hello_world and helloworld (like zsh's setopt).

abelbraaksma Dec 15, 2022

For lower-level languages such as C and C++, I imagine it will be necessary to import a third-party library instead of being able to rely on the standard lib

The common thing to do is to use ICU, I guess. You can, of course, choose to statically link, rely on it to be present (true for most Linux and Windows distros afaik) or call system functions, like RtlCompareUnicode on Windows. Relying on it to be present just leaves it to the user, if it’s not there you fallback to ordinal.

It’s been a while that I did C++, so if others say it is quite involved, I believe that. But having ‘should’ in the text does not mean ‘must’.

About NFC vs NFKC, these don’t behave equal, the latter will consider more strings the same (like U+017F (ſ) is eq to s, or German ß (Eszett) equal to ss), which we should discourage, unless at user option.

It's really up to applications rather than TOML applications in how they want to treat "identity"

People cam, and should, expect a certain predefined behaviour when using a standard (perceived or not), and expect portability. Since TOML explicitly supports Unicode throughout, it implicitly supports the Unicode standard. Since Unicode says “these are the ways to compare stuff”, we should follow suit, as we do compare stuff.

My personal preference would be to have this explicitly in TOML, with a line that says that at a minimum, ordinal should be offered, but at user option, NFC/NFK can be chosen. That way, users of parsers can choose themselves. And if only one option is valid and they try something else, they’d receive an error.

This idea is probably not very popular, but it allows freedom to implementers, and clarity for the user. When nothing is selected, it defaults to Ordinal as it is now (well, depending on implementation).

erbsland-dev · 2023-06-08T09:08:35Z

erbsland-dev
Jun 8, 2023

Over the recent years, I have actively worked to remove dependencies on the ICU library from many software projects. In these server-based projects, this highly complex Unicode library was used, even when it was not really necessary.

So, I believe, pushing implementers towards a complex Unicode normalisation, which requires an equally complex external library of approximately 16MB, needs to be thoughtfully considered. This would be only for what I see as a few special cases where developers deem it critical to have unique keys with Unicode characters and rely on their correct comparability.

This issue becomes especially significant in embedded development, where maintaining large tables in memory just for proper Unicode handling isn't practical. By enforcing something seemingly simple like requiring normalisation for key comparison, we could hinder the development of an efficient, small embedded library that fully supports TOML. I personally think that arguments like "ICU is available on most operating systems" ignore the fact that not everyone wants or can manage a dependency on ICU.

From my experience, for most uses of the TOML format, a binary comparison of keys should work without any problems. If TOML is used for purposes beyond configurations, such as translation files or other specific use cases, the user of the TOML implementation might need to put in a little extra effort to normalise the keys.

Instead of requiring a specific comparison model in the TOML specification, I would rather suggest that a TOML implementation should define the comparison mode used for keys, or provide an option to select the comparison mode. This approach is quite common in many databases.

So in conclusion, while I agree that having a defined way to compare keys is important, I believe forcing implementers towards one single, very complex model that depends on large external libraries may not be the best solution.

1 reply

eksortso Jun 8, 2023

Instead of requiring a specific comparison model in the TOML specification, I would rather suggest that a TOML implementation should define the comparison mode used for keys, or provide an option to select the comparison mode. This approach is quite common in many databases.

This is essentially my own view. And I wrote some minimal proposal language to permit implementations to compare keys as they see fit. But it's not an open-ended license; no practical parser would be arbitrarily loose. Binary comparisons will remain prevalent, and I doubt that allowing other comparison types would divide the community, as some have expressed concern about.

So in conclusion, while I agree that having a defined way to compare keys is important, I believe forcing implementers towards one single, very complex model that depends on large external libraries may not be the best solution.

Rest assured, at this point in the debate, nobody's advocating for forced key normalization any longer. (Correct me if I'm wrong though.)

dbuenzli · 2023-06-30T02:15:34Z

dbuenzli
Jun 30, 2023

I just skimmed these discussion so sorry if that comes a bit off. But it seems to me that two questions tend to be conflated or at least not well delineated.

Given a key t.hé that has different representations in NFC and NFD:

What is the data model if two different representations are used in
the same document ? For example:
```
 t.hé = "nfd"    # with é in NFD
 t.hé = "nfc"    # with é in NFC
```
How do I query the data of t.hé (which code point sequence do I use) and what is the result ?

If TOML doesn't want to bring the whole Unicode normalization machinery in (which looks like a good idea to me) then the hash table defined by t should have two entries in the case of 1.

Given the XML precedent I would argue that this is a reasonable behaviour. Input methods and file saves in editors tend to mandate a single normalization form in a given document. This is why you likely never ran into a puzzling tag mismatch error on a <hé></hé><-- é in NFC--> sequence which can perfectly happen in theory. This means that you are very unlikely to actually fall on 1. in the wild (except perhaps in a malicious setting).

Now if you want to extract the tag data of <hé> in XML (point 2.) then either the schema needs to specify that hé should be in a particular form (i.e. mandate a code point sequence) or request to normalize the tag for looking it up. But then if the tag is supposed to be unique you have to say something if you find twice the same tag which normalizes to the same value. I guess that at that point one can argue that this is a problem of "schema interpretation" which is out of scope of the TOML specification itself.

So I would say that the simplest would be to

(Which hash table is being defined ?) Say that keys are binary compared for the purpose of the data model.
(How do I lookup the hash table ?) If a key has decomposable characters it SHOULD be looked ~~with NFC~~ against the data model with an NFC key for best-effort (because input methods that insert decomposable characters seem to input them in NFC) but that a lookup protocol has to be defined at the schema level which is out of scope of the specification.

0 replies

Unicode name comparison (keys, tables). #941

Replies: 6 comments · 24 replies

Replies: 6 comments 24 replies