Replies: 6 comments 24 replies
-
Yes, this is a good subject to bring up. I think that, by default, ordinal comparison is the way to go. Implementations may offer other normalisation forms, or localised comparisons. Alternatively, the most commonly used normalisation form, like the one used in internet specifications, is NFC. From here, quote:
|
Beta Was this translation helpful? Give feedback.
-
If I see it correctly, NFC and NFD produce the same results regarding string equality? That this, any two strings that are identical under one of them are also identical under the other? If that's the case, it would be sufficient if implementations use one of these normalization forms to detect duplicate keys/table names. Which of them is used doesn't make a difference here. On the other hand, NFKC and NFKD normalize more agressively, e.g. converting Another question is whether and how to normalize strings before handing them over to the user/the calling application. The case can certainly be made that they should, since e.g. a user looking for the key (or value) "café" (4 code points) won't find it if the file has "café" (5 code points) and no normalization is performed (following @abelbraaksma's example). I think a reasonable recommendation would be that TOML parsers should apply NFC to detect duplicate keys and before handing any parsed string over to the calling app, and that TOML writers should apply NFC before writing a TOML bytestream to disk (or sending it over the wire). However, I wouldn't make it a strict requirement ("must" instead of "should") since it may be too high a burden on some implementations. |
Beta Was this translation helpful? Give feedback.
-
Probably worth pointing out that TOML is approaching 10 years old and has been UTF-8 for that entire time, without anyone ever reporting an issue that would have been solved by key string normalization (that I could find, anyway). I would guess that ordinal comparison (which is what most will be doing I suspect) is fine. Personally I'd prefer if this was deferred to the applications on a per-case basis, or remained a recommendation at most. |
Beta Was this translation helpful? Give feedback.
-
Let's step back a moment and codify what we have discussed so far. I think we ought to recommend specifically NFC as the normalization format to use for keys in tables, but let us acknowledge that ordinal comparisons may be used, and in either case, using two different forms of a key's name is bad practice. The following text can be inserted into Because some keys look the same with different Unicode code point sequences, parsers should compare the NFC forms of keys, instead of just their code points. Likewise, encoders should normalize the keys they write using NFC. # DO NOT DO THIS
# prénom = "Françoise" #but assigned with two different forms
"pr\u00e9nom" = "Françoise" ## NFC form
"pr\u0065\u0301nom" = "Françoise" ## NFD form, looks the same as NFC form The example I used contains two characters, "é" and "ç", which each have different forms under NFC and NFD. The name isn't significant, in case you're wondering about that, but I did write it using different forms of the cedilla. Also, since I used quoted keys, this can be tested in TOML v1.0.0 as it stands. What do you think? What would you change? |
Beta Was this translation helpful? Give feedback.
-
Over the recent years, I have actively worked to remove dependencies on the ICU library from many software projects. In these server-based projects, this highly complex Unicode library was used, even when it was not really necessary. So, I believe, pushing implementers towards a complex Unicode normalisation, which requires an equally complex external library of approximately 16MB, needs to be thoughtfully considered. This would be only for what I see as a few special cases where developers deem it critical to have unique keys with Unicode characters and rely on their correct comparability. This issue becomes especially significant in embedded development, where maintaining large tables in memory just for proper Unicode handling isn't practical. By enforcing something seemingly simple like requiring normalisation for key comparison, we could hinder the development of an efficient, small embedded library that fully supports TOML. I personally think that arguments like "ICU is available on most operating systems" ignore the fact that not everyone wants or can manage a dependency on ICU. From my experience, for most uses of the TOML format, a binary comparison of keys should work without any problems. If TOML is used for purposes beyond configurations, such as translation files or other specific use cases, the user of the TOML implementation might need to put in a little extra effort to normalise the keys. Instead of requiring a specific comparison model in the TOML specification, I would rather suggest that a TOML implementation should define the comparison mode used for keys, or provide an option to select the comparison mode. This approach is quite common in many databases. So in conclusion, while I agree that having a defined way to compare keys is important, I believe forcing implementers towards one single, very complex model that depends on large external libraries may not be the best solution. |
Beta Was this translation helpful? Give feedback.
-
I just skimmed these discussion so sorry if that comes a bit off. But it seems to me that two questions tend to be conflated or at least not well delineated. Given a key
If TOML doesn't want to bring the whole Unicode normalization machinery in (which looks like a good idea to me) then the hash table defined by Given the XML precedent I would argue that this is a reasonable behaviour. Input methods and file saves in editors tend to mandate a single normalization form in a given document. This is why you likely never ran into a puzzling tag mismatch error on a Now if you want to extract the tag data of So I would say that the simplest would be to
|
Beta Was this translation helpful? Give feedback.
-
String comparison for keys and tables probably needs to be fined more explicitly. When comparing a key for duplication, we can implement string comparison in three ways.
The normalization algorithm is pretty heady-duty and will require implementations for some languages to include external Unicode libraries to handle.
I'm curious what others think about this. How do the current implementations handle this?
Beta Was this translation helpful? Give feedback.
All reactions