En- and decoding of characters outside the Basic Multilingual Plane #12

RFC4627 prescribes that everything has to in some Unicode encoding (which we comply with by using ASCII (UTF-8) and encoding everything else) and that any character may be escaped. When escaping, however, we need to take care to only escape characters in the Basic Multilingual Plane (BMP) which is U+0000 to U+FFFF: > Any character may be escaped. If the character is in the Basic > Multilingual Plane (U+0000 through U+FFFF), then it may be > represented as a six-character sequence: a reverse solidus, followed > by the lowercase letter u, followed by four hexadecimal digits that > encode the character's code point. [...] > > To escape an extended character that is not in the Basic Multilingual > Plane, the character is represented as a twelve-character sequence, > encoding the UTF-16 surrogate pair. So, for example, a string > containing only the G clef character (U+1D11E) may be represented as > "\uD834\uDD1E". > > - RFC4627, p. 3 This commit implements en- and decoding of UTF-16 surrogate pairs and the necessary error handling logic required by the ordering requirements and the fact that a lone surrogate code unit/point may never be decoded nor encoded. Test cases partially taken from sharplispers#3. BREAKING CHANGES: Note that the broken behavior can all be considered a bug insofar as it violates the JSON spec. * A Unicode code point outside the BMP will now always be encoded as an UTF-16 surrogate pair. * A valid UTF-16 surrogate pair will now always be decoded to a single Unicode codepoint. * When *use-strict-json-rules* encoding a surrogate codepoint or decoding a lone surrogate code unit will result in an error. If *use-strict-json-rules* is NIL, it'll behave as before. Co-Authored-By: Chaitanya Gupta <[email protected]>

CCL's unicode implementation doesn't allow lone surrogate code point in a string, preventing us from ever creating a string that would trigger the tested behavior here. Other CL implementations are more lenient here, whereas CCL follows the Unicode standard strictly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

En- and decoding of characters outside the Basic Multilingual Plane #12

En- and decoding of characters outside the Basic Multilingual Plane #12

Commits on Jun 18, 2022

Commits on Aug 30, 2022

En- and decoding of characters outside the Basic Multilingual Plane #12

Are you sure you want to change the base?

En- and decoding of characters outside the Basic Multilingual Plane #12

Commits on Jun 18, 2022

Commits on Aug 30, 2022