Encode and decode non-BMP chars into unicode surrogate pairs #3

rpgoldman · 2021-02-22T02:41:32Z

No description provided.

RFC4627 prescribes that everything has to in some Unicode encoding (which we comply with by using ASCII (UTF-8) and encoding everything else) and that any character may be escaped. When escaping, however, we need to take care to only escape characters in the Basic Multilingual Plane (BMP) which is U+0000 to U+FFFF: > Any character may be escaped. If the character is in the Basic > Multilingual Plane (U+0000 through U+FFFF), then it may be > represented as a six-character sequence: a reverse solidus, followed > by the lowercase letter u, followed by four hexadecimal digits that > encode the character's code point. [...] > > To escape an extended character that is not in the Basic Multilingual > Plane, the character is represented as a twelve-character sequence, > encoding the UTF-16 surrogate pair. So, for example, a string > containing only the G clef character (U+1D11E) may be represented as > "\uD834\uDD1E". > > - RFC4627, p. 3 This commit implements en- and decoding of UTF-16 surrogate pairs and the necessary error handling logic required by the ordering requirements and the fact that a lone surrogate code unit/point may never be decoded nor encoded. Test cases partially taken from sharplispers#3. BREAKING CHANGES: Note that the broken behavior can all be considered a bug insofar as it violates the JSON spec. * A Unicode code point outside the BMP will now always be encoded as an UTF-16 surrogate pair. * A valid UTF-16 surrogate pair will now always be decoded to a single Unicode codepoint. * When *use-strict-json-rules* encoding a surrogate codepoint or decoding a lone surrogate code unit will result in an error. If *use-strict-json-rules* is NIL, it'll behave as before. Co-Authored-By: Chaitanya Gupta <[email protected]>

chaitanyagupta added 2 commits November 17, 2018 19:04

Encode and decode non-BMP chars into unicode surrogate pairs

987a8c7

Add tests for encoding and decoding non-ASCII and non-BMP chars

8e8d1b4

sternenseemann mentioned this pull request Jun 18, 2022

En- and decoding of characters outside the Basic Multilingual Plane #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode and decode non-BMP chars into unicode surrogate pairs #3

Encode and decode non-BMP chars into unicode surrogate pairs #3

rpgoldman commented Feb 22, 2021

Encode and decode non-BMP chars into unicode surrogate pairs #3

Are you sure you want to change the base?

Encode and decode non-BMP chars into unicode surrogate pairs #3

Conversation

rpgoldman commented Feb 22, 2021