Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode and decode non-BMP chars into unicode surrogate pairs #3

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rpgoldman
Copy link
Collaborator

No description provided.

sternenseemann pushed a commit to sternenseemann/cl-json that referenced this pull request Jun 18, 2022
RFC4627 prescribes that everything has to in some Unicode
encoding (which we comply with by using ASCII (UTF-8) and encoding
everything else) and that any character may be escaped. When escaping,
however, we need to take care to only escape characters in the Basic
Multilingual Plane (BMP) which is U+0000 to U+FFFF:

> Any character may be escaped.  If the character is in the Basic
> Multilingual Plane (U+0000 through U+FFFF), then it may be
> represented as a six-character sequence: a reverse solidus, followed
> by the lowercase letter u, followed by four hexadecimal digits that
> encode the character's code point. [...]
>
> To escape an extended character that is not in the Basic Multilingual
> Plane, the character is represented as a twelve-character sequence,
> encoding the UTF-16 surrogate pair.  So, for example, a string
> containing only the G clef character (U+1D11E) may be represented as
> "\uD834\uDD1E".
>
> - RFC4627, p. 3

This commit implements en- and decoding of UTF-16 surrogate pairs and
the necessary error handling logic required by the ordering requirements
and the fact that a lone surrogate code unit/point may never be decoded
nor encoded.

Test cases partially taken from sharplispers#3.

BREAKING CHANGES:

Note that the broken behavior can all be considered a bug insofar as it
violates the JSON spec.

* A Unicode code point outside the BMP will now always be encoded as an
  UTF-16 surrogate pair.
* A valid UTF-16 surrogate pair will now always be decoded to a single
  Unicode codepoint.
* When *use-strict-json-rules* encoding a surrogate codepoint or
  decoding a lone surrogate code unit will result in an error.
  If *use-strict-json-rules* is NIL, it'll behave as before.

Co-Authored-By: Chaitanya Gupta <[email protected]>
sternenseemann added a commit to sternenseemann/cl-json that referenced this pull request Jun 18, 2022
RFC4627 prescribes that everything has to in some Unicode
encoding (which we comply with by using ASCII (UTF-8) and encoding
everything else) and that any character may be escaped. When escaping,
however, we need to take care to only escape characters in the Basic
Multilingual Plane (BMP) which is U+0000 to U+FFFF:

> Any character may be escaped.  If the character is in the Basic
> Multilingual Plane (U+0000 through U+FFFF), then it may be
> represented as a six-character sequence: a reverse solidus, followed
> by the lowercase letter u, followed by four hexadecimal digits that
> encode the character's code point. [...]
>
> To escape an extended character that is not in the Basic Multilingual
> Plane, the character is represented as a twelve-character sequence,
> encoding the UTF-16 surrogate pair.  So, for example, a string
> containing only the G clef character (U+1D11E) may be represented as
> "\uD834\uDD1E".
>
> - RFC4627, p. 3

This commit implements en- and decoding of UTF-16 surrogate pairs and
the necessary error handling logic required by the ordering requirements
and the fact that a lone surrogate code unit/point may never be decoded
nor encoded.

Test cases partially taken from sharplispers#3.

BREAKING CHANGES:

Note that the broken behavior can all be considered a bug insofar as it
violates the JSON spec.

* A Unicode code point outside the BMP will now always be encoded as an
  UTF-16 surrogate pair.
* A valid UTF-16 surrogate pair will now always be decoded to a single
  Unicode codepoint.
* When *use-strict-json-rules* encoding a surrogate codepoint or
  decoding a lone surrogate code unit will result in an error.
  If *use-strict-json-rules* is NIL, it'll behave as before.

Co-Authored-By: Chaitanya Gupta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants