Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

En- and decoding of characters outside the Basic Multilingual Plane #12

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Commits on Jun 18, 2022

  1. Configuration menu
    Copy the full SHA
    b2c91fa View commit details
    Browse the repository at this point in the history
  2. En- and decoding of characters outside the Basic Multilingual Plane

    RFC4627 prescribes that everything has to in some Unicode
    encoding (which we comply with by using ASCII (UTF-8) and encoding
    everything else) and that any character may be escaped. When escaping,
    however, we need to take care to only escape characters in the Basic
    Multilingual Plane (BMP) which is U+0000 to U+FFFF:
    
    > Any character may be escaped.  If the character is in the Basic
    > Multilingual Plane (U+0000 through U+FFFF), then it may be
    > represented as a six-character sequence: a reverse solidus, followed
    > by the lowercase letter u, followed by four hexadecimal digits that
    > encode the character's code point. [...]
    >
    > To escape an extended character that is not in the Basic Multilingual
    > Plane, the character is represented as a twelve-character sequence,
    > encoding the UTF-16 surrogate pair.  So, for example, a string
    > containing only the G clef character (U+1D11E) may be represented as
    > "\uD834\uDD1E".
    >
    > - RFC4627, p. 3
    
    This commit implements en- and decoding of UTF-16 surrogate pairs and
    the necessary error handling logic required by the ordering requirements
    and the fact that a lone surrogate code unit/point may never be decoded
    nor encoded.
    
    Test cases partially taken from sharplispers#3.
    
    BREAKING CHANGES:
    
    Note that the broken behavior can all be considered a bug insofar as it
    violates the JSON spec.
    
    * A Unicode code point outside the BMP will now always be encoded as an
      UTF-16 surrogate pair.
    * A valid UTF-16 surrogate pair will now always be decoded to a single
      Unicode codepoint.
    * When *use-strict-json-rules* encoding a surrogate codepoint or
      decoding a lone surrogate code unit will result in an error.
      If *use-strict-json-rules* is NIL, it'll behave as before.
    
    Co-Authored-By: Chaitanya Gupta <[email protected]>
    sternenseemann and chaitanyagupta committed Jun 18, 2022
    Configuration menu
    Copy the full SHA
    4796850 View commit details
    Browse the repository at this point in the history

Commits on Aug 30, 2022

  1. Fix test suite execution in CCL

    CCL's unicode implementation doesn't allow lone surrogate code point in
    a string, preventing us from ever creating a string that would trigger
    the tested behavior here. Other CL implementations are more lenient
    here, whereas CCL follows the Unicode standard strictly.
    sternenseemann committed Aug 30, 2022
    Configuration menu
    Copy the full SHA
    c059bec View commit details
    Browse the repository at this point in the history