You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's possible for valid Javascript strings to be invalid unicode strings - arising out of the fact that JS strings are specced as arbitrary sequences of 16-bit words. This means that invalid UCS-2 sequences, for example \udc11 (which is a lone surrogate pair component) can show up in our string literals.
The binast encoding needs to handle this - we cannot assume that there is always a valid translation of a JS string to a UTF-8 string. This all relates to situations where 16-bit chars fall into the surrogate pair range.
My suggestion is the following: we translate the 16-bit word sequence as if it was a UTF-16 string. This means that when we see valid surrogate pair sequences, we translate those into unicode codepoints and re-encode as a UTF-8 sequence.
When we see surrogate pair values that occur in invalid circumstances, we encode those directly as codepoints. These 16-bit chars are not valid unicode codepoints, so there is no valid UTF-8 sequence that corresponds to them. Those sequences are thus "free" for us to use to encode invalid 16-bit codepoints.
I'm not 100% sure this needs to be addressed in the spec, but @Yoric suggested I make the issue here because it may need to be addressed here.
The text was updated successfully, but these errors were encountered:
It's possible for valid Javascript strings to be invalid unicode strings - arising out of the fact that JS strings are specced as arbitrary sequences of 16-bit words. This means that invalid UCS-2 sequences, for example
\udc11
(which is a lone surrogate pair component) can show up in our string literals.The binast encoding needs to handle this - we cannot assume that there is always a valid translation of a JS string to a UTF-8 string. This all relates to situations where 16-bit chars fall into the surrogate pair range.
My suggestion is the following: we translate the 16-bit word sequence as if it was a UTF-16 string. This means that when we see valid surrogate pair sequences, we translate those into unicode codepoints and re-encode as a UTF-8 sequence.
When we see surrogate pair values that occur in invalid circumstances, we encode those directly as codepoints. These 16-bit chars are not valid unicode codepoints, so there is no valid UTF-8 sequence that corresponds to them. Those sequences are thus "free" for us to use to encode invalid 16-bit codepoints.
I'm not 100% sure this needs to be addressed in the spec, but @Yoric suggested I make the issue here because it may need to be addressed here.
The text was updated successfully, but these errors were encountered: