Javascript can represent invalid unicode strings #41

kannanvijayan-zz · 2018-06-14T18:15:17Z

It's possible for valid Javascript strings to be invalid unicode strings - arising out of the fact that JS strings are specced as arbitrary sequences of 16-bit words. This means that invalid UCS-2 sequences, for example \udc11 (which is a lone surrogate pair component) can show up in our string literals.

The binast encoding needs to handle this - we cannot assume that there is always a valid translation of a JS string to a UTF-8 string. This all relates to situations where 16-bit chars fall into the surrogate pair range.

My suggestion is the following: we translate the 16-bit word sequence as if it was a UTF-16 string. This means that when we see valid surrogate pair sequences, we translate those into unicode codepoints and re-encode as a UTF-8 sequence.

When we see surrogate pair values that occur in invalid circumstances, we encode those directly as codepoints. These 16-bit chars are not valid unicode codepoints, so there is no valid UTF-8 sequence that corresponds to them. Those sequences are thus "free" for us to use to encode invalid 16-bit codepoints.

I'm not 100% sure this needs to be addressed in the spec, but @Yoric suggested I make the issue here because it may need to be addressed here.

The text was updated successfully, but these errors were encountered:

mroch · 2018-06-18T22:14:04Z

WTF-8 seems like a similar scheme

kannanvijayan-zz · 2018-06-19T18:45:11Z

@mroch yeah, it's exactly that scheme :) I just didn't realize it. We can replace my whole comment with "use WTF-8 for encoding JS strings".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Javascript can represent invalid unicode strings #41

Javascript can represent invalid unicode strings #41

kannanvijayan-zz commented Jun 14, 2018

mroch commented Jun 18, 2018

kannanvijayan-zz commented Jun 19, 2018

Javascript can represent invalid unicode strings #41

Javascript can represent invalid unicode strings #41

Comments

kannanvijayan-zz commented Jun 14, 2018

mroch commented Jun 18, 2018

kannanvijayan-zz commented Jun 19, 2018