Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add error-tolerant mode #19

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

darrachequesne
Copy link
Contributor

@darrachequesne darrachequesne commented Oct 16, 2016

Closes #2 and #5

@coveralls
Copy link

Coverage Status

Coverage increased (+0.4%) to 92.958% when pulling c373d19 on darrachequesne:patch-1 into 2fa80fa on mathiasbynens:master.

@darrachequesne
Copy link
Contributor Author

@mathiasbynens does that implementation comply with what you had in mind? Could you please review when you have time?

@mathiasbynens
Copy link
Owner

Of course! It might take a while until I get around to it, though.

@darrachequesne
Copy link
Contributor Author

No problem! Please tell me if I can help in any way.

@darrachequesne
Copy link
Contributor Author

Hi @mathiasbynens ! Do you know when you'll be able to review that PR please?

@coveralls
Copy link

coveralls commented Dec 18, 2016

Coverage Status

Coverage increased (+0.4%) to 92.958% when pulling 41c4eef on darrachequesne:patch-1 into 5566334 on mathiasbynens:master.

@chharvey
Copy link

@darrachequesne Does this handle the case of missing or extra continuation bytes?

The encoding 1110xxxx 10xxxxxx 10xxxxxx 0xxxxxxx (a 3-sequence followed by a 1-sequence) is well-formed and decodes to two codepoints. But if one of the “continuation bytes” was lost in transmission,1110xxxx 10xxxxxx 0xxxxxxx would error. With {strict: false}, we would want the first character to resolve to U+FFFD instead of erroring, and the second character to resolve as normal. Example:

utf8.decode(
	'\xE2\xAC\xE2\x82\xAC', // 11100010 10101100 11100010 10000010 10101100
	{strict: false},
) === '\uFFFD\u20AC';

Likewise, 1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx is not well-formed either. With strict turned off, the first character (the 3-sequence) should resolve as normal, but then U+FFFD should be returned for any remaining continuation bytes until the next “header byte” (that is, a byte starting with 00, 01, or 11) is found. Example:

utf8.decode(
	'\xE2\x82\xAC\x82\xAC\xE2\x82\xAC', // 11100010 10000010 10101100 10000010 10101100 11100010 10000010 10101100
	{strict: false},
) === '\u20AC\uFFFD\u20AC';

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add error-tolerant mode
4 participants