-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect parsing of escaped characters from higher unicode planes in a JSON string #9712
Comments
assigned @cosmo0920 |
I realized I was referring to an obsoleted JSON RFC, however, nothing changed in the latest regarding Unicode escaping. I also did a little code search and found the likely place of the problem - else if (str[0] == 'u') {
while (i < size && hex_digit(str[i]) && dno < 4) {
digs[dno++] = str[i++];
}
if (dno > 0) {
ch = strtol(digs, NULL, 16);
}
} The referenced code above simply converts one Interestingly enough, just below the referenced code there's a logic parsing The following code could serve as a fix inspiration. As I'm far from being fluent-bit-fluent, I haven't properly tackled the several error cases, just returned static bool is_high_surrogate(uint32_t ch) {
return ch >= 0xD800 && ch <= 0xDBFF;
}
static bool is_low_surrogate(uint32_t ch) {
return ch >= 0xDC00 && ch <= 0xDFFF;
}
static uint32_t combine_surrogates(uint32_t high, uint32_t low) {
return 0x10000 + (((high - 0xD800) << 10) | (low - 0xDC00));
static int u8_read_escape_sequence(const char *str, int size, uint32_t *dest)
// ....
if (str[0] == 'u') {
// Parse the first 4 hex digits
while (i < size && hex_digit(str[i]) && dno < 4) {
digs[dno++] = str[i++];
}
if (dno == 4) {
ch = strtol(digs, NULL, 16);
if (is_low_surrogate(ch)) {
// Invalid: low surrogate without preceding high surrogate
return -1;
} else if (is_high_surrogate(ch)) {
// Handle surrogate pair
if (i + 6 < size && str[i] == '\\' && str[i + 1] == 'u') {
dno = 0;
i += 2; // Skip "\u"
while (i < size && hex_digit(str[i]) && dno < 4) {
digs[dno++] = str[i++];
}
if (dno == 4) {
uint32_t low = strtol(digs, NULL, 16);
if (is_low_surrogate(low)) {
ch = combine_surrogates(ch, low);
} else {
// Invalid: high surrogate not followed by low surrogate
return -1;
}
} else {
// Incomplete low surrogate
return -1;
}
} else {
// Invalid: high surrogate not followed by \u
return -1;
}
}
} else {
// Incomplete \u escape sequence
return -1;
}
// ... In case of those errors one could set |
Bug Report
Describe the bug
All characters in a JSON string are by its specification Unicode and all can be escaped using the
\u####
notation. This works only for codepoints in the Basic Multilingual Plane (U+0000 - U+FFFF), higher unicode planes, like the emojis are specified to be encoded as a utf-16 surrogate pair, e.g.\ud83e\udd17
, the utf-16 surrogate pair for the "hugging face" emoji 🤗 U+1F917.This escaped surrogate pair needs to be parsed as a single character, while fluent-bit parses them as two standalone unicode codepoints (i.e. U+D83E and U+DD17), which are in fact forbidden to appear in a correct Unicode string.
To Reproduce
Setup a simple stdin to stdout pipeline, pass
{"text": "\ud83e\udd17"}
to stdin.Out comes a mangled
Expected behavior
The output message should be:
Your Environment
-i stdin -o stdout
Additional context
This is mangling python
json
module dumped data (its default is to use the escapes, so the string is actually ASCII) in our destination log database.There's a workaround to make the dumper use Unicode strings, which don't trigger the problem in fluent-bit.
The text was updated successfully, but these errors were encountered: