Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect parsing of escaped characters from higher unicode planes in a JSON string #9712

Open
vit-zikmund opened this issue Dec 11, 2024 · 3 comments · May be fixed by #9799
Open

Incorrect parsing of escaped characters from higher unicode planes in a JSON string #9712

vit-zikmund opened this issue Dec 11, 2024 · 3 comments · May be fixed by #9799

Comments

@vit-zikmund
Copy link

Bug Report

Describe the bug
All characters in a JSON string are by its specification Unicode and all can be escaped using the \u#### notation. This works only for codepoints in the Basic Multilingual Plane (U+0000 - U+FFFF), higher unicode planes, like the emojis are specified to be encoded as a utf-16 surrogate pair, e.g. \ud83e\udd17, the utf-16 surrogate pair for the "hugging face" emoji 🤗 U+1F917.

This escaped surrogate pair needs to be parsed as a single character, while fluent-bit parses them as two standalone unicode codepoints (i.e. U+D83E and U+DD17), which are in fact forbidden to appear in a correct Unicode string.

To Reproduce
Setup a simple stdin to stdout pipeline, pass {"text": "\ud83e\udd17"} to stdin.
Out comes a mangled

[{"date":1733946314.083699,"text":"������"}]

Expected behavior
The output message should be:

[{"date":1733946229.895005,"text":"🤗"}]

Your Environment

  • Version used: 3.2.2
  • Configuration: -i stdin -o stdout
  • Environment name and version (e.g. Kubernetes? What version?): docker image, 3.2.2
  • Server type and version: n/a
  • Operating System and version: Fedora Linux 40
  • Filters and plugins: none

Additional context
This is mangling python json module dumped data (its default is to use the escapes, so the string is actually ASCII) in our destination log database.

There's a workaround to make the dumper use Unicode strings, which don't trigger the problem in fluent-bit.

@vit-zikmund
Copy link
Author

FYI, the reason there are 6 replacement characters (�) in place of the 🤗 emoji is because its standalone high surrogate U+d83e is utf-8 encoded as ed a0 be (3 bytes, that don't map to a valid character) and the low surrogate U+dd17 is encoded as ed b4 97 (the other 3 invalid bytes).

@edsiper
Copy link
Member

edsiper commented Dec 17, 2024

assigned @cosmo0920

@vit-zikmund
Copy link
Author

I realized I was referring to an obsoleted JSON RFC, however, nothing changed in the latest regarding Unicode escaping.

I also did a little code search and found the likely place of the problem - u8_read_escape_sequence:

    else if (str[0] == 'u') {
        while (i < size && hex_digit(str[i]) && dno < 4) {
            digs[dno++] = str[i++];
        }
        if (dno > 0) {
            ch = strtol(digs, NULL, 16);
        }
    }

The referenced code above simply converts one \uXXXX escape string to a unicode character (ch) without checking if it's a surrogate pair member. This check needs to be added.

Interestingly enough, just below the referenced code there's a logic parsing \Uxxxxxxxx, which is an escape sequence used in many programming languages and data exchange formats like C/C++, Python or YAML, but that's kinda it. JSON is a part of a different bunch (with Java and C#) that are using those awful surrogate pairs to achieve the same goal. This code can stay there, because it doesn't really hurt anything ;)

The following code could serve as a fix inspiration. As I'm far from being fluent-bit-fluent, I haven't properly tackled the several error cases, just returned -1 for show (wasn't sure if flb_error would be a good fit there):

static bool is_high_surrogate(uint32_t ch) {
    return ch >= 0xD800 && ch <= 0xDBFF;
}

static bool is_low_surrogate(uint32_t ch) {
    return ch >= 0xDC00 && ch <= 0xDFFF;
}

static uint32_t combine_surrogates(uint32_t high, uint32_t low) {
    return 0x10000 + (((high - 0xD800) << 10) | (low - 0xDC00));

static int u8_read_escape_sequence(const char *str, int size, uint32_t *dest)
// ....
    if (str[0] == 'u') {
        // Parse the first 4 hex digits
        while (i < size && hex_digit(str[i]) && dno < 4) {
            digs[dno++] = str[i++];
        }
        if (dno == 4) {
            ch = strtol(digs, NULL, 16);
            if (is_low_surrogate(ch)) {
                // Invalid: low surrogate without preceding high surrogate
                return -1;
            } else if (is_high_surrogate(ch)) {
                // Handle surrogate pair
                if (i + 6 < size && str[i] == '\\' && str[i + 1] == 'u') {
                    dno = 0;
                    i += 2; // Skip "\u"
                    while (i < size && hex_digit(str[i]) && dno < 4) {
                        digs[dno++] = str[i++];
                    }
                    if (dno == 4) {
                        uint32_t low = strtol(digs, NULL, 16);
                        if (is_low_surrogate(low)) {
                            ch = combine_surrogates(ch, low);
                        } else {
                            // Invalid: high surrogate not followed by low surrogate
                            return -1;
                        }
                    } else {
                        // Incomplete low surrogate
                        return -1;
                    }
                } else {
                    // Invalid: high surrogate not followed by \u
                    return -1;
                }
            }
        } else {
            // Incomplete \u escape sequence
            return -1;
        }
// ...

In case of those errors one could set ch to the Unicode Replacement Character to not die entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants