Incorrect parsing of escaped characters from higher unicode planes in a JSON string #9712

vit-zikmund · 2024-12-11T20:50:24Z

Bug Report

Describe the bug
All characters in a JSON string are by its specification Unicode and all can be escaped using the \u#### notation. This works only for codepoints in the Basic Multilingual Plane (U+0000 - U+FFFF), higher unicode planes, like the emojis are specified to be encoded as a utf-16 surrogate pair, e.g. \ud83e\udd17, the utf-16 surrogate pair for the "hugging face" emoji 🤗 U+1F917.

This escaped surrogate pair needs to be parsed as a single character, while fluent-bit parses them as two standalone unicode codepoints (i.e. U+D83E and U+DD17), which are in fact forbidden to appear in a correct Unicode string.

To Reproduce
Setup a simple stdin to stdout pipeline, pass {"text": "\ud83e\udd17"} to stdin.
Out comes a mangled

[{"date":1733946314.083699,"text":"������"}]

Expected behavior
The output message should be:

[{"date":1733946229.895005,"text":"🤗"}]

Your Environment

Version used: 3.2.2
Configuration: -i stdin -o stdout
Environment name and version (e.g. Kubernetes? What version?): docker image, 3.2.2
Server type and version: n/a
Operating System and version: Fedora Linux 40
Filters and plugins: none

Additional context
This is mangling python json module dumped data (its default is to use the escapes, so the string is actually ASCII) in our destination log database.

There's a workaround to make the dumper use Unicode strings, which don't trigger the problem in fluent-bit.

The text was updated successfully, but these errors were encountered:

vit-zikmund · 2024-12-12T08:33:08Z

FYI, the reason there are 6 replacement characters (�) in place of the 🤗 emoji is because its standalone high surrogate U+d83e is utf-8 encoded as ed a0 be (3 bytes, that don't map to a valid character) and the low surrogate U+dd17 is encoded as ed b4 97 (the other 3 invalid bytes).

edsiper · 2024-12-17T16:12:34Z

assigned @cosmo0920

vit-zikmund · 2025-01-03T16:12:19Z

I realized I was referring to an obsoleted JSON RFC, however, nothing changed in the latest regarding Unicode escaping.

I also did a little code search and found the likely place of the problem - u8_read_escape_sequence:

    else if (str[0] == 'u') {
        while (i < size && hex_digit(str[i]) && dno < 4) {
            digs[dno++] = str[i++];
        }
        if (dno > 0) {
            ch = strtol(digs, NULL, 16);
        }
    }

The referenced code above simply converts one \uXXXX escape string to a unicode character (ch) without checking if it's a surrogate pair member. This check needs to be added.

Interestingly enough, just below the referenced code there's a logic parsing \Uxxxxxxxx, which is an escape sequence used in many programming languages and data exchange formats like C/C++, Python or YAML, but that's kinda it. JSON is a part of a different bunch (with Java and C#) that are using those awful surrogate pairs to achieve the same goal. This code can stay there, because it doesn't really hurt anything ;)

The following code could serve as a fix inspiration. As I'm far from being fluent-bit-fluent, I haven't properly tackled the several error cases, just returned -1 for show (wasn't sure if flb_error would be a good fit there):

static bool is_high_surrogate(uint32_t ch) {
    return ch >= 0xD800 && ch <= 0xDBFF;
}

static bool is_low_surrogate(uint32_t ch) {
    return ch >= 0xDC00 && ch <= 0xDFFF;
}

static uint32_t combine_surrogates(uint32_t high, uint32_t low) {
    return 0x10000 + (((high - 0xD800) << 10) | (low - 0xDC00));

static int u8_read_escape_sequence(const char *str, int size, uint32_t *dest)
// ....
    if (str[0] == 'u') {
        // Parse the first 4 hex digits
        while (i < size && hex_digit(str[i]) && dno < 4) {
            digs[dno++] = str[i++];
        }
        if (dno == 4) {
            ch = strtol(digs, NULL, 16);
            if (is_low_surrogate(ch)) {
                // Invalid: low surrogate without preceding high surrogate
                return -1;
            } else if (is_high_surrogate(ch)) {
                // Handle surrogate pair
                if (i + 6 < size && str[i] == '\\' && str[i + 1] == 'u') {
                    dno = 0;
                    i += 2; // Skip "\u"
                    while (i < size && hex_digit(str[i]) && dno < 4) {
                        digs[dno++] = str[i++];
                    }
                    if (dno == 4) {
                        uint32_t low = strtol(digs, NULL, 16);
                        if (is_low_surrogate(low)) {
                            ch = combine_surrogates(ch, low);
                        } else {
                            // Invalid: high surrogate not followed by low surrogate
                            return -1;
                        }
                    } else {
                        // Incomplete low surrogate
                        return -1;
                    }
                } else {
                    // Invalid: high surrogate not followed by \u
                    return -1;
                }
            }
        } else {
            // Incomplete \u escape sequence
            return -1;
        }
// ...

In case of those errors one could set ch to the Unicode Replacement Character to not die entirely.

vit-zikmund added the status: waiting-for-triage label Dec 11, 2024

edsiper assigned cosmo0920 Dec 17, 2024

cosmo0920 linked a pull request Jan 6, 2025 that will close this issue

unescape: Decode \u escaped characters for surrogate pairs correctly #9799

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect parsing of escaped characters from higher unicode planes in a JSON string #9712

Incorrect parsing of escaped characters from higher unicode planes in a JSON string #9712

vit-zikmund commented Dec 11, 2024

vit-zikmund commented Dec 12, 2024

edsiper commented Dec 17, 2024

vit-zikmund commented Jan 3, 2025

Incorrect parsing of escaped characters from higher unicode planes in a JSON string #9712

Incorrect parsing of escaped characters from higher unicode planes in a JSON string #9712

Comments

vit-zikmund commented Dec 11, 2024

Bug Report

vit-zikmund commented Dec 12, 2024

edsiper commented Dec 17, 2024

vit-zikmund commented Jan 3, 2025