Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #16968

andsel · 2025-01-28T11:28:19Z

Release notes

[rn:skip]

What does this PR do?

This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:

accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
respect the encoding of the input string. Use concat method instead of addAll, which avoid to convert RubyString to String and back to RubyString. When return the head StringBuilder it enforce the encoding with the input charset.

Why is it important/What is the impact to the user?

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files (and/or docker env variables)~~
I have added tests that prove my fix is effective or that my feature works

Author's Checklist

[ ]

How to test this PR locally

The test plan has two sides:

one to check that the behaviour of size limiting acts as expected. In such case follow the instructions in BufferedTokenizer doesn't dice correctly the payload when restart processing after buffer full error #16483.
the other to verify the encoding is respected.

How to test the encoding is respected

Startup a REPL with Logstash and exercise the tokenizer:

$> bin/logstash -i irb
> buftok = FileWatch::BufferedTokenizer.new
> buftok.extract("\xA3".force_encoding("ISO8859-1")); buftok.flush.bytes

or use the following script

require 'socket'

hostname = 'localhost'
port = 1234

socket = TCPSocket.open(hostname, port)

text = "\xA3" # the £ symbol in ISO-8859-1 aka Latin-1
text.force_encoding("ISO-8859-1")
socket.puts(text)

socket.close

with the Logstash run as

bin/logstash -e "input { tcp { port => 1234 codec => line { charset => 'ISO8859-1' } } } output { stdout { codec => rubydebug } }"

In the output the £ as to be present and not Â£

Related issues

github-actions · 2025-01-31T11:48:22Z

It looks like this PR modifies one or more .asciidoc files. These files are being migrated to Markdown, and any changes merged now will be lost. See the migration guide for details.

github-actions · 2025-01-31T12:11:32Z

📃 DOCS PREVIEW ✨ https://logstash_bk_16968.docs-preview.app.elstc.co/diff

…transformation

…as the same encoding

…g and avoid implicit deconding in addAll iterator

…y strings

…h data input encoding, to do not change encoding

andsel · 2025-02-03T10:49:22Z

Uncovered use cases

This is a bugfix on the original code to solve the problem to respect sizeLimit when the first token is fragmented on different input buffers.
However, as the original implementation, this doesn't cover the case where the exceeding token is not the first of the data fragment, but is in the middle.
Consider a sizeLimit of 100. If the second second token is wide more than 100 chars, no errors are raised and the token is parsed.

Check with the pipeline:

input {
  tcp {
    port => 1234

    codec => json_lines {
      decode_size_limit_bytes => 100
    }
  }
}

output {
  stdout {
    codec => rubydebug
  }
}

and a loding script as:

require 'socket' 
require 'json'

hostname = 'localhost'
port = 1234

socket = TCPSocket.open(hostname, port)

data = {"a" => "a"*10}.to_json + "\n" + {"b" => "b" * 105}.to_json; socket.write(data)

socket.close

it produces an output like:

{
      "@version" => "1",
             "a" => "aaaaaaaaaa",
    "@timestamp" => 2025-02-03T10:40:06.093178Z
}
{
      "@version" => "1",
             "b" => "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb",
    "@timestamp" => 2025-02-03T10:40:06.094601Z
}

Ideal solution

To solve this problem, the BufferedTokenizer 's extract method should return an iterator and not array (or list). The iterator should apply the boundary check on each next invocation.

…extract call

donoghuc

I think that this solves the bug!

While reviewing it I am a bit concerned about the shared state with encodingName. I see the logic you added in flush that determines if extract has been run or not based on that having been set. However I couldnt really come up with anything materially better.

For you consideration I added a small diff with what I started as a suggestion:

diff --git a/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java b/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
index 63aa3c230..ea015d128 100644
--- a/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
+++ b/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
@@ -82,8 +82,11 @@ public class BufferedTokenizerExt extends RubyObject {
     @JRubyMethod
     @SuppressWarnings("rawtypes")
     public RubyArray extract(final ThreadContext context, IRubyObject data) {
-        RubyEncoding encoding = (RubyEncoding) data.convertToString().encoding(context);
-        encodingName = encoding.getEncoding().getCharsetName();
+        // Cache the encodingName. This state has implications for the `flush` method.
+        if (encodingName == null) {
+            RubyEncoding encoding = (RubyEncoding) data.convertToString().encoding(context);
+            encodingName = encoding.getEncoding().getCharsetName();
+        }
         final RubyArray entities = data.convertToString().split(delimiter, -1);
         if (!bufferFullErrorNotified) {
             input.clear();
@@ -137,8 +140,7 @@ public class BufferedTokenizerExt extends RubyObject {
             // in the accumulator, and clean the pending token part.
             headToken.append(input.shift(context)); // append buffer to first element and
             // create new RubyString with the data specified encoding
-            RubyString encodedHeadToken = RubyUtil.RUBY.newString(new ByteList(headToken.toString().getBytes(Charset.forName(encodingName))));
-            encodedHeadToken.force_encoding(context, RubyUtil.RUBY.newString(encodingName));
+            RubyString encodedHeadToken = toEncodedRubyString(context, headToken.toString());
             input.unshift(encodedHeadToken); // reinsert it into the array
             headToken = new StringBuilder();
         }
@@ -163,8 +165,7 @@ public class BufferedTokenizerExt extends RubyObject {
         // create new RubyString with the last data specified encoding, if exists
         RubyString encodedHeadToken;
         if (encodingName != null) {
-            encodedHeadToken = RubyUtil.RUBY.newString(new ByteList(buffer.toString().getBytes(Charset.forName(encodingName))));
-            encodedHeadToken.force_encoding(context, RubyUtil.RUBY.newString(encodingName));
+            encodedHeadToken = toEncodedRubyString(context, buffer.toString());
         } else {
             // When used with TCP input it could be that on socket connection the flush method
             // is invoked while no invocation of extract, leaving the encoding name unassigned.
@@ -183,4 +184,10 @@ public class BufferedTokenizerExt extends RubyObject {
         return RubyUtil.RUBY.newBoolean(headToken.toString().isEmpty() && (inputSize == 0));
     }
 
+    private RubyString toEncodedRubyString(ThreadContext context, String input) {
+        // Depends on the encodingName being set by the extract method, could potentially raise if not set. 
+        RubyString result = RubyUtil.RUBY.newString(new ByteList(input.getBytes(Charset.forName(encodingName))));
+        result.force_encoding(context, RubyUtil.RUBY.newString(encodingName));
+        return result;
+    }
 }

Mainly the change is making it a bit more clear the caching behavior of the encodingName and deduplicating some somewhat complex code.

I dont feel super strongly about this suggestion though. Take them or leave it (if you do want to incorporate, bounce it back to me and I can do a quick review as I've spent quite a bit of time on this today and have all the context in my head).

robbavey

A couple of suggestions to add more context to Exceptions to help us if we encounter this issue in the wild

logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java

robbavey · 2025-02-03T23:07:44Z

logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java

+            // is invoked while no invocation of extract, leaving the encoding name unassigned.
+            // In such case also the headToken must be empty
+            if (!buffer.toString().isEmpty()) {
+                throw new IllegalStateException("invoked flush with unassigned encoding but not empty head token, this shouldn't happen");


Is there additional context we can add to the exception? Maybe the buffer contents? I would also remove the this shouldn't happen text from the exception message body.

Is this code path likely

This code path should never be executed. If we execute it means that the following conditions are both true:

we have the field encoding not assigned

the head token has a value

Regarding the 1, every RubyString has an encoding assigned, in worst case is defaulted to UTF8, so if we pass at line https://github.com/elastic/logstash/pull/16968/files#diff-5c7f8990e98f54782395d29b4b1b5b68cf6f782b34af5eb8f1b5a77331e0172eR86 we have an encoding.
headToken can be empty but it can't be that there isn't any encoding assigned.

logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java

mashhurs · 2025-02-04T00:59:23Z

logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java

+            // create new RubyString with the data specified encoding
+            RubyString encodedHeadToken = RubyUtil.RUBY.newString(new ByteList(headToken.toString().getBytes(Charset.forName(encodingName))));
+            encodedHeadToken.force_encoding(context, RubyUtil.RUBY.newString(encodingName));
+            input.unshift(encodedHeadToken); // reinsert it into the array


Ah shift... AFAIK it is O(N) operation, I wonder if there is improvement we can do...

If you refer to unshift, in the existing code we always invoke

entities.unshift(input.join(context));

so we shouldn't have introduced any bottleneck that wasn't present in previous version.

- extracted common code used in string encoding - avoid full packaeg import - better execption message with details on limit exceeded

elastic-sonarqube · 2025-02-04T10:05:55Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube

elasticmachine · 2025-02-04T10:15:51Z

💚 Build Succeeded

Buildkite Build
Commit: 727e3b1

History

💚 Build #2201 succeeded 015ba43
💚 Build #2196 succeeded b42ca05
💚 Build #2195 succeeded b682c20
💔 Build #2163 failed 8dfa1fe
💔 Build #2161 failed 76139e9
💔 Build #2151 failed a84656e

cc @andsel

donoghuc

Awesome test coverage BTW ❤️

mashhurs

LGTM!
Thank you @andsel, this is great!

andsel · 2025-02-05T09:25:34Z

@logstashmachine backport 9.0

…n respecting the encoding of the input string (#16968) Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54)

andsel · 2025-02-05T10:15:13Z

@logstashmachine backport 8.x

…n respecting the encoding of the input string (#16968) Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54)

andsel · 2025-02-05T10:30:18Z

@logstashmachine backport 8.18

andsel · 2025-02-05T10:30:23Z

@logstashmachine backport 8.17

andsel · 2025-02-05T10:30:28Z

@logstashmachine backport 8.16

…n respecting the encoding of the input string (#16968) Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54)

…fter a buffer full condition respecting the encoding of the input string (#16968) (#17018) Backport PR #16968 to 9.0 branch, original message: ---- Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54) Co-authored-by: Andrea Selva <[email protected]>

andsel self-assigned this Jan 28, 2025

andsel force-pushed the fix/buffered_tokenizer_clean_state_in_case_of_line_too_big_respecting_character_encoding branch from 69bd4f4 to a84656e Compare January 28, 2025 16:02

andsel added 9 commits January 31, 2025 13:38

Re-established existing tests with the addition of the encoding case …

2382fee

…transformation

Fixed the test to compare Java strings byte rappresentation so that h…

339d645

…as the same encoding

Re-imported previously reverted code

0a910a2

Try to make the test reporduce the encoding problem

33dc6b1

Fixed test to verify the case of encoding preservation

674d377

Switch from RubyArray addAll to concat method to preserve the encodin…

5af3a62

…g and avoid implicit deconding in addAll iterator

Covered with more use cases and verifies encoding of the returned Rub…

0f76bd0

…y strings

Updates the point of return Java String to return the one encoded wit…

a5d6bef

…h data input encoding, to do not change encoding

Minor, removed unused import

b42ca05

andsel force-pushed the fix/buffered_tokenizer_clean_state_in_case_of_line_too_big_respecting_character_encoding branch from b682c20 to b42ca05 Compare January 31, 2025 12:40

andsel changed the title ~~Fix/buffered tokenizer clean state in case of line too big respecting character encoding~~ Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string Jan 31, 2025

andsel added the bug label Jan 31, 2025

andsel marked this pull request as ready for review January 31, 2025 15:57

Force UTF8 for empty string when flush is invoked without a previous …

015ba43

…extract call

andsel added the status:needs-review label Feb 3, 2025

donoghuc self-requested a review February 3, 2025 17:23

donoghuc approved these changes Feb 3, 2025

View reviewed changes

robbavey reviewed Feb 3, 2025

View reviewed changes

mashhurs reviewed Feb 4, 2025

View reviewed changes

Addressed some concerns raised in PR review

727e3b1

- extracted common code used in string encoding - avoid full packaeg import - better execption message with details on limit exceeded

donoghuc approved these changes Feb 4, 2025

View reviewed changes

mashhurs approved these changes Feb 4, 2025

View reviewed changes

andsel mentioned this pull request Feb 5, 2025

BufferedTokenizerExt applies sizeLimit check only of first token of input fragment #17017

Open

andsel merged commit 1c8cf54 into elastic:main Feb 5, 2025
7 checks passed

github-actions bot added the v9.0.0 label Feb 5, 2025

github-actions bot mentioned this pull request Feb 5, 2025

Backport PR #16968 to 9.0: Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #17018

Merged

3 tasks

github-actions bot added the v8.19.0 label Feb 5, 2025

github-actions bot mentioned this pull request Feb 5, 2025

Backport PR #16968 to 8.x: Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #17019

Open

3 tasks

github-actions bot added the v8.18.0 label Feb 5, 2025

github-actions bot mentioned this pull request Feb 5, 2025

Backport PR #16968 to 8.18: Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #17020

Open

3 tasks

github-actions bot added the v8.16.4 label Feb 5, 2025

github-actions bot mentioned this pull request Feb 5, 2025

Backport PR #16968 to 8.16: Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #17021

Open

3 tasks

github-actions bot added the v8.17.2 label Feb 5, 2025

github-actions bot mentioned this pull request Feb 5, 2025

Backport PR #16968 to 8.17: Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #17022

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #16968

Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #16968

andsel commented Jan 28, 2025 •

edited

Loading

github-actions bot commented Jan 31, 2025

github-actions bot commented Jan 31, 2025

andsel commented Feb 3, 2025

donoghuc left a comment

robbavey left a comment

robbavey Feb 3, 2025

andsel Feb 4, 2025

mashhurs Feb 4, 2025

andsel Feb 4, 2025 •

edited

Loading

elastic-sonarqube bot commented Feb 4, 2025

elasticmachine commented Feb 4, 2025

donoghuc left a comment

mashhurs left a comment

andsel commented Feb 5, 2025

andsel commented Feb 5, 2025

andsel commented Feb 5, 2025

andsel commented Feb 5, 2025

andsel commented Feb 5, 2025

Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #16968

Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #16968

Conversation

andsel commented Jan 28, 2025 • edited Loading

Release notes

What does this PR do?

Why is it important/What is the impact to the user?

Checklist

Author's Checklist

How to test this PR locally

How to test the encoding is respected

Related issues

github-actions bot commented Jan 31, 2025

github-actions bot commented Jan 31, 2025

andsel commented Feb 3, 2025

Uncovered use cases

Ideal solution

donoghuc left a comment

Choose a reason for hiding this comment

robbavey left a comment

Choose a reason for hiding this comment

robbavey Feb 3, 2025

Choose a reason for hiding this comment

andsel Feb 4, 2025

Choose a reason for hiding this comment

mashhurs Feb 4, 2025

Choose a reason for hiding this comment

andsel Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

elastic-sonarqube bot commented Feb 4, 2025

Quality Gate passed

elasticmachine commented Feb 4, 2025

💚 Build Succeeded

History

donoghuc left a comment

Choose a reason for hiding this comment

mashhurs left a comment

Choose a reason for hiding this comment

andsel commented Feb 5, 2025

andsel commented Feb 5, 2025

andsel commented Feb 5, 2025

andsel commented Feb 5, 2025

andsel commented Feb 5, 2025

andsel commented Jan 28, 2025 •

edited

Loading

andsel Feb 4, 2025 •

edited

Loading