Character encoding issues with refactored `BufferedTokenizerExt` #16694

donoghuc · 2024-11-19T20:42:52Z

With the addition of https://github.com/elastic/logstash/pull/16482/commits it is possible that character encodings can be improperly handled leading to corrupted data.

Logstash information:
The affected (released) versions are:

8.15.4

Reproduction

The issue can be demonstrated by making the following changes and performing the small reproduction case in a repl:

diff --git a/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java b/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
index 2c36370af..7bd9e2e03 100644
--- a/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
+++ b/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
@@ -79,9 +79,25 @@ public class BufferedTokenizerExt extends RubyObject {
     @SuppressWarnings("rawtypes")
     public RubyArray extract(final ThreadContext context, IRubyObject data) {
         final RubyArray entities = data.convertToString().split(delimiter, -1);
+        // Debug before addAll
+        System.out.println("\n=== Before addAll ===");
+        for (int i = 0; i < entities.size(); i++) {
+            RubyString entity = (RubyString)entities.eltInternal(i);
+            System.out.println("Entity " + i + ":");
+            System.out.println("  Bytes: " + java.util.Arrays.toString(entity.getBytes()));
+            System.out.println("  Encoding: " + entity.getEncoding());
+        }
         if (!bufferFullErrorNotified) {
             input.clear();
             input.addAll(entities);
+            // Debug after addAll
+            System.out.println("\n=== After addAll ===");
+            for (int i = 0; i < input.size(); i++) {
+                RubyString stored = (RubyString)input.eltInternal(i);
+                System.out.println("Stored " + i + ":");
+                System.out.println("  Bytes: " + java.util.Arrays.toString(stored.getBytes()));
+                System.out.println("  Encoding: " + stored.getEncoding());
+            }
         } else {
             // after a full buffer signal
             if (input.isEmpty()) {

irb(main):001:0> line = LogStash::Plugin.lookup("codec", "line").new
=> <LogStash::Codecs::Line id=>"line_7fe29211-65b2-4931-985b-3ff04b227a90", enable_metric=>true, charset=>"UTF-8", delimiter=>"\n">
irb(main):002:0> buftok = FileWatch::BufferedTokenizer.new
=> #<FileWatch::BufferedTokenizer:0x350ce9db>
irb(main):003:0> buftok.extract("\xA3".force_encoding("ISO8859-1"))
irb(main):004:0> buftok.flush.bytes

=== Before addAll ===
Entity 0:
  Bytes: [-93]
  Encoding: ISO-8859-1

=== After addAll ===
Stored 0:
  Bytes: [-62, -93]
  Encoding: UTF-8
=> [194, 163]

We expect a Single byte [163] (£ in ISO-8859-1) but we observe instead Double-encoded bytes [194, 163] (UTF-8 representation of £).

Source of the bug
RubyArray.add (invoked by addAll) invokes a conversion JavaUtil.convertJavaToUsableRubyObject(metaClass.runtime, element) which invokes a StringConverter which creates a new unicode string at which appears to be the source of the extra encoding.

additional information

A test has been raised to demonstrate the bug: [Test] add test to verify that BufferedTokenizerExt doesn't apply enc… #16690
Another example has been submitted showing the behavior outside the tokenizer code:

package org.logstash.common;

import org.jruby.RubyArray;
import org.jruby.RubyString;
import org.jruby.runtime.ThreadContext;
import org.jruby.runtime.builtin.IRubyObject;
import org.junit.Before;
import org.junit.Test;
import org.logstash.RubyUtil;

import static org.junit.Assert.assertEquals;
import static org.logstash.RubyUtil.RUBY;

@SuppressWarnings("rawtypes")
public class BoomTest {

    private IRubyObject rubyInput;

    private static void assertEqualsBytes(byte[] expected, byte[] actual) {
        assertEquals(expected.length, actual.length);
        for (int i = 0; i < expected.length; i++) {
            assertEquals(expected[i], actual[i]);
        }
    }

    private ThreadContext context;

    private static RubyString NEW_LINE = (RubyString) RubyUtil.RUBY.newString("\n").
            freeze(RubyUtil.RUBY.getCurrentContext());

    @Before
    public void setUp() {
        context = RUBY.getCurrentContext();
        RubyString rubyString = RubyString.newString(RUBY, new byte[]{(byte) 0xA3});
        rubyInput = rubyString.force_encoding(context, RUBY.newString("ISO8859-1"));
    }

    @Test
    public void testEncodingIsPreservedOutside() {
        final RubyArray entities = rubyInput.convertToString().split(NEW_LINE, -1);

        // shift the first directly from entities, doesn't apply any charset conversion
        RubyString head = (RubyString) entities.shift(context);

        assertEqualsBytes(new byte[]{(byte) 0xA3}, head.getBytes());
    }

    @Test
    public void testEncodingIsPreservedOutsideAfterAdding() {
        final RubyArray entities = rubyInput.convertToString().split(NEW_LINE, -1);

        // adding all entities and shifting the first from this secondary accumulator does some charset conversion
        RubyArray input = RubyUtil.RUBY.newArray();
        input.addAll(entities);
        RubyString head = (RubyString) input.shift(context);

        assertEqualsBytes(new byte[]{(byte) 0xA3}, head.getBytes());
    }
}

The text was updated successfully, but these errors were encountered:

andsel · 2025-01-22T16:36:09Z

Added an issue to JRuby for this jruby/jruby#8583, maybe they can throw some light to it.

…n respecting the encoding of the input string (#16968) Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset.

…n respecting the encoding of the input string (#16968) Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54)

…fter a buffer full condition respecting the encoding of the input string (#16968) (#17018) Backport PR #16968 to 9.0 branch, original message: ---- Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54) Co-authored-by: Andrea Selva <[email protected]>

…fter a buffer full condition respecting the encoding of the input string (#16968) (#17019) Backport PR #16968 to 8.x branch, original message: ---- Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54) Co-authored-by: Andrea Selva <[email protected]>

…after a buffer full condition respecting the encoding of the input string (#16968) (#17020) Backport PR #16968 to 8.18 branch, original message: ---- Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54) Co-authored-by: Andrea Selva <[email protected]>

…after a buffer full condition respecting the encoding of the input string (#16968) (#17022) Backport PR #16968 to 8.17 branch, original message: ---- Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54) Co-authored-by: Andrea Selva <[email protected]>

…after a buffer full condition respecting the encoding of the input string (#16968) (#17021) Backport PR #16968 to 8.16 branch, original message: ---- Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54) Co-authored-by: Andrea Selva <[email protected]>

donoghuc added bug status:needs-triage labels Nov 19, 2024

donoghuc mentioned this issue Nov 20, 2024

remove .travis.yml customization logstash-plugins/logstash-input-tcp#226

Merged

robbavey assigned andsel and unassigned andsel Jan 10, 2025

andsel mentioned this issue Jan 31, 2025

Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #16968

Merged

3 tasks

andsel closed this as completed in #16968 Feb 5, 2025

github-actions bot mentioned this issue Feb 5, 2025

Backport PR #16968 to 9.0: Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #17018

Merged

3 tasks

github-actions bot mentioned this issue Feb 5, 2025

Backport PR #16968 to 8.x: Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #17019

Merged

3 tasks

github-actions bot mentioned this issue Feb 5, 2025

Backport PR #16968 to 8.18: Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #17020

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character encoding issues with refactored `BufferedTokenizerExt` #16694

Character encoding issues with refactored `BufferedTokenizerExt` #16694

donoghuc commented Nov 19, 2024

andsel commented Jan 22, 2025

Character encoding issues with refactored BufferedTokenizerExt #16694

Character encoding issues with refactored BufferedTokenizerExt #16694

Comments

donoghuc commented Nov 19, 2024

andsel commented Jan 22, 2025

Character encoding issues with refactored `BufferedTokenizerExt` #16694

Character encoding issues with refactored `BufferedTokenizerExt` #16694