Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character encoding issues with refactored BufferedTokenizerExt #16694

Closed
donoghuc opened this issue Nov 19, 2024 · 1 comment · Fixed by #16968
Closed

Character encoding issues with refactored BufferedTokenizerExt #16694

donoghuc opened this issue Nov 19, 2024 · 1 comment · Fixed by #16968

Comments

@donoghuc
Copy link
Member

With the addition of https://github.com/elastic/logstash/pull/16482/commits it is possible that character encodings can be improperly handled leading to corrupted data.

Logstash information:
The affected (released) versions are:

  • 8.15.4

Reproduction

The issue can be demonstrated by making the following changes and performing the small reproduction case in a repl:

diff --git a/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java b/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
index 2c36370af..7bd9e2e03 100644
--- a/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
+++ b/logstash-core/src/main/java/org/logstash/common/BufferedTokenizerExt.java
@@ -79,9 +79,25 @@ public class BufferedTokenizerExt extends RubyObject {
     @SuppressWarnings("rawtypes")
     public RubyArray extract(final ThreadContext context, IRubyObject data) {
         final RubyArray entities = data.convertToString().split(delimiter, -1);
+        // Debug before addAll
+        System.out.println("\n=== Before addAll ===");
+        for (int i = 0; i < entities.size(); i++) {
+            RubyString entity = (RubyString)entities.eltInternal(i);
+            System.out.println("Entity " + i + ":");
+            System.out.println("  Bytes: " + java.util.Arrays.toString(entity.getBytes()));
+            System.out.println("  Encoding: " + entity.getEncoding());
+        }
         if (!bufferFullErrorNotified) {
             input.clear();
             input.addAll(entities);
+            // Debug after addAll
+            System.out.println("\n=== After addAll ===");
+            for (int i = 0; i < input.size(); i++) {
+                RubyString stored = (RubyString)input.eltInternal(i);
+                System.out.println("Stored " + i + ":");
+                System.out.println("  Bytes: " + java.util.Arrays.toString(stored.getBytes()));
+                System.out.println("  Encoding: " + stored.getEncoding());
+            }
         } else {
             // after a full buffer signal
             if (input.isEmpty()) {
irb(main):001:0> line = LogStash::Plugin.lookup("codec", "line").new
=> <LogStash::Codecs::Line id=>"line_7fe29211-65b2-4931-985b-3ff04b227a90", enable_metric=>true, charset=>"UTF-8", delimiter=>"\n">
irb(main):002:0> buftok = FileWatch::BufferedTokenizer.new
=> #<FileWatch::BufferedTokenizer:0x350ce9db>
irb(main):003:0> buftok.extract("\xA3".force_encoding("ISO8859-1"))
irb(main):004:0> buftok.flush.bytes

=== Before addAll ===
Entity 0:
  Bytes: [-93]
  Encoding: ISO-8859-1

=== After addAll ===
Stored 0:
  Bytes: [-62, -93]
  Encoding: UTF-8
=> [194, 163]

We expect a Single byte [163] (£ in ISO-8859-1) but we observe instead Double-encoded bytes [194, 163] (UTF-8 representation of £).

Source of the bug
RubyArray.add (invoked by addAll) invokes a conversion JavaUtil.convertJavaToUsableRubyObject(metaClass.runtime, element) which invokes a StringConverter which creates a new unicode string at which appears to be the source of the extra encoding.

additional information

package org.logstash.common;

import org.jruby.RubyArray;
import org.jruby.RubyString;
import org.jruby.runtime.ThreadContext;
import org.jruby.runtime.builtin.IRubyObject;
import org.junit.Before;
import org.junit.Test;
import org.logstash.RubyUtil;

import static org.junit.Assert.assertEquals;
import static org.logstash.RubyUtil.RUBY;

@SuppressWarnings("rawtypes")
public class BoomTest {

    private IRubyObject rubyInput;

    private static void assertEqualsBytes(byte[] expected, byte[] actual) {
        assertEquals(expected.length, actual.length);
        for (int i = 0; i < expected.length; i++) {
            assertEquals(expected[i], actual[i]);
        }
    }

    private ThreadContext context;

    private static RubyString NEW_LINE = (RubyString) RubyUtil.RUBY.newString("\n").
            freeze(RubyUtil.RUBY.getCurrentContext());

    @Before
    public void setUp() {
        context = RUBY.getCurrentContext();
        RubyString rubyString = RubyString.newString(RUBY, new byte[]{(byte) 0xA3});
        rubyInput = rubyString.force_encoding(context, RUBY.newString("ISO8859-1"));
    }

    @Test
    public void testEncodingIsPreservedOutside() {
        final RubyArray entities = rubyInput.convertToString().split(NEW_LINE, -1);

        // shift the first directly from entities, doesn't apply any charset conversion
        RubyString head = (RubyString) entities.shift(context);

        assertEqualsBytes(new byte[]{(byte) 0xA3}, head.getBytes());
    }

    @Test
    public void testEncodingIsPreservedOutsideAfterAdding() {
        final RubyArray entities = rubyInput.convertToString().split(NEW_LINE, -1);

        // adding all entities and shifting the first from this secondary accumulator does some charset conversion
        RubyArray input = RubyUtil.RUBY.newArray();
        input.addAll(entities);
        RubyString head = (RubyString) input.shift(context);

        assertEqualsBytes(new byte[]{(byte) 0xA3}, head.getBytes());
    }
}
@robbavey robbavey assigned andsel and unassigned andsel Jan 10, 2025
@andsel
Copy link
Contributor

andsel commented Jan 22, 2025

Added an issue to JRuby for this jruby/jruby#8583, maybe they can throw some light to it.

andsel added a commit that referenced this issue Feb 5, 2025
…n respecting the encoding of the input string (#16968)

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.
Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens.

## How solve the problem
This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:
- accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
- respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset.
github-actions bot pushed a commit that referenced this issue Feb 5, 2025
…n respecting the encoding of the input string (#16968)

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.
Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens.

## How solve the problem
This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:
- accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
- respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset.

(cherry picked from commit 1c8cf54)
github-actions bot pushed a commit that referenced this issue Feb 5, 2025
…n respecting the encoding of the input string (#16968)

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.
Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens.

## How solve the problem
This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:
- accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
- respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset.

(cherry picked from commit 1c8cf54)
github-actions bot pushed a commit that referenced this issue Feb 5, 2025
…n respecting the encoding of the input string (#16968)

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.
Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens.

## How solve the problem
This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:
- accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
- respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset.

(cherry picked from commit 1c8cf54)
github-actions bot pushed a commit that referenced this issue Feb 5, 2025
…n respecting the encoding of the input string (#16968)

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.
Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens.

## How solve the problem
This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:
- accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
- respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset.

(cherry picked from commit 1c8cf54)
github-actions bot pushed a commit that referenced this issue Feb 5, 2025
…n respecting the encoding of the input string (#16968)

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.
Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens.

## How solve the problem
This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:
- accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
- respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset.

(cherry picked from commit 1c8cf54)
andsel added a commit that referenced this issue Feb 5, 2025
…fter a buffer full condition respecting the encoding of the input string (#16968) (#17018)

Backport PR #16968 to 9.0 branch, original message:

----

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.
Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens.

## How solve the problem
This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:
- accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
- respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset.

(cherry picked from commit 1c8cf54)

Co-authored-by: Andrea Selva <[email protected]>
andsel added a commit that referenced this issue Feb 5, 2025
…fter a buffer full condition respecting the encoding of the input string (#16968) (#17019)

Backport PR #16968 to 8.x branch, original message:

----

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.
Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens.

## How solve the problem
This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:
- accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
- respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset.

(cherry picked from commit 1c8cf54)

Co-authored-by: Andrea Selva <[email protected]>
andsel added a commit that referenced this issue Feb 5, 2025
…after a buffer full condition respecting the encoding of the input string (#16968) (#17020)

Backport PR #16968 to 8.18 branch, original message:

----

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.
Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens.

## How solve the problem
This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:
- accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
- respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset.

(cherry picked from commit 1c8cf54)

Co-authored-by: Andrea Selva <[email protected]>
andsel added a commit that referenced this issue Feb 5, 2025
…after a buffer full condition respecting the encoding of the input string (#16968) (#17022)

Backport PR #16968 to 8.17 branch, original message:

----

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.
Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens.

## How solve the problem
This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:
- accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
- respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset.

(cherry picked from commit 1c8cf54)

Co-authored-by: Andrea Selva <[email protected]>
andsel added a commit that referenced this issue Feb 5, 2025
…after a buffer full condition respecting the encoding of the input string (#16968) (#17021)

Backport PR #16968 to 8.16 branch, original message:

----

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.
Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens.

## How solve the problem
This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:
- accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
- respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset.

(cherry picked from commit 1c8cf54)

Co-authored-by: Andrea Selva <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants