-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character encoding issues with refactored BufferedTokenizerExt
#16694
Labels
Comments
Added an issue to JRuby for this jruby/jruby#8583, maybe they can throw some light to it. |
3 tasks
andsel
added a commit
that referenced
this issue
Feb 5, 2025
…n respecting the encoding of the input string (#16968) Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset.
github-actions bot
pushed a commit
that referenced
this issue
Feb 5, 2025
…n respecting the encoding of the input string (#16968) Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54)
github-actions bot
pushed a commit
that referenced
this issue
Feb 5, 2025
…n respecting the encoding of the input string (#16968) Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54)
github-actions bot
pushed a commit
that referenced
this issue
Feb 5, 2025
…n respecting the encoding of the input string (#16968) Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54)
github-actions bot
pushed a commit
that referenced
this issue
Feb 5, 2025
…n respecting the encoding of the input string (#16968) Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54)
github-actions bot
pushed a commit
that referenced
this issue
Feb 5, 2025
…n respecting the encoding of the input string (#16968) Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54)
This was referenced Feb 5, 2025
andsel
added a commit
that referenced
this issue
Feb 5, 2025
…fter a buffer full condition respecting the encoding of the input string (#16968) (#17018) Backport PR #16968 to 9.0 branch, original message: ---- Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54) Co-authored-by: Andrea Selva <[email protected]>
andsel
added a commit
that referenced
this issue
Feb 5, 2025
…fter a buffer full condition respecting the encoding of the input string (#16968) (#17019) Backport PR #16968 to 8.x branch, original message: ---- Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54) Co-authored-by: Andrea Selva <[email protected]>
andsel
added a commit
that referenced
this issue
Feb 5, 2025
…after a buffer full condition respecting the encoding of the input string (#16968) (#17020) Backport PR #16968 to 8.18 branch, original message: ---- Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54) Co-authored-by: Andrea Selva <[email protected]>
andsel
added a commit
that referenced
this issue
Feb 5, 2025
…after a buffer full condition respecting the encoding of the input string (#16968) (#17022) Backport PR #16968 to 8.17 branch, original message: ---- Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54) Co-authored-by: Andrea Selva <[email protected]>
andsel
added a commit
that referenced
this issue
Feb 5, 2025
…after a buffer full condition respecting the encoding of the input string (#16968) (#17021) Backport PR #16968 to 8.16 branch, original message: ---- Permit to use effectively the tokenizer also in context where a line is bigger than a limit. Fixes an issues related to token size limit error, when the offending token was bigger than the input fragment in happened that the tokenzer wasn't unable to recover the token stream from the first delimiter after the offending token but messed things, loosing part of tokens. ## How solve the problem This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694. The first try failed on returning the tokens in the same encoding of the input. This PR does a couple of things: - accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one. - respect the encoding of the input string. Use `concat` method instead of `addAll`, which avoid to convert RubyString to String and back to RubyString. When return the head `StringBuilder` it enforce the encoding with the input charset. (cherry picked from commit 1c8cf54) Co-authored-by: Andrea Selva <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
With the addition of https://github.com/elastic/logstash/pull/16482/commits it is possible that character encodings can be improperly handled leading to corrupted data.
Logstash information:
The affected (released) versions are:
Reproduction
The issue can be demonstrated by making the following changes and performing the small reproduction case in a repl:
We expect a Single byte [163] (£ in ISO-8859-1) but we observe instead Double-encoded bytes [194, 163] (UTF-8 representation of £).
Source of the bug
RubyArray.add (invoked by addAll) invokes a conversion
JavaUtil.convertJavaToUsableRubyObject(metaClass.runtime, element)
which invokes a StringConverter which creates a new unicode string at which appears to be the source of the extra encoding.additional information
The text was updated successfully, but these errors were encountered: