Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detection and Tiny sequences #391

Closed
dpatryas-rtbhouse opened this issue Dec 8, 2023 · 1 comment
Closed

Detection and Tiny sequences #391

dpatryas-rtbhouse opened this issue Dec 8, 2023 · 1 comment

Comments

@dpatryas-rtbhouse
Copy link

dpatryas-rtbhouse commented Dec 8, 2023

Describe the bug
The issue pertains to the charset_normalizer.detect() method, which fails to perform valid encoding detection for expression sequences. Specifically, the method incorrectly recognizes the sequence as Big5 and returns it as a Chinese character. This behavior can be observed with the provided code snippet:

To Reproduce
Execute the code snippet, where the charset_normalizer.detect() method misidentifies the encoding of the sequence.

import charset_normalizer
text = b"What Actions Will Keep Us at 1.5-2\xbaC?"
text.decode(charset_normalizer.detect(text)["encoding"])

Expected behavior
The expected behavior can be demonstrated using the chardet library, which accurately recognizes the encoding as ISO 8859-1. The correct degree character is then returned from the sequence:

import chardet
text = b"What Actions Will Keep Us at 1.5-2\xbaC?"
text.decode(chardet.detect(text)["encoding"])

Logs

charset-normalizer:
What Actions Will Keep Us at 1.5-2慢?

chardet:
What Actions Will Keep Us at 1.5-2ºC?

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}  # chardet
{'encoding': 'Big5', 'language': 'Chinese', 'confidence': 1.0}  # charset-normalizer

Desktop (please complete the following information):

  • OS: Linux
  • Python version 3.10
  • Package version 3.3.2 / 2.1.1
@dpatryas-rtbhouse dpatryas-rtbhouse added bug Something isn't working help wanted Extra attention is needed labels Dec 8, 2023
@Ousret
Copy link
Member

Ousret commented Dec 9, 2023

I understand you've encountered a frustrating case.
There could be a minor misunderstanding on how a charset detector works.

Let's analyze the input you've shared, and thanks for that:
text = b"What Actions Will Keep Us at 1.5-2\xbaC?"

We can only agree that the debate will focus on the \xba character.
So charset-normalizer or chardet must decide "what meant the original writer/author?"

OK, so far?

chardet knows about 35ich encodings, and charset-normalizer around 100.

\xba can be decoded using most of the encoding available (that extends ASCII of course)
Here is a tiny extract of what it could translate to.

  • º
  • ş
  • ÷
  • Ί
  • ŗ
  • ¬
  • [

and so on...

As a human, you've concluded that ° was his original intention due to the sentence What Actions Will Keep Us at 1.5-2

How do you teach that to a machine without taking 10 seconds to answer today?

Know that we did our best to answer as accurately as possible.
A solution to improve must exist, it's just out of reach for me right now.

chardet only answered correctly by pure luck.

Take this for example: Charset Detection, for Everyone 👋 that encode to Charset Detection, for Everyone \xf0\x9f\x91\x8b

>>> chardet.detect(b'Charset Detection, for Everyone \xf0\x9f\x91\x8b')
{'encoding': 'Windows-1254', 'confidence': 0.4957960183590231, 'language': 'Turkish'}

>>> charset_normalizer.detect(b'Charset Detection, for Everyone \xf0\x9f\x91\x8b')
{'encoding': 'utf-8', 'language': '', 'confidence': 1.0}

Or Je suis pas d'accord avec Ahméd that translate to Je suis pas d'accord avec Ahm\xc3\xa9d.

>>> chardet.detect(b"Je suis pas d'accord avec Ahm\xc3\xa9d")
{'encoding': 'ISO-8859-9', 'confidence': 0.5648588804140238, 'language': 'Turkish'}

>>> charset_normalizer.detect(b"Je suis pas d'accord avec Ahm\xc3\xa9d")
{'encoding': 'utf-8', 'language': '', 'confidence': 1.0}

Now, luck it out of the equation thanks to our ability to detect Unicode.

Hope that clarifies,

@Ousret Ousret closed this as not planned Won't fix, can't repro, duplicate, stale Dec 9, 2023
@Ousret Ousret removed bug Something isn't working help wanted Extra attention is needed labels Dec 9, 2023
@Ousret Ousret changed the title [BUG] Incorrect Encoding Detection for Expression Sequences in charset_normalizer.detect() Detection and Tiny sequences Dec 9, 2023
@Ousret Ousret pinned this issue Dec 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants