-
-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detection and Tiny sequences #391
Comments
I understand you've encountered a frustrating case. Let's analyze the input you've shared, and thanks for that: We can only agree that the debate will focus on the OK, so far? chardet knows about 35ich encodings, and charset-normalizer around 100.
and so on... As a human, you've concluded that How do you teach that to a machine without taking 10 seconds to answer today? Know that we did our best to answer as accurately as possible. chardet only answered correctly by pure luck. Take this for example:
Or
Now, luck it out of the equation thanks to our ability to detect Unicode. Hope that clarifies, |
Describe the bug
The issue pertains to the charset_normalizer.detect() method, which fails to perform valid encoding detection for expression sequences. Specifically, the method incorrectly recognizes the sequence as Big5 and returns it as a Chinese character. This behavior can be observed with the provided code snippet:
To Reproduce
Execute the code snippet, where the charset_normalizer.detect() method misidentifies the encoding of the sequence.
Expected behavior
The expected behavior can be demonstrated using the chardet library, which accurately recognizes the encoding as ISO 8859-1. The correct degree character is then returned from the sequence:
Logs
charset-normalizer:
What Actions Will Keep Us at 1.5-2慢?
chardet:
What Actions Will Keep Us at 1.5-2ºC?
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: