Release 3.1 (#270)

jawah · Mar 6, 2023 · db9af43 · db9af43
1 parent 86617ac
commit db9af43
Show file tree

Hide file tree

Showing 7 changed files with 99 additions and 69 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,14 +2,17 @@
 All notable changes to charset-normalizer will be documented in this file. This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
-## [3.1.0-dev0](https://github.com/Ousret/charset_normalizer/compare/3.0.1...master) (unreleased)
+## [3.1.0](https://github.com/Ousret/charset_normalizer/compare/3.0.1...3.1.0) (2023-03-06)
 
 ### Added
-- Argument `should_rename_legacy` for legacy function `detect` and disregard any new arguments without errors (PR #261)
+- Argument `should_rename_legacy` for legacy function `detect` and disregard any new arguments without errors (PR #262)
 
 ### Removed
 - Support for Python 3.6 (PR #260)
 
+### Changed
+- Optional speedup provided by mypy/c 1.0.1
+
 ## [3.0.1](https://github.com/Ousret/charset_normalizer/compare/3.0.0...3.0.1) (2022-11-18)
 
 ### Fixed

diff --git a/README.md b/README.md
@@ -23,18 +23,18 @@
 
 This project offers you an alternative to **Universal Charset Encoding Detector**, also known as **Chardet**.
 
-| Feature       | [Chardet](https://github.com/chardet/chardet)       | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) |
-| ------------- | :-------------: | :------------------: | :------------------: |
-| `Fast`         | ❌<br>          | ✅<br>             | ✅ <br> |
-| `Universal**`     | ❌            | ✅                 | ❌ |
-| `Reliable` **without** distinguishable standards | ❌ | ✅ | ✅ |
-| `Reliable` **with** distinguishable standards | ✅ | ✅ | ✅ |
-| `License` | LGPL-2.1<br>_restrictive_ | MIT | MPL-1.1<br>_restrictive_ |
-| `Native Python` | ✅ | ✅ | ❌ |
-| `Detect spoken language` | ❌ | ✅ | N/A |
-| `UnicodeDecodeError Safety` | ❌ | ✅ | ❌ |
-| `Whl Size` | 193.6 kB | 39.5 kB | ~200 kB |
-| `Supported Encoding` | 33 | :tada: [90](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings)  | 40
+| Feature                                          | [Chardet](https://github.com/chardet/chardet) |                                           Charset Normalizer                                           | [cChardet](https://github.com/PyYoshi/cChardet) |
+|--------------------------------------------------|:---------------------------------------------:|:------------------------------------------------------------------------------------------------------:|:-----------------------------------------------:|
+| `Fast`                                           |                     ❌<br>                     |                                                 ✅<br>                                                  |                     ✅ <br>                      |
+| `Universal**`                                    |                       ❌                       |                                                   ✅                                                    |                        ❌                        |
+| `Reliable` **without** distinguishable standards |                       ❌                       |                                                   ✅                                                    |                        ✅                        |
+| `Reliable` **with** distinguishable standards    |                       ✅                       |                                                   ✅                                                    |                        ✅                        |
+| `License`                                        |           LGPL-2.1<br>_restrictive_           |                                                  MIT                                                   |            MPL-1.1<br>_restrictive_             |
+| `Native Python`                                  |                       ✅                       |                                                   ✅                                                    |                        ❌                        |
+| `Detect spoken language`                         |                       ❌                       |                                                   ✅                                                    |                       N/A                       |
+| `UnicodeDecodeError Safety`                      |                       ❌                       |                                                   ✅                                                    |                        ❌                        |
+| `Whl Size`                                       |                   193.6 kB                    |                                                39.5 kB                                                 |                     ~200 kB                     |
+| `Supported Encoding`                             |                      33                       | :tada: [90](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) |                       40                        |
 
 <p align="center">
 <img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://media.tenor.com/images/c0180f70732a18b4965448d33adba3d0/tenor.gif" alt="Cat Reading Text" width="200"/>
@@ -50,15 +50,15 @@ Did you got there because of the logs? See [https://charset-normalizer.readthedo
 
 This package offer better performance than its counterpart Chardet. Here are some numbers.
 
-| Package       | Accuracy       | Mean per file (ms) | File per sec (est) |
-| ------------- | :-------------: | :------------------: | :------------------: |
-|      [chardet](https://github.com/chardet/chardet)        |     86 %     |     200 ms      |       5 file/sec        |
-| charset-normalizer |    **98 %**     |     **10 ms**      |       100 file/sec    |
+| Package                                       | Accuracy | Mean per file (ms) | File per sec (est) |
+|-----------------------------------------------|:--------:|:------------------:|:------------------:|
+| [chardet](https://github.com/chardet/chardet) |   86 %   |       200 ms       |     5 file/sec     |
+| charset-normalizer                            | **98 %** |     **10 ms**      |    100 file/sec    |
 
-| Package       | 99th percentile       | 95th percentile | 50th percentile |
-| ------------- | :-------------: | :------------------: | :------------------: |
-|      [chardet](https://github.com/chardet/chardet)        |     1200 ms     |     287 ms      |       23 ms        |
-| charset-normalizer |    100 ms     |     50 ms      |       5 ms    |
+| Package                                       | 99th percentile | 95th percentile | 50th percentile |
+|-----------------------------------------------|:---------------:|:---------------:|:---------------:|
+| [chardet](https://github.com/chardet/chardet) |     1200 ms     |     287 ms      |      23 ms      |
+| charset-normalizer                            |     100 ms      |      50 ms      |      5 ms       |
 
 Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload.
 
@@ -185,15 +185,15 @@ Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is
 ## 🍰 How
 
   - Discard all charset encoding table that could not fit the binary content.
-  - Measure chaos, or the mess once opened (by chunks) with a corresponding charset encoding.
+  - Measure noise, or the mess once opened (by chunks) with a corresponding charset encoding.
   - Extract matches with the lowest mess detected.
   - Additionally, we measure coherence / probe for a language.
 
-**Wait a minute**, what is chaos/mess and coherence according to **YOU ?**
+**Wait a minute**, what is noise/mess and coherence according to **YOU ?**
 
-*Chaos :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then
+*Noise :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then
 **I established** some ground rules about **what is obvious** when **it seems like** a mess.
- I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to
+ I know that my interpretation of what is noise is probably incomplete, feel free to contribute in order to
  improve or rewrite it.
 
 *Coherence :* For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought
@@ -226,7 +226,7 @@ This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/L
 
 Characters frequencies used in this project © 2012 [Denny Vrandečić](http://simia.net/letters/)
 
-## For Enterprise
+## 💼 For Enterprise
 
 Professional support for charset-normalizer is available as part of the [Tidelift
 Subscription][1].  Tidelift gives software development teams a single source for

diff --git a/bin/run_autofix.sh b/bin/run_autofix.sh
@@ -7,5 +7,5 @@ fi
 
 set -x
 
-${PREFIX}black --target-version=py36 charset_normalizer
+${PREFIX}black --target-version=py37 charset_normalizer
 ${PREFIX}isort charset_normalizer
diff --git a/bin/run_checks.sh b/bin/run_checks.sh
@@ -8,7 +8,7 @@ fi
 set -x
 
 ${PREFIX}pytest
-${PREFIX}black --check --diff --target-version=py36 charset_normalizer
+${PREFIX}black --check --diff --target-version=py37 charset_normalizer
 ${PREFIX}flake8 charset_normalizer
 ${PREFIX}mypy charset_normalizer
 ${PREFIX}isort --check --diff charset_normalizer
diff --git a/charset_normalizer/version.py b/charset_normalizer/version.py
@@ -2,5 +2,5 @@
 Expose version
 """
 
-__version__ = "3.1.0-dev0"
+__version__ = "3.1.0"
 VERSION = __version__.split(".")
diff --git a/docs/community/faq.rst b/docs/community/faq.rst
@@ -40,7 +40,7 @@ If you use the legacy `detect` function,
 Then this change is mostly backward-compatible, exception of a thing:
 
 - This new library support way more code pages (x3) than its counterpart Chardet.
-  - Based on the 30-ich charsets that Chardet support, expect roughly 85% BC results https://github.com/Ousret/charset_normalizer/pull/77/checks?check_run_id=3244585065
+- Based on the 30-ich charsets that Chardet support, expect roughly 80% BC results
 
 We do not guarantee this BC exact percentage through time. May vary but not by much.
 
@@ -56,3 +56,20 @@ detection.
 
 Any code page supported by your cPython is supported by charset-normalizer! It is that simple, no need to update the
 library. It is as generic as we could do.
+
+I can't build standalone executable
+-----------------------------------
+
+If you are using ``pyinstaller``, ``py2exe`` or alike, you may be encountering this or close to:
+
+    ModuleNotFoundError: No module named 'charset_normalizer.md__mypyc'
+
+Why?
+
+- Your package manager picked up a optimized (for speed purposes) wheel that match your architecture and operating system.
+- Finally, the module ``charset_normalizer.md__mypyc`` is imported via binaries and can't be seen using your tool.
+
+How to remedy?
+
+If your bundler program support it, set up a hook that implicitly import the hidden module.
+Otherwise, follow the guide on how to install the vanilla version of this package. (Section: *Optional speedup extension*)
diff --git a/docs/user/support.rst b/docs/user/support.rst
@@ -124,41 +124,51 @@ Supported Languages
 Those language can be detected inside your content. All of these are specified in ./charset_normalizer/assets/__init__.py .
 
 
-English,
-German,
-French,
-Dutch,
-Italian,
-Polish,
-Spanish,
-Russian,
-Japanese,
-Portuguese,
-Swedish,
-Chinese,
-Ukrainian,
-Norwegian,
-Finnish,
-Vietnamese,
-Czech,
-Hungarian,
-Korean,
-Indonesian,
-Turkish,
-Romanian,
-Farsi,
-Arabic,
-Danish,
-Serbian,
-Lithuanian,
-Slovene,
-Slovak,
-Malay,
-Hebrew,
-Bulgarian,
-Croatian,
-Hindi,
-Estonian,
-Thai,
-Greek,
-Tamil.
+| English,
+| German,
+| French,
+| Dutch,
+| Italian,
+| Polish,
+| Spanish,
+| Russian,
+| Japanese,
+| Portuguese,
+| Swedish,
+| Chinese,
+| Ukrainian,
+| Norwegian,
+| Finnish,
+| Vietnamese,
+| Czech,
+| Hungarian,
+| Korean,
+| Indonesian,
+| Turkish,
+| Romanian,
+| Farsi,
+| Arabic,
+| Danish,
+| Serbian,
+| Lithuanian,
+| Slovene,
+| Slovak,
+| Malay,
+| Hebrew,
+| Bulgarian,
+| Croatian,
+| Hindi,
+| Estonian,
+| Thai,
+| Greek,
+| Tamil.
+
+----------------------------
+Incomplete Sequence / Stream
+----------------------------
+
+It is not (yet) officially supported. If you feed an incomplete byte sequence (eg. truncated multi-byte sequence) the detector will
+most likely fail to return a proper result.
+If you are purposely feeding part of your payload for performance concerns, you may stop doing it as this package is fairly optimized.
+
+We are working on a dedicated way to handle streams.