-
Notifications
You must be signed in to change notification settings - Fork 40
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat!: Change default character error handling to "surrogateescape"
This addresses long-standing ergonomic issue #43 when dealing with files that have various or unknown character encoding. Previously, the library assumed both input and output files should be UTF-8, and it failed in case this was incorrect, forcing the user to provide appropriate character encoding. After this commit, UTF-8 is still the default input/output encoding, but default error handling changed from "strict" to "surrogateescape", ie. non-UTF-8 characters will be read into Unicode surrogate pairs which will be turned to the original non-UTF-8 characters on output. To get the previous behaviour, use `SSAFile.load(..., errors=None)` and `SSAFile.save(..., errors=None)`. For text processing, you still should specify the encoding explicitly, otherwise you will get surrogate pairs instead of non-ASCII characters when inspecting the SSAFile. Note that multi-byte encodings may still break the parser; parsing with surrogate escapes will work best with ASCII-like encodings.
- Loading branch information
Showing
5 changed files
with
253 additions
and
16 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters