fix!: Only use "surrogateescape" error handling in CLI, revert to def…

…ault handling in API Changing default error handling in Python API of the library is a breaking change, let's change it to opt-in for now. The benefit or "surrogateescape" seems more clear in CLI, since there is currently no feature that works with subtitle text, it only removes (some of) annoying errors about character encoding, without breaking anything.
tkarabela · May 18, 2024 · 445f1c3 · 445f1c3
1 parent 5c98e6c
commit 445f1c3
Show file tree

Hide file tree

Showing 4 changed files with 65 additions and 40 deletions.
diff --git a/docs/cli.rst b/docs/cli.rst
@@ -34,18 +34,17 @@ CLI parameters
 
 ::
 
-    usage: pysubs2 [-h] [-v] [-f {srt,ass,ssa,microdvd,json,mpl2,tmp,vtt}] [-t {srt,ass,ssa,microdvd,json,mpl2,tmp,vtt}] [--input-enc ENCODING] [--output-enc ENCODING] [--fps FPS] [-o DIR] [--clean] [--verbose]
+    usage: pysubs2 [-h] [-v] [-f {srt,ass,ssa,microdvd,json,mpl2,tmp,vtt}] [-t {srt,ass,ssa,microdvd,json,mpl2,tmp,vtt}] [--input-enc ENCODING] [--output-enc ENCODING] [--enc-error-handling {strict,surrogateescape}] [--fps FPS] [-o DIR] [--clean] [--verbose]
                    [--shift TIME | --shift-back TIME | --transform-framerate FPS1 FPS2] [--srt-keep-unknown-html-tags] [--srt-keep-html-tags] [--srt-keep-ssa-tags] [--sub-no-write-fps-declaration]
-                   [FILE [FILE ...]]
+                   [FILE ...]
 
     The pysubs2 CLI for processing subtitle files.
     https://github.com/tkarabela/pysubs2
 
     positional arguments:
-      FILE                  Input subtitle files. Can be in SubStation Alpha (*.ass, *.ssa), SubRip (*.srt), MicroDVD (*.sub) or other supported format. When no files are specified, pysubs2 will work as a pipe, reading from
-                            standard input and writing to standard output.
+      FILE                  Input subtitle files. Can be in SubStation Alpha (*.ass, *.ssa), SubRip (*.srt), MicroDVD (*.sub) or other supported format. When no files are specified, pysubs2 will work as a pipe, reading from standard input and writing to standard output.
 
-    optional arguments:
+    options:
       -h, --help            show this help message and exit
       -v, --version         show program's version number and exit
       -f {srt,ass,ssa,microdvd,json,mpl2,tmp,vtt}, --from {srt,ass,ssa,microdvd,json,mpl2,tmp,vtt}
@@ -54,12 +53,13 @@ CLI parameters
                             Convert subtitle files to given format. By default, each file is saved in its original format.
       --input-enc ENCODING  Character encoding for input files. By default, UTF-8 is used for both input and output.
       --output-enc ENCODING
-                            Character encoding for output files. By default, it is the same as input encoding. If you wish to convert between encodings, make sure --input-enc is set correctly! Otherwise, your output files will
-                            probably be corrupted. It's a good idea to back up your files or use the -o option.
+                            Character encoding for output files. By default, it is the same as input encoding. If you wish to convert between encodings, make sure --input-enc is set correctly! Otherwise, your output files will probably be corrupted. It's a good idea to back up your files or use the -o
+                            option.
+      --enc-error-handling {strict,surrogateescape}
+                            Character encoding error handling for input and output. Defaults to 'surrogateescape' which passes through unrecognized characters to output unchanged. Use 'strict' if you want the command to fail when encountering a character incompatible with selected input/output encoding.
       --fps FPS             This argument specifies framerate for MicroDVD files. By default, framerate is detected from the file. Use this when framerate specification is missing or to force different framerate.
       -o DIR, --output-dir DIR
-                            Use this to save all files to given directory. By default, every file is saved to its parent directory, ie. unless it's being saved in different subtitle format (and thus with different file
-                            extension), it overwrites the original file.
+                            Use this to save all files to given directory. By default, every file is saved to its parent directory, ie. unless it's being saved in different subtitle format (and thus with different file extension), it overwrites the original file.
       --clean               Attempt to remove non-essential subtitles (eg. karaoke, SSA drawing tags), strip styling information when saving to non-SSA formats
       --verbose             Print misc logging
       --shift TIME          Delay all subtitles by given time amount. Time is specified like this: '1m30s', '0.5s', ...

diff --git a/docs/tutorial.rst b/docs/tutorial.rst
@@ -30,20 +30,6 @@ Now that we have a real file on the harddrive, let's import pysubs2 and load it.
     >>> subs
     <SSAFile with 2 events and 1 styles, last timestamp 0:02:00>
 
-.. note::
-   By default, pysubs2 uses UTF-8 encoding when reading and writing files, with surrogate pair escape error handling.
-   This works best if your file is either:
-
-      * in UTF-8 encoding or
-      * in a similar ASCII-like encoding (line ``latin-1``) and you don't need to work with the text (only convert subtitle format, shift time, etc.).
-
-   Use the ``encoding`` and ``errors`` keyword arguments in the :meth:`pysubs2.SSAFile.load()` and :meth:`pysubs2.SSAFile.save()` methods in case you need something else,
-   or you can do the processing yourself and work only with ``str`` using :meth:`pysubs2.SSAFile.from_string()` and :meth:`pysubs2.SSAFile.to_string()`.
-
-   If you use the default settings, you can get the input ``bytes`` for a particular subtitle using:
-
-   >>> subs[0].text.encode("utf-8", "surrogateescape")
-
 Now we have a subtitle file, the :class:`pysubs2.SSAFile` object. It has two "events", ie. subtitles. You can treat ``subs`` as a list:
 
     >>> subs[0].text
@@ -65,6 +51,43 @@ Individual subtitles are :class:`pysubs2.SSAEvent` objects and have the attribut
     there was a SubRip file
     with two subtitles.
 
+A point about character encoding
+################################
+
+By default, pysubs2 uses `UTF-8 <https://en.wikipedia.org/wiki/UTF-8>`_ character encoding when reading and writing files,
+which enjoys wide software support, can represent any character from `Unicode <https://en.wikipedia.org/wiki/Unicode>`_,
+and is efficient in terms of disk space. It's arguably "the" character encoding to use for text storage today, but it
+hasn't always been like this, and it's possible that the subtitle files you will be dealing with use some other
+encoding.
+
+UTF-8 is a superset of `ASCII <https://en.wikipedia.org/wiki/ASCII>`_ and it's defined in such a way that files using
+other encodings are very unlikely to form valid UTF-8 file. In other words, if your non-UTF-8 file contains characters
+such as accented Latin letters, East Asian scripts, etc., instead of question marks or wrong characters in the output,
+you will get an error like this:
+
+    >>> import pysubs2
+    >>> subs = pysubs2.load("subtitles.srt")
+    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 110: invalid start byte
+
+When this happens, you have two options:
+
+    1. **If you need to work with subtitle text (eg. for translation)**, you must specify the correct encoding using the ``encoding``
+       parameter for :meth:`pysubs2.load()`,
+       eg. ``pysubs2.load("subtitles.srt", encoding="latin-1")``. If you don't know which encoding
+       to use, you can try autodetecting it using a library like `charset-normalizer <https://pypi.org/project/charset-normalizer/>`_
+       or `chardet <https://pypi.org/project/chardet/>`_.
+    2. **If you don't need to read/modify subtitle text (eg. for retiming or format conversion)**, you can try using
+       ``errors="surrogateescape"`` to wrap non-UTF-8 characters as `Unicode surrogate pairs <https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates>`_
+       and effectively pass them through to output, eg. ``pysubs2.load("subtitles.srt", errors="surrogateescape")``.
+       This will only work if the actual character encoding is sufficiently "ASCII-like"
+       that pysubs2 recognizes the file structure, which may fail with multi-byte encodings. The CLI tool uses this
+       by default for better user experience.
+
+Lastly, there have been reports about rare subtitle files with mixed character encodings. If you have the misfortune
+to stumble upon such a file, use ``errors="surrogateescape"`` which will allow you to get the input ``bytes`` of a particular
+subtitle by using: ``subs[0].text.encode("utf-8", "surrogateescape")``. You can then set the :attr:`pysubs2.SSAEvent.text`
+to whatever is the correct decoded text.
+
 Working with timing
 -------------------
 

diff --git a/pysubs2/ssafile.py b/pysubs2/ssafile.py
@@ -50,7 +50,7 @@ def __init__(self) -> None:
 
     @classmethod
     def load(cls, path: str, encoding: str = "utf-8", format_: Optional[str] = None, fps: Optional[float] = None,
-             errors: Optional[str] = "surrogateescape", **kwargs: Any) -> "SSAFile":
+             errors: Optional[str] = None, **kwargs: Any) -> "SSAFile":
         """
         Load subtitle file from given path.
 
@@ -66,10 +66,12 @@ def load(cls, path: str, encoding: str = "utf-8", format_: Optional[str] = None,
             encoding (str): Character encoding of input file.
                 Defaults to UTF-8, you may need to change this.
             errors (Optional[str]): Error handling for character encoding
-                of input file. Defaults to ``"surrogateescape"``. See documentation
-                of builtin ``open()`` function for more.
+                of input file. Defaults to ``None``; use the value ``"surrogateescape"``
+                for pass-through of bytes not supported by selected encoding via
+                `Unicode surrogate pairs <https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates>`_.
+                See documentation of builtin ``open()`` function for more.
 
-                .. versionchanged:: 2.0.0
+                .. versionchanged:: 1.7.0
                     The ``errors`` parameter was introduced to facilitate
                     pass-through of subtitle files with unknown text encoding.
                     Previous versions of the library behaved as if ``errors=None``.
@@ -190,7 +192,7 @@ def from_file(cls, fp: TextIO, format_: Optional[str] = None, fps: Optional[floa
         return subs
 
     def save(self, path: str, encoding: str = "utf-8", format_: Optional[str] = None, fps: Optional[float] = None,
-             errors: Optional[str] = "surrogateescape", **kwargs: Any) -> None:
+             errors: Optional[str] = None, **kwargs: Any) -> None:
         """
         Save subtitle file to given path.
 
@@ -217,11 +219,13 @@ def save(self, path: str, encoding: str = "utf-8", format_: Optional[str] = None
                 different framerate, use this argument. See also
                 :meth:`SSAFile.transform_framerate()` for fixing bad
                 frame-based to time-based conversions.
-            errors (Optional[str]): Error handling for character encoding,
-                defaults to ``"surrogateescape"``. See documentation
-                of builtin ``open()`` function for more.
+            errors (Optional[str]): Error handling for character encoding
+                of input file. Defaults to ``None``; use the value ``"surrogateescape"``
+                for pass-through of bytes not supported by selected encoding via
+                `Unicode surrogate pairs <https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates>`_.
+                See documentation of builtin ``open()`` function for more.
 
-                .. versionchanged:: 2.0.0
+                .. versionchanged:: 1.7.0
                     The ``errors`` parameter was introduced to facilitate
                     pass-through of subtitle files with unknown text encoding.
                     Previous versions of the library behaved as if ``errors=None``.

diff --git a/tests/formats/test_subrip.py b/tests/formats/test_subrip.py
@@ -326,15 +326,14 @@ def test_win1250_passthrough_with_surrogateescape() -> None:
             fp.write(input_bytes_win1250)
 
         with pytest.raises(UnicodeDecodeError):
-            # legacy behaviour
-            SSAFile.load(input_path, errors=None)
+            SSAFile.load(input_path)
 
-        subs = SSAFile.load(input_path)
+        subs = SSAFile.load(input_path, errors="surrogateescape")
 
         assert subs[0].text == "The quick brown fox jumps over the lazy dog"
         assert subs[1].text.startswith("P") and subs[1].text.endswith("dy")
 
-        subs.save(output_path)
+        subs.save(output_path, errors="surrogateescape")
 
         with open(output_path, "rb") as fp:
             output_bytes = fp.read()
@@ -362,15 +361,14 @@ def test_multiencoding_passthrough_with_surrogateescape() -> None:
             fp.write(input_bytes)
 
         with pytest.raises(UnicodeDecodeError):
-            # legacy behaviour
-            SSAFile.load(input_path, errors=None)
+            SSAFile.load(input_path)
 
-        subs = SSAFile.load(input_path)
+        subs = SSAFile.load(input_path, errors="surrogateescape")
 
         assert subs[0].text.startswith("The quick brown fox jumps over the lazy dog")
         assert "Felix bzw. Jody" in subs[0].text
 
-        subs.save(output_path)
+        subs.save(output_path, errors="surrogateescape")
 
         with open(output_path, "rb") as fp:
             output_bytes = fp.read()