Skip to content

Commit

Permalink
fix!: Only use "surrogateescape" error handling in CLI, revert to def…
Browse files Browse the repository at this point in the history
…ault handling in API

Changing default error handling in Python API of the library is a breaking change,
let's change it to opt-in for now.

The benefit or "surrogateescape" seems more clear in CLI, since there is currently
no feature that works with subtitle text, it only removes (some of) annoying errors
about character encoding, without breaking anything.
  • Loading branch information
tkarabela committed May 18, 2024
1 parent 5c98e6c commit 445f1c3
Show file tree
Hide file tree
Showing 4 changed files with 65 additions and 40 deletions.
18 changes: 9 additions & 9 deletions docs/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,18 +34,17 @@ CLI parameters
::

usage: pysubs2 [-h] [-v] [-f {srt,ass,ssa,microdvd,json,mpl2,tmp,vtt}] [-t {srt,ass,ssa,microdvd,json,mpl2,tmp,vtt}] [--input-enc ENCODING] [--output-enc ENCODING] [--fps FPS] [-o DIR] [--clean] [--verbose]
usage: pysubs2 [-h] [-v] [-f {srt,ass,ssa,microdvd,json,mpl2,tmp,vtt}] [-t {srt,ass,ssa,microdvd,json,mpl2,tmp,vtt}] [--input-enc ENCODING] [--output-enc ENCODING] [--enc-error-handling {strict,surrogateescape}] [--fps FPS] [-o DIR] [--clean] [--verbose]
[--shift TIME | --shift-back TIME | --transform-framerate FPS1 FPS2] [--srt-keep-unknown-html-tags] [--srt-keep-html-tags] [--srt-keep-ssa-tags] [--sub-no-write-fps-declaration]
[FILE [FILE ...]]
[FILE ...]

The pysubs2 CLI for processing subtitle files.
https://github.com/tkarabela/pysubs2

positional arguments:
FILE Input subtitle files. Can be in SubStation Alpha (*.ass, *.ssa), SubRip (*.srt), MicroDVD (*.sub) or other supported format. When no files are specified, pysubs2 will work as a pipe, reading from
standard input and writing to standard output.
FILE Input subtitle files. Can be in SubStation Alpha (*.ass, *.ssa), SubRip (*.srt), MicroDVD (*.sub) or other supported format. When no files are specified, pysubs2 will work as a pipe, reading from standard input and writing to standard output.

optional arguments:
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-f {srt,ass,ssa,microdvd,json,mpl2,tmp,vtt}, --from {srt,ass,ssa,microdvd,json,mpl2,tmp,vtt}
Expand All @@ -54,12 +53,13 @@ CLI parameters
Convert subtitle files to given format. By default, each file is saved in its original format.
--input-enc ENCODING Character encoding for input files. By default, UTF-8 is used for both input and output.
--output-enc ENCODING
Character encoding for output files. By default, it is the same as input encoding. If you wish to convert between encodings, make sure --input-enc is set correctly! Otherwise, your output files will
probably be corrupted. It's a good idea to back up your files or use the -o option.
Character encoding for output files. By default, it is the same as input encoding. If you wish to convert between encodings, make sure --input-enc is set correctly! Otherwise, your output files will probably be corrupted. It's a good idea to back up your files or use the -o
option.
--enc-error-handling {strict,surrogateescape}
Character encoding error handling for input and output. Defaults to 'surrogateescape' which passes through unrecognized characters to output unchanged. Use 'strict' if you want the command to fail when encountering a character incompatible with selected input/output encoding.
--fps FPS This argument specifies framerate for MicroDVD files. By default, framerate is detected from the file. Use this when framerate specification is missing or to force different framerate.
-o DIR, --output-dir DIR
Use this to save all files to given directory. By default, every file is saved to its parent directory, ie. unless it's being saved in different subtitle format (and thus with different file
extension), it overwrites the original file.
Use this to save all files to given directory. By default, every file is saved to its parent directory, ie. unless it's being saved in different subtitle format (and thus with different file extension), it overwrites the original file.
--clean Attempt to remove non-essential subtitles (eg. karaoke, SSA drawing tags), strip styling information when saving to non-SSA formats
--verbose Print misc logging
--shift TIME Delay all subtitles by given time amount. Time is specified like this: '1m30s', '0.5s', ...
Expand Down
51 changes: 37 additions & 14 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,20 +30,6 @@ Now that we have a real file on the harddrive, let's import pysubs2 and load it.
>>> subs
<SSAFile with 2 events and 1 styles, last timestamp 0:02:00>

.. note::
By default, pysubs2 uses UTF-8 encoding when reading and writing files, with surrogate pair escape error handling.
This works best if your file is either:

* in UTF-8 encoding or
* in a similar ASCII-like encoding (line ``latin-1``) and you don't need to work with the text (only convert subtitle format, shift time, etc.).

Use the ``encoding`` and ``errors`` keyword arguments in the :meth:`pysubs2.SSAFile.load()` and :meth:`pysubs2.SSAFile.save()` methods in case you need something else,
or you can do the processing yourself and work only with ``str`` using :meth:`pysubs2.SSAFile.from_string()` and :meth:`pysubs2.SSAFile.to_string()`.

If you use the default settings, you can get the input ``bytes`` for a particular subtitle using:

>>> subs[0].text.encode("utf-8", "surrogateescape")

Now we have a subtitle file, the :class:`pysubs2.SSAFile` object. It has two "events", ie. subtitles. You can treat ``subs`` as a list:

>>> subs[0].text
Expand All @@ -65,6 +51,43 @@ Individual subtitles are :class:`pysubs2.SSAEvent` objects and have the attribut
there was a SubRip file
with two subtitles.

A point about character encoding
################################

By default, pysubs2 uses `UTF-8 <https://en.wikipedia.org/wiki/UTF-8>`_ character encoding when reading and writing files,
which enjoys wide software support, can represent any character from `Unicode <https://en.wikipedia.org/wiki/Unicode>`_,
and is efficient in terms of disk space. It's arguably "the" character encoding to use for text storage today, but it
hasn't always been like this, and it's possible that the subtitle files you will be dealing with use some other
encoding.

UTF-8 is a superset of `ASCII <https://en.wikipedia.org/wiki/ASCII>`_ and it's defined in such a way that files using
other encodings are very unlikely to form valid UTF-8 file. In other words, if your non-UTF-8 file contains characters
such as accented Latin letters, East Asian scripts, etc., instead of question marks or wrong characters in the output,
you will get an error like this:

>>> import pysubs2
>>> subs = pysubs2.load("subtitles.srt")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 110: invalid start byte

When this happens, you have two options:

1. **If you need to work with subtitle text (eg. for translation)**, you must specify the correct encoding using the ``encoding``
parameter for :meth:`pysubs2.load()`,
eg. ``pysubs2.load("subtitles.srt", encoding="latin-1")``. If you don't know which encoding
to use, you can try autodetecting it using a library like `charset-normalizer <https://pypi.org/project/charset-normalizer/>`_
or `chardet <https://pypi.org/project/chardet/>`_.
2. **If you don't need to read/modify subtitle text (eg. for retiming or format conversion)**, you can try using
``errors="surrogateescape"`` to wrap non-UTF-8 characters as `Unicode surrogate pairs <https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates>`_
and effectively pass them through to output, eg. ``pysubs2.load("subtitles.srt", errors="surrogateescape")``.
This will only work if the actual character encoding is sufficiently "ASCII-like"
that pysubs2 recognizes the file structure, which may fail with multi-byte encodings. The CLI tool uses this
by default for better user experience.

Lastly, there have been reports about rare subtitle files with mixed character encodings. If you have the misfortune
to stumble upon such a file, use ``errors="surrogateescape"`` which will allow you to get the input ``bytes`` of a particular
subtitle by using: ``subs[0].text.encode("utf-8", "surrogateescape")``. You can then set the :attr:`pysubs2.SSAEvent.text`
to whatever is the correct decoded text.

Working with timing
-------------------

Expand Down
22 changes: 13 additions & 9 deletions pysubs2/ssafile.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ def __init__(self) -> None:

@classmethod
def load(cls, path: str, encoding: str = "utf-8", format_: Optional[str] = None, fps: Optional[float] = None,
errors: Optional[str] = "surrogateescape", **kwargs: Any) -> "SSAFile":
errors: Optional[str] = None, **kwargs: Any) -> "SSAFile":
"""
Load subtitle file from given path.
Expand All @@ -66,10 +66,12 @@ def load(cls, path: str, encoding: str = "utf-8", format_: Optional[str] = None,
encoding (str): Character encoding of input file.
Defaults to UTF-8, you may need to change this.
errors (Optional[str]): Error handling for character encoding
of input file. Defaults to ``"surrogateescape"``. See documentation
of builtin ``open()`` function for more.
of input file. Defaults to ``None``; use the value ``"surrogateescape"``
for pass-through of bytes not supported by selected encoding via
`Unicode surrogate pairs <https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates>`_.
See documentation of builtin ``open()`` function for more.
.. versionchanged:: 2.0.0
.. versionchanged:: 1.7.0
The ``errors`` parameter was introduced to facilitate
pass-through of subtitle files with unknown text encoding.
Previous versions of the library behaved as if ``errors=None``.
Expand Down Expand Up @@ -190,7 +192,7 @@ def from_file(cls, fp: TextIO, format_: Optional[str] = None, fps: Optional[floa
return subs

def save(self, path: str, encoding: str = "utf-8", format_: Optional[str] = None, fps: Optional[float] = None,
errors: Optional[str] = "surrogateescape", **kwargs: Any) -> None:
errors: Optional[str] = None, **kwargs: Any) -> None:
"""
Save subtitle file to given path.
Expand All @@ -217,11 +219,13 @@ def save(self, path: str, encoding: str = "utf-8", format_: Optional[str] = None
different framerate, use this argument. See also
:meth:`SSAFile.transform_framerate()` for fixing bad
frame-based to time-based conversions.
errors (Optional[str]): Error handling for character encoding,
defaults to ``"surrogateescape"``. See documentation
of builtin ``open()`` function for more.
errors (Optional[str]): Error handling for character encoding
of input file. Defaults to ``None``; use the value ``"surrogateescape"``
for pass-through of bytes not supported by selected encoding via
`Unicode surrogate pairs <https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates>`_.
See documentation of builtin ``open()`` function for more.
.. versionchanged:: 2.0.0
.. versionchanged:: 1.7.0
The ``errors`` parameter was introduced to facilitate
pass-through of subtitle files with unknown text encoding.
Previous versions of the library behaved as if ``errors=None``.
Expand Down
14 changes: 6 additions & 8 deletions tests/formats/test_subrip.py
Original file line number Diff line number Diff line change
Expand Up @@ -326,15 +326,14 @@ def test_win1250_passthrough_with_surrogateescape() -> None:
fp.write(input_bytes_win1250)

with pytest.raises(UnicodeDecodeError):
# legacy behaviour
SSAFile.load(input_path, errors=None)
SSAFile.load(input_path)

subs = SSAFile.load(input_path)
subs = SSAFile.load(input_path, errors="surrogateescape")

assert subs[0].text == "The quick brown fox jumps over the lazy dog"
assert subs[1].text.startswith("P") and subs[1].text.endswith("dy")

subs.save(output_path)
subs.save(output_path, errors="surrogateescape")

with open(output_path, "rb") as fp:
output_bytes = fp.read()
Expand Down Expand Up @@ -362,15 +361,14 @@ def test_multiencoding_passthrough_with_surrogateescape() -> None:
fp.write(input_bytes)

with pytest.raises(UnicodeDecodeError):
# legacy behaviour
SSAFile.load(input_path, errors=None)
SSAFile.load(input_path)

subs = SSAFile.load(input_path)
subs = SSAFile.load(input_path, errors="surrogateescape")

assert subs[0].text.startswith("The quick brown fox jumps over the lazy dog")
assert "Felix bzw. Jody" in subs[0].text

subs.save(output_path)
subs.save(output_path, errors="surrogateescape")

with open(output_path, "rb") as fp:
output_bytes = fp.read()
Expand Down

0 comments on commit 445f1c3

Please sign in to comment.