From bf347b184ad832b44bff544af91564740da6a379 Mon Sep 17 00:00:00 2001 From: Marcel Martin Date: Fri, 12 Jan 2024 15:00:46 +0100 Subject: [PATCH 1/4] Clean up README slightly and document the xopen function parameters --- README.rst | 131 ++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 100 insertions(+), 31 deletions(-) diff --git a/README.rst b/README.rst index 8611aab..c2ff337 100644 --- a/README.rst +++ b/README.rst @@ -17,43 +17,22 @@ xopen ===== -This Python module provides an ``xopen`` function that works like the +This Python module provides an ``xopen`` function that works like Python’s built-in ``open`` function but also transparently deals with compressed files. -Supported compression formats are currently gzip, bzip2, xz and optionally Zstandard. - ``xopen`` selects the most efficient method for reading or writing a compressed file. -For gzip files this means falling back on the threaded methods of the -``python-isal`` library if supported. Alternatively a pipe can be opened to -an external tool, such as `pigz `_, which is a parallel -version of ``gzip``. - -If ``threads=0`` is passed to ``xopen()``, no external process is used. -For gzip files, this will then use `python-isal -`_ (which binds isa-l) if -it is installed (since ``python-isal`` is a dependency of ``xopen``, -this should always be the case). -``python-isal`` does not support compression levels -greater than 3, so if no external tool is available or ``threads`` has been set to 0, -Python’s built-in ``gzip.open`` is used. - -For xz files, a pipe to the ``xz`` program is used because it has built-in support for multithreaded compression. -For bz2 files, `pbzip2 (parallel bzip2) `_ is used. +Supported compression formats are: -``xopen`` falls back to Python’s built-in functions -(``gzip.open``, ``lzma.open``, ``bz2.open``) -if none of the other methods can be used. - -The file format to use is determined from the file name if the extension is recognized -(``.gz``, ``.bz2``, ``.xz`` or ``.zst``). -When reading a file without a recognized file extension, xopen attempts to detect the format -by reading the first couple of bytes from the file. +- gzip (``.gz``) +- bzip2 (``.bz2``) +- xz (``.xz``) +- Zstandard (``.zst``) (optional) ``xopen`` is compatible with Python versions 3.8 and later. -Usage ------ +Example usage +------------- Open a file for reading:: @@ -72,6 +51,96 @@ and avoid using an external process:: f.write(b"Hello") +The ``xopen`` function +---------------------- + +The ``xopen`` module offers a single function named ``xopen`` with the following +signature:: + + xopen( + filename: str | bytes | os.PathLike, + mode: Literal["r", "w", "a", "rt", "rb", "wt", "wb", "at", "ab"] = "r", + compresslevel: Optional[int] = None, + threads: Optional[int] = None, + *, + encoding: str = "utf-8", + errors: Optional[str] = None, + newline: Optional[str] = None, + format: Optional[str] = None, + ) -> IO + +The function opens the file using a function suitable for the detected +file format and returns an open file-like object. + +When writing, the file format is chosen based on the file name extension: +``.gz``, ``.bz2``, ``.xz``, ``.zst``. This can be overriden with ``format``. +If the extension is not recognized, no compression is used. + +When reading and a file name extension is available, the format is detected +from the extension. +When reading and no file name extension is available, +the format is detected from the contents. + +Parameters +~~~~~~~~~~ + +**filename** (str, bytes, or `os.PathLike `_): +Name of the file to open. + +If set to ``"-"``, standard output (in mode ``"w"``) or +standard input (in mode ``"r"``) is returned. + +**mode**, **encoding**, **errors**, **newline**: +These parameters have the same meaning as in Python’s built-in +`open function `_ +except that the default encoding is always UTF-8 instead of the +preferred locale encoding. +``encoding``, ``errors`` and ``newline`` are only used when opening a file in text mode. + +**compresslevel**: +The compression level for writing to gzip, xz and Zstandard files. +If set to None, a default depending on the format is used: +gzip: 6, xz: 6, Zstandard: 3. + +This parameter is ignored for other compression formats. + +**format**: +Override the autodetection of the input or output format. +Possible values are: ``"gz"``, ``"xz"``, ``"bz2"``, ``"zst"``. + +**threads**: +If multi-threaded compression or decompression is available, +this parameter can be used to override the number of threads +used. It is ignored otherwise. + +Set threads to 0 to force opening the file without using a subprocess. +For some compression levels, +compressed files are by default read or written +using a pipe to a subprocess running an external tool such as +``pbzip2`` or ``xz``. +With *threads* set to 0, a normal function call is used instead. + + +Backends +-------- + +Opening of gzip files is delegated to one of these programs or libraries: + +* `python-isal `_. + Supports multiple threads and compression levels up to 3. +* zlib-ng +* `pigz `_ (a parallel version of ``gzip``) + +For xz files, a pipe to the ``xz`` program is used because it has +built-in support for multithreaded compression. + +For bz2 files, `pbzip2 (parallel bzip2) `_ is used. + +``xopen`` falls back to Python’s built-in functions +(``gzip.open``, ``lzma.open``, ``bz2.open``) +if none of the other methods can be used. + + Reproducibility --------------- @@ -270,7 +339,7 @@ Credits ------- The name ``xopen`` was taken from the C function of the same name in the -`utils.h file which is part of +`utils.h file that is part of BWA `_. Some ideas were taken from the `canopener project `_. @@ -284,7 +353,7 @@ Maintainers * Marcel Martin * Ruben Vorderman -* For a list of contributors, see +* See also the `full list of contributors `_. Links From acf474f87c77daca8ea2512440d78b77b25725b1 Mon Sep 17 00:00:00 2001 From: Marcel Martin Date: Mon, 15 Jan 2024 21:21:08 +0100 Subject: [PATCH 2/4] Document change of default gzip compression level --- README.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.rst b/README.rst index c2ff337..4f745ad 100644 --- a/README.rst +++ b/README.rst @@ -100,7 +100,7 @@ preferred locale encoding. **compresslevel**: The compression level for writing to gzip, xz and Zstandard files. If set to None, a default depending on the format is used: -gzip: 6, xz: 6, Zstandard: 3. +gzip: 1, xz: 6, Zstandard: 3. This parameter is ignored for other compression formats. From fb569e3857daa8f9bb799d0560136b31b3200be1 Mon Sep 17 00:00:00 2001 From: Marcel Martin Date: Mon, 15 Jan 2024 21:24:15 +0100 Subject: [PATCH 3/4] Fix docstring --- src/xopen/__init__.py | 3 --- 1 file changed, 3 deletions(-) diff --git a/src/xopen/__init__.py b/src/xopen/__init__.py index 287176e..8857e55 100644 --- a/src/xopen/__init__.py +++ b/src/xopen/__init__.py @@ -543,9 +543,6 @@ def __init__( """ mode -- one of 'w', 'wt', 'wb', 'a', 'at', 'ab' compresslevel -- compression level - threads (int) -- number of pigz threads. If this is set to None, a reasonable default is - used. At the moment, this means that the number of available CPU cores is used, capped - at four to avoid creating too many threads. Use 0 to let pigz use all available cores. """ if compresslevel is not None and compresslevel not in range(1, 10): raise ValueError("compresslevel must be between 1 and 9") From 7778069439a325078948c838573ee9642f499e1a Mon Sep 17 00:00:00 2001 From: Marcel Martin Date: Tue, 16 Jan 2024 23:52:15 +0100 Subject: [PATCH 4/4] Apply review comments --- README.rst | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/README.rst b/README.rst index 4f745ad..13a7aaa 100644 --- a/README.rst +++ b/README.rst @@ -79,7 +79,8 @@ If the extension is not recognized, no compression is used. When reading and a file name extension is available, the format is detected from the extension. When reading and no file name extension is available, -the format is detected from the contents. +the format is detected from the +`file signature `. Parameters ~~~~~~~~~~ @@ -109,16 +110,18 @@ Override the autodetection of the input or output format. Possible values are: ``"gz"``, ``"xz"``, ``"bz2"``, ``"zst"``. **threads**: -If multi-threaded compression or decompression is available, -this parameter can be used to override the number of threads -used. It is ignored otherwise. +Set the number of additional threads spawned for compression or decompression. +May be ignored if the backend does not support threads. -Set threads to 0 to force opening the file without using a subprocess. -For some compression levels, -compressed files are by default read or written -using a pipe to a subprocess running an external tool such as -``pbzip2`` or ``xz``. -With *threads* set to 0, a normal function call is used instead. +If *threads* is None (the default), as many threads as available CPU cores are +used, but not more than four. + +xopen tries to offload the (de)compression to other threads +to free up the main Python thread for the application. +This can either be done by using a subprocess to an external application or +using a library that supports threads. + +Set threads to 0 to force xopen to use only the main Python thread. Backends @@ -128,8 +131,9 @@ Opening of gzip files is delegated to one of these programs or libraries: * `python-isal `_. Supports multiple threads and compression levels up to 3. -* zlib-ng +* `python-zlib-ng `_ * `pigz `_ (a parallel version of ``gzip``) +* `gzip `_ For xz files, a pipe to the ``xz`` program is used because it has built-in support for multithreaded compression.