Skip to content

Commit

Permalink
wip
Browse files Browse the repository at this point in the history
  • Loading branch information
marcelm committed Jan 12, 2024
1 parent ee5b270 commit 66ca452
Showing 1 changed file with 109 additions and 30 deletions.
139 changes: 109 additions & 30 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,43 +17,21 @@
xopen
=====

This Python module provides an ``xopen`` function that works like the
This Python module provides an ``xopen`` function that works like Python’s
built-in ``open`` function but also transparently deals with compressed files.
Supported compression formats are currently gzip, bzip2, xz and optionally Zstandard.

``xopen`` selects the most efficient method for reading or writing a compressed file.
For gzip files this means falling back on the threaded methods of the
``python-isal`` library if supported. Alternatively a pipe can be opened to
an external tool, such as `pigz <https://zlib.net/pigz/>`_, which is a parallel
version of ``gzip``.

If ``threads=0`` is passed to ``xopen()``, no external process is used.
For gzip files, this will then use `python-isal
<https://github.com/pycompression/python-isal>`_ (which binds isa-l) if
it is installed (since ``python-isal`` is a dependency of ``xopen``,
this should always be the case).
``python-isal`` does not support compression levels
greater than 3, so if no external tool is available or ``threads`` has been set to 0,
Python’s built-in ``gzip.open`` is used.

For xz files, a pipe to the ``xz`` program is used because it has built-in support for multithreaded compression.

For bz2 files, `pbzip2 (parallel bzip2) <http://compression.ca/pbzip2/>`_ is used.

``xopen`` falls back to Python’s built-in functions
(``gzip.open``, ``lzma.open``, ``bz2.open``)
if none of the other methods can be used.

The file format to use is determined from the file name if the extension is recognized
(``.gz``, ``.bz2``, ``.xz`` or ``.zst``).
When reading a file without a recognized file extension, xopen attempts to detect the format
by reading the first couple of bytes from the file.
Supported compression formats are:
- gzip (``.gz``)
- bzip2 (``.bz2``)
- xz (``.xz``)
- Zstandard (``.zst``) (optional)

``xopen`` is compatible with Python versions 3.8 and later.


Usage
-----
Example usage
-------------

Open a file for reading::

Expand All @@ -72,6 +50,107 @@ and avoid using an external process::
f.write(b"Hello")


The ``xopen`` function
----------------------

The ``xopen`` module offers a single function named ``xopen`` with the following
signature::

xopen(
filename: FilePath,
mode: Literal["r", "w", "a", "rt", "rb", "wt", "wb", "at", "ab"] = "r",
compresslevel: Optional[int] = None,
threads: Optional[int] = None,
*,
encoding: str = "utf-8",
errors: Optional[str] = None,
newline: Optional[str] = None,
format: Optional[str] = None,
) -> IO

``xopen`` detects whether a file is compressed, if necessary delegates opening
it to an appropriate function, and returns an open file-like object.

When writing, the file format is chosen based on the file name extension:
``.gz``, ``.bz2``, ``.xz``, ``.zst``.
If the extension is not recognized, no compression is used.

When reading, if a file name extension is available, the format is detected
using it, but if not, the format is detected from the contents.

Parameters
~~~~~~~~~~

**filename**:
str, bytes, or `os.PathLike <https://docs.python.org/3/library/os.html#os.PathLike>`_
naming the file to open.

If set to ``"-"``, standard output (in mode ``"w"``) or
standard input (mode ``"r"``) is returned.

**mode**, **encoding**, **errors**, **newline**:
These parameters have the same meaning as in Python’s built-in
`open function <https://docs.python.org/3/library/functions.html#open>`_
except that the default encoding is always UTF-8 instead of the
preferred locale encoding.
``encoding``, ``errors`` and ``newline`` are only used when opening a file in text mode.

**compresslevel**:
The compression level for writing to gzip, xz and Zstandard files.
If set to None, a default depending on the format is used:
gzip: 6, xz: 6, Zstandard: 3.

This parameter is ignored for other compression formats.

**format**:
Override the autodetection of the input or output format.
Possible values are: ``"gz"``, ``"xz"``, ``"bz2"``, ``"zst"``.

**threads**:
When threads is None (the default), compressed file formats are read or written
using a pipe to a subprocess running an external tool such as,
``pbzip2``, ``gzip`` etc., see PipedGzipWriter, PipedGzipReader etc.
If the external tool supports multiple threads, *threads* can be set to an int
specifying the number of threads to use.
If no external tool supporting the compression format is available, the file is
opened calling the appropriate Python function
(that is, no subprocess is spawned).

Set threads to 0 to force opening the file without using a subprocess.


Compression backends
--------------------


gzip
~~~~



For gzip files this means falling back on the threaded methods of the
``python-isal`` library if supported. Alternatively a pipe can be opened to
an external tool, such as `pigz <https://zlib.net/pigz/>`_, which is a parallel
version of ``gzip``.

If ``threads=0`` is passed to ``xopen()``, no external process is used.
For gzip files, this will then use `python-isal
<https://github.com/pycompression/python-isal>`_ (which binds isa-l) if
it is installed (since ``python-isal`` is a dependency of ``xopen``,
this should always be the case).
``python-isal`` does not support compression levels
greater than 3, so if no external tool is available or ``threads`` has been set to 0,
Python’s built-in ``gzip.open`` is used.

For xz files, a pipe to the ``xz`` program is used because it has built-in support for multithreaded compression.

For bz2 files, `pbzip2 (parallel bzip2) <http://compression.ca/pbzip2/>`_ is used.

``xopen`` falls back to Python’s built-in functions
(``gzip.open``, ``lzma.open``, ``bz2.open``)
if none of the other methods can be used.


Reproducibility
---------------

Expand Down

0 comments on commit 66ca452

Please sign in to comment.