Use only level 1 and 2 with python-isal #142
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Zlib compression level 0 outputs uncompressed deflate format, which consists of 64K blocks in the following format:
1 byte signifying data is uncompressed. (Actually it is a two-bit code, but I am not going to rewrite the spec here). 2 bytes with an unsigned 16 bit integer that holds the length of the following block. 2 bytes with the unsigned 16 bit integer complemented for parity checking. Then follows the uncompressed block of up to 64K in length.
ISA-L however compresses the data with a special algorithm. This is an area where zlib and ISA-L differ. That is okay, but xopen should behave somewhat consistently in terms of output for the same parameters and not let it be backend dependent. Also level 0 should be quite fast on zlib and zlib-ng as well, there is no reason to use ISA-L for speed here.
Next is level 3. For FASTQ data, I find that this has very little extra value in terms of compression. Other libraries, such as zlib-ng and zlib do a better job here. Also it is much slower than levels 1 or 2, so it is a bit of a useless level for FASTQ data. ISA-L level 1 and 2 outperform anything by huge margins and since level 1 is the default I don't see why level 3 can't be done by a library that compresses better.
All right: benchmarks. Not done very cleanly, with processes runniing in the background and n=1, but at least it gives a good ballpark indication.
before:
after:
The benchmark show that there was also a fault in the logic introduced at some point (EDIT: due to an if->elif replacement in the refactoring). If python-isal with threads=0 threw a ValueError for a bad compression level, zlib-ng was skipped and zlib used instead. This is now also fixed.
As you can see zlib-ng gets a nice improvement over python-isal in terms of compressed size at level 3. "Compression" at level 0 is now non-existent as expected and as a result is a lot faster, also as expected.
Overall I am very happy with the minimal code size required to affect these changes. Python-isal is a special case, so I don't mind a bit more specific special-casing in the code.