parallel implementations and clustering of reads #1

darked89 · 2023-06-06T11:12:15Z

Hello,

there are mature, included in the major Linux distros programs such as pigz, pbizp2 and lbzip2. These compress faster while being on par on compression ratios. The last one (lbzip2) seems to be faster with decompression.

For achieving better compression ratios it does help to cluster reads based on the sequence with clumpify from BBMap. This also tends to speed up a bit with downstream mapping.

Hope it helps

DK

mbhall88 · 2023-06-06T23:26:27Z

Hi @darked89.

Thanks for the suggestions. I did contemplate adding in the parallel compressors you mentioned, but figured I would just stick to the standard single-threaded implementations for simplicity. (zstd also has a multi-threading option.) Reading the pbzip2 docs it seems it only does decompression? I will have a think about adding in a section on parallel (de)compression (really just a matter of whether I get the time).

Regarding clustering the reads, I'm sure you're correct, and that is very interesting, but again, this adds to the complexity of compression and I wanted this benchmark to be for the "standard" user/scenario. The other thing that would need to be accounted for in compression rates etc. would be the time taken to cluster reads.

Thanks again.

lpsantil · 2024-09-14T02:16:55Z

pbzip2 does compression as well. It's very fast in my recent adoption of it.

Have you looked into lz4 (https://github.com/lz4/lz4) and LZHAM (https://github.com/richgel999/lzham_codec)? While lz4 will achieve compression ratios inline with the lower levels of gzip or just below, it will also achieve decompression speeds in the GB/s range. Usually within the same order of magnitude of memory copy speed (https://github.com/lz4/lz4/tree/dev?tab=readme-ov-file#benchmarks).

mbhall88 mentioned this issue Sep 18, 2024

Add LZ4 #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel implementations and clustering of reads #1

parallel implementations and clustering of reads #1

darked89 commented Jun 6, 2023

mbhall88 commented Jun 6, 2023

lpsantil commented Sep 14, 2024 •

edited

Loading

parallel implementations and clustering of reads #1

parallel implementations and clustering of reads #1

Comments

darked89 commented Jun 6, 2023

mbhall88 commented Jun 6, 2023

lpsantil commented Sep 14, 2024 • edited Loading

lpsantil commented Sep 14, 2024 •

edited

Loading