Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel implementations and clustering of reads #1

Open
darked89 opened this issue Jun 6, 2023 · 2 comments
Open

parallel implementations and clustering of reads #1

darked89 opened this issue Jun 6, 2023 · 2 comments

Comments

@darked89
Copy link

darked89 commented Jun 6, 2023

Hello,

there are mature, included in the major Linux distros programs such as pigz, pbizp2 and lbzip2. These compress faster while being on par on compression ratios. The last one (lbzip2) seems to be faster with decompression.

For achieving better compression ratios it does help to cluster reads based on the sequence with clumpify from BBMap. This also tends to speed up a bit with downstream mapping.

Hope it helps

DK

@mbhall88
Copy link
Owner

mbhall88 commented Jun 6, 2023

Hi @darked89.

Thanks for the suggestions. I did contemplate adding in the parallel compressors you mentioned, but figured I would just stick to the standard single-threaded implementations for simplicity. (zstd also has a multi-threading option.) Reading the pbzip2 docs it seems it only does decompression? I will have a think about adding in a section on parallel (de)compression (really just a matter of whether I get the time).

Regarding clustering the reads, I'm sure you're correct, and that is very interesting, but again, this adds to the complexity of compression and I wanted this benchmark to be for the "standard" user/scenario. The other thing that would need to be accounted for in compression rates etc. would be the time taken to cluster reads.

Thanks again.

@lpsantil
Copy link

lpsantil commented Sep 14, 2024

pbzip2 does compression as well. It's very fast in my recent adoption of it.

Have you looked into lz4 (https://github.com/lz4/lz4) and LZHAM (https://github.com/richgel999/lzham_codec)? While lz4 will achieve compression ratios inline with the lower levels of gzip or just below, it will also achieve decompression speeds in the GB/s range. Usually within the same order of magnitude of memory copy speed (https://github.com/lz4/lz4/tree/dev?tab=readme-ov-file#benchmarks).

@mbhall88 mbhall88 mentioned this issue Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants