Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved checksum calculations #187

Open
bosterholz opened this issue Jun 20, 2022 · 2 comments
Open

Improved checksum calculations #187

bosterholz opened this issue Jun 20, 2022 · 2 comments
Labels
enhancement New feature or request
Milestone

Comments

@bosterholz
Copy link
Collaborator

The used md5sum process is single threaded and takes ages calculating a 240GB nr database checksum.
It would be nice to use a checksum algorithm/program which can be parallelized to speed this up.

@bosterholz bosterholz added the enhancement New feature or request label Jun 20, 2022
@pbelmann
Copy link
Member

Good Catch! Maybe xargs is easiest way to solve this.

@pbelmann pbelmann added this to the Publication milestone Jun 29, 2022
@bosterholz
Copy link
Collaborator Author

I tried two different algorithms, but they finished really closely while not maxing out IO.
We should take a look at parallel implementations as it seems that the one used core could be the bottleneck.

time cksum nr_2022-04-02_mmseqs_taxonomy.tar
2820021559 280000174080 nr_2022-04-02_mmseqs_taxonomy.tar

real    65m15.818s
user    21m25.004s
sys     2m6.584s


time md5sum nr_2022-04-02_mmseqs_taxonomy.tar                                                                                               
35b7bc1a96f0b337c12713d4d3d4b4d3  nr_2022-04-02_mmseqs_taxonomy.tar

real    60m19.030s
user    13m45.796s
sys     2m22.600s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants