Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test 4,546 Salmonella genomes? #6

Open
jermp opened this issue Oct 26, 2023 · 3 comments
Open

Test 4,546 Salmonella genomes? #6

jermp opened this issue Oct 26, 2023 · 3 comments

Comments

@jermp
Copy link

jermp commented Oct 26, 2023

Dear all,

I'm trying to build your compressed representation (for k=31) on a rather small pangenome, which can be downloaded from here https://zenodo.org/records/1323684 and contains 4,546 Salmonella genomes.
Can you please try to build your archive on the same data?

Specifically, the pipeline run for ~5h before aborting, saying "no space left on device" which is very strange because I have over 1.5T available. Also, I've noticed that the pipeline outputs some very large intermediate files, like 186 GB. Do you confirm?
Is there any parameters I need to set (I've set -k 31 and -j 8)?

Thanks!
Best,
-Giulio

@amatur
Copy link
Member

amatur commented Oct 27, 2023

Hi Giulio,

The current pipeline is not very optimized for intermediate disk usage unfortunately, but there are some easy fixes. It's expected to have high disk usage, since we dump the intermediate uncompressed color matrix to disk (which is not even gzipped). The other issue is that the current version in github only supports upto 128 colors (I realize this constraint is not documented anywhere).

Currently I am working on fixing these two issues. I have an experimental implementation that supports larger number of colors. I will test if it works on this dataset and then update the repo with the fixes.

Thanks,
Amatur

@jermp
Copy link
Author

jermp commented Oct 27, 2023

Hi @amatur,
thank you for the answer and confirmation about the space usage.

we dump the intermediate uncompressed color matrix to disk (which is not even gzipped)

Yes, I think this is a severe limitation because it would prevent the use for even small files.

The other issue is that the current version in github only supports up to 128 colors.

Oh, that's why! Let me know.

Best,
-Giulio

@jermp
Copy link
Author

jermp commented Nov 30, 2023

Hi @amatur and @yoann-dufresne,
any update on this matter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants