-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guidance for working with large reference data sets #37
Comments
Wow that is a large data set. I'll have to think about your questions. Maybe @Funatiq can also chime in. to point 4) One thing I would advise right away is to keep the kmer length at 16 and the kmer data type at 32 bits. The difference in classification accuracy is not that big but the runtime and memory impact will be quite substantial. |
In summary: I would first try to build one single partition with k=16 (32bit kmers) and perform a few experiments to estimate the runtime performance and accuracy based on this single partition. After everything works satisfactory for this single partition I would then build all partitions. I'll be honest, we have never built a database in the TB range. |
Thanks a lot for the detailed answer! |
You just might need to experiment a bit with all of that. I guess I would start with larger partitions (1TB or more) and reduce the partition size in case of any problem / poor performance / poor classification results.
BTW: Can you tell me a bit about your use case and the hardware specs of your system(s)? |
I am working with ~500K genomes comprising about ~1.5TB. Is there any way to speed up the index building? I have created another issue related to memory mapping. #43 |
@ChillarAnand: If you have access to a GPU system you can use the GPU version of metacache which is able to build even very large database files within seconds to a few minutes. This however will also require to partition your database so that the partitions fit into the GPU memory. This is even faster when you have access to a multi-GPU system like an NVIDIA DGX. The produced database files are compatible with the CPU version. So you can build databases on the GPU and then query it on the CPU. Regarding database loading see my comment in #43 |
Hi,
I am trying to build a database from RefSeq and GenBank genomes.
The total size of the ~1.9 million compressed genomes is ~8.5T. Since the data set contains many genomes, some of which with extremely long chromosomes, I built MetaCache with:
make MACROS="-DMC_TARGET_ID_TYPE=uint64_t -DMC_WINDOW_ID_TYPE=uint64_t -DMC_KMER_TYPE=uint64_t"
What will the peak memory consumption during build be, when partitioning in x M sized partitions?
Atm, I am running with
${p2mc}/metacache-partition-genomes ${p2g} 1600000
, since I have at max ~ 1.9T of memory available.Is the partition size reasonable?
Does it matter for the partition size calculation, if the genomes are compressed or not?
Would it be beneficial for build time and memory consumption to create more smaller partitions instead of fewer large ones? There was a similar question in Merging results from querying partitioned database #33 (comment), with advice to build fewer partitions, to keep the merging time in check.
Should I then try to find the maximum partition size that will fit into memory during building?
Since I am partitioning anyway, do I actually need to compile with
uint64_t
, or check the largest partition, sequence count-wise, and see if I can get away withuint32_t
?Would you expect performance differences between querying a single db, few partitions, and many partitions, using the merge functionality with the latter two?
I chose
-kmerlen 20
based on Figure SF1 from the publication. Would you advise against this in favor of the default value of 16, maybe to keep the computational resource demand and query speed etc. at a reasonable level?Should other sketching parameters be adjusted as well for a value of 20 or are the defaults fine?
Since the reference data set is large, should some of the advanced options be set/adjusted e.g.
-remove-overpopulated-features
? If so, what values should be chosen, based on the reference data?I would be very grateful for any guidance you have.
Best
Oskar
The text was updated successfully, but these errors were encountered: