Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for update sourmash_plugin_branchwater lastest features and performance improvements in yacht #124

Open
tnmquann opened this issue Oct 9, 2024 · 3 comments

Comments

@tnmquann
Copy link

tnmquann commented Oct 9, 2024

Hi,

I've been using yacht as a way to reduce false positives in sourmash, and I wanted to ask if it's possible to update the tool to incorporate the latest features from sourmash_plugin_branchwater? This would be helpful for a couple of reasons:

  • Currently, the newest version of yacht only supports processing one sample at a time, which becomes time-consuming when working with many samples.
  • As highlighted in the tutorial, the training process is indeed time-consuming, especially with large databases. I've been training GTDB-R220 (all genomes) for nearly a week without results, whereas training on the genomic representatives version only took me about a morning. This performance gap is significant.

I believe incorporating improvements like supporting new rocksdb data format and using manysketch and/or fastmultigather could help reduce processing times and allow handling of multiple samples simultaneously.

Thanks for the great tool, and I'm looking forward to potential improvements in future releases!

@dkoslicki
Copy link
Member

Thanks for the suggestion @tnmquann ! We (@mahmudhera and @chunyuma ) have recently been working on this exact issue, but from a different direction: the reference database formation step in yacht train contains an inherently quadradic step, in that all genomes need to be compared to all others to identify those that are within the ANI threshold. Taking a different algorithmic approach than anything in branchwater, we've been able to reduce the training time on a datatset of ~2.7M genomes from a month to about 3 days on a 128 core server. It will take a while, but we will eventually make that an official part of YACHT.

For the "only supporting one sample at a time", since running yacht on different samples is independent from running it on any other sample, doesn't something like gnu parallel or xargs -P work? Doing it in the yacht run itself wouldn't actually save much time at all, save for the very little bit of time to load in the reference/training database.

@tnmquann
Copy link
Author

Hi @dkoslicki , thank you for letting me know about the upcoming release, and it’s exciting to hear about the algorithmic improvements to reduce training time. I’ll be looking forward to seeing that in action when it’s ready.

For running multiple samples, I’m currently using gnu parallel to process them simultaneously, as you suggested. However, I just had a sudden thought: would there be any significant time savings if the database was loaded once for all queries, similar to how the fastmultigather module operates, and then using multithreading to process multiple samples at the same time? Just a curiosity that popped up while working with YACHT.

Thanks again, and really excited for the next release.

@dkoslicki
Copy link
Member

We have experimented with loading the database once and using multithreading to process multiple samples, and found that there were very negligible gains (on the order of seconds). This might be helpful when you have a massive reference database, which typically occurs with a very high ANI value (eg. 0.99995), but in such cases, a more targeted approach seems better (focusing on a specific clade or clades)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants