Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expected Runtimes for HAWK #8

Open
jasongallant opened this issue Nov 8, 2018 · 1 comment
Open

Expected Runtimes for HAWK #8

jasongallant opened this issue Nov 8, 2018 · 1 comment

Comments

@jasongallant
Copy link

Hello- thanks for developing this very intriguing software.

I'm currently exploring its use for several resequenced genomes (~1GB each). I'm curious what expected runtimes are? Currently, I am using in a shared/HPC configuration. When running the runHAWK script, and am unsure of the wall time/memory requirements are. Any pointers or benchmarks?

Second, the strings output by the HAWK.cpp program are not immediately clear what they indicate-- they are many orders of magnitude larger than the total number of k-mers in my dataset. I cannot infer what these correspond to. Some clarity in this may help me benchmark the program on my own data.

@atifrahman
Copy link
Owner

Thanks for using it!

The run times will depend on the number of samples and the read coverage in each sample in addition to the size of the genome. Our analysis of ~200 YRI and TSI samples from the 1000 genomes project took around 12 days (mostly to do k-mer counting) using 30 cores. The analysis of E.coli ampicillin resistance data set took about 2 days. It should run in 64GB memory and requirements can be adjusted by decreasing valInc in hawk.cpp.

The case_out_w_bonf.kmerDiff and control_out_w_bonf.kmerDiff files output by hawk.cpp contains k-mers that passed Bonferroni correction (before correcting for co-factors). Unless something is going wrong, they should contain much smaller number of k-mers compared to total number of k-mers. The files may be large because they contain k-mer strings, p-values and counts of the k-mer in each sample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants