Expected Runtimes for HAWK #8

jasongallant · 2018-11-08T02:57:29Z

Hello- thanks for developing this very intriguing software.

I'm currently exploring its use for several resequenced genomes (~1GB each). I'm curious what expected runtimes are? Currently, I am using in a shared/HPC configuration. When running the runHAWK script, and am unsure of the wall time/memory requirements are. Any pointers or benchmarks?

Second, the strings output by the HAWK.cpp program are not immediately clear what they indicate-- they are many orders of magnitude larger than the total number of k-mers in my dataset. I cannot infer what these correspond to. Some clarity in this may help me benchmark the program on my own data.

atifrahman · 2018-11-08T07:15:52Z

Thanks for using it!

The run times will depend on the number of samples and the read coverage in each sample in addition to the size of the genome. Our analysis of ~200 YRI and TSI samples from the 1000 genomes project took around 12 days (mostly to do k-mer counting) using 30 cores. The analysis of E.coli ampicillin resistance data set took about 2 days. It should run in 64GB memory and requirements can be adjusted by decreasing valInc in hawk.cpp.

The case_out_w_bonf.kmerDiff and control_out_w_bonf.kmerDiff files output by hawk.cpp contains k-mers that passed Bonferroni correction (before correcting for co-factors). Unless something is going wrong, they should contain much smaller number of k-mers compared to total number of k-mers. The files may be large because they contain k-mer strings, p-values and counts of the k-mer in each sample.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expected Runtimes for HAWK #8

Expected Runtimes for HAWK #8

jasongallant commented Nov 8, 2018

atifrahman commented Nov 8, 2018

Expected Runtimes for HAWK #8

Expected Runtimes for HAWK #8

Comments

jasongallant commented Nov 8, 2018

atifrahman commented Nov 8, 2018