Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reduce the runtime? #1

Open
spcspin opened this issue Aug 23, 2023 · 1 comment
Open

How to reduce the runtime? #1

spcspin opened this issue Aug 23, 2023 · 1 comment

Comments

@spcspin
Copy link
Collaborator

spcspin commented Aug 23, 2023

The most time-consuming part of the workflow is the calling variant using the HaplotypeCaller. Therefore we focused on the HaplotypeCaller step.

Here are a few ways to try:

  1. HaplotypeCallerSpark:
    HaplotypeCallerSpark is a tool designed by gatk to replace the threading functionality in gatk3. However, it is still in BETA stage, and
    many attempts to use the bee data caused problems. After a discussion with the gatk team, it is confirmed that it is a problem with
    HaplotypeCallerSpark itself.

    The discussion link

  2. break down Reference into smaller chunks for HaplotypeCaller:
    scattered intervals based on N masked regions of the reference genome and collecting each intervals calls at the end using
    GatherVcfs tool.

  3. Optimize JAVA setting:
    Trying to adjust the parameters related to garbage collection:

    • -XX:ParallelGCThreads
    • Heap Space -Xmx
      There is not much difference in the results.
  4. CPU utilization:
    Using the --native-pair-hmm-threads option in HaplotypeCaller there is not much difference in the results.
    3 and 4 can refer to this website

  5. Try different variant calling tools:

@spcspin
Copy link
Collaborator Author

spcspin commented Aug 31, 2023

To deal with the low integrity problem when breaking down the Reference genome into smaller chunks for HaplotypeCaller, the GATK team replied as follows:
Depending on how you scatter your intervals it should still hold true. Worst case scenario you may have to run HaplotypeCaller per contig/chromosome which is probably the safest way but if your reference is split by long repeats of N then you may want to split your intervals based on the positions of N repeats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant