Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run MarkDuplicatesSpark by library #234

Open
tyamaguchi-ucla opened this issue Aug 8, 2022 · 4 comments
Open

Run MarkDuplicatesSpark by library #234

tyamaguchi-ucla opened this issue Aug 8, 2022 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@tyamaguchi-ucla
Copy link
Contributor

tyamaguchi-ucla commented Aug 8, 2022

This is not too urgent but we probably want to implement the following processes

  1. Run MarkDuplicatesSpark by library (or SAMtools markdup)
  2. Remove intermediate files
  3. Merge BAMs with SAMtools merge

so that we can process large samples with multiple libraries (e.g. CPCG0196-F1) with 2TB scratch

We could parallelize #1 for intermediate size samples with multiple libraries (e.g. CPCG0196-B1) but not sure if this would be always faster because the library level BAMs need to be merged.

"""
It looks like both of the logs indicated 2TB scratch wasn't enough and we know MarkDuplicatesSpark generates quite a bit of intermediate files. I don't think we can do much unless we run MarkDuplicatesSpark at the library level, remove intermediate files and then samtools merge.

Originally posted by @tyamaguchi-ucla in #229 (comment)

@tyamaguchi-ucla
Copy link
Contributor Author

@nkwang24 @yashpatel6 For multi-library samples, this approach would help although it will take some time to implement it. Also, it would be helpful to understand the usage of /scratch space between MarkDuplicatesSpark and SAMtools markdup.

@tyamaguchi-ucla tyamaguchi-ucla added the enhancement New feature or request label May 8, 2023
@nkwang24
Copy link

nkwang24 commented May 8, 2023

@tyamaguchi-ucla agreed. I wrote a script to periodically log the scratch use over the course of a metapipeline run, but as I commented in #229, I can't access the files generated by Spark. The best I've been able to do is correlate sample size with where in metapipeline the failures occur. Based on what I've gathered so far using the latest metapipeline PR, it looks like align-DNA lets samples of up to ~450Gb through. Of these, call-gSNP lets ~400Gb through.

It's really hard to get a good idea of what's going as my tests have been somewhat inconsistent and confounded by stochastic node level errors. Possible sources of inconsistencies:

  1. Dependent on tumor vs normal fastq size and not total fastq size
  2. @yashpatel6 For the intermediate file deletions that you implemented, is it possible that network congestion causes variability in the time it takes for processes to write their outputs to hot? If the processes are asynchronous, this might affect how long the intermediate files stay in scratch and the next process might start filling up scratch before they can be deleted

@yashpatel6
Copy link
Contributor

2. @yashpatel6 For the intermediate file deletions that you implemented, is it possible that network congestion causes variability in the time it takes for processes to write their outputs to hot? If the processes are asynchronous, this might affect how long the intermediate files stay in scratch and the next process might start filling up scratch before they can be deleted

We discussed this briefly in the metapipeline-DNA meeting but just to log the discussion here:

The intermediate file deletion with individual pipelines doesn't depend on the write to /hot since when intermediate file deletion is enabled, the deleted files are never written to /hot. And conversely, any output files written to /hot under that case aren't subject to the deletion process.

From a inter-pipeline deletion perspective, there's only one case where this happens, which is when align-DNA output is deleted from /scratch once it's been copied to /hot and used by the first step of call-gSNP. This process could potentially be impacted by latency since if the copying takes a long time due to latency, then call-gSNP may continue while the deletion process is still waiting for the files to finish being copied to /hot. This specific case can actually be traced from the .command.log from the failing sample/patient by checking if the deletion process had completed by the time the pipeline failed

@tyamaguchi-ucla
Copy link
Contributor Author

broadinstitute/gatk#8134 This is another good reason to consider using samtools markdup instead if benchmarking is promising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants