Run MarkDuplicatesSpark by library #234

tyamaguchi-ucla · 2022-08-08T23:36:16Z

This is not too urgent but we probably want to implement the following processes

Run MarkDuplicatesSpark by library (or SAMtools markdup)
Remove intermediate files
Merge BAMs with SAMtools merge

so that we can process large samples with multiple libraries (e.g. CPCG0196-F1) with 2TB scratch

We could parallelize #1 for intermediate size samples with multiple libraries (e.g. CPCG0196-B1) but not sure if this would be always faster because the library level BAMs need to be merged.

"""
It looks like both of the logs indicated 2TB scratch wasn't enough and we know MarkDuplicatesSpark generates quite a bit of intermediate files. I don't think we can do much unless we run MarkDuplicatesSpark at the library level, remove intermediate files and then samtools merge.

Originally posted by @tyamaguchi-ucla in #229 (comment)

The text was updated successfully, but these errors were encountered:

tyamaguchi-ucla · 2023-05-08T04:16:11Z

@nkwang24 @yashpatel6 For multi-library samples, this approach would help although it will take some time to implement it. Also, it would be helpful to understand the usage of /scratch space between MarkDuplicatesSpark and SAMtools markdup.

nkwang24 · 2023-05-08T04:41:31Z

@tyamaguchi-ucla agreed. I wrote a script to periodically log the scratch use over the course of a metapipeline run, but as I commented in #229, I can't access the files generated by Spark. The best I've been able to do is correlate sample size with where in metapipeline the failures occur. Based on what I've gathered so far using the latest metapipeline PR, it looks like align-DNA lets samples of up to ~450Gb through. Of these, call-gSNP lets ~400Gb through.

It's really hard to get a good idea of what's going as my tests have been somewhat inconsistent and confounded by stochastic node level errors. Possible sources of inconsistencies:

Dependent on tumor vs normal fastq size and not total fastq size
@yashpatel6 For the intermediate file deletions that you implemented, is it possible that network congestion causes variability in the time it takes for processes to write their outputs to hot? If the processes are asynchronous, this might affect how long the intermediate files stay in scratch and the next process might start filling up scratch before they can be deleted

yashpatel6 · 2023-05-08T21:40:41Z

2. @yashpatel6 For the intermediate file deletions that you implemented, is it possible that network congestion causes variability in the time it takes for processes to write their outputs to hot? If the processes are asynchronous, this might affect how long the intermediate files stay in scratch and the next process might start filling up scratch before they can be deleted

We discussed this briefly in the metapipeline-DNA meeting but just to log the discussion here:

The intermediate file deletion with individual pipelines doesn't depend on the write to /hot since when intermediate file deletion is enabled, the deleted files are never written to /hot. And conversely, any output files written to /hot under that case aren't subject to the deletion process.

From a inter-pipeline deletion perspective, there's only one case where this happens, which is when align-DNA output is deleted from /scratch once it's been copied to /hot and used by the first step of call-gSNP. This process could potentially be impacted by latency since if the copying takes a long time due to latency, then call-gSNP may continue while the deletion process is still waiting for the files to finish being copied to /hot. This specific case can actually be traced from the .command.log from the failing sample/patient by checking if the deletion process had completed by the time the pipeline failed

tyamaguchi-ucla · 2023-11-14T07:43:36Z

broadinstitute/gatk#8134 This is another good reason to consider using samtools markdup instead if benchmarking is promising.

tyamaguchi-ucla mentioned this issue Sep 16, 2022

run_MarkDuplicatesSpark_GATK error exit status (3) with CPCG0196-F1 #229

Open

tyamaguchi-ucla added the enhancement New feature or request label May 8, 2023

tyamaguchi-ucla mentioned this issue May 9, 2023

Update resource handling #266

Merged

8 tasks

tyamaguchi-ucla mentioned this issue Oct 3, 2023

Compare scratch usage between MarkDuplicatesSpark and SAMtools markdup #278

Open

tyamaguchi-ucla assigned j2salmingo Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run MarkDuplicatesSpark by library #234

Run MarkDuplicatesSpark by library #234

tyamaguchi-ucla commented Aug 8, 2022 •

edited

Loading

tyamaguchi-ucla commented May 8, 2023

nkwang24 commented May 8, 2023

yashpatel6 commented May 8, 2023

tyamaguchi-ucla commented Nov 14, 2023

Run MarkDuplicatesSpark by library #234

Run MarkDuplicatesSpark by library #234

Comments

tyamaguchi-ucla commented Aug 8, 2022 • edited Loading

tyamaguchi-ucla commented May 8, 2023

nkwang24 commented May 8, 2023

yashpatel6 commented May 8, 2023

tyamaguchi-ucla commented Nov 14, 2023

tyamaguchi-ucla commented Aug 8, 2022 •

edited

Loading