run_MarkDuplicatesSpark_GATK error exit status (3) with CPCG0196-F1 #229

jarbet · 2022-07-29T17:58:47Z

Describe the bug

Pipeline failed when testing on CPCG0196-F1, giving error exit status (3) for run_MarkDuplicatesSpark_GATK. First noticed here.

Pipeline release version: unreleased, this branch (note this branch has not implemented any retry methods for memory)
Cluster you are using: Slurm
Node type: F72
Submission method: python submission script
Actual submission script (python submission script, "nextflow run ...", etc.)

Testing info/results:

BWA-MEM2 (failed after 19 hours)
- submission script: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/testing_CPCG0196-F1.sh
- sample: CPCG0196-F1
- input csv: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/CPCG0196-F1.csv
- config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.config
- output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/align-DNA-8.0.0/CPCG0196-F1/log-align-DNA-8.0.0-20220725T174134Z/nextflow-log/report.html
  - /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.log
HISAT2 (failed after 22 hours)
- submission script: same
- sample: same
- config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/HISAT2-CPCG0196-F1.config
- output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/align-DNA-8.0.0/CPCG0196-F1/log-align-DNA-8.0.0-20220725T174652Z/nextflow-log/report.html
  - /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/HISAT2-CPCG0196-F1.log

Note that BWA-MEM2 and HISAT2 give slightly different error messages. Both say the following:

Error executing process > 'align_DNA_HISAT2_workflow:run_MarkDuplicatesSpark_GATK'
Caused by:
Process align_DNA_HISAT2_workflow:run_MarkDuplicatesSpark_GATK terminated with an error exit status (3)

But only HISAT2 says the following (several times) in regards to run_MarkDuplicatesSpark_GATK :

No space left on device

The text was updated successfully, but these errors were encountered:

tyamaguchi-ucla · 2022-07-29T21:59:39Z

It looks like both of the logs indicated 2TB scratch wasn't enough and we know MarkDuplicatesSpark generates quite a bit of intermediate files. I don't think we can do much unless we run MarkDuplicatesSpark at the library level, remove intermediate files and then samtools merge.

tyamaguchi-ucla · 2022-07-29T22:03:11Z

Also, it looks like no major updates for MarkDuplicatesSpark since 4.2.4.1 (current) (the latest is 4.2.6.1)

yashpatel6 · 2022-07-29T22:22:52Z

Yeah barring special nodes with expanded disk space or moving MarkDuplicatesSpark to run per library one by one, it'd be hard to fix this problem.

tyamaguchi-ucla · 2022-09-16T22:51:10Z

I think we want to implement #234 in the long run but we could also try -Dsamjdk.compression_level option although I couldn't find the default compression level info for MarkDuplicatesSpark.

--java-options -Dsamjdk.compression_level=X

https://gatk.broadinstitute.org/hc/en-us/community/posts/360061711971-How-to-set-a-COMPRESSION-LEVEL-of-ApplyBQSR

jarbet · 2022-09-19T23:34:26Z

Currently testing with --java-options -Dsamjdk.compression_level=6

nkwang24 · 2023-05-08T03:58:42Z

@jarbet Did changing the compression level help? I'm running into the same issue with a subset of CPCG. It looks like samples with total fastq size > ~400Gb will fail with the current Spark configuration. The fastq size distribution of CPCG overlaps with this ~400Gb limit with ~1/3 of the cohort being too large.

I was trying to monitor scratch usage, but the intermediate files generated by Spark are assigned to nfsnobody with no read access so I can't query directory size.

nfsnobody is a user account that is used by NFS (Network File System) when it cannot map a remote user to a local user. This can happen when the remote user does not exist on the local system or when the local system cannot authenticate the remote user. When this happens, NFS uses the nfsnobody account instead of the remote user’s account.

Not sure if there's a way to properly map the users so this doesn't happen, but this is probably low priority.

nkwang24 · 2023-05-08T04:05:10Z

@tyamaguchi-ucla mentioned that it could potentially be possible to have Spark parallelize less, theoretically reducing data copying and scratch usage. The parameters for this are located in the F72.config and not template.config or default.config. @yashpatel6 would reducing the number of cpus allowed for the run_MarkDuplicatesSpark_GATK process reduce scratch usage?

If not, it looks like ~1/3 of CPCG will need to be run without Spark or dependent upon upgrades to our F72 scratch size.

nkwang24 · 2023-05-08T04:18:33Z

If we can conclude that the only way around this is by increasing scratch size, I can write up a cost-benefit analysis of upgrading scratch vs. having to run larger samples with Picard and send it to Paul.

It looks like the current metapipeline bottlenecks are scratch space during align-DNA mark duplicates Spark and call-gSNP recalibrate/reheader steps so it might be necessary to expand scratch regardless unless we can make optimizations at both of these steps.

jarbet self-assigned this Jul 29, 2022

This was referenced Jul 29, 2022

SAMtools sort memory allocation #199

Closed

add retry method for run_MarkDuplicatesSpark_GATK and run_sort_SAMtools #226

Merged

tyamaguchi-ucla assigned yashpatel6 and tyamaguchi-ucla Jul 29, 2022

tyamaguchi-ucla mentioned this issue Aug 8, 2022

Run MarkDuplicatesSpark by library #234

Open

jarbet mentioned this issue Aug 22, 2022

Runs fail because not enough memory for MarkDuplicates | How much memory should I give it? #221

Open

jarbet mentioned this issue Sep 13, 2022

Process CPCG0196-F1 with Spark #241

Closed

nkwang24 mentioned this issue May 8, 2023

No error during align-DNA failure #252

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run_MarkDuplicatesSpark_GATK error exit status (3) with CPCG0196-F1 #229

run_MarkDuplicatesSpark_GATK error exit status (3) with CPCG0196-F1 #229

jarbet commented Jul 29, 2022 •

edited

Loading

tyamaguchi-ucla commented Jul 29, 2022

tyamaguchi-ucla commented Jul 29, 2022 •

edited

Loading

yashpatel6 commented Jul 29, 2022

tyamaguchi-ucla commented Sep 16, 2022

jarbet commented Sep 19, 2022

nkwang24 commented May 8, 2023

nkwang24 commented May 8, 2023

nkwang24 commented May 8, 2023

run_MarkDuplicatesSpark_GATK error exit status (3) with CPCG0196-F1 #229

run_MarkDuplicatesSpark_GATK error exit status (3) with CPCG0196-F1 #229

Comments

jarbet commented Jul 29, 2022 • edited Loading

tyamaguchi-ucla commented Jul 29, 2022

tyamaguchi-ucla commented Jul 29, 2022 • edited Loading

yashpatel6 commented Jul 29, 2022

tyamaguchi-ucla commented Sep 16, 2022

jarbet commented Sep 19, 2022

nkwang24 commented May 8, 2023

nkwang24 commented May 8, 2023

nkwang24 commented May 8, 2023

jarbet commented Jul 29, 2022 •

edited

Loading

tyamaguchi-ucla commented Jul 29, 2022 •

edited

Loading