Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_MarkDuplicatesSpark_GATK error exit status (3) with CPCG0196-F1 #229

Open
jarbet opened this issue Jul 29, 2022 · 8 comments
Open

run_MarkDuplicatesSpark_GATK error exit status (3) with CPCG0196-F1 #229

jarbet opened this issue Jul 29, 2022 · 8 comments
Assignees

Comments

@jarbet
Copy link
Contributor

jarbet commented Jul 29, 2022

Describe the bug

Pipeline failed when testing on CPCG0196-F1, giving error exit status (3) for run_MarkDuplicatesSpark_GATK. First noticed here.

  • Pipeline release version: unreleased, this branch (note this branch has not implemented any retry methods for memory)
  • Cluster you are using: Slurm
  • Node type: F72
  • Submission method: python submission script
  • Actual submission script (python submission script, "nextflow run ...", etc.)

Testing info/results:

  • BWA-MEM2 (failed after 19 hours)

    • submission script: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/testing_CPCG0196-F1.sh
    • sample: CPCG0196-F1
    • input csv: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/CPCG0196-F1.csv
    • config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.config
    • output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/align-DNA-8.0.0/CPCG0196-F1/log-align-DNA-8.0.0-20220725T174134Z/nextflow-log/report.html
      • /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/BWA-MEM2-CPCG0196-F1.log
  • HISAT2 (failed after 22 hours)

    • submission script: same
    • sample: same
    • config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/HISAT2-CPCG0196-F1.config
    • output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/align-DNA-8.0.0/CPCG0196-F1/log-align-DNA-8.0.0-20220725T174652Z/nextflow-log/report.html
      • /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-samtools-sort-mem/HISAT2-CPCG0196-F1.log

Note that BWA-MEM2 and HISAT2 give slightly different error messages. Both say the following:

Error executing process > 'align_DNA_HISAT2_workflow:run_MarkDuplicatesSpark_GATK'
Caused by:
Process align_DNA_HISAT2_workflow:run_MarkDuplicatesSpark_GATK terminated with an error exit status (3)

But only HISAT2 says the following (several times) in regards to run_MarkDuplicatesSpark_GATK :

No space left on device

@tyamaguchi-ucla
Copy link
Contributor

It looks like both of the logs indicated 2TB scratch wasn't enough and we know MarkDuplicatesSpark generates quite a bit of intermediate files. I don't think we can do much unless we run MarkDuplicatesSpark at the library level, remove intermediate files and then samtools merge.

@tyamaguchi-ucla
Copy link
Contributor

tyamaguchi-ucla commented Jul 29, 2022

Also, it looks like no major updates for MarkDuplicatesSpark since 4.2.4.1 (current) (the latest is 4.2.6.1)

@yashpatel6
Copy link
Contributor

Yeah barring special nodes with expanded disk space or moving MarkDuplicatesSpark to run per library one by one, it'd be hard to fix this problem.

@tyamaguchi-ucla
Copy link
Contributor

I think we want to implement #234 in the long run but we could also try -Dsamjdk.compression_level option although I couldn't find the default compression level info for MarkDuplicatesSpark.

--java-options -Dsamjdk.compression_level=X

https://gatk.broadinstitute.org/hc/en-us/community/posts/360061711971-How-to-set-a-COMPRESSION-LEVEL-of-ApplyBQSR

@jarbet
Copy link
Contributor Author

jarbet commented Sep 19, 2022

Currently testing with --java-options -Dsamjdk.compression_level=6

@nkwang24
Copy link

nkwang24 commented May 8, 2023

@jarbet Did changing the compression level help? I'm running into the same issue with a subset of CPCG. It looks like samples with total fastq size > ~400Gb will fail with the current Spark configuration. The fastq size distribution of CPCG overlaps with this ~400Gb limit with ~1/3 of the cohort being too large.

I was trying to monitor scratch usage, but the intermediate files generated by Spark are assigned to nfsnobody with no read access so I can't query directory size.

nfsnobody is a user account that is used by NFS (Network File System) when it cannot map a remote user to a local user. This can happen when the remote user does not exist on the local system or when the local system cannot authenticate the remote user. When this happens, NFS uses the nfsnobody account instead of the remote user’s account.

Not sure if there's a way to properly map the users so this doesn't happen, but this is probably low priority.

@nkwang24
Copy link

nkwang24 commented May 8, 2023

@tyamaguchi-ucla mentioned that it could potentially be possible to have Spark parallelize less, theoretically reducing data copying and scratch usage. The parameters for this are located in the F72.config and not template.config or default.config. @yashpatel6 would reducing the number of cpus allowed for the run_MarkDuplicatesSpark_GATK process reduce scratch usage?

If not, it looks like ~1/3 of CPCG will need to be run without Spark or dependent upon upgrades to our F72 scratch size.

@nkwang24
Copy link

nkwang24 commented May 8, 2023

If we can conclude that the only way around this is by increasing scratch size, I can write up a cost-benefit analysis of upgrading scratch vs. having to run larger samples with Picard and send it to Paul.

It looks like the current metapipeline bottlenecks are scratch space during align-DNA mark duplicates Spark and call-gSNP recalibrate/reheader steps so it might be necessary to expand scratch regardless unless we can make optimizations at both of these steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants