add retry method for run_MarkDuplicatesSpark_GATK and run_sort_SAMtools #226

jarbet · 2022-07-25T22:35:07Z

Description

Add retry method to run_MarkDuplicatesSpark_GATK and run_sort_SAMtools. If pipeline runs out of memory for either process, then it will retry with more memory.

Closes #225

Testing Results

a_mini_n1

Although retry is not needed for a_mini_n1, I want to test to make sure the newly added code doesn't break anything

Testing bash script: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/testing-a_mini_n1.sh

BWA-MEM2
- input csv: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/a_mini_n1.csv
- config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BWA-MEM2-a_mini_n1.config
- output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BWA-MEM2-a_mini_n1.log
HISAT2
- input csv: same
- config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/HISAT2-a_mini_n1.config
- output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/HISAT2-a_mini_n1.log
BWA-MEM2 & HISAT2
- input csv: same
- config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BWA-MEM2-HISAT2-a_mini_n1.config
- output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BWA-MEM2-HISAT2-a_mini_n1.log

BE-1-RP3-ffpe_input

BWA-MEM2
- input csv: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/BE-1-RP3-ffpe_input.csv
- config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BE-1-RP3.config
- output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BWA-MEM2-BE-1-RP3.log
  - /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/align-DNA-8.0.0/BE-1-RP3-ffpe_input/log-align-DNA-8.0.0-20220726T175629Z/nextflow-log/report.html
  - 39 processes succeeded, 1 failed (but passed after retry with 2x memory):
  Process align_DNA_BWA_MEM2_workflow:run_MarkDuplicatesSpark_GATK terminated with an error exit status (52) -- Execution is retried (1)

Checklist

I have read the code review guidelines and the code review best practice on GitHub check-list.
I have reviewed the Nextflow pipeline standards.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have set up the branch protection rule following the github standards before opening this pull request, or the branch protection rule has already been set up.
I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)
I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)
I have tested the pipeline on at least one A-mini sample with aligner setting to BWA-MEM2, HISAT2, and both. The paths to the test config files and output directories were attached in the Testing Results section.

jarbet · 2022-07-26T16:21:05Z

@yashpatel6: I added the retry module, but it failed when testing BE-1-RP3-ffpe_input (see above for file info). The error messages I got were:

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BE-1-RP3.log

Unknown method invocation check_limits on ConfigObject type

WARN: There's no process matching config selector: run_MarkDuplicates_Spark -- Did you mean: run_MarkDuplicate_Picard?

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/align-DNA-8.0.0/BE-1-RP3-ffpe_input/log-align-DNA-8.0.0-20220726T015724Z/nextflow-log/report.html

Error executing process > 'align_DNA_BWA_MEM2_workflow:run_sort_SAMtools (1)'
Caused by:
No signature of method: groovy.util.ConfigObject.check_limits() is applicable for argument types: (nextflow.util.MemoryUnit, String) values: [15 GB, memory]

So check_limits (used in retry.config) seems to be missing. Do I just need to add this code to my methods.config? Or are there more parts missing from my methods.config in order to get retry.config to work?

yashpatel6 · 2022-07-26T16:31:21Z

@yashpatel6: I added the retry module, but it failed when testing BE-1-RP3-ffpe_input (see above for file info). The error messages I got were:

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BE-1-RP3.log

Unknown method invocation check_limits on ConfigObject type

WARN: There's no process matching config selector: run_MarkDuplicates_Spark -- Did you mean: run_MarkDuplicate_Picard?

This warning is because the process in align-DNA is called run_MarkDuplicatesSpark_GATK I believe (in base.config)

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/align-DNA-8.0.0/BE-1-RP3-ffpe_input/log-align-DNA-8.0.0-20220726T015724Z/nextflow-log/report.html

Error executing process > 'align_DNA_BWA_MEM2_workflow:run_sort_SAMtools (1)'
Caused by:
No signature of method: groovy.util.ConfigObject.check_limits() is applicable for argument types: (nextflow.util.MemoryUnit, String) values: [15 GB, memory]

So check_limits (used in retry.config) seems to be missing. Do I just need to add this code to my methods.config? Or are there more parts missing from my methods.config in order to get retry.config to work?

Right, I didn't realize align-DNA doesn't have that function. You should only need to add that and update the process name in base.config to get it work.

yashpatel6 · 2022-07-29T19:56:26Z

Are you re-running the CPCG0196-F1 sample that failed with this branch?

jarbet · 2022-07-29T21:24:02Z

Are you re-running the CPCG0196-F1 sample that failed with this branch?

Yeah it's been running for 17 hours so it should be done soon. However, @tyamaguchi-ucla is fairly certain the retry won't work with CPCG0196-F1, because the time it failed it gave a general error message: error exit status (3) which would not be caught by retry under the current configuration. Taka thought we should not try to catch (3) with retry since it is a general error message and not necessarily related to memory.

The only reason I didn't kill the job is because it is almost finished and I want to confirm whether Taka is right.

This PR is still ready for review though. I will release 8.1.0 after it is approved

tyamaguchi-ucla

Looks good to me. For CPCG0196-F1, I commented here #229 but most likely scratch space issue. @yashpatel6 anything else to add?

tyamaguchi-ucla · 2022-07-29T22:07:10Z

config/base.config

@@ -5,7 +5,12 @@
 process {
    cpus = { methods.check_max( 1 * task.attempt, 'cpus' ) }

-    errorStrategy = { task.exitStatus in [143, 137, 104, 134, 139] ? 'retry' : 'finish' }
+    commonRetryCodes = [104, 134, 137, 139, 143, 247] // Common out-of-memory error codes


Was 247 a GATK specific (Java) or Nextflow exit code?

We added 247 in align-RNA; it's a Nextflow-recommended out of memory exit code (from the table here). I just opened a PR to add it to the NF template repo.

yashpatel6

We'll want to add this to F16.config as well but looks good otherwise!

Jaron Arbet added 3 commits July 25, 2022 15:31

add retry method for run_MarkDuplicatesSpark_GATK

6dd954d

add retry submodule to external folder

8680dad

add retry method to run_sort_SAMtools

7752deb

add check_limits to methods.config for retry.config

e489de0

jarbet mentioned this pull request Jul 26, 2022

SAMtools sort memory allocation #199

Closed

jarbet changed the title ~~add retry method for run_MarkDuplicatesSpark_GATK~~ add retry method for run_MarkDuplicatesSpark_GATK and run_sort_SAMtools Jul 29, 2022

update changelog

fa481bd

jarbet marked this pull request as ready for review July 29, 2022 17:15

jarbet requested a review from a team as a code owner July 29, 2022 17:15

jarbet requested review from yashpatel6 and tyamaguchi-ucla July 29, 2022 17:15

This was referenced Jul 29, 2022

Error while running run_MarkDuplicatesSpark_GATK (different issue) #228

Closed

run_MarkDuplicatesSpark_GATK error exit status (3) with CPCG0196-F1 #229

Open

tyamaguchi-ucla assigned tyamaguchi-ucla and yashpatel6 Jul 29, 2022

tyamaguchi-ucla approved these changes Jul 29, 2022

View reviewed changes

yashpatel6 approved these changes Jul 29, 2022

View reviewed changes

jarbet merged commit de25154 into main Jul 30, 2022

jarbet deleted the jarbet-mark-dup-spark-mem-retry branch August 3, 2022 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add retry method for run_MarkDuplicatesSpark_GATK and run_sort_SAMtools #226

add retry method for run_MarkDuplicatesSpark_GATK and run_sort_SAMtools #226

jarbet commented Jul 25, 2022 •

edited

Loading

jarbet commented Jul 26, 2022

yashpatel6 commented Jul 26, 2022

yashpatel6 commented Jul 29, 2022

jarbet commented Jul 29, 2022

tyamaguchi-ucla left a comment

tyamaguchi-ucla Jul 29, 2022

yashpatel6 Jul 29, 2022

yashpatel6 left a comment

add retry method for run_MarkDuplicatesSpark_GATK and run_sort_SAMtools #226

add retry method for run_MarkDuplicatesSpark_GATK and run_sort_SAMtools #226

Conversation

jarbet commented Jul 25, 2022 • edited Loading

Description

Closes #225

Testing Results

a_mini_n1

BE-1-RP3-ffpe_input

Checklist

jarbet commented Jul 26, 2022

yashpatel6 commented Jul 26, 2022

yashpatel6 commented Jul 29, 2022

jarbet commented Jul 29, 2022

tyamaguchi-ucla left a comment

Choose a reason for hiding this comment

tyamaguchi-ucla Jul 29, 2022

Choose a reason for hiding this comment

yashpatel6 Jul 29, 2022

Choose a reason for hiding this comment

yashpatel6 left a comment

Choose a reason for hiding this comment

jarbet commented Jul 25, 2022 •

edited

Loading