Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add retry method for run_MarkDuplicatesSpark_GATK and run_sort_SAMtools #226

Merged
merged 5 commits into from
Jul 30, 2022

Conversation

jarbet
Copy link
Contributor

@jarbet jarbet commented Jul 25, 2022

Description

Add retry method to run_MarkDuplicatesSpark_GATK and run_sort_SAMtools. If pipeline runs out of memory for either process, then it will retry with more memory.

Closes #225

Testing Results

a_mini_n1

Although retry is not needed for a_mini_n1, I want to test to make sure the newly added code doesn't break anything

Testing bash script: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/testing-a_mini_n1.sh

  • BWA-MEM2
    • input csv: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/a_mini_n1.csv
    • config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BWA-MEM2-a_mini_n1.config
    • output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BWA-MEM2-a_mini_n1.log
  • HISAT2
    • input csv: same
    • config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/HISAT2-a_mini_n1.config
    • output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/HISAT2-a_mini_n1.log
  • BWA-MEM2 & HISAT2
    • input csv: same
    • config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BWA-MEM2-HISAT2-a_mini_n1.config
    • output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BWA-MEM2-HISAT2-a_mini_n1.log

BE-1-RP3-ffpe_input

  • BWA-MEM2
    • input csv: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/input/csv/BE-1-RP3-ffpe_input.csv
    • config: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BE-1-RP3.config
    • output: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BWA-MEM2-BE-1-RP3.log
      • /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/align-DNA-8.0.0/BE-1-RP3-ffpe_input/log-align-DNA-8.0.0-20220726T175629Z/nextflow-log/report.html
      • 39 processes succeeded, 1 failed (but passed after retry with 2x memory):

      Process align_DNA_BWA_MEM2_workflow:run_MarkDuplicatesSpark_GATK terminated with an error exit status (52) -- Execution is retried (1)

Checklist

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have reviewed the Nextflow pipeline standards.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have set up the branch protection rule following the github standards before opening this pull request, or the branch protection rule has already been set up.

  • I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

  • I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

  • I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)

  • I have tested the pipeline on at least one A-mini sample with aligner setting to BWA-MEM2, HISAT2, and both. The paths to the test config files and output directories were attached in the Testing Results section.

@jarbet
Copy link
Contributor Author

jarbet commented Jul 26, 2022

@yashpatel6: I added the retry module, but it failed when testing BE-1-RP3-ffpe_input (see above for file info). The error messages I got were:

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BE-1-RP3.log

Unknown method invocation check_limits on ConfigObject type

WARN: There's no process matching config selector: run_MarkDuplicates_Spark -- Did you mean: run_MarkDuplicate_Picard?

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/align-DNA-8.0.0/BE-1-RP3-ffpe_input/log-align-DNA-8.0.0-20220726T015724Z/nextflow-log/report.html

Error executing process > 'align_DNA_BWA_MEM2_workflow:run_sort_SAMtools (1)'
Caused by:
No signature of method: groovy.util.ConfigObject.check_limits() is applicable for argument types: (nextflow.util.MemoryUnit, String) values: [15 GB, memory]

So check_limits (used in retry.config) seems to be missing. Do I just need to add this code to my methods.config? Or are there more parts missing from my methods.config in order to get retry.config to work?

@yashpatel6
Copy link
Contributor

@yashpatel6: I added the retry module, but it failed when testing BE-1-RP3-ffpe_input (see above for file info). The error messages I got were:

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/BE-1-RP3.log

Unknown method invocation check_limits on ConfigObject type

WARN: There's no process matching config selector: run_MarkDuplicates_Spark -- Did you mean: run_MarkDuplicate_Picard?

This warning is because the process in align-DNA is called run_MarkDuplicatesSpark_GATK I believe (in base.config)

/hot/software/pipeline/pipeline-align-DNA/Nextflow/development/unreleased/jarbet-mark-dup-spark-mem-retry/align-DNA-8.0.0/BE-1-RP3-ffpe_input/log-align-DNA-8.0.0-20220726T015724Z/nextflow-log/report.html

Error executing process > 'align_DNA_BWA_MEM2_workflow:run_sort_SAMtools (1)'
Caused by:
No signature of method: groovy.util.ConfigObject.check_limits() is applicable for argument types: (nextflow.util.MemoryUnit, String) values: [15 GB, memory]

So check_limits (used in retry.config) seems to be missing. Do I just need to add this code to my methods.config? Or are there more parts missing from my methods.config in order to get retry.config to work?

Right, I didn't realize align-DNA doesn't have that function. You should only need to add that and update the process name in base.config to get it work.

@jarbet jarbet changed the title add retry method for run_MarkDuplicatesSpark_GATK add retry method for run_MarkDuplicatesSpark_GATK and run_sort_SAMtools Jul 29, 2022
@yashpatel6
Copy link
Contributor

Are you re-running the CPCG0196-F1 sample that failed with this branch?

@jarbet
Copy link
Contributor Author

jarbet commented Jul 29, 2022

Are you re-running the CPCG0196-F1 sample that failed with this branch?

Yeah it's been running for 17 hours so it should be done soon. However, @tyamaguchi-ucla is fairly certain the retry won't work with CPCG0196-F1, because the time it failed it gave a general error message: error exit status (3) which would not be caught by retry under the current configuration. Taka thought we should not try to catch (3) with retry since it is a general error message and not necessarily related to memory.

The only reason I didn't kill the job is because it is almost finished and I want to confirm whether Taka is right.

This PR is still ready for review though. I will release 8.1.0 after it is approved

Copy link
Contributor

@tyamaguchi-ucla tyamaguchi-ucla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. For CPCG0196-F1, I commented here #229 but most likely scratch space issue. @yashpatel6 anything else to add?

@@ -5,7 +5,12 @@
process {
cpus = { methods.check_max( 1 * task.attempt, 'cpus' ) }

errorStrategy = { task.exitStatus in [143, 137, 104, 134, 139] ? 'retry' : 'finish' }
commonRetryCodes = [104, 134, 137, 139, 143, 247] // Common out-of-memory error codes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was 247 a GATK specific (Java) or Nextflow exit code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added 247 in align-RNA; it's a Nextflow-recommended out of memory exit code (from the table here). I just opened a PR to add it to the NF template repo.

Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to add this to F16.config as well but looks good otherwise!

@jarbet jarbet merged commit de25154 into main Jul 30, 2022
@jarbet jarbet deleted the jarbet-mark-dup-spark-mem-retry branch August 3, 2022 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory error for run_MarkDuplicatesSpark_GATK
3 participants