Multiple input bams with the same name in MarkDuplicates Spark causes crash #192

graceooh · 2022-05-24T19:03:26Z

Describe the bug
Multiple input bams with the same name. (MarkDuplicates Spark GATK)

Pipeline release version: 8.0.0
Cluster you are using: Slurm
Node type: F72
Submission method: submission script
Actual submission script:
nextflow_script=/hot/users/jieunoh/copy-align-dna/pipeline-align-DNA/main.nf sample_name=BE-4 pipeline_run_name_blood=$sample_name-Blood sample_config="/hot/users/jieunoh/gap6_pilot/align-DNA-input/2021-9126_blood/input_config_dir/$sample_name.config" python3 /hot/users/jieunoh/tools/tool-submit-nf/submit_nextflow_pipeline.py \ --nextflow_script $nextflow_script \ --nextflow_config $sample_config \ --pipeline_run_name $pipeline_run_name_blood \ --partition_type F72 \ --email [email protected]
Sbatch or qsub command and logs if applicable: /hot/users/jieunoh/gap6_outputs/align-dna-outputs/2021-9126_blood/BE-4-Blood/align-DNA-8.0.0/BE-4-Blood/log-align-DNA-8.0.0-20220524T072853Z
Config files: /hot/users/jieunoh/gap6_pilot/align-DNA-input/2021-9126_blood/input_config_dir/BE-4.config
Any logs produced by the pipeline: /hot/users/jieunoh/gap6_outputs/align-dna-outputs/2021-9126_blood/log_files/BE-4-Blood/BE-4-Blood.log
and /hot/users/jieunoh/gap6_outputs/align-dna-outputs/2021-9126_blood/log_files/BE-4-Blood/BE-4-Blood.error

To Reproduce
Steps to reproduce the behavior:

Run submission script attached above.
Scroll down to : /hot/users/jieunoh/gap6_outputs/align-dna-outputs/2021-9126_blood/log_files/BE-4-Blood/BE-4-Blood.log
See error:

Error executing process > 'align_DNA_BWA_MEM2_workflow:run_MarkDuplicatesSpark_GATK'

Caused by:
Process align_DNA_BWA_MEM2_workflow:run_MarkDuplicatesSpark_GATK input file name collision -- There are multiple input files for each of the following file names: BE-4-Blood-L001.sorted.bam

Expected behavior
Some sorted bams have the same name. This causes a problem when it gets inputted into MarkDuplicates Spark. Probably need to sort out how we specify information in input.csv (library_identifier). At the moment using library_identifier as external ID but this causes clash in bam names when there are more than one fastqs that have been processed on the same lanes.

(#143 "Sample ->Internal ID and we could include External ID in library_identifier, which will be passed to RG identifier.")

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

tyamaguchi-ucla · 2022-05-26T00:03:17Z

I think using L001-01 and L001-02 for the lane column will solve the issue as I commented on another issue?

graceooh · 2022-05-26T18:30:21Z

I see! The case was slightly different because the input.csvs are different for align-RNA and DNA. I'll append it to the lanes. Thank you! 😄

tyamaguchi-ucla · 2022-05-26T18:34:15Z

I see! The case was slightly different because the input.csvs are different for align-RNA and DNA. I'll append it to the lanes. Thank you! 😄

Yes, we're aware of the inconsistency and are tying to resolve the issue.

graceooh · 2022-05-26T18:43:38Z

Yup! I appended the numbers only for the read_group_identifier for align-DNA. I think it will be useful to know we also need to add it on the lanes column too. 😄

tyamaguchi-ucla · 2022-05-26T19:00:04Z

I appended the numbers only for the read_group_identifier for align-DNA.I think it will be useful to know we also need to add it on the lanes column too. 😄

Sure but do you understand why the original run failed based on how the dedup process works as you've been working on align-DNA for a while?

graceooh · 2022-05-26T20:41:35Z

Yes I do! Thanks for checking though! :)

graceooh assigned tyamaguchi-ucla May 25, 2022

graceooh closed this as completed May 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple input bams with the same name in MarkDuplicates Spark causes crash #192

Multiple input bams with the same name in MarkDuplicates Spark causes crash #192

graceooh commented May 24, 2022 •

edited

Loading

tyamaguchi-ucla commented May 26, 2022

graceooh commented May 26, 2022

tyamaguchi-ucla commented May 26, 2022

graceooh commented May 26, 2022

tyamaguchi-ucla commented May 26, 2022

graceooh commented May 26, 2022

Multiple input bams with the same name in MarkDuplicates Spark causes crash #192

Multiple input bams with the same name in MarkDuplicates Spark causes crash #192

Comments

graceooh commented May 24, 2022 • edited Loading

tyamaguchi-ucla commented May 26, 2022

graceooh commented May 26, 2022

tyamaguchi-ucla commented May 26, 2022

graceooh commented May 26, 2022

tyamaguchi-ucla commented May 26, 2022

graceooh commented May 26, 2022

graceooh commented May 24, 2022 •

edited

Loading