Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple input bams with the same name in MarkDuplicates Spark causes crash #192

Closed
graceooh opened this issue May 24, 2022 · 6 comments
Closed
Assignees

Comments

@graceooh
Copy link
Contributor

graceooh commented May 24, 2022

Describe the bug
Multiple input bams with the same name. (MarkDuplicates Spark GATK)

  • Pipeline release version: 8.0.0
  • Cluster you are using: Slurm
  • Node type: F72
  • Submission method: submission script
  • Actual submission script:
    nextflow_script=/hot/users/jieunoh/copy-align-dna/pipeline-align-DNA/main.nf sample_name=BE-4 pipeline_run_name_blood=$sample_name-Blood sample_config="/hot/users/jieunoh/gap6_pilot/align-DNA-input/2021-9126_blood/input_config_dir/$sample_name.config" python3 /hot/users/jieunoh/tools/tool-submit-nf/submit_nextflow_pipeline.py \ --nextflow_script $nextflow_script \ --nextflow_config $sample_config \ --pipeline_run_name $pipeline_run_name_blood \ --partition_type F72 \ --email [email protected]
  • Sbatch or qsub command and logs if applicable: /hot/users/jieunoh/gap6_outputs/align-dna-outputs/2021-9126_blood/BE-4-Blood/align-DNA-8.0.0/BE-4-Blood/log-align-DNA-8.0.0-20220524T072853Z
  • Config files: /hot/users/jieunoh/gap6_pilot/align-DNA-input/2021-9126_blood/input_config_dir/BE-4.config
  • Any logs produced by the pipeline: /hot/users/jieunoh/gap6_outputs/align-dna-outputs/2021-9126_blood/log_files/BE-4-Blood/BE-4-Blood.log
    and /hot/users/jieunoh/gap6_outputs/align-dna-outputs/2021-9126_blood/log_files/BE-4-Blood/BE-4-Blood.error

To Reproduce
Steps to reproduce the behavior:

  1. Run submission script attached above.
  2. Scroll down to : /hot/users/jieunoh/gap6_outputs/align-dna-outputs/2021-9126_blood/log_files/BE-4-Blood/BE-4-Blood.log
  3. See error:

Error executing process > 'align_DNA_BWA_MEM2_workflow:run_MarkDuplicatesSpark_GATK'

Caused by:
Process align_DNA_BWA_MEM2_workflow:run_MarkDuplicatesSpark_GATK input file name collision -- There are multiple input files for each of the following file names: BE-4-Blood-L001.sorted.bam

Expected behavior
Some sorted bams have the same name. This causes a problem when it gets inputted into MarkDuplicates Spark. Probably need to sort out how we specify information in input.csv (library_identifier). At the moment using library_identifier as external ID but this causes clash in bam names when there are more than one fastqs that have been processed on the same lanes.

(#143 "Sample ->Internal ID and we could include External ID in library_identifier, which will be passed to RG identifier.")

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

@tyamaguchi-ucla
Copy link
Contributor

I think using L001-01 and L001-02 for the lane column will solve the issue as I commented on another issue?

@graceooh
Copy link
Contributor Author

I see! The case was slightly different because the input.csvs are different for align-RNA and DNA. I'll append it to the lanes. Thank you! 😄

@tyamaguchi-ucla
Copy link
Contributor

I see! The case was slightly different because the input.csvs are different for align-RNA and DNA. I'll append it to the lanes. Thank you! 😄

Yes, we're aware of the inconsistency and are tying to resolve the issue.

@graceooh
Copy link
Contributor Author

Yup! I appended the numbers only for the read_group_identifier for align-DNA. I think it will be useful to know we also need to add it on the lanes column too. 😄

@tyamaguchi-ucla
Copy link
Contributor

I appended the numbers only for the read_group_identifier for align-DNA.I think it will be useful to know we also need to add it on the lanes column too. 😄

Sure but do you understand why the original run failed based on how the dedup process works as you've been working on align-DNA for a while?

@graceooh
Copy link
Contributor Author

Yes I do! Thanks for checking though! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants