-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple input bams with the same name in MarkDuplicates Spark causes crash #192
Comments
I think using |
I see! The case was slightly different because the input.csvs are different for align-RNA and DNA. I'll append it to the lanes. Thank you! 😄 |
Yes, we're aware of the inconsistency and are tying to resolve the issue. |
Yup! I appended the numbers only for the read_group_identifier for align-DNA. I think it will be useful to know we also need to add it on the lanes column too. 😄 |
Sure but do you understand why the original run failed based on how the dedup process works as you've been working on align-DNA for a while? |
Yes I do! Thanks for checking though! :) |
Describe the bug
Multiple input bams with the same name. (MarkDuplicates Spark GATK)
nextflow_script=/hot/users/jieunoh/copy-align-dna/pipeline-align-DNA/main.nf sample_name=BE-4 pipeline_run_name_blood=$sample_name-Blood sample_config="/hot/users/jieunoh/gap6_pilot/align-DNA-input/2021-9126_blood/input_config_dir/$sample_name.config" python3 /hot/users/jieunoh/tools/tool-submit-nf/submit_nextflow_pipeline.py \ --nextflow_script $nextflow_script \ --nextflow_config $sample_config \ --pipeline_run_name $pipeline_run_name_blood \ --partition_type F72 \ --email [email protected]
and /hot/users/jieunoh/gap6_outputs/align-dna-outputs/2021-9126_blood/log_files/BE-4-Blood/BE-4-Blood.error
To Reproduce
Steps to reproduce the behavior:
Error executing process > 'align_DNA_BWA_MEM2_workflow:run_MarkDuplicatesSpark_GATK'
Caused by:
Process
align_DNA_BWA_MEM2_workflow:run_MarkDuplicatesSpark_GATK
input file name collision -- There are multiple input files for each of the following file names: BE-4-Blood-L001.sorted.bamExpected behavior
Some sorted bams have the same name. This causes a problem when it gets inputted into MarkDuplicates Spark. Probably need to sort out how we specify information in input.csv (library_identifier). At the moment using library_identifier as external ID but this causes clash in bam names when there are more than one fastqs that have been processed on the same lanes.
(#143 "Sample ->Internal ID and we could include External ID in library_identifier, which will be passed to RG identifier.")
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: