Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open a GH issue for the Spark empty BAM case #232

Closed
tyamaguchi-ucla opened this issue Aug 1, 2022 · 13 comments
Closed

Open a GH issue for the Spark empty BAM case #232

tyamaguchi-ucla opened this issue Aug 1, 2022 · 13 comments
Assignees

Comments

@tyamaguchi-ucla
Copy link
Contributor

  • Create an issue at https://github.com/broadinstitute/gatk with enough details if you haven't ( check if there are similar issues before creating an issue - Yash and I did a quick search but didn't find any).
@tyamaguchi-ucla
Copy link
Contributor Author

#223

@madisonjordan
Copy link

madisonjordan commented Dec 14, 2022

Issue Draft

Title

MarkDuplicatesSpark error when FASTQ headers contain another @ in string

Content

Headers with another @ character fail to create a valid bam using MarkDuplicatesSpark. The bam file is empty. But the header will work when using samtools markdup instead. The following example was found in one of many samples we found in ICGC datasets.

Example header:
@HWI-ST700660_163:1:1101:1243:1870#1@0/1

@tyamaguchi-ucla
Copy link
Contributor Author

tyamaguchi-ucla commented Dec 15, 2022

@madisonjordan Looks good! I think it's worth mentioning that the example was found in one of the ICGC (International Cancer Genome Consortium) samples and there are many others in ICGC datasets.

As we expect many more samples like this and want to avoid fixing the headers internally, we may want to consider adding samtools markdup in addition to MarkDuplicatesSpark if they have no intention to fix the issue. @yashpatel6 @jarbet

@yashpatel6
Copy link
Contributor

As we expect many more samples like this and want to avoid fixing the headers internally, we may want to consider adding samtools markdup in addition to MarkDuplicatesSpark if they have no intention to fix the issue. @yashpatel6 @jarbet

Agreed, we do have the option to just not mark duplicates but it would be good to have a working option for these types of samples in the pipeline.

@tyamaguchi-ucla
Copy link
Contributor Author

As we expect many more samples like this and want to avoid fixing the headers internally, we may want to consider adding samtools markdup in addition to MarkDuplicatesSpark if they have no intention to fix the issue. @yashpatel6 @jarbet

Agreed, we do have the option to just not mark duplicates but it would be good to have a working option for these types of samples in the pipeline.

Yup, my only concern is that some other GATK commands may have the same issue so we'll have to test it out first.

@madisonjordan
Copy link

I'll create the issue right now with the suggested change for where the data came from.

@jarbet
Copy link
Contributor

jarbet commented Dec 15, 2022

@madisonjordan Looks good! I think it's worth mentioning that the example was found in one of the ICGC (International Cancer Genome Consortium) samples and there are many others in ICGC datasets.

As we expect many more samples like this and want to avoid fixing the headers internally, we may want to consider adding samtools markdup in addition to MarkDuplicatesSpark if they have no intention to fix the issue. @yashpatel6 @jarbet

Sounds good. Here is the test example Alfredo made that we can use for testing/development:

/hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/input/align-DNA.input_SP195275_single_paired_read.csv

If any of you know of other example fastq files with @ in the header, or other problems with the header, please share.

EDIT: Before adding samtools markup we should confirm whether or not Picard's mark duplicates can handle these headers. Alfredo mentioned it worked when he manually ran Picard's mark duplicates, but that it does not work in the pipeline. Maybe there is a way we can modify the arguments to get it to work. For example, modifying SORTING_COLLECTION_SIZE_RATIO - thank you @madisonjordan for finding this!

@madisonjordan
Copy link

madisonjordan commented Dec 15, 2022

Logs:
Also wondering if we can post any logs? They said to post them if applicable, but I wanted to make sure what to post in case it contains sensitive information. Also not sure where the full log is for failed jobs, as will all nextflow stuff.
The information might be in the nextflow log but that might have mixed information with other tools if it was ran at the same time or duplicated info just from what I've seen in my nextflow runs.


Here's what we have now:

Bug Report

Affected tool(s) or class(es)

MarkDuplicatesSpark

Affected version(s)

  • GATK version 4.1.9.0

Description

Describe the problem below. Provide screenshots , stacktrace , logs where appropriate.

Headers with another @ character fail to create a valid bam using MarkDuplicatesSpark. The bam file is empty. But the header will work when using samtools markdup instead. The following example was found in one of many samples we found in ICGC datasets.

Example header:
@HWI-ST700660_163:1:1101:1243:1870#1@0/1

Steps to reproduce

Command:

gatk --java-options "-Djava.io.tmpdir=/scratch" MarkDuplicatesSpark --input C19CUACXX.1.1-1.sorted.bam --output /scratch/C19CUACXX.1.1.sorted.Spark.Strigency-strict.bam --conf 'spark.local.dir=/scratch' --tmp-dir /scratch --read-validation-stringency STRICT

Expected behavior

Finish MarkDuplicatesSpark successfully and output a valid bam file.

Actual behavior

The bam file is empty.

@tyamaguchi-ucla
Copy link
Contributor Author

EDIT: Before adding samtools markup we should confirm whether or not Picard's mark duplicates can handle these headers. Alfredo mentioned it worked when he manually ran Picard's mark duplicates, but that it does not work in the pipeline.

@jarbet That's right but the issue of picard is single-threading and it's too slow even if it works.

@tyamaguchi-ucla
Copy link
Contributor Author

It looks like Alfredo provided enough info here #223

  1. GATK Version

Can you check align-DNA 8.0.0? (Ignore 3.X -> IndelRealignment)

  1. Command & Logs

We should be able to get the actual command from the process logs without showing cluster paths if the output dir is still on the cluster. I believe same for the logs.

@madisonjordan
Copy link

madisonjordan commented Dec 16, 2022

It looks like Alfredo provided enough info here #223

  1. GATK Version

Can you check align-DNA 8.0.0? (Ignore 3.X -> IndelRealignment)

  1. Command & Logs

We should be able to get the actual command from the process logs without showing cluster paths if the output dir is still on the cluster. I believe same for the logs.

Oh I see his comment of the version and the manual command he used that failed too. That should be fine. Thanks!
I'll see if I can find the log. otherwise I'll just post it and see. Oh wait I found the main Nextflow log path he posted.

@tyamaguchi-ucla is the info in this log okay? I had to cut out a lot of content because it was too long but I got the main parts.

00:05 DEBUG: [kryo] Read: Object[]
00:05 DEBUG: [kryo] Read: Object[]
00:05 DEBUG: [kryo] Read: Object[]
00:05 DEBUG: [kryo] Read: Object[]
00:05 DEBUG: [kryo] Write: Object[]
00:05 DEBUG: [kryo] Write: Object[]
00:05 DEBUG: [kryo] Write: Object[]
...
01:22 DEBUG: [kryo] Read: CompressedMapStatus
01:22 DEBUG: [kryo] Write: CompressedMapStatus
...
02:25 DEBUG: [kryo] Read: WrappedArray([])
02:25 DEBUG: [kryo] Write: WrappedArray([])
02:25 DEBUG: [kryo] Read: scala.Tuple3[]
02:25 DEBUG: [kryo] Read: scala.Tuple3[]
02:25 DEBUG: [kryo] Read: WrappedArray([])
02:25 DEBUG: [kryo] Read: WrappedArray([])
02:25 DEBUG: [kryo] Write: scala.Tuple3[]
...
02:42 DEBUG: [kryo] Write object reference 1941: HLA-A*24:152
02:42 DEBUG: [kryo] Write object reference 1945: chrUn_JTFH01001224v1_decoy
02:42 DEBUG: [kryo] Write object reference 1949: HLA-B*14:01:01
02:42 DEBUG: [kryo] Write object reference 1953: chr5_GL949742v1_alt
...
02:42 DEBUG: [kryo] Write object reference 1942: SAMSequenceRecord(name=HLA-A*24:152,length=3176,dict_index=2919,assembly=null,alternate_names=[])
02:42 DEBUG: [kryo] Write object reference 1946: SAMSequenceRecord(name=chrUn_JTFH01001224v1_decoy,length=1051,dict_index=2066,assembly=null,alternate_names=[])
02:42 DEBUG: [kryo] Write object reference 1950: SAMSequenceRecord(name=HLA-B*14:01:01,length=3312,dict_index=2999,assembly=null,alternate_names=[])
02:42 DEBUG: [kryo] Write object reference 1954: SAMSequenceRecord(name=chr5_GL949742v1_alt,length=226852,dict_index=241,assembly=null,alternate_names=[])
...
02:42 DEBUG: [kryo] Write: Array[java.lang.Object]
02:42 DEBUG: [kryo] Write: Object[]
02:42 DEBUG: [kryo] Write: byte[]
02:42 DEBUG: [kryo] Write: WrappedArray([])
...
03:20 DEBUG: [kryo] Read: CompressedMapStatus
03:20 DEBUG: [kryo] Write: Array[java.lang.Object]
03:20 DEBUG: [kryo] Write: Object[]
03:20 DEBUG: [kryo] Write: byte[]
03:20 DEBUG: [kryo] Read: Array[java.lang.Object]
03:20 DEBUG: [kryo] Read: Object[]
03:21 DEBUG: [kryo] Write: TaskCommitMessage
03:21 DEBUG: [kryo] Read: TaskCommitMessage

@tyamaguchi-ucla
Copy link
Contributor Author

@tyamaguchi-ucla is the info in this log okay? I had to cut out a lot of content because it was too long but I got the main parts.

Yup, looks good. It seems that no notification will be sent out (even tagging people) when the editing feature is used.

@madisonjordan
Copy link

Done. Putting the link to the GATK issue created in our issue 223.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants