Open a GH issue for the Spark empty BAM case #232

tyamaguchi-ucla · 2022-08-01T19:29:37Z

Create an issue at https://github.com/broadinstitute/gatk with enough details if you haven't ( check if there are similar issues before creating an issue - Yash and I did a quick search but didn't find any).

tyamaguchi-ucla · 2022-12-09T23:48:20Z

#223

madisonjordan · 2022-12-14T21:26:08Z

Issue Draft

Title

MarkDuplicatesSpark error when FASTQ headers contain another @ in string

Content

Headers with another @ character fail to create a valid bam using MarkDuplicatesSpark. The bam file is empty. But the header will work when using samtools markdup instead. The following example was found in one of many samples we found in ICGC datasets.

Example header:
@HWI-ST700660_163:1:1101:1243:1870#1@0/1

tyamaguchi-ucla · 2022-12-15T04:05:38Z

@madisonjordan Looks good! I think it's worth mentioning that the example was found in one of the ICGC (International Cancer Genome Consortium) samples and there are many others in ICGC datasets.

As we expect many more samples like this and want to avoid fixing the headers internally, we may want to consider adding samtools markdup in addition to MarkDuplicatesSpark if they have no intention to fix the issue. @yashpatel6 @jarbet

yashpatel6 · 2022-12-15T15:25:25Z

As we expect many more samples like this and want to avoid fixing the headers internally, we may want to consider adding samtools markdup in addition to MarkDuplicatesSpark if they have no intention to fix the issue. @yashpatel6 @jarbet

Agreed, we do have the option to just not mark duplicates but it would be good to have a working option for these types of samples in the pipeline.

tyamaguchi-ucla · 2022-12-15T15:35:52Z

As we expect many more samples like this and want to avoid fixing the headers internally, we may want to consider adding samtools markdup in addition to MarkDuplicatesSpark if they have no intention to fix the issue. @yashpatel6 @jarbet

Agreed, we do have the option to just not mark duplicates but it would be good to have a working option for these types of samples in the pipeline.

Yup, my only concern is that some other GATK commands may have the same issue so we'll have to test it out first.

madisonjordan · 2022-12-15T17:28:26Z

I'll create the issue right now with the suggested change for where the data came from.

jarbet · 2022-12-15T17:29:39Z

@madisonjordan Looks good! I think it's worth mentioning that the example was found in one of the ICGC (International Cancer Genome Consortium) samples and there are many others in ICGC datasets.

As we expect many more samples like this and want to avoid fixing the headers internally, we may want to consider adding samtools markdup in addition to MarkDuplicatesSpark if they have no intention to fix the issue. @yashpatel6 @jarbet

Sounds good. Here is the test example Alfredo made that we can use for testing/development:

/hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/input/align-DNA.input_SP195275_single_paired_read.csv

If any of you know of other example fastq files with @ in the header, or other problems with the header, please share.

EDIT: Before adding samtools markup we should confirm whether or not Picard's mark duplicates can handle these headers. Alfredo mentioned it worked when he manually ran Picard's mark duplicates, but that it does not work in the pipeline. Maybe there is a way we can modify the arguments to get it to work. For example, modifying SORTING_COLLECTION_SIZE_RATIO - thank you @madisonjordan for finding this!

madisonjordan · 2022-12-15T17:44:51Z

Logs:
Also wondering if we can post any logs? They said to post them if applicable, but I wanted to make sure what to post in case it contains sensitive information. Also not sure where the full log is for failed jobs, as will all nextflow stuff.
The information might be in the nextflow log but that might have mixed information with other tools if it was ran at the same time or duplicated info just from what I've seen in my nextflow runs.

Here's what we have now:

Bug Report

Affected tool(s) or class(es)

MarkDuplicatesSpark

Affected version(s)

GATK version 4.1.9.0

Description

Describe the problem below. Provide screenshots , stacktrace , logs where appropriate.

Headers with another @ character fail to create a valid bam using MarkDuplicatesSpark. The bam file is empty. But the header will work when using samtools markdup instead. The following example was found in one of many samples we found in ICGC datasets.

Example header:
@HWI-ST700660_163:1:1101:1243:1870#1@0/1

Steps to reproduce

Command:

gatk --java-options "-Djava.io.tmpdir=/scratch" MarkDuplicatesSpark --input C19CUACXX.1.1-1.sorted.bam --output /scratch/C19CUACXX.1.1.sorted.Spark.Strigency-strict.bam --conf 'spark.local.dir=/scratch' --tmp-dir /scratch --read-validation-stringency STRICT

Expected behavior

Finish MarkDuplicatesSpark successfully and output a valid bam file.

Actual behavior

The bam file is empty.

tyamaguchi-ucla · 2022-12-16T02:09:48Z

EDIT: Before adding samtools markup we should confirm whether or not Picard's mark duplicates can handle these headers. Alfredo mentioned it worked when he manually ran Picard's mark duplicates, but that it does not work in the pipeline.

@jarbet That's right but the issue of picard is single-threading and it's too slow even if it works.

tyamaguchi-ucla · 2022-12-16T02:17:10Z

It looks like Alfredo provided enough info here #223

GATK Version

Can you check align-DNA 8.0.0? (Ignore 3.X -> IndelRealignment)

Command & Logs

We should be able to get the actual command from the process logs without showing cluster paths if the output dir is still on the cluster. I believe same for the logs.

madisonjordan · 2022-12-16T17:17:38Z

It looks like Alfredo provided enough info here #223

GATK Version

Can you check align-DNA 8.0.0? (Ignore 3.X -> IndelRealignment)

Command & Logs

We should be able to get the actual command from the process logs without showing cluster paths if the output dir is still on the cluster. I believe same for the logs.

Oh I see his comment of the version and the manual command he used that failed too. That should be fine. Thanks!
I'll see if I can find the log. otherwise I'll just post it and see. Oh wait I found the main Nextflow log path he posted.

@tyamaguchi-ucla is the info in this log okay? I had to cut out a lot of content because it was too long but I got the main parts.

00:05 DEBUG: [kryo] Read: Object[]
00:05 DEBUG: [kryo] Read: Object[]
00:05 DEBUG: [kryo] Read: Object[]
00:05 DEBUG: [kryo] Read: Object[]
00:05 DEBUG: [kryo] Write: Object[]
00:05 DEBUG: [kryo] Write: Object[]
00:05 DEBUG: [kryo] Write: Object[]
...
01:22 DEBUG: [kryo] Read: CompressedMapStatus
01:22 DEBUG: [kryo] Write: CompressedMapStatus
...
02:25 DEBUG: [kryo] Read: WrappedArray([])
02:25 DEBUG: [kryo] Write: WrappedArray([])
02:25 DEBUG: [kryo] Read: scala.Tuple3[]
02:25 DEBUG: [kryo] Read: scala.Tuple3[]
02:25 DEBUG: [kryo] Read: WrappedArray([])
02:25 DEBUG: [kryo] Read: WrappedArray([])
02:25 DEBUG: [kryo] Write: scala.Tuple3[]
...
02:42 DEBUG: [kryo] Write object reference 1941: HLA-A*24:152
02:42 DEBUG: [kryo] Write object reference 1945: chrUn_JTFH01001224v1_decoy
02:42 DEBUG: [kryo] Write object reference 1949: HLA-B*14:01:01
02:42 DEBUG: [kryo] Write object reference 1953: chr5_GL949742v1_alt
...
02:42 DEBUG: [kryo] Write object reference 1942: SAMSequenceRecord(name=HLA-A*24:152,length=3176,dict_index=2919,assembly=null,alternate_names=[])
02:42 DEBUG: [kryo] Write object reference 1946: SAMSequenceRecord(name=chrUn_JTFH01001224v1_decoy,length=1051,dict_index=2066,assembly=null,alternate_names=[])
02:42 DEBUG: [kryo] Write object reference 1950: SAMSequenceRecord(name=HLA-B*14:01:01,length=3312,dict_index=2999,assembly=null,alternate_names=[])
02:42 DEBUG: [kryo] Write object reference 1954: SAMSequenceRecord(name=chr5_GL949742v1_alt,length=226852,dict_index=241,assembly=null,alternate_names=[])
...
02:42 DEBUG: [kryo] Write: Array[java.lang.Object]
02:42 DEBUG: [kryo] Write: Object[]
02:42 DEBUG: [kryo] Write: byte[]
02:42 DEBUG: [kryo] Write: WrappedArray([])
...
03:20 DEBUG: [kryo] Read: CompressedMapStatus
03:20 DEBUG: [kryo] Write: Array[java.lang.Object]
03:20 DEBUG: [kryo] Write: Object[]
03:20 DEBUG: [kryo] Write: byte[]
03:20 DEBUG: [kryo] Read: Array[java.lang.Object]
03:20 DEBUG: [kryo] Read: Object[]
03:21 DEBUG: [kryo] Write: TaskCommitMessage
03:21 DEBUG: [kryo] Read: TaskCommitMessage

tyamaguchi-ucla · 2022-12-17T01:13:21Z

@tyamaguchi-ucla is the info in this log okay? I had to cut out a lot of content because it was too long but I got the main parts.

Yup, looks good. It seems that no notification will be sent out (even tagging people) when the editing feature is used.

madisonjordan · 2022-12-19T16:35:18Z

Done. Putting the link to the GATK issue created in our issue 223.

tyamaguchi-ucla assigned Alfredo-Enrique and jarbet Aug 1, 2022

tyamaguchi-ucla assigned madisonjordan and unassigned Alfredo-Enrique Dec 14, 2022

madisonjordan closed this as completed Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open a GH issue for the Spark empty BAM case #232

Open a GH issue for the Spark empty BAM case #232

tyamaguchi-ucla commented Aug 1, 2022

tyamaguchi-ucla commented Dec 9, 2022

madisonjordan commented Dec 14, 2022 •

edited

Loading

tyamaguchi-ucla commented Dec 15, 2022 •

edited

Loading

yashpatel6 commented Dec 15, 2022

tyamaguchi-ucla commented Dec 15, 2022

madisonjordan commented Dec 15, 2022

jarbet commented Dec 15, 2022 •

edited

Loading

madisonjordan commented Dec 15, 2022 •

edited

Loading

tyamaguchi-ucla commented Dec 16, 2022

tyamaguchi-ucla commented Dec 16, 2022

madisonjordan commented Dec 16, 2022 •

edited

Loading

tyamaguchi-ucla commented Dec 17, 2022

madisonjordan commented Dec 19, 2022

Open a GH issue for the Spark empty BAM case #232

Open a GH issue for the Spark empty BAM case #232

Comments

tyamaguchi-ucla commented Aug 1, 2022

tyamaguchi-ucla commented Dec 9, 2022

madisonjordan commented Dec 14, 2022 • edited Loading

Issue Draft

Title

Content

tyamaguchi-ucla commented Dec 15, 2022 • edited Loading

yashpatel6 commented Dec 15, 2022

tyamaguchi-ucla commented Dec 15, 2022

madisonjordan commented Dec 15, 2022

jarbet commented Dec 15, 2022 • edited Loading

madisonjordan commented Dec 15, 2022 • edited Loading

Bug Report

Affected tool(s) or class(es)

Affected version(s)

Description

Steps to reproduce

Expected behavior

Actual behavior

tyamaguchi-ucla commented Dec 16, 2022

tyamaguchi-ucla commented Dec 16, 2022

madisonjordan commented Dec 16, 2022 • edited Loading

tyamaguchi-ucla commented Dec 17, 2022

madisonjordan commented Dec 19, 2022

madisonjordan commented Dec 14, 2022 •

edited

Loading

tyamaguchi-ucla commented Dec 15, 2022 •

edited

Loading

jarbet commented Dec 15, 2022 •

edited

Loading

madisonjordan commented Dec 15, 2022 •

edited

Loading

madisonjordan commented Dec 16, 2022 •

edited

Loading