Input csv fields: redundancy ok? #143

alkaZeltser · 2021-10-22T23:30:11Z

I'm constructing a csv input file for a sample (from an ILLUMINA sequencer) and am a little confused about some of the fields.

Referencing gatk, it seems that there is a lot of redundancy in the input fields. For example both the read_group_identifier (ID) and platform_unit (PU) fields are constructed using the flowcell ID and lane number (for ILLUMINA reads). Then the lane number is provided separately as another field, to be concatenated with the ID field. Therefore the ID field should really just be the flowcell in my case?

Also, for the sample field, would I use the internal sample ID or the original (external) sample ID?

For example, an input csv for a sample with the following fastq filename:
FD00123067_S14_L001_R1_001.fastq.gz

And the following fastq header :
@A00817:312:HKTWMDRXY:1:1101:3106:1016 1:N:0:GATAGGCCGA+GCCATGTGCG

parsed using the following ILLUMINA header format:
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>

Would look like this:

index	read_group_identifier	sequencing_center	library_identifier	platform_technology	platform_unit	sample	lane	read1_fastq	read2_fastq
1	HKTWMDRXY	UNGC	FD00123067	ILLUMINA	HKTWMDRXY.1	FD00123067(original) or BZPRGPT1000001-N001-B01-F(internal ID)	1	/path/to/fastq/pair/r1.fastq.gz	/path/to/fastq/pair/r2.fastq.gz

The text was updated successfully, but these errors were encountered:

zhuchcn · 2021-10-24T23:18:17Z

I'm not an expert of FASTQ header. But seems like the read_groupt_identifer and platform_unit don't have to be the same. Here is am example of the input CSV file created from a CPTAC BAM. Seems like their read_group_identifier is just the first 4 letters of the platform_unit, if some extra characters (don't know where it comes from) to solve conflict. The platform_uni doesn't have to be unique, and I guess GATK is using it internally. @tyamaguchi-ucla can probably give more smart comments.

Also, for the sample field, would I use the internal sample ID or the original (external) sample ID?

I had the same question in the uclahs-cds/pipeline-germline-somatic pipeline. Although we just need to make sure that the correct sample name is used in call-gSNP, but might be good to use our internal ID? @tyamaguchi-ucla

tyamaguchi-ucla · 2021-10-25T19:52:35Z

Hi guys, @zhuchcn @alkaZeltser

As discussed in the last NF WG (https://confluence.mednet.ucla.edu/display/BOUTROSLAB/2021-10-20+Nextflow+Working+Group+Meeting+Notes ), I suggest that we use something like

read_group_identifier -> library_identifier + '.' + lane # (it can be retrieved from FASTQ) (must be unique and required for BQSR)
sequencing_center -> This one is almost impossible to automate. I guess the samples were sequenced at UNGC?
library_identifier -> Usually in file name (required for Markduplicates) - hard to automate.
platform_unit -> flowcell_id (see FASTQ read IDs) + '.' + lane# (it can be retrieved from FASTQ)

So, here we probably want to update the read_group_identifier although the current CSV would work perfectly fine.

Also, for the sample field, would I use the internal sample ID or the original (external) sample ID?

I had the same question in the uclahs-cds/pipeline-germline-somatic pipeline. Although we just need to make sure that the correct sample name is used in call-gSNP, but might be good to use our internal ID? @tyamaguchi-ucla

Yeah, I was thinking about this as well. I think it would be nice to have both external/internal ID in the BAM header so I'm thinking of using

Sample ->Internal ID and we could include External ID in library_identifier, which will be passed to RG identifier.

For lanes, I think we may want to standardize the field and use L00 + lane number for readability instead of using integer.

Some references

https://samtools.github.io/hts-specs/SAMv1.pdf
https://en.wikipedia.org/wiki/FASTQ_format
https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups
Sentieon® recommendations

https://support.sentieon.com/appnotes/read_groups/

graceooh · 2022-05-23T05:53:46Z

@tyamaguchi-ucla Hi, I have a problem with the read_group_identifier convention.
It says:
read_group_identifier -> library_identifier + '.' + lane # (it can be retrieved from FASTQ) (must be unique and required for BQSR)

but there is a case where I don't have unique read_group_identifiers if I use this convention.

Example:
less /hot/data/unregistered/Movember-Hiyari-Bone-GAP62/2021-9126/BE-1-Blood_L001_ds.aee418f085464ff89488437ce340b52a/BE-1-Blood_S3_L001_R1_001.fastq.gz | head -n1
@A00817:341:HMNVTDRXY:1:2101:1271:1000 1:N:0:ACAGGTAT+ATGGTGGC

less /hot/data/unregistered/Movember-Hiyari-Bone-GAP62/2021-9126/BE-1-Blood_L001_ds.018ae688d10d49f7bdca5bb4932df2ab/BE-1-Blood_S3_L001_R1_001.fastq.gz | head -n1
@A00817:337:H5VJWDSX3:1:1101:3134:1000 1:N:0:ACAGGTAT+ATGGTGGC

Using this library_identifier + '.' + lane # convention, I will end up two with non-unique read_group_identifiers:
BE-1-Blood.1 (both library name is BE-1-Blood and lane is lane 1.)
Please could you advise?
Thank you!

tyamaguchi-ucla · 2022-05-23T15:54:18Z

@graceooh isn't it the same case we saw in the PRESTO dataset (the same library sequenced twice using the same lane)? Maybe you can check with Sarah and see how the samples were processed?

graceooh · 2022-05-23T17:05:16Z

Yes that's right! OK I'll check and add -01 like we did for PRESTO then. Thanks!

tyamaguchi-ucla self-assigned this Jan 21, 2022

tyamaguchi-ucla added the question Further information is requested label Jan 21, 2022

tyamaguchi-ucla pinned this issue Jan 21, 2022

graceooh mentioned this issue May 24, 2022

Multiple input bams with the same name in MarkDuplicates Spark causes crash #192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input csv fields: redundancy ok? #143

Input csv fields: redundancy ok? #143

alkaZeltser commented Oct 22, 2021

zhuchcn commented Oct 24, 2021 •

edited

Loading

tyamaguchi-ucla commented Oct 25, 2021

graceooh commented May 23, 2022

tyamaguchi-ucla commented May 23, 2022 •

edited

Loading

graceooh commented May 23, 2022

Input csv fields: redundancy ok? #143

Input csv fields: redundancy ok? #143

Comments

alkaZeltser commented Oct 22, 2021

zhuchcn commented Oct 24, 2021 • edited Loading

tyamaguchi-ucla commented Oct 25, 2021

graceooh commented May 23, 2022

tyamaguchi-ucla commented May 23, 2022 • edited Loading

graceooh commented May 23, 2022

zhuchcn commented Oct 24, 2021 •

edited

Loading

tyamaguchi-ucla commented May 23, 2022 •

edited

Loading