Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input csv fields: redundancy ok? #143

Open
alkaZeltser opened this issue Oct 22, 2021 · 5 comments
Open

Input csv fields: redundancy ok? #143

alkaZeltser opened this issue Oct 22, 2021 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@alkaZeltser
Copy link
Contributor

I'm constructing a csv input file for a sample (from an ILLUMINA sequencer) and am a little confused about some of the fields.

Referencing gatk, it seems that there is a lot of redundancy in the input fields. For example both the read_group_identifier (ID) and platform_unit (PU) fields are constructed using the flowcell ID and lane number (for ILLUMINA reads). Then the lane number is provided separately as another field, to be concatenated with the ID field. Therefore the ID field should really just be the flowcell in my case?

Also, for the sample field, would I use the internal sample ID or the original (external) sample ID?

For example, an input csv for a sample with the following fastq filename:
FD00123067_S14_L001_R1_001.fastq.gz

And the following fastq header :
@A00817:312:HKTWMDRXY:1:1101:3106:1016 1:N:0:GATAGGCCGA+GCCATGTGCG

parsed using the following ILLUMINA header format:
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>

Would look like this:

index read_group_identifier sequencing_center library_identifier platform_technology platform_unit sample lane read1_fastq read2_fastq
1 HKTWMDRXY UNGC FD00123067 ILLUMINA HKTWMDRXY.1 FD00123067(original) or BZPRGPT1000001-N001-B01-F(internal ID) 1 /path/to/fastq/pair/r1.fastq.gz /path/to/fastq/pair/r2.fastq.gz
@zhuchcn
Copy link
Member

zhuchcn commented Oct 24, 2021

I'm not an expert of FASTQ header. But seems like the read_groupt_identifer and platform_unit don't have to be the same. Here is am example of the input CSV file created from a CPTAC BAM. Seems like their read_group_identifier is just the first 4 letters of the platform_unit, if some extra characters (don't know where it comes from) to solve conflict. The platform_uni doesn't have to be unique, and I guess GATK is using it internally. @tyamaguchi-ucla can probably give more smart comments.

Also, for the sample field, would I use the internal sample ID or the original (external) sample ID?

I had the same question in the uclahs-cds/pipeline-germline-somatic pipeline. Although we just need to make sure that the correct sample name is used in call-gSNP, but might be good to use our internal ID? @tyamaguchi-ucla

@tyamaguchi-ucla
Copy link
Contributor

Hi guys, @zhuchcn @alkaZeltser

As discussed in the last NF WG (https://confluence.mednet.ucla.edu/display/BOUTROSLAB/2021-10-20+Nextflow+Working+Group+Meeting+Notes ), I suggest that we use something like

  • read_group_identifier -> library_identifier + '.' + lane # (it can be retrieved from FASTQ) (must be unique and required for BQSR)
  • sequencing_center -> This one is almost impossible to automate. I guess the samples were sequenced at UNGC?
  • library_identifier -> Usually in file name (required for Markduplicates) - hard to automate.
  • platform_unit -> flowcell_id (see FASTQ read IDs) + '.' + lane# (it can be retrieved from FASTQ)

So, here we probably want to update the read_group_identifier although the current CSV would work perfectly fine.

Also, for the sample field, would I use the internal sample ID or the original (external) sample ID?

I had the same question in the uclahs-cds/pipeline-germline-somatic pipeline. Although we just need to make sure that the correct sample name is used in call-gSNP, but might be good to use our internal ID? @tyamaguchi-ucla

Yeah, I was thinking about this as well. I think it would be nice to have both external/internal ID in the BAM header so I'm thinking of using

Sample ->Internal ID and we could include External ID in library_identifier, which will be passed to RG identifier.

For lanes, I think we may want to standardize the field and use L00 + lane number for readability instead of using integer.

  • Some references

https://samtools.github.io/hts-specs/SAMv1.pdf
https://en.wikipedia.org/wiki/FASTQ_format
https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups
Sentieon® recommendations

https://support.sentieon.com/appnotes/read_groups/

@tyamaguchi-ucla tyamaguchi-ucla self-assigned this Jan 21, 2022
@tyamaguchi-ucla tyamaguchi-ucla added the question Further information is requested label Jan 21, 2022
@tyamaguchi-ucla tyamaguchi-ucla pinned this issue Jan 21, 2022
@graceooh
Copy link
Contributor

@tyamaguchi-ucla Hi, I have a problem with the read_group_identifier convention.
It says:
read_group_identifier -> library_identifier + '.' + lane # (it can be retrieved from FASTQ) (must be unique and required for BQSR)

but there is a case where I don't have unique read_group_identifiers if I use this convention.

Example:
less /hot/data/unregistered/Movember-Hiyari-Bone-GAP62/2021-9126/BE-1-Blood_L001_ds.aee418f085464ff89488437ce340b52a/BE-1-Blood_S3_L001_R1_001.fastq.gz | head -n1
@A00817:341:HMNVTDRXY:1:2101:1271:1000 1:N:0:ACAGGTAT+ATGGTGGC

less /hot/data/unregistered/Movember-Hiyari-Bone-GAP62/2021-9126/BE-1-Blood_L001_ds.018ae688d10d49f7bdca5bb4932df2ab/BE-1-Blood_S3_L001_R1_001.fastq.gz | head -n1
@A00817:337:H5VJWDSX3:1:1101:3134:1000 1:N:0:ACAGGTAT+ATGGTGGC

Using this library_identifier + '.' + lane # convention, I will end up two with non-unique read_group_identifiers:
BE-1-Blood.1 (both library name is BE-1-Blood and lane is lane 1.)
Please could you advise?
Thank you!

@tyamaguchi-ucla
Copy link
Contributor

tyamaguchi-ucla commented May 23, 2022

@graceooh isn't it the same case we saw in the PRESTO dataset (the same library sequenced twice using the same lane)? Maybe you can check with Sarah and see how the samples were processed?

@graceooh
Copy link
Contributor

Yes that's right! OK I'll check and add -01 like we did for PRESTO then. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants