Potentially serious problem with read group definitions #308

sorelfitzgibbon · 2024-04-25T18:56:04Z

Line 71 in 76c63a1

// the library, sample and lane are used as keys downstream to group into

Library, sample and lane are not sufficient to identify unique read groups. The same sample from the same library can be sequenced on two different machine runs and have the same lane. In this case the pipeline yields an error, but the user is likely to attempt to get around it by renaming the libraries to distinct names. This can lead to massive numbers of false positives, dependent on background duplication rate, due to failure to mark duplicates across the two runs.

The text was updated successfully, but these errors were encountered:

sorelfitzgibbon · 2024-04-25T19:02:03Z

Additionally, for my metapipeline run with this situation it only failed after having aligned the normal and 4 of 5 tumor FASTQs, whereas I assume it would be trivial to check for this early in the run.

yashpatel6 · 2024-04-25T21:18:23Z

This is a good point; with some discussion, we'll want to add something like:

Try to automatically extract the run identifier from the FASTQ header
If it fails, print out a warning and tack on a counter to the lane to distinguish the files

tyamaguchi-ucla · 2024-04-25T21:49:38Z

Yes, this is what dataset registration is trying to mitigate (only internally). There are many different FASTQ formats and it's hard to accommodate everything but it would be useful to add some checks/automation. Also, this pipeline adds .Seq# to the RG to make it unique so it's unlikely to cause serious issues as long as the library information is correctly provided.

yashpatel6 · 2024-04-25T22:08:10Z

Generally, yes, the error align-DNA ends up running into is a filename collision since the initial alignments are named based on a combination of library and lane

tyamaguchi-ucla · 2024-04-25T22:17:45Z

I think you meant this line -

pipeline-align-DNA/module/align_DNA_BWA_MEM2.nf

Line 61 in a6769c3

    
           lane_level_bam = generate_standard_filename(params.bwa_version, params.dataset_id, params.sample_id, [additional_information: "${library}-${lane}.bam"])

(also for HISAT2)

Then, yes, this should be updated. Internally registered FASTQs should be fine as they should have unique lane info for each library.

sorelfitzgibbon added the high priority high priority issues label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potentially serious problem with read group definitions #308

Potentially serious problem with read group definitions #308

sorelfitzgibbon commented Apr 25, 2024

sorelfitzgibbon commented Apr 25, 2024

yashpatel6 commented Apr 25, 2024

tyamaguchi-ucla commented Apr 25, 2024

yashpatel6 commented Apr 25, 2024

tyamaguchi-ucla commented Apr 25, 2024

Potentially serious problem with read group definitions #308

Potentially serious problem with read group definitions #308

Comments

sorelfitzgibbon commented Apr 25, 2024

sorelfitzgibbon commented Apr 25, 2024

yashpatel6 commented Apr 25, 2024

tyamaguchi-ucla commented Apr 25, 2024

yashpatel6 commented Apr 25, 2024

tyamaguchi-ucla commented Apr 25, 2024