Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potentially serious problem with read group definitions #308

Open
sorelfitzgibbon opened this issue Apr 25, 2024 · 5 comments
Open

Potentially serious problem with read group definitions #308

sorelfitzgibbon opened this issue Apr 25, 2024 · 5 comments
Labels
high priority high priority issues

Comments

@sorelfitzgibbon
Copy link

// the library, sample and lane are used as keys downstream to group into

Library, sample and lane are not sufficient to identify unique read groups. The same sample from the same library can be sequenced on two different machine runs and have the same lane. In this case the pipeline yields an error, but the user is likely to attempt to get around it by renaming the libraries to distinct names. This can lead to massive numbers of false positives, dependent on background duplication rate, due to failure to mark duplicates across the two runs.

@sorelfitzgibbon sorelfitzgibbon added the high priority high priority issues label Apr 25, 2024
@sorelfitzgibbon
Copy link
Author

Additionally, for my metapipeline run with this situation it only failed after having aligned the normal and 4 of 5 tumor FASTQs, whereas I assume it would be trivial to check for this early in the run.

@yashpatel6
Copy link
Contributor

This is a good point; with some discussion, we'll want to add something like:

  • Try to automatically extract the run identifier from the FASTQ header
  • If it fails, print out a warning and tack on a counter to the lane to distinguish the files

@tyamaguchi-ucla
Copy link
Contributor

Yes, this is what dataset registration is trying to mitigate (only internally). There are many different FASTQ formats and it's hard to accommodate everything but it would be useful to add some checks/automation. Also, this pipeline adds .Seq# to the RG to make it unique so it's unlikely to cause serious issues as long as the library information is correctly provided.

@yashpatel6
Copy link
Contributor

Generally, yes, the error align-DNA ends up running into is a filename collision since the initial alignments are named based on a combination of library and lane

@tyamaguchi-ucla
Copy link
Contributor

I think you meant this line -

lane_level_bam = generate_standard_filename(params.bwa_version, params.dataset_id, params.sample_id, [additional_information: "${library}-${lane}.bam"])
(also for HISAT2)

Then, yes, this should be updated. Internally registered FASTQs should be fine as they should have unique lane info for each library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority high priority issues
Projects
None yet
Development

No branches or pull requests

3 participants