Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add genotype filtering Terra workflow configs and documentation #695

Merged
merged 4 commits into from
Oct 21, 2024

Conversation

mwalker174
Copy link
Collaborator

  • Adds JoinRawCalls, SVConcordance, and FilterGenotypes Terra workflow configurations. Note that the workflow numbering in Dockstore and Terra are not yet updated.
  • Updates genotype filtering documentation

Tested on our "bwa-melt" benchmarking workspace, which had a recent cleaned vcf to work from.

@mwalker174 mwalker174 requested a review from epiercehoffman July 2, 2024 13:53
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to add this documentation to the website, too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we'll want everything on the website by the time we release the featured workspace. I think it makes sense to me to get the README and dashboard updated so we can update the template Terra workspace, then update the website after - does that sound good?

Copy link
Collaborator

@epiercehoffman epiercehoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing all of this up. Mostly looks good, though I flagged a few places to fix. In addition, the pipeline diagram should be updated to include these workflows.

* [JoinRawCalls](#join-raw-calls) - Merges unfiltered calls across batches
* [SVConcordance](#svconcordance) - Calculates genotype concordance with raw calls
* [FilterGenotypes](#filter-genotypes) - Performs genotype filtering
* [AnnotateVcf](#annotate-vcf) - Functional and allele frequency annotation
* [Module 09](#module09) - QC and Visualization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we delete this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I will clean out the readme with other unrelated changes when I do a full readme->website transfer later.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Computes genotype concordance metrics between all variants in the joint call set and raw calls.

#### Prerequisites:
* [MakeCohortVcf](#make-cohort-vcf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably update the README with the four subsections of MakeCohortVcf and change this prereq to CleanVcf. At the very least, the MakeCohortVcf section should describe the subworkflows.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I think this is a bit beyond the scope of this PR but I am going to create a new ticket for the website update and will mention this.

Comment on lines 6 to 16
"JoinRawCalls.clustered_depth_vcfs" : "${this.clustered_depth_vcf}",
"JoinRawCalls.clustered_depth_vcf_indexes" : "${this.clustered_depth_vcf_index}",

"JoinRawCalls.clustered_manta_vcfs" : "${this.clustered_manta_vcf}",
"JoinRawCalls.clustered_manta_vcf_indexes" : "${this.clustered_manta_vcf_index}",

"JoinRawCalls.clustered_wham_vcfs" : "${this.clustered_wham_vcf}",
"JoinRawCalls.clustered_wham_vcf_indexes" : "${this.clustered_wham_vcf_index}",

"JoinRawCalls.clustered_melt_vcfs" : "${this.clustered_melt_vcf}",
"JoinRawCalls.clustered_melt_vcf_indexes" : "${this.clustered_melt_vcf_index}",
Copy link
Collaborator

@epiercehoffman epiercehoffman Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"JoinRawCalls.clustered_depth_vcfs" : "${this.clustered_depth_vcf}",
"JoinRawCalls.clustered_depth_vcf_indexes" : "${this.clustered_depth_vcf_index}",
"JoinRawCalls.clustered_manta_vcfs" : "${this.clustered_manta_vcf}",
"JoinRawCalls.clustered_manta_vcf_indexes" : "${this.clustered_manta_vcf_index}",
"JoinRawCalls.clustered_wham_vcfs" : "${this.clustered_wham_vcf}",
"JoinRawCalls.clustered_wham_vcf_indexes" : "${this.clustered_wham_vcf_index}",
"JoinRawCalls.clustered_melt_vcfs" : "${this.clustered_melt_vcf}",
"JoinRawCalls.clustered_melt_vcf_indexes" : "${this.clustered_melt_vcf_index}",
"JoinRawCalls.clustered_depth_vcfs" : "${this.sample_sets.clustered_depth_vcf}",
"JoinRawCalls.clustered_depth_vcf_indexes" : "${this.sample_sets.clustered_depth_vcf_index}",
"JoinRawCalls.clustered_manta_vcfs" : "${this.sample_sets.clustered_manta_vcf}",
"JoinRawCalls.clustered_manta_vcf_indexes" : "${this.sample_sets.clustered_manta_vcf_index}",
"JoinRawCalls.clustered_wham_vcfs" : "${this.sample_sets.clustered_wham_vcf}",
"JoinRawCalls.clustered_wham_vcf_indexes" : "${this.sample_sets.clustered_wham_vcf_index}",
"JoinRawCalls.clustered_melt_vcfs" : "${this.sample_sets.clustered_melt_vcf}",
"JoinRawCalls.clustered_melt_vcf_indexes" : "${this.sample_sets.clustered_melt_vcf_index}",

Can you cross-reference these JSONs with the ones in the existing Terra workspace to avoid issues like this? Or were they all configured for a single batch? In that case AoU Phase 2 could work for checking, though I hope I got them all

Also, probably should add scramble VCF inputs and double check that the workflow runs ok when an empty array is provided for one set of caller VCFs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This looks consistent with the AoU workspace and I've changed the default to Scramble even though that won't be default behavior until #722 goes in. I think this has been running with empty Scramble inputs so that shouldn't be a problem.

@@ -0,0 +1,41 @@
{
"FilterGenotypes.vcf": "${this.concordance_vcf}",
"FilterGenotypes.output_prefix": "${this.sample_set_id}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"FilterGenotypes.output_prefix": "${this.sample_set_id}",
"FilterGenotypes.output_prefix": "${this.sample_set_set_id}",

@@ -5,6 +5,7 @@ gatk_docker {{ dockers.gatk_docker }}
gatk_docker_pesr_override {{ dockers.gatk_docker_pesr_override }}
gcnv_gatk_docker {{ dockers.gatk_docker }}
genomes_in_the_cloud_docker {{ dockers.genomes_in_the_cloud_docker }}
gq_recalibrator_docker {{ dockers.gq_recalibrator_docker }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get this merged with the other GATK docker soon?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlikely since this tool is still on a branch

Typo

Typo 2

Add to overview

Round out readme

Typo 3

Add detail

Minor changes

Fix workspace table

Update inputs

Fix templates
@mwalker174 mwalker174 force-pushed the mw_terra_genotype_filtering branch from 7237b4a to 05b8025 Compare September 30, 2024 15:03
Copy link
Collaborator

@epiercehoffman epiercehoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments! This is looking good. We should also get these workflows added to the dockstore.yml and the template workspace - I can also do that in my upcoming ManualReview PR though.

README.md Outdated
@@ -541,7 +541,7 @@ See the SV "Genotype Filter" section on page 34 of the [All of Us Genomic Qualit

All valid genotypes are annotated with a "scaled logit" (SL) score, which is rescaled to non-negative adjusted GQs on [1, 99]. Note that the rescaled GQs should *not* be interpreted as probabilities. Original genotype qualities are retained in the OGQ field.

A more positive SL score indicates higher probability of correctness of the given genotype. Genotypes are therefore filtered using SL thresholds that depend on SV type and size. This workflow also generates QC plots using the [MainVcfQc](https://github.com/broadinstitute/gatk-sv/blob/main/wdl/MainVcfQc.wdl) workflow to review call set quality (see below for recommended practices).
A more positive SL score indicates higher probability that the give genotype is not homozygous for the reference allele. Genotypes are therefore filtered using SL thresholds that depend on SV type and size. This workflow also generates QC plots using the [MainVcfQc](https://github.com/broadinstitute/gatk-sv/blob/main/wdl/MainVcfQc.wdl) workflow to review call set quality (see below for recommended practices).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A more positive SL score indicates higher probability that the give genotype is not homozygous for the reference allele. Genotypes are therefore filtered using SL thresholds that depend on SV type and size. This workflow also generates QC plots using the [MainVcfQc](https://github.com/broadinstitute/gatk-sv/blob/main/wdl/MainVcfQc.wdl) workflow to review call set quality (see below for recommended practices).
A more positive SL score indicates higher probability that the given genotype is not homozygous for the reference allele. Genotypes are therefore filtered using SL thresholds that depend on SV type and size. This workflow also generates QC plots using the [MainVcfQc](https://github.com/broadinstitute/gatk-sv/blob/main/wdl/MainVcfQc.wdl) workflow to review call set quality (see below for recommended practices).

@mwalker174 mwalker174 merged commit c844d78 into main Oct 21, 2024
5 checks passed
@mwalker174 mwalker174 deleted the mw_terra_genotype_filtering branch October 21, 2024 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants