You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*[MakeCohortVcf](#make-cohort-vcf): Cross-batch integration; complex variant resolution and re-genotyping; vcf cleanup
162
-
*[Module 07](#module07): Downstream filtering, including minGQ, batch effect check, outlier samples removal and final recalibration;
163
-
*[AnnotateVcf](#annotate-vcf): Annotations, including functional annotation, allele frequency (AF) annotation and AF annotation with external population callsets;
164
+
*[JoinRawCalls](#join-raw-calls): Merges unfiltered calls across batches
165
+
*[SVConcordance](#svconcordance): Calculates genotype concordance with raw calls
*[AnnotateVcf](#annotate-vcf): Annotations, including functional annotation, allele frequency (AF) annotation and AF annotation with external population callsets
164
168
*[Module 09](#module09): Visualization, including scripts that generates IGV screenshots and rd plots.
165
169
* Additional modules to be added: de novo and mosaic scripts
## <aname="module07">Module 07</a> (in development)
475
-
Apply downstream filtering steps to the cleaned VCF to further control the false discovery rate; all steps are optional and users should decide based on the specific purpose of their projects.
478
+
## <aname="join-raw-calls">JoinRawCalls</a>
476
479
477
-
Filtering methods include:
478
-
* minGQ - remove variants based on the genotype quality across populations.
479
-
Note: Trio families are required to build the minGQ filtering model in this step. We provide tables pre-trained with the 1000 genomes samples at different FDR thresholds for projects that lack family structures, and they can be found at the paths below. These tables assume that GQ has a scale of [0,999], so they will not work with newer VCFs where GQ has a scale of [0,99].
480
+
Merges raw unfiltered calls across batches. Concordance between these genotypes and the joint call set usually can be indicative of variant quality and is used downstream for genotype filtering.
See the SV "Genotype Filter" section on page 34 of the [All of Us Genomic Quality Report C2022Q4R9 CDR v7](https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report-ARCHIVED-C2022Q4R9-CDR-v7) for further details on model training.
541
+
542
+
All valid genotypes are annotated with a "scaled logit" (SL) score, which is rescaled to non-negative adjusted GQs on [1, 99]. Note that the rescaled GQs should *not* be interpreted as probabilities. Original genotype qualities are retained in the OGQ field.
543
+
544
+
A more positive SL score indicates higher probability that the given genotype is not homozygous for the reference allele. Genotypes are therefore filtered using SL thresholds that depend on SV type and size. This workflow also generates QC plots using the [MainVcfQc](https://github.com/broadinstitute/gatk-sv/blob/main/wdl/MainVcfQc.wdl) workflow to review call set quality (see below for recommended practices).
545
+
546
+
This workflow can be run in one of two modes:
547
+
548
+
1. (Recommended) The user explicitly provides a set of SL cutoffs through the `sl_filter_args` parameter, e.g.
Genotypes with SL scores less than the cutoffs are set to no-call (`./.`). The above values were taken directly from Appendix N of the [All of Us Genomic Quality Report C2022Q4R9 CDR v7 ](https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report-ARCHIVED-C2022Q4R9-CDR-v7). Users should adjust the thresholds depending on data quality and desired accuracy. Please see the arguments in [this script](https://github.com/broadinstitute/gatk-sv/blob/main/src/sv-pipeline/scripts/apply_sl_filter.py) for all available options.
553
+
554
+
2. (Advanced) The user provides truth labels for a subset of non-reference calls, and SL cutoffs are automatically optimized. These truth labels should be provided as a json file in the following format:
555
+
```
556
+
{
557
+
"sample_1": {
558
+
"good_variant_ids": ["variant_1", "variant_3"],
559
+
"bad_variant_ids": ["variant_5", "variant_10"]
560
+
},
561
+
"sample_2": {
562
+
"good_variant_ids": ["variant_2", "variant_13"],
563
+
"bad_variant_ids": ["variant_8", "variant_11"]
564
+
}
565
+
}
566
+
```
567
+
where "good_variant_ids" and "bad_variant_ids" are lists of variant IDs corresponding to non-reference (i.e. het or hom-var) sample genotypes that are true positives and false positives, respectively. SL cutoffs are optimized by maximizing the [F-score](https://en.wikipedia.org/wiki/F-score) with "beta" parameter `fmax_beta`, which modulates the weight given to precision over recall (lower values give higher precision).
568
+
569
+
In both modes, the workflow additionally filters variants based on the "no-call rate", the proportion of genotypes that were filtered in a given variant. Variants exceeding the `no_call_rate_cutoff` are assigned a `HIGH_NCR` filter status.
570
+
571
+
We recommend users observe the following basic criteria to assess the overall quality of the filtered call set:
485
572
486
-
* BatchEffect - remove variants that show significant discrepancies in allele frequencies across batches
487
-
* FilterOutlierSamplesPostMinGQ - remove outlier samples with unusually high or low number of SVs
488
-
* FilterCleanupQualRecalibration - sanitize filter columns and recalibrate variant QUAL scores for easier interpretation
573
+
* Number of PASS variants (excluding BND) between 7,000 and 11,000.
574
+
* At least 75% of variants in Hardy-Weinberg equilibrium (HWE). Note that this could be lower, depending on how how closely the cohort adheres to the assumptions of the Hardy-Weinberg model. However, HWE is expected to at least improve after filtering.
575
+
* Low *de novo* inheritance rate (if applicable), typically 5-10%.
576
+
577
+
These criteria can be assessed from the plots in the `main_vcf_qc_tarball` output, which is generated by default.
578
+
579
+
#### Prerequisites:
580
+
* [SVConcordance](#svconcordance)
581
+
582
+
#### Inputs:
583
+
* VCF with genotype concordance annotations URI ([SVConcordance](#svconcordance))
584
+
* Ploidy table URI ([JoinRawCalls](#join-raw-calls))
585
+
* GQRecalibrator model URI
586
+
* Either a set of SL cutoffs or truth labels
587
+
588
+
#### Outputs:
589
+
* Filtered VCF
590
+
* Call set QC plots (optional)
591
+
* Optimized SL cutoffs with filtering QC plots and data tables (if running mode [2] with truth labels)
592
+
* VCF with only SL annotation and GQ recalibration (before filtering)
489
593
490
-
## <aname="annotate-vcf">AnnotateVcf</a> (in development)
594
+
## <a name="annotate-vcf">AnnotateVcf</a>
491
595
*Formerly Module08Annotation*
492
596
493
597
Add annotations, such as the inferred function and allele frequencies of variants, to final VCF.
17. `17-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and AF annotation with external population callsets
64
+
16. `16-JoinRawCalls`: Combines unfiltered calls (from step 5) across batches
65
+
17. `17-SVConcordance`: Annotates variants with genotype concordance against raw calls
66
+
18. `18-FilterGenotypes`: Performs genotype filtering to improve precision and generates QC plots
67
+
19. `19-AnnotateVcf`: Cohort VCF annotations, including functional annotation, allele frequency (AF) annotation, and AF annotation with external population callsets
66
68
67
-
Additional downstream modules, such as those for filtering and visualization, are under development. They are not included in this workspace at this time, but the source code can be found in the [GATK-SV GitHub repository](https://github.com/broadinstitute/gatk-sv). See **Downstream steps** towards the bottom of this page for more information.
69
+
Additional downstream modules, such as those for visualization, are under development. They are not included in this workspace at this time, but the source code can be found in the [GATK-SV GitHub repository](https://github.com/broadinstitute/gatk-sv). See **Downstream steps** towards the bottom of this page for more information.
68
70
69
71
Extra workflows (Not part of canonical pipeline, but included for your convenience. May require manual configuration):
70
-
* `PlotSVCountsPerSample: Plot SV counts per sample per SV type
72
+
* `MainVcfQc`: Generates VCF QC reports (is run during 18-FilterGenotypes by default)
73
+
* `PlotSVCountsPerSample`: Plot SV counts per sample per SV type
71
74
* `FilterOutlierSamples`: Filter outlier samples (in terms of SV counts) from a single VCF. Recommended to run `PlotSVCountsPerSample` beforehand (configured with the single VCF you want to filter) to enable IQR cutoff choice.
72
75
73
76
For detailed instructions on running the pipeline in Terra, see **Step-by-step instructions** below.
@@ -202,11 +205,11 @@ Read the full MergeBatchSites documentation [here](https://github.com/broadinsti
202
205
Read the full GenotypeBatch documentation [here](https://github.com/broadinstitute/gatk-sv#genotype-batch).
203
206
* Use the same `sample_set` definitions you used for `03-TrainGCNV` through `08-FilterBatchSamples`.
204
207
205
-
#### 11-RegenotypeCNVs, 12-CombineBatches, 13-ResolveComplexVariants, 14-GenotypeComplexVariants, 15-CleanVcf, 16-MainVcfQc, and 17-AnnotateVcf
Read the full documentation for [RegenotypeCNVs](https://github.com/broadinstitute/gatk-sv#regenotype-cnvs), [MakeCohortVcf](https://github.com/broadinstitute/gatk-sv#make-cohort-vcf) (which includes `CombineBatches`, `ResolveComplexVariants`, `GenotypeComplexVariants`, `CleanVcf`, `MainVcfQc`), and [AnnotateVcf](https://github.com/broadinstitute/gatk-sv#annotate-vcf) on the README.
210
+
Read the full documentation for [RegenotypeCNVs](https://github.com/broadinstitute/gatk-sv#regenotype-cnvs), [MakeCohortVcf](https://github.com/broadinstitute/gatk-sv#make-cohort-vcf) (which includes `CombineBatches`, `ResolveComplexVariants`, `GenotypeComplexVariants`, `CleanVcf`), [`JoinRawCalls`](https://github.com/broadinstitute/gatk-sv#join-raw-calls), [`SVConcordance`](https://github.com/broadinstitute/gatk-sv#svconcordance), [`FilterGenotypes`](https://github.com/broadinstitute/gatk-sv#filter-genotypes), and [AnnotateVcf](https://github.com/broadinstitute/gatk-sv#annotate-vcf) on the README.
208
211
* Use the same cohort `sample_set_set` you created and used for `09-MergeBatchSites`.
209
212
210
213
#### Downstream steps
211
214
212
-
Additional downstream steps are under development. Read about some of them on the README [here](https://github.com/broadinstitute/gatk-sv#module07). Please note that the VCF produced by `15-CleanVcf` (and annotated by `17-AnnotateVcf`) prioritizes sensitivity, but additional downstream filtration is recommended to improve specificity. Filtration methods are under active development by the GATK-SV team; stay tuned for updates.
215
+
Additional downstream steps are under development.
0 commit comments