Skip to content

VCF import

Dave Lawrence edited this page Nov 9, 2023 · 3 revisions

VariantGrid reads VCF files and inserts the records in a SQL database. A VCF file can have 2M rows with 100+ samples (400M reference/alt genotypes) and some labs have thousands of VCFs a year.

The difficult part is doing this at speed while ensuring each variant has a single record, so that you can see every sample that ever contained that variant.

The image below shows an exome import. Colors represent different steps, with lines from start/end time of each task, with a new vertical row started if there is any overlap (to show parallel processing).

VCF import task times

Everything is not perfectly parallel, we need to finish some stages before others can start.

For instance you need to normalize variants and insert previously unknown variants before you can annotate them, or insert genotypes against the Variant IDs. Here are descriptions of some steps:

Create Data from VCF Header

  • Stores header text
  • Creates VCF and sample database objects
  • Find what format/info fields map to AD/DP/GQ/PL (different for some VCFs)
  • Converts VCF filters into single character codes. Writes filter descriptions and mappings to single character to DB
  • Stores importer version used, eg: PythonKnownVariantsImporter (v.4). Git: 7cbb9ae2876168b24924e5b97a785e4925a9d2f0. Uses cyvcf2 (v.0.8.9)

Preprocess VCF

5 sub-steps running in a pipeline:

  • vcf_clean_and_filter - handle Contigs, clean file, tracks what's discarded
  • vt decompose - converts rows with multiple alts into multiple rows of ref/alt
  • vt normalize - Ensure variant has a standard representation so variants from multiple files "match up"
  • vt uniq - Drop duplicate variants (eg after normalisation)
  • GNU split to split the VCF in 50k chunks (so it can be processed in parallel)

Any variants lost or modified in the above process are tracked with the following database records.

  • VCFSkippedContigs - contig name/num records skipped in from vcf_format_chrom
  • ModifiedImportedVariant - from vt decompose/normalize. Below shows a sample record which was inserted as "19:42398206 A>AGT":
old_multiallelic: '19:42398206:AGT/AGTGTGT/AGTGT'
old_variant: '19:42398206:AGT/AGTGT'

There is not yet any way to search or see these via the GUI (but they are being stored)

Create Unknown Loci and Variants

We use Redis to store Variant primary keys (in Postgres) by the variant hashe (ie "1_24929292_G_C"). This takes a few gigs of RAM but multiple workers can start up without loading any state and connect to it in parallel to check whether a variant has been seen before.

If a variant/locus is unknown, we write it to a file and launch jobs to insert them. Unknown insert jobs are run one at a time on the "variant_id_single_worker" queue to avoid race conditions (inserting multiple records, or Redis/database getting out of sync).

  • Reads unknown variants csv checks again whether variant is in Redis (as the unknown variant generation tasks can run in parallel, and may have been inserted in another VCF since the file was written)
  • Writes unknowns to CSV, inserts into Postgres (Locus/Variants assigned an auto integer primary key)
  • Reads new Loci/Variants, inserts into Redis

CheckStartAnnotationTask

Waits for the tasks in 'Unknown Variants' stage to finish.

  • If at least one "Create Unknown Loci and Variants" task was run during this import, run the annotation scheduler, which dumps any unannotated variants to disk and runs the annotation pipeline.

Schedule Parallel VCF Processing tasks

Waits for all 'Unknown Variants' stage tasks to finish, so that we know every variant in the VCF file has been inserted into the database and is in Redis.

Launch a task for each sub-vcf file made from GNU split.

Process VCF File

  • Jobs run in parallel on sub-vcfs, uses CyVCF2
  • Batch calls Redis to map from variant string -> integer primary key
  • Writes out CSVs of CohortVariantZygosityCounts, launch tasks to insert them
  • Writes out VCF filter records

CohortGenotype SQL COPY

Inserts records generated by previous step. These have information from multiple samples packed together in a single database row, so we can perform fast multi-sample zygosity queries.

Gemini uses this technique, but we store our data as Postgres arrays instead of binary blobs so we can do everything in SQL.

VCFLocusFilter SQL COPY

  • Inserts VCF filters generated by "Process VCF File"
  • As these are 1 per VCF row, we store these per Locus (not Variant)
  • They are single character codes, generated in "Create Data from VCF Header", mappings which look like:
{'filter_code': '$',
 'description': 'Truth sensitivity tranche level for SNP model at VQS Lod: -1.4325 <= x < 2.3577',
 'filter_id': 'VQSRTrancheSNP99.00to99.90'}

We only store non-PASS filters. The final records are quite small - an integer (locus ID) and 1 character per filter. We can quickly perform filters by using Postgres SQL Regex.

GenotypeVCFCheckAnnotationTask

Check whether every variant in the VCF has been annotated. This runs after all above steps are finished (data insertion stage).

  • Creates a database record UploadedVCFPendingAnnotation for the Uploaded VCF

Runs attempt_schedule_annotation_stage_steps which:

  • Finds the lowest unannotated variant (for latest annotation version) in the database
  • Compare lowest unannotated variant id to the maximum variant ID inserted for the VCF (Each "Process VCF File" task updates UploadedVCF with highest record it used)
  • If all variants are annotated, close this step, otherwise leave it open.

Every time the annotation scheduler finishes, it looks for unfinished UploadedVCFPendingAnnotation objects then calls attempt_schedule_annotation_stage_steps()

UpdateVariantZygosityCountsTask

This runs after all data is inserted. Keeps track of the total amount of het/hom samples that share a variant in the database. Updates from the CohortVariantZygosityCount het/hom counts.

Works like Reference Counting, counts from CohortVariantZygosityCount het/hom counts are subtracted when the VCF/samples are deleted.

SampleLocusCountsTask

This runs after all data is inserted. Counts how many alts per locus a sample had, eg:

Locus Count Count
1 (HOM) 1102186
2 (HET) 110271
3 4324
4 953
5 245

This is useful for looking for germline sample contamination.

Calculate Sample Stats

Collects information about variants in a sample. Runs after annotation is completed, as it counts eg ClinVar pathological variants, genes that have an OMIM phenotype, Snpeff impacts etc for both all variants and those with filter=PASS.

These can be viewed on the Sample and VCF pages, and are also used to lookup the node counts in the analysis rather than calculate them each time.

ImportGenotypeVCFSuccessTask

Runs after all other steps are completed. Sets VCF and samples to "SUCCESS" and sends a message/email saying the data is ready.

Clone this wiki locally