Skip to content

Variants and Alleles

Dave Lawrence edited this page Oct 14, 2020 · 3 revisions

GenomeBuild and Locus

A GenomeBuild (eg GRCh37) has Contigs representing it's version of chromosome 1, 2, X etc). Some contigs are shared between multiple builds, eg HG19 is the same as GRCh37 except for MT and chrY. GRCh37 and GRCh38 share a MT contig.

Contig schema

Variant

Each Locus (contig/position/ref) and Variant (locus/alt) is unique in the database, so we can link records from different VCFs together.

Variant / Allele schema

Sequence

Postgres has a 1-2k size limit on fields used for constraints (ie less than a big indel) so base sequences (Locus.ref and Variant.alt) use foreign keys to the Sequence table which has a constraint on it's md5sum. We can also cache the length here.

A reference variant has an alt sequence of "=" (Variant.REFERENCE_ALT) as a sentinel is faster than checking ref/alt equality across tables, and you can't put a unique constraint on NULL

Liftover

Allele is a change independent of a genome build - ie GRCh37 and GRCh38 Variants for same change point to same Allele.

An Allele usually has a ClinGenAllele (ID and JSON response from ClinGen Allele Registry), but sometimes that can fail, so we need ClinGenAllele to be optional.

See Liftover for details on how variants from different builds are created and linked to the Allele.

Clone this wiki locally