-
Notifications
You must be signed in to change notification settings - Fork 2
Variants and Alleles
A GenomeBuild (eg GRCh37) has Contigs representing it's version of chromosome 1, 2, X etc). Some contigs are shared between multiple builds, eg HG19 is the same as GRCh37 except for MT and chrY. GRCh37 and GRCh38 share a MT contig.
Each Locus (contig/position/ref) and Variant (locus/alt) is unique in the database, so we can link records from different VCFs together.
Postgres has a 1-2k size limit on fields used for constraints (ie less than a big indel) so base sequences (Locus.ref and Variant.alt) use foreign keys to the Sequence table which has a constraint on it's md5sum. We can also cache the length here.
A reference variant has an alt sequence of "=" (Variant.REFERENCE_ALT) as a sentinel is faster than checking ref/alt equality across tables, and you can't put a unique constraint on NULL
Allele is a change independent of a genome build - ie GRCh37 and GRCh38 Variants for same change point to same Allele.
An Allele usually has a ClinGenAllele (ID and JSON response from ClinGen Allele Registry), but sometimes that can fail, so we need ClinGenAllele to be optional.
See Liftover for details on how variants from different builds are created and linked to the Allele.