Skip to content

Data partitioning

Dave Lawrence edited this page Oct 14, 2020 · 3 revisions

Partitioning is a database scaling technique, that divides tables based on a constraint (Eg all of the genotype calls for a VCF go into a separate partition)

This makes queries much faster (the optimiser jumpst to the partition instead of checking every row in a billion row table) and delete (by dropping the whole partition in a single operation)

Postgres implementation

We use PostgreSQL partitioning via inheritance which means we have to explicitly manage creating and inserting data into partitions.

Declarative Partitioning manages data insertion according to constraints, and was introduced in Postgres 10 but didn't support foreign keys (which we need). Postgres 12 (2019) added support for FKs in declarative partitions so we should investigate moving to them.

Data insertion

As we have to explicitly specify the partition name during data insertion, don't create data via create() or bulk_create() but instead write the data to a CSV and then insert it (explicitly using partition name) via psql command line tool.

Django

The model that the partition is constrained by (not the model in the partition) should subclass RelatedModelsPartitionModel

To get the performance benefits, the SQL query planner / optimiser must know the data is only in a partition (otherwise it looks in the whole table)

The cleanest way to do this is via Django's FilteredRelation which allows you to modify the JOIN ON with another constraint (ie the same one as the partition). When adding filters, you need to refer to the annotation instead of the normal relation to use this partition, otherwise it will add another join to the whole table.

Unfortunately, you can't use a FilteredRelation to span relations (join across multiple tables) and thus it can't be used eg for some analysis queries which join with anything other than Variant (eg EnsemblGeneAnnotation which joins via variantannotation__gene)

In that case, implement RelatedModelsPartitionModel.sql_partition_transformer which does a find/replace on SQL generated on Django QuerySets to explicitly reference the partition instead of the table. Warning: You can only ever join to 1 partition per query - be very careful with joining querysets, eg MergeNode.

Clone this wiki locally