Manuscript at: Chen et al., 2023.
Note: Pre-computed genomic constraint scores (called Gnocchi: Genomic Non-Coding Constraint of HaploInsufficient variation) are available in the Supplementary Dataset 2 & 3 of the paper and can also be downloaded from the gnomAD official website.
This repository provides
- The computation pipeline for computing the constraint scores from gnomAD v3.1.2 whole-genome sequencing data (N=76,156).
- The code for generating figures present in the manuscript.
The main components of the computation pipeline are written in Hail 0.2, which enables scalability to large datasets like gnomAD. The script run_nc_constraint_gnomad_v31_main.py
implements QC and the mutational model as described in Chen et al., 2023, which uses the gnomAD genomes and a dataset of every possible variant (~6 billion well-covered variants) to detect depletions of variation (constraint), and generates constraint scores at a 1kb scale across the human genome.
The constraint scores and all intermediate files can be generated by a single run of
python run_nc_constraint_gnomad_v31_main.py
-output_bucket "gs://bucket_name/"
-output_dir "/path/to/local/dir_name"
which will access the public gnomAD data living on Google Cloud and save all output files (including plain text files and Hail tables) to the designated -output_bucket
. A copy of key outputs (tables of estimated mutation rates, possible/observed/expected variant counts, and constraint Z scores) will be saved locally to -output_dir
.
All figures presented in the manuscript can be reproduced using generate_manuscript_figures.py
, which takes a figure number (a single number from 1-5) or “all” (to generate all figures at once). The script will automatically download necessary input files from the public Google Cloud bucket gs://gnomad-nc-constraint-v31-paper/fig_tables/
to a local directory fig_tables/
if the files are not already existing. Please save generate_manuscript_figures.py
and fig_utils .py
to the same directory. For example
python generate_manuscript_figures.py -fig 1
will generate Figure_1a.pdf and Figure_1b.pdf in the current directory. Similarly, all extended figures in the manuscript can be reproduced by running
python generate_manuscript_efigures.py -efig x
where x can be a single number from 1-8 or “all”.