This repository contains all scripts for the manuscript "Decoding common and rare non-coding variant effects across cellular and developmental contexts" by Marderstein, Kundu et al.
For any questions, please contact:
Andrew Marderstein & Soumya Kundu 📧 mardera1@mskcc.org, soumyak@stanford.edu
These sets of scripts in the preprocess/ directory generate all of the outputs used for the downstream analyses.
0_process_data/– Scripts for processing the scATAC-seq data.1_train_chrombpnet/– Scripts for training the ChromBPNet models.2_score_variants/– Scripts for scoring the rare, common, and ASD variants.3_shap_variants/– Scripts for generating DeepLIFT / DeepSHAP contribution scores for both alleles of each variant.4_shap_peaks/- Scripts for generating DeepLIFT / DeepSHAP contribution scores for scATAC-seq peaks.5_run_modisco/- Scripts for running TF-MoDISco to identify the motif patterns learned by each model.6_cluster_motifs/- Scripts for running MotifCompendium to cluster the motif patterns from TF-MoDISco.7_run_finemo/- Scripts for running Fi-NeMo for identifying motif instances in the genome.
These sets of scripts in the analysis/ directory generate all of the results presented in the manuscript.
These scripts process ChromBPNet variant scoring outputs and compile annotation tables for downstream analyses.
- Extract outputs: Run
1_pull_scores.shto extract relevant ChromBPNet outputs. - Analyze model performance:
2a_model_performance.Revaluates performance metrics.2b_model_performance_plot.Ridentifies model outliers.
- Annotate variants:
- Use
3a_bed2vcf.Rare.CADD_VEP.Rto run CADD and VEP. - Process outputs with
3c_Process_CADD_VEP.R.
- Use
- Merge results: Run
4_mergeData.Rto integrate annotations, merge scores, and remove outliers.
These scripts correspond to the manuscript section "Variants effects are shaped by genomic context and TF binding".
They analyze:
- Genomic context – A variant’s proximity to transcribed regions.
- Cell-type specificity – How constrained or widespread variant effects are.
- Regulatory magnitude – The extent of chromatin accessibility and TF binding changes.
These scripts support the manuscript sections:
- "Context-specific models reveal regulatory effects of fine-mapped eQTLs"
- "Pinpointing disease-relevant variants using cell-type-specific chromatin models"
- "Microglia-driven mechanisms of Alzheimer’s disease risk"
We use ChromBPNet to identify candidate causal variants affecting gene regulation and disease risk.
These scripts correspond to:
- "Ultra-rare variants show larger and more shared regulatory effects than common variants"
- "Specific motifs influence constraint of fetal neuron regulation"
They compare rare and common variant effects to understand the selective pressures that influence allele frequency distributions across human populations.
These scripts correspond to:
- "FLARE: a functional genomic model of constraint"
- "FLARE prioritizes de novo non-coding mutations in autism"
Since PhyloP scores are not context-specific, FLARE models the relationship between genomic context, regulatory effects, and evolutionary conservation within cell-type-specific contexts. FLARE:
- Disentangles accessibility and regulatory effects from conservation.
- Integrates multiple functional genomic features into a unified model.
- Captures regulatory potential across multiple cell types.
We set up a FLARE repository for the FLARE method, which can be found by clicking here.
These show additional FLARE applications, with scripts corresponding to:
- "FLARE prioritizes de novo non-coding mutations in congenital heart disease"
- "FLARE prioritizes variants underlying expression outliers in the adult brain"
- "FLARE captures common variant heritability in schizophrenia"
Marderstein^, Kundu^, et al. Mapping the regulatory effects of common and rare non-coding variants across cellular and developmental contexts in the brain and heart.