This repository contains the code for training and running APARENT, a deep neural network that can predict human 3' UTR Alternative Polyadenylation (APA), annotate genetic variants based on the impact of APA regulation, and engineer new polyadenylation signals according to target isoform abundances or cleavage profiles.
APARENT was described in Bogard et al, Cell 2019 in press.
The model was trained on >3.5 million randomized 3' UTR poly-A signals expressed on mini gene reporters in HEK293.
Forward-engineering of new poly-A signals is done using the included SeqProp (Stochastic Sequence Backpropagation) software, which implements a gradient-based input optimization algorithm and uses APARENT as the predictor.
Further below on this page are links to IPython Notebooks containing all of the analyses performed in the paper. There is also a link to the repository containing all of the processed data used by the notebooks.
Contact jlinder2 (at) cs.washington.edu for any questions about the model or data.
We have hosted a publicly accessible web application where users can predict APA isoform abundance and variant effects with APARENT and visualize the results.
The web prediction tool is located at https://apa.cs.washington.edu.
APARENT can be installed by cloning or forking the github repository:
git clone https://github.com/johli/aparent.git
cd aparent
python setup.py install
- Tensorflow >= 1.13.1
- Keras >= 2.2.4
- Scipy >= 1.2.1
- Numpy >= 1.16.2
- Isolearn >= 0.2.0 (github)
- [Optional] SeqProp >= 0.1 (github)
APARENT is built as a Keras Model, and as such can be easily executed using simple Keras function calls. See the example usage notebooks below for a tutorial on how to use the model for APA- and Variant Effect prediction.
This simple example illustrates how to predict the isoform abundance and cleavage profile of an input APA event:
import keras
from keras.models import Sequential, Model, load_model
from aparent.predictor import *
#Load APADB-tuned APARENT model and input encoder
apadb_model = load_model('../saved_models/aparent_apadb_fitted_large_lessdropout_no_sampleweights.h5')
apadb_encoder = get_apadb_encoder()
#Example APA sites (gene = PSMC6)
#Proximal and Distal PAS Sequences
seq_prox = 'AGATAGTGGTATAAGAAAGCATTTCTTATGACTTATTTTGTATCATTTGTTTTCCTCATCTAAAAAGTTGAATAAAATCTGTTTGATTCAGTTCTCCTACATATATATTCTTGTCTTTTCTGAGTATATTTACTGTGGTCCTTTAGGTTCTTTAGCAAGTAAACTATTTGATAACCCAGATGGATTGTGGATTTTTGAATATTAT'
seq_dist = 'TGGATTGTGGATTTTTGAATATTATTTTAAAATAGTACACATACTTAATGTTCATAAGATCATCTTCTTAAATAAAACATGGATGTGTGGGTATGTCTGTACTCCTCCTTTCAGAAAGTGTTTACATATTCTTCATCTACTGTGATTAAGCTCATTGTTGGTTAATTGAAAATATACATGCACATCCATAACTTTTTAAAGAGTA'
#Site Distance
site_distance = 180
#Proximal and Distal cut intervals within each sequence defining the isoforms
prox_cut_start, prox_cut_end = 80, 105
dist_cut_start, dist_cut_end = 80, 105
#Predict with APADB-tuned APARENT model
iso_pred, cut_prox, cut_dist = apadb_model.predict(x=apadb_encoder([seq_prox], [seq_dist], [prox_cut_start], [prox_cut_end], [dist_cut_start], [dist_cut_end], [site_distance]))
print("Predicted proximal vs. distal isoform % (APADB) = " + str(iso_pred[0, 0]))
These two notebooks illustrate how to use the APARENT Keras models to predict APA given a proximal and distal site, and to predict APA Variant effects, respectively. These are the two model versions we recommend using:
saved_models/aparent_large_lessdropout_all_libs_no_sampleweights.h5
The base version of APARENT. Given an input sequence, predicts the (non-normalized) isoform abundance and cleavage distribution. It is non-normalized in the sense that predictions are not scaled w.r.t. a particular distal site, but rather the average distal bias of the training MPRA data. The main use of this model is to predict the effect of variants, by calculating the odds ratio between variant and wildtype isoform predictions.
saved_models/aparent_apadb_fitted_large_lessdropout_no_sampleweights.h5
A siamese APARENT network model, expecting both proximal and distal sequences as input. APARENT scores each site independently. The scores are weighted and combined with the log site distance, where the combination weights have been fitted on the Pooled-Tissue APADB data.
Notebook 1: APA Isoform & Cleavage Prediction
Notebook 2: APA Variant Effect Prediction
Note: This model version is not the one evaluated in the paper; this version has been trained on all MPRA libraries (no libraries have been held out) in order to make the best APA predictor possible.
The Legacy Model is the version evaluated in the paper, which we provide here for reproducibility. The model architecture itself has not changed since the Legacy version, but the newest version has been trained on all MPRA libraries. The Legacy models (base version and APADB-fitted version) are located at saved_models/legacy_models/.
The Legacy model was originally built and trained using Theano. Theano has since stopped being developed, so we have lifted the original model into Keras. The original Theano training code can be found in the below repository:
The raw sequencing data for the 3' UTR MPRA libraries are found at GEO accession GSE113849.
The Legacy Data is the version of the processed data analyzed in the paper, which we provide here for reproducibility. The newest version of the data has been re-processed with the following additional improvements:
- Exact cleavage positions have been mapped for the Alien1 Random MPRA Sublibrary.
- A 20 nt random barcode upstream of the USE in the Alien1 Sublibrary has been included in the sequence.
Processed Data Repository
Processed Data Repository (legacy)
Note: The "Processed Data Repository" also includes the Legacy data, but the data has been re-formatted such that it is easier to work with in Keras.
The following collection of IPython Notebooks contains all of the analyses performed in the paper. To aid reproducibility, we have used the Legacy APARENT model and Legacy Data in all of the notebooks.
Log Odds Ratio Analysis of hexamers in the Random MPRA libraries and Linear Logistic Hexamer Regression.
Notebook 1a: Isoform Log Odds Ratio Analysis (Alien1 Library)
Notebook 1b: Isoform Log Odds Ratio Analysis (Alien2 Library)
Notebook 2: Cleavage Log Odds Ratio Analysis (Alien1 Library)
Notebook 3a: Hexamer Logistic Regression (Combined Library)
Notebook 3b: Hexamer Logistic Regression (TOMM5 Library only)
Notebook 3c: Hexamer Logistic Regression (Alien1 Library only)
Notebook 3d: Hexamer Logistic Regression (Alien2 Library only)
Evaluation of APARENT on the Random MPRA libraries, and Convolutional Layer 1 & 2 visualizations.
Notebook 1: MPRA Prediction Evaluation
Notebook 2a: Conv Layer 1 and 2 Analysis (Alien1 Library)
Notebook 2b: Conv Layer 1 and 2 Analysis (Alien2 Library)
Notebook 3: CSE Hexamer Filter (Conv Layer 1)
Notebook 4: Cleavage Motifs (Conv Layer 1)
Engineering of PAS sequences according to target isoform and cleavage objectives (and DeepDream).
Notebook 1: Target Isoform Sequence Optimization
Notebook 2: Target Cleavage Sequence Optimization
Notebook 3: Dense Layer Sequence Visualization (DeepDream-Style)
Analysis of the Designed MPRA library, including Forward-engineering, Native PAS prediction, and Variant analysis.
Notebook 0a: Basic MPRA Library Statistics
Notebook 0b: MPRA LoFi vs. HiFi Replicates
Notebook 1a: SeqProp Target Isoforms (Summary)
Notebook 1b: SeqProp Target Isoforms (Detailed)
Notebook 2a: SeqProp Target Cut (Summary)
Notebook 2b: SeqProp Target Cut (Detailed)
Notebook 3: Human Wildtype APA Prediction
Notebook 4a: Human Variant Analysis (Summary)
Notebook 4b: Disease-Implicated Variants/UTRs (Detailed)
Notebook 4c: Cleavage-Altering Variants (Detailed)
Notebook 5a: Complex Functional Variants (Summary)
Notebook 5b: Complex Functional Variants (Canonical CSE)
Notebook 5c: Complex Functional Variants (Cryptic CSE)
Notebook 5d: Complex Functional Variants (CFIm25)
Notebook 5e: Complex Functional Variants (CstF)
Notebook 5f: Complex Functional Variants (Folding)
Notebook Bonus: TGTA Motif Saturation Mutagenesis
Analysis of native human APA (APADB and Leslie APA Atlas), including cell-type specific APA prediction evaluation.
Data sources: (APADB | Leslie)
Notebook 0: Basic Data Statistics
Notebook 1: Differential Usage Analysis
Notebook 2: Cleavage Site Prediction
Notebook 3: APA Isoform Prediction
Notebook 4: APA Isoform Prediction (Cross-Validation)