This is an extension to our earlier work on Bichrom. We have added Graph Attention Network (GAT) to predict the TF-DNA binding using pre-existing contact matrix data. The GAT is implemented from this papar.
Please refer to bichrom_gat.yaml for rquired packages.
Clone and navigate to the Bichrom-GAT-Scripts/code.
cd Bichrom-GAT-Scripts/code
- config.py: This is the configuration file which contains various parameters for data construction & training.
- construct_data.py: Script for generating training and test set.
- construct_data.sh: Script for submiting job to the cluster.
- predict_chip_seq_track_from_seqnet.py: Script for predicting TF bigwig values from a trained seq-net model.
- predict_chip_seq_track_from_seqnet.sh: Script for submiting job to the cluster.
- predict_on_external_data.py: Script for evaluating performance of a trained GAT-net/Bimodal-net model.
- seq_and_bimodal_networks_ResNet.py: Definitions of all models.
- train_bichrom.py: Script for training both seq-net & GAT-net.
- train_bichrom.sh: Script for submiting job to the cluster.
- utils.py: Contains helper function for data construction and training.
You can find some input files in example_input
directory.
- MultiGPS .events file
- ChIP-seq .bigwig file
- Genome sizes file (.info)
- Genome .fasta file
- Blacklist regions .bed file
Some important parameters that you may need to modify are given below. Please refer to the code/config.py
for additional parameters
# Number of Epochs
EPOCHS = 2
# batch size
batchsize = 64
# Fraction of data to be used for constructing training and test set
frac = 0.002
# stride for generating onehot_seq_dict & chip-seq hdf5 file
stride = 50
# prediction region length
window_len = 400
# context around prediction region
context_window_len = 10_000
# stride used for chopping genome during training set construction
chop_genome_stride=50
# number of times to oversample chip-seq (positive) regions in training set
num_oversample_chip_seq_peaks = 5
# output path
out_path = "../example_output"
# input data path
in_path = "../example_input"
# experiment name (directory with this name will be create in out_path)
exp_name="experiment_name"
# training data directory name (directory with this name will be created in out_path/exp_name)
training_data_dir_name="training_data"
# train out directory name (directory with this name will be created in out_path/exp_name)
train_out_dir_name="train_out"
# test results directory path (directory with this name will be created in out_path/exp_name/train_out)
test_out_dir_name="test_set_performance"
# path to genome sizes file
info = f"{in_path}/mm10.info"
# path to genome fasta file
fa = f"{in_path}/mm10.fa"
# path to chip seq bigwig file in a list
bigwig_tracks_path_list =[
f"{in_path}/Ascl1_R1_R2_R3_rep_avg.bw"
]
# path to MultiGPS .event file
peaks = f"{in_path}/multigps_2023-03-30-05-43-08_ES.events"
# path to blacklist regions .bed file
blacklist = f"{in_path}/mm10_blacklist.bed"
# list training chromosomes
training_chrom_list = ["chr1"]
# list of validation chromosomes
val_chroms = ["chr17"]
# list of test chromosomes
test_chroms = ["chr10"]
# path to cool .file
cool_file_path = f"{in_path}/GSE130275_mESC_WT_combined_1.3B_400_normalized.cool"
# resolution of .cool file
resolution = window_len
In order to reduce the training time, we have incorporated few strategies during data construction.
- Pre-compute one-hot encoded matrices and store them into a dictionary.
- Convert TF .bigwig file into HDF5 file.
To improve the generalization of the GAT model, we construct ubound (-ve) training that is much large than bound (+ve) training set. We try to generate enough -ve samples such that every batch will have unique -ve samples (although uniqueness is not guaranteed).
# Activate conda environment
source activate bichrom
Run:
./construct_data.sh
construct_data.py will produce following files which includes train, test bed files and other files in the specified output directory.
- common_data: Directory containing onehot_seq dictionary & chip-seq hdf5 file.
- training_df_seq.bed: Bed file containing the regions for training seq-net.
- training_df_bimodal_bound.bed: Bound regions (+ve samples) bed fild for training GAT-net/Bimodal-net.
- training_df_bimodal_unbound.bed: Unbound regions (-ve samples) bed fild for training GAT-net/Bimodal-net.
- test_df_internal.bed: Internal test-set regions bed file.
- test_df_external.bed: External test-est regions bed file.
- val_df_external.bed: Validation regions bed file.
- stats.txt: Number of -ve and -ve samples in bed files.
- config.py: Copy of configuration file
There two networks in this approach:
- Sequence-only Netowrk (seq-net): This networks is trained first on the sequence data to predict ChIP-seq track. It also serves as the feature generator for the nodes in the contact matrix. The input to seq-next is the region of length
config.window_length
(prediction window) which gets expanded to both side by half of theconfig.context_window_length
before training. Then, seq-net is trained on the sequence of sizeconfig.window_length + config.context_window_length
. This model predicts the ChIP-seq bigwig values averaged over bins of size 100. E.g. If window length is 400bp and context window size is 10,000bp then input is expanded to both sides by 5000pb (final size of input = 10,400bp). In this case the size of the output by seq-net would be 104 (10,400/100). - GAT Network (GAT-net/Bimodal-net): This network is trained on the contact matrix (adjacency matrix) and the features extracted from a trained seq-net to predict the binding probability. You will need to add or remove layers depending on the size of
config.window_length
andconfig.window_length + config.context_window_length
.
Run:
./train_bichrom.sh
Bichrom output directory.
- seqnet:
- records the training loss of seq-net at each epoch.
- stores models (PyTorch object) checkpointed after each epoch.
- also, stores best performing model (i.e., mininum train loss).
- creates train_hist_seq.csv which contains average training loss at every epoch
- bimodal:
- records the training loss of GAT-net/Bimodal-net at each epoch.
- stores models (PyTorch object) checkpointed after each epoch.
- also, stores best performing model (i.e., mininum train loss).
- creates train_hist_seq.csv which contains average training loss at every epoch
- test_set_performance:
- internal_test_set_performance: stores internal test-set performance of the best GAT-net/Bimodal-net
- test_set_metrics.txt: stores AUC ROC, AUC PRC, Confusion matrix, and number of +ve & -ve predictions at 0.5 cut-off
- test_set_metrics.csv: stores epoch, AUC ROC, and AUC PRC
- test_set_probs_bimodal.txt: stores probabilities predicted by best GAT-net/Bimodal-net
- external_test_set_performance: stores external test-set performance of the best GAT-net/Bimodal-net
- test_set_metrics.txt: stores AUC ROC, AUC PRC, Confusion matrix, and number of +ve & -ve predictions at 0.5 cut-off
- test_set_metrics.csv: stores epoch, AUC ROC, and AUC PRC
- test_set_probs_bimodal.txt: stores probabilities predicted by best GAT-net/Bimodal-net
- internal_test_set_performance: stores internal test-set performance of the best GAT-net/Bimodal-net
- ids: training samples IDs (for debugging purposes)
Modify the following parameters in predict_chip_seq_track_from_seqnet.py
file.
# chromosome for which to predict bigwig track
_chr="chr10"
# trained seq-net model path
seq_model_path=f"{config.train_out_path}/seqnet/best_seq_model_epoch_1_train_loss_0.32.pt"
# output directory path
out_path=f"{config.exp_path}/predict_chip_seq_track_from_seqnet"
# window length of each sample. This shold be equal to total window length that seq-net was train on (prediction window length + context window length)
chop_genome_window_size=config.window_len+config.context_window_len
Then run the following command to predict on ChIP-seq tracks
Run:
./predict_chip_seq_track_from_seqnet.sh