Learning Evidence of Pairwise Association from Epigenomic and TF binding data

Learning Evidence of Pairwise Association from Epigenomic and TF binding data (LEPAE) quantifies evidence of association for pairs of genomic windows based on large-scale epigenomic and TF binding data while considering distance information.

Genome-wide LEPAE score for hg38

Score tracks are provided in .hic format.

Pipeline

This pipeline is written in Snakemake, which enables running on various execution environments. It is highly recommended to run it in a high-performance computing (HPC) environment. We used UCLA's Hoffman2 cluster.

If the following dependencies are available, there is no installation step other than cloning this repository.

Pre-requisites

Snakemake 6.8.1
Python 3.9.6
- Numpy 1.22.2
- Pandas 1.3.2
- Pytorch 1.10.0+cu102
- Scikit-learn 0.24.2
Bedtools 2.30.0

How to run on full dataset on HPC

snakemake \
-j 100 \
--configfile parameters_1kb.yaml \
--cluster-config cluster.json \
--cluster "qsub -V -l h_rt={cluster.time},h_data={cluster.memory} -pe shared {cluster.threads} -N {cluster.name} -j y -o {cluster.output}" \
--latency-wait 60 \
--local-cores 1 \
--use-conda \
--retry

-j 100 is to limit the number of parallel jobs to 100. Adjust this number based on resource availability.
--configfile parameters_1kb.yaml specifies the model configuration file
--cluster-config cluster.json specifies the cluster configuration file
--cluster "qsub -V -l h_rt={cluster.time},h_data={cluster.memory} -pe shared {cluster.threads} -N {cluster.name} -j y -o {cluster.output}" specifies the cluster submission command
--latency-wait 60 specifies to wait for 60 seconds before checking the status of the submitted jobs. This is optional. Remove or adjust this number accordingly.
--local-cores 1 limits the number of local cores used to 1. This is optional
--use-conda enables using conda environment defined in env.yaml for some of the jobs. Snakemake will automatically create the conda environment based on the YAML file. Depending on the computing enviornment, this YAML file may need to be modified. See here for more details. If all the pre-requisites are already installed and available to the pipeline, this flag is not necessary.

Run time on HPC

As a point of reference, the 10-kb resolution model using parameters_10kb.yaml took ~22 hours to complete using 100 parallel jobs on UCLA's Hoffman2 cluster. Note that the run time will vary depending on the computing environment, resource availability, and input data.

Required input

Name	Description	Example(s)	Where to specify
Snakefile	Pipeline definition	Snakefile	NA
Config YAML	Model configuration	demo/demo.yaml, parameters_1kb.yaml, parameters_10kb.yaml	`--configfile` or inside Snakefile
Cluster JSON	Cluster configuration	cluster.json	`--cluster-config`
Chromosome size	Chromosome sizes	hg38.chrom.sizes	Config YAML
DNase-seq and ChIP-seq datasets	Text file listing URLs of gzipped .bed files containing peak calls from DNase-seq and ChIP-seq experiments	files_hg38_DNaseChIPseq.txt	Config YAML
ChromHMM annoations	Text file listing URLs of gzipped .bed files containing ChromHMM tracks	files_hg38_ChromHMM.txt	Config YAML

The 1-kb resolution LEPAE score was generated using parameters_1kb.yaml and the 10-kb resolution LEPAE score was generated using parameters_10kb.yaml. Other files are used for both resolutions. Input and output paths in config YAML and cluster JSON files should be updated accordingly.

Config YAML

# parameters_1kb.yaml
env: "cluster" # Or "local" to run in a local environment. Used to load bedtools module (See below)

min_dist: 1000 # Minimum pairwise distance in bp. Typically same as window_size
max_dist: 100000 # Maximum pairwise distance in bp
window_size: 1000 # Window size in bp

seed: 413666 # Random seed for reproducibility

training_data_size: 50000 # Number of training data points
tuning_data_size: 5000 # Number of tuning data points
num_random_search: 10 # Number of combinations of hyperparameters to try during random search
ensemble_size: 5 # Number of models to train in an ensemble

num_dnasechipseq_experiments: 3337 # Number of DNase-seq and ChIP-seq experiments
num_chromhmm_experiments: 127 # Number of ChromHMM datasets
num_chromhmm_states: 25 # Number of ChromHMM states

paths: # Paths to input files and output directory
  dnasechipseq_experiment_paths_file: "/u/project/ernst/skwon94/pairwise_scoring/files_hg38_DNaseChIPseq.txt"
  chromhmm_experiment_paths_file: "/u/project/ernst/skwon94/pairwise_scoring/files_hg38_ChromHMM.txt"
  chrom_size_file: "/u/project/ernst/skwon94/pairwise_scoring/hg38.chrom.sizes"
  output_dir: "/u/project/ernst/skwon94/pairwise_scoring/model/gw_1kb"

Loading `bedtools` module

If env is set to cluster in the config YAML file, the bedtools module, which is available in our HPC environment, is loaded before running any bedtools command as shown below. Depending on your environment, this may not be necessary or need to change (e.g. use a conda environment or Docker image instead).

# Snakefile
...
if [ "{env}" = "cluster" ]; then
    . /u/local/Modules/default/init/modules.sh; module load bedtools
fi
bedtools ...

Demo

This demo is for a local environment and has been tested in a Linux environment using a Windows laptop.

Create a conda environment with dependencies and run the demo inside the environment as follows:

conda create -c conda-forge -c bioconda -n lepae snakemake=6.8.1 python=3.9.6 numpy=1.22.2 pandas=1.3.2 pytorch=1.10.0 scikit-learn=0.24.2 bedtools=2.30.0 tabulate=0.8
conda activate lepae
snakemake --configfile demo/demo.yaml -c1

In total, 149 steps run in ~15 minutes. Output file demo/model/prediction/NN/prediction_min10000_max30000_window10000.bedpe.gz contains the final score predictions for pairwise distances 10, 20, and 30 kb at a 10-kb resolution.

Given the minimal set of input datasets and small training data size, the trained models are not expected to produce meaningful predictions. To take advantage of this pipeline, scale up to use thousands of datasets and larger training data size as done in parameters_1kb.yaml and parameters_10kb.yaml in a HPC environment.

Output

Main output directory and files are listed below in the order they are generated. The final output file containing LEPAE score is highlighted in bold. Scores are in .bedpe format.

Name	Description
`model/window/all_windows.bed.gz`	Gzipped .bed file containing all genomic windows with the specified window size
`model/feature/intersect/hg38_DNaseChIPseq/{experiment_name}.bed.gz`	Gzipped .bed file containing DNase-seq and ChIP-seq peak calls intersected with genomic windows
`model/feature/intersect/hg38_ChromHMM/{annotation_name}.bed.gz`	Gzipped .bed file containing ChromHMM state annotations intersected with genomic windows
`model/data/{chrom}.window`	Chromosome-specific tab-separated file with each row corresponding to a genomic window. Each row follows the format of `chrom start end window_index \| feature_index_1 feature_index_2 feature_index_3 ...`. The list after `\|` consist of the indices of features that overlap with the genomic window
`model/classifier/NN/even_heldout/`	Directory containing results from hyperparameter tuning and trained models for pairs in even chromosomes
`model/classifier/NN/odd_heldout/`	Directory containing results from hyperparameter tuning and trained models for pairs in odd chromosomes
`model/prediction/NN/even_heldout/{chrom}/`	Chromosome-specific directory containing gzipped .bedpe files of individual and aggregated predictions from models with even chromosomes held out during training
`model/prediction/NN/odd_heldout/{chrom}/`	Chromosome-specific directory containing gzipped .bedpe files of individual and aggregated predictions from models with odd chromosomes held out during training
`model/prediction/NN/prediction_min{min_dist}_max{max_dist}_window{window_size}.bedpe.gz`	Gzipped .bedpe file of final LEPAE score for all pairs of genomic windows of the specified window size and within the specified distance range

Generating .hic files from .bedpe score files

This requires having the Juicer tool installed.

First, convert the .bedpe file into a "short with score" (.sws) file using the following command. Replace {final_bedpe} and {output_prefix} with the actual file path and desired output prefix, respectively.

gzip -cd {final_bedpe} | awk -v OFS="\t" '{{print 0,$1,$2,0,0,$1,$5,1,$11}}' > {output_prefix}.sws

Then, convert the .sws file to a .hic file using the Juicer tool's pre command. Replace {input_sws} and {output_hic} with the actual file paths. Adjust the resolution (-r option) as needed. Juicer tool version 1.22.01 was used in this work.

time java -Xmx6g -jar {path_to_juicer_tools_jar} source/juicer_tools_1.22.01.jar pre -r {window_size} -n --random-seed {seed} {input_sws} {output_hic} hg38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning Evidence of Pairwise Association from Epigenomic and TF binding data

Genome-wide LEPAE score for hg38

Pipeline

Pre-requisites

How to run on full dataset on HPC

Run time on HPC

Required input

Config YAML

Loading `bedtools` module

Demo

Output

Generating .hic files from .bedpe score files

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
demo		demo
source		source
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
cluster.json		cluster.json
env.yaml		env.yaml
files_hg38_ChromHMM.txt		files_hg38_ChromHMM.txt
files_hg38_DNaseChIPseq.txt		files_hg38_DNaseChIPseq.txt
hg38.chrom.sizes		hg38.chrom.sizes
parameters_10kb.yaml		parameters_10kb.yaml
parameters_1kb.yaml		parameters_1kb.yaml

License

ernstlab/LEPAE

Folders and files

Latest commit

History

Repository files navigation

Learning Evidence of Pairwise Association from Epigenomic and TF binding data

Genome-wide LEPAE score for hg38

Pipeline

Pre-requisites

How to run on full dataset on HPC

Run time on HPC

Required input

Config YAML

Loading bedtools module

Demo

Output

Generating .hic files from .bedpe score files

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

Loading `bedtools` module