Skip to content

Latest commit

 

History

History
104 lines (63 loc) · 3.4 KB

README.md

File metadata and controls

104 lines (63 loc) · 3.4 KB

atac_seq2

Process and analyze your ATAC-Seq datasets

1. Prepare your work environment

# clone this repo to a new working directory
git clone [email protected]:maxsonBraunLab/atac_seq2.git

# cd into atac_seq-2.0 and make new dir for your FASTQ files
mkdir -p samples/raw

# cd into 'samples/raw' and symlink your FASTQ files
ln -s /absolute/path/to/files/condition1_replicate1_R1.fastq.gz .
ln -s /absolute/path/to/files/condition1_replicate1_R2.fastq.gz .

2. Prepare your conda environment

Continue forward if you don't have a conda environment with snakemake installed

# while using base conda env, create your snakemake environment
conda install -c conda-forge mamba # installs into base env
mamba create -c conda-forge -c bioconda -n snakemake snakemake # installs snakemake into new env

conda activate snakemake

3. Prepare your pipeline configuration

Edit the config.yaml file to specify which organism to use and other pipeline parameters.

Edit the config/metadata.csv file to specify which replicates belong with which condition in DESeq2.

4. Run the pipeline

You can run the pipeline using an interactive node like this:

srun --cores=20 --mem=64G --time=24:00:00 --pty bash
conda activate snakemake
snakemake -j 20 --use-conda

This is sufficient for small jobs or running small parts of the pipeline, but not appropriate for the entire process.

You can run the pipeline via batch mode like this:

snakemake -j 64 --use-conda --rerun-incomplete --latency-wait 60 --cluster-config cluster.yaml --cluster "sbatch -p {cluster.partition} -N {cluster.N}  -t {cluster.t} -J {cluster.J} -c {cluster.c} --mem={cluster.mem}" -s Snakefile

# if the above does not work, try the command below
snakemake -j 64 --use-conda --rerun-incomplete --latency-wait 60 --cluster-config cluster.yaml --profile .slurm

This will submit up to 64 jobs to exacloud servers and is appropriate for running computationally-intensive programs (read aligning, peak calling, calculating differentially open chromatin regions).

Pipeline Summary

Inputs

  • Reads in this specific format: {condition}_{replicate}_{dir}.fastq.gz
    • Condition = experimental treatment such as: 'MOLM24D', 'SETBP1_CSF3R_Mutant' , 'SETBP1_CSF3R_Control_8hr'. Multiple underscores, text / number mixing is OK.
    • Replicate = biological replicate. acceptable values are integers >= 1.
    • Dir = read direction. Acceptable values are ['R1', 'R2'].
    • Reads must be placed in the samples/raw directory.
  • Adapter file in FASTA format for adapter trimming.
  • Reference genome in FASTA format.

Outputs

  • Quality Control
    • Fragment length distribution plot
    • Fraction of Reads in Peaks (FRiP) per sample
    • PCA of all replicates
  • Table of QC metrics per sample (e.g. number of reads before and after removing mitochondrial reads, duplicate reads, poorly mapping reads)
  • Counts table of peaks (rows are intervals, columns are samples)
  • Fraction of Reads in Peaks (FRiP) per sample
  • Consensus peaks among n replicates (n is configurable)
  • Read pileup tracks in bigwig format in progress
  • Differentially open chromatin regions for all unique combinations of conditions.
    • Instead of specifying contrasts explicitly, the pipeline will assess all unique combinations of conditions.
  • Processed data (tracks, QC metrics, counts table) are in data directory.

Methods

.

References