Clinical research pipeline for exploring variants in whole genome (WGS) and exome (WES) sequencing data
crg2 is a research pipeline aimed at discovering clinically relevant variants in whole genome and exome sequencing data. It aims to provide reproducible results, be computationally efficient, and transparent in it's workflow.
crg2 uses Snakemake and Conda to manage jobs and software dependencies.
A note about the config files: the values in config_hpf.yaml refer to tool/filepaths on SickKid's HPC4Health tenancy (hpf), while the values in config_cheo_ri.yaml refer to tool/filepaths on CHEO's HPC4Health tenancy. For simplicity, we refer below only to config_hpf.yaml; if you are running crg2 on CHEO's tenancy, use config_cheo_ri.yaml instead.
- If you are running crg2 on the hpf, skip to the 'Running the pipeline' section.
- Download and setup Anaconda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
- Install Snakemake 5.10.0 via Conda: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
- Git clone this repo, crg, and cre. crg2 uses various scripts from other two repos to generate final reports.
- Make a directory for Conda to install all its environments and executables in, for example:
mkdir ~/crg2-conda
- Navigate to the crg2 directory. Install all software dependencies using:
- WGS:
cd crg2 snakemake --use-conda -s Snakefile --conda-prefix ~/crg2-conda --create-envs-only
- WES: This will install additional tools like freebayes, platypus, mosdepth and gatk3.
cd crg2 snakemake --use-conda -s cre.Snakefile --conda-prefix ~/crg2-conda --create-envs-only
Make sure to replace ~/crg2-conda
with the path made in step 4. This will take a while.
-
Install these plugins for VEP:
LoF, MaxEntScan, SpliceRegion
. Refer to this page for installation instructions: https://useast.ensembl.org/info/docs/tools/vep/script/vep_plugins.html. The INSTALL.pl script has been renamed to vep_install in the VEP's Conda build. It is located in the conda environment directory, undershare/ensembl-vep-99.2-0/vep_install
. Therefore, your command should be similar to:fb5f2eb3/share/ensembl-vep-99.2-0/vep_install -a p --PLUGINS LoF,MaxEntScan,SpliceRegion
-
Git clone cre:
git clone https://github.com/ccmbioinfo/cre
to a safe place. -
Replace the VEP paths to the VEP directory installed from step 6. Replace the cre path in crg2/config_hpf.yaml with the one from step 7.
-
AnnotSV 2.1 is required for SV report generation.
- Download AnnotSV:
wget https://lbgi.fr/AnnotSV/Sources/AnnotSV_2.1.tar.gz
- Unpack :
tar -xzvf AnnotSV_2.1.tar.gz
- Set the value of $ANNOTSV in your .bashrc:
export ANNOTSV=/path_of_AnnotSV_installation/bin
- Modify AnnotSV_2.1/configfile:
- set
-bedtools: bedtools
- set
-overlap: 50
- set
-reciprocal yes
- set
-svtBEDcol: 4
- set
- To generate a gene panel from an HPO text file exported from PhenomeCentral or G4RD, add the HPO filepath to
config["run"]["hpo"]
. You will also need to generate Ensembl and RefSeq gene files as well as an HGNC gene mapping file.
- Download and unzip Ensembl gtf:
wget -qO- http://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.gtf.gz | gunzip -c > Homo_sapiens.GRCh37.87.gtf
- Download and unzip RefSeq gff:
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gff.gz | gunzip -c > GRCh37_latest_genomic.gff
- Download RefSeq chromosome mapping file:
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_assembly_report.txt
- Run script to parse the above files:
python scripts/clean_gtf.py --ensembl_gtf /path/to/Homo_sapiens.GRCh37.87.gtf --refseq_gff3 /path/to/GRCh37_latest_genomic.gff --refseq_assembly /path/to/GRCh37_latest_assembly_report.txt
- Add the paths to the output files, Homo_sapiens.GRCh37.87.gtf_subset.csv and GRCh37_latest_genomic.gff_subset.csv, to the
config["gene"]["ensembl"]
andconfig["gene"]["refseq"]
fields. - You will also need the HGNC alias file: download this from https://www.genenames.org/download/custom/ using the default fields. Add the path this file to
config["gene"]["hgnc"]
.
-
crg2 uses mity to call and annotate small mitochondrial variants. Mity is not available via conda unfortunately, so it must be installed manually into an environment. The first time you run crg2 WGS, snakemake will build the conda environments specified in wrappers/mity/*/environment.yaml, which include the dependencies necessary to run mity. You can find the conda environment associated with rule
mity_call
by looking at the slurm log file for that rule after the pipeline tries and fails to run mity. Activate this environment, then manually install mity into this conda environment:pip install mitywgs==0.4.0
Replaceconfig["tools"]["mity"]
with the path to the conda environment binaries, e.g. "/srv/shared/conda_envs/crg2-conda/8a9bda62/bin/". You may also need to add a shebang at the top of the mity program (in this example, /srv/shared/conda_envs/crg2-conda/8a9bda62/bin/mity):#!/srv/shared/conda_envs/crg2-conda/8a9bda62/bin/python3
-
To generate a mobile element insertion report (WGS only), MELT installation is required and some paths must be added to config_hpf.yaml:
- Download MELT from https://melt.igs.umaryland.edu/downloads.php.
- Unpack the .tar.gz file:
tar zxf MELTvX.X.tar.gz
This should create a MELTvX.X directory in your current directory. - In config_hpf.yaml, add the path to the MELTvX.X directory to
config[“tools”][”melt”]
. - Generate a transposon reference text file containing a list of full paths to the mobile element references. For example:
ls <full_path_to>/MELTv2.2.2/me_refs/1KGP_Hg19/*_MELT.zip > transposon_file_list.txt
- In config_hpf.yaml, add the full path to the transposon reference text file to
config[“ref”][“melt_element_ref”]
. - In config_hpf.yaml, add the full path to hg19.genes.bed, containing the gene annotation for the FASTA reference to
config[“annotation”][“melt”][“genes”]
. The file can be found at:<full_path_to>/MELTv2.2.2/add_bed_files/1KGP_Hg19/hg19.genes.bed
-
Map fastqs to the human decoy genome GRCh37d5
-
Picard MarkDuplicates, but don't remove reads
-
GATK4 base recalibration
-
Remove reads mapped to decoy chromosomes
-
Call SNV's and generate gVCFs
-
Merge gVCF's and perform joint genotyping
-
Filter against GATK best practices filters
-
Decompose multiallelics, sort and uniq the filtered VCF using vt
-
Annotate using vcfanno and VEP
-
Generate a gemini db using vcf2db.py
-
Generate a cre report using cre.sh
-
Call variants using GATK, Freebayes, Platypus, and SAMTools.
-
Apply caller specific filters and retain PASS variants
-
Decompose multiallelics, sort and uniq filtered VCF using vt
-
Retain variants called by GATK or 2 other callers; Annotate caller info in VCF with INFO/CALLER and INFO/NUMCALLS.
-
Annotate using vcfanno and VEP
-
Generate a gemini db using vcf2db.py
-
Generate a cre report using cre.sh
-
Repeat the above 1-7 steps using GATK MUTECT2 variant calls to generate a mosaic variant report
-
Call SV's using Manta, Smoove and Wham
-
Merge calls using MetaSV
-
Annotate VCF using snpEff and SVScores
-
Split multi-sample VCF into individual sample VCFs
-
Generate an annotated report (produces filtered, i.e. SV called by at least two callers, and unfiltered reports)
-
Filter Manta SV calls to exclude all variants but BNDs
-
Annotate VCF using snpEff
-
Generate an annotated BND report
A. ExpansionHunter: known repeat location
- Identify repeat expansions in sample BAM/CRAMs
- Annotate repeats with disease threshold, gene name, repeat sizes from 1000Genome (mean& median)
- Generate per-family report as Excel file
B. ExpansionHunterDenovo: denovo repeats
- Identify denovo repeat in sample BAM/CRAMs
- Combine individual JSONs from current family and 1000Genomes to a multi-sample TSV
- Run DBSCAN clustering to identify outlier repeats
- Annotate with gnoMAD, OMIM, ANNOVAR
- Generate per-family report as Excel file
- Call MT variants using mity call
- Normalize MT variants using mity normalize
- Annotate MT variants using vcfanno
- Generate mitochondrial variant report
- Preprocess CRAMs
- Call MEIs in individuals
- Group and genotype MEIs across a family
- Filter out ac0 calls
- Annotate variants
- Generate MELT report
- Make a folder in a directory with sufficient space. Copy over the template files crg2/samples.tsv, crg2/units.tsv, crg2/config_hpf.yaml, crg2/dnaseq_slurm_hpf.sh, crg2/slurm_profile/slurm-config.yaml . You may need to re-copy config_hpf.yaml and slurm-config.yaml if the files were recently updated in the repo from previous crg2 runs. Note that 'slurm-config.yaml' is for submitting each rule as cluster jobs, so ignore this if not running on cluster.
mkdir NA12878
cp crg2/samples.tsv crg2/units.tsv crg2/config_hpf.yaml crg2/dnaseq_slurm_hpf.sh crg2/slurm_profile/slurm-config.yaml NA12878
cd NA12878
- Set up pipeline run
- Reconfigure 'samples.tsv', 'units.tsv' to reflect sample names and input files.
- Modify 'config_hpf.yaml':
- change
project
to refer to the family ID (here, NA12878). - set
pipeline
towes
,wgs
,annot
ormity
for exome sequences, whole genome sequences, to simply annotate a VCF respectively or to generate mitochondrial reports - inclusion of a panel bed file (
panel
) or hpofile (hpo
) will generate 2 SNV reports with all variants falling within these regions, one which includes variants in flanking regions as specified byflank
. - inclusion of a
ped
file with parents and a proband(s) will allow generation of a genome-wide de novo report ifpipeline
iswgs
. minio
refers to the path of the file system mount that backs MinIO in the CHEO HPC4Health tenancy. Exome reports (and coverage reports, if duplication percentage is >20%) will be copied to this path if specified.
- change
samples.tsv
sample
NA12878
units.tsv
sample platform fq1 fq2 bam cram
NA12878 ILLUMINA /hpf/largeprojects/ccmbio/GIAB_benchmark_datasets/hg19/WGS/NA12878/RMNISTHS_30xdownsample.bam
config_hpf.yaml
run:
project: "NA12878"
samples: samples.tsv
units: units.tsv
ped: "" # leave this string empty if there is no ped
panel: "" # three-column BED file based on hpo file; leave this string empty if there is no panel
hpo: "" # five-column TSV with HPO terms; leave this string empty is there are no hpo terms
flank: 100000
gatk: "gatk"
pipeline: "wgs" #either wes (exomes) or wgs (genomes) or annot (to annotate and produce reports for an input vcf) or mity (to generate mitochondrial reports)
minio: ""
PT_credentials: ""
...
- Activate the conda environment with Snakemake 5.10.0
(base) [dennis.kao@qlogin5 crg2]$ conda activate snakemake
(snakemake) [dennis.kao@qlogin5 crg2]$ snakemake -v
5.10.0
- Test that the pipeline will run by adding the flag "-n" to the command in dnaseq_slurm_hpf.sh and running it.
(snakemake) [dennis.kao@qlogin5 crg2]$ sh dnaseq_slurm_hpf.sh
Building DAG of jobs...
Job counts:
count jobs
1 EH
1 EH_report
1 EHdn
1 EHdn_report
1 all
1 allsnvreport
1 bam_to_cram
1 bcftools_stats
1 bgzip
25 call_variants
25 combine_calls
1 fastq_screen
1 fastqc
25 genotype_variants
2 hard_filter_calls
1 input_prep
1 manta
1 map_reads
1 mark_duplicates
1 md5
1 merge_calls
1 merge_variants
1 metasv
1 mito_vcfanno
1 mity_call
1 mity_normalise
1 mity_report
1 mosdepth
1 multiqc
1 pass
1 peddy
1 qualimap
1 recalibrate_base_qualities
1 remove_decoy
3 samtools_index
1 samtools_index_cram
1 samtools_stats
2 select_calls
1 smoove
2 snpeff
1 subset
1 svreport
2 svscore
1 tabix
1 vcf2db
1 vcfanno
1 vep
1 verifybamid2
1 vt
1 wham
1 write_version
129
[rule inputs and outputs removed for brevity]
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
...
-
Submit pipeline job to the cluster
Job scheduler: PBS on SickKids hpf
- Serialized jobs:
qsub dnaseq.pbs
- Parallelized jobs:
qsub dnaseq_cluster.pbs
Refer to
pbs_profile/cluster.md
document for detailed documentation for cluster integration.Job scheduler: Slurm
- Parallelized jobs on SickKids hpf:
sbatch dnaseq_slurm_hpf.sh
- Parallelized jobs on CHEO-RI space:
sbatch dnaseq_slurm_cheo_ri.sh
dnaseq_slurm_api.sh
is called by Stager when exome analyses are requested. It automatically sets up (exome_setup_stager.py
) and kicks off the crg2 WES pipeline via the slurm API using linked files that have been uploaded to MinIO. - Serialized jobs:
Column descriptions and more info on how variants are filtered can be found in the CCM Sharepoint.
The WES pipeline generates 5 reports:
-
wes.regular - report on coding SNVs in exonic regions
-
wes.synonymous - report on synonymous SNVs in exonic regions
-
clinical.wes.regular - same as wes.regular but with more stringent filters
-
clinical.wes.synonymous - same as wes.synonymous but with more stringent filters
-
wes.mosaic - putative mosaic variants
The WGS pipeline generates up to 11 reports:
The SNV reports can be found in the directories:
- report/coding/{PROJECT_ID}/{PROJECT_ID}.*wes*.csv
- report/panel/{PROJECT_ID}/{PROJECT_ID}.*wgs*.csv
- report/panel-flank/{PROJECT_ID}/{PROJECT_ID}.*wgs*.csv
- report/denovo/{PROJECT_ID}/{PROJECT_ID}.*wgs*.csv
The SV reports can be found in the directory (SV, unfiltered SV, and BND):
- report/sv/{PROJECT_ID}.wgs.{VER}.{DATE}.tsv
The STR reports can be found in:
- report/str/{PROJECT_ID}.EH.v1.1.{DATE}.xlsx
- report/str/{PROJECT_ID}.EHDN.{DATE}.xlsx
The mitochondrial report can be found in the directory:
- report/mitochondrial/{PROJECT_ID}.mitochondrial.report.csv
The MELT mobile element report can be found in the directory:
- report/MELT/{PROJECT_ID}.wgs.{DATE}.csv
parser_genomes.py
script can be used to automate the above process for a batch of genomes.
usage: parser_genomes.py [-h] -f FILE -s {fastq,mapped,recal,decoy_rm} -d path
Reads sample info from TSV file (-f) and creates directory (-d) necessary to start crg2 from step (-s) requested.
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE Five column TAB-seperated sample info file; template sample file: crg2/sample_info.tsv
-s {fastq,mapped,recal,decoy_rm}, --step {fastq,mapped,recal,decoy_rm}
start running from this folder creation(step)
-d path, --dir path Absolute path where crg2 directory struture will be created under familyID as base directory
The script performs the following operations for each familyID present in the sample info file
- create run folder: <familyID>
- copy the following files to run folder and update settings where applicable:
- config_hpf.yaml: update run name, input type, panel and ped (if avaialble in /hpf/largeprojects/ccmbio/dennis.kao/gene_data/{HPO,Pedigrees})
- units.tsv: add sample name, and input file paths
- samples.tsv: add sample names
- dnaseq_slurm_hpf.sh: rename job
- slurm-config.yaml
- submit Snakemake job
exome_reanalysis.py
script can be used to automate the above process for exome re-analysis.
usage: exome_reanalysis.py [-h] -f FILE -d path
Reads sample info from Stager analysis csv file (-f) and creates directory (-d) necessary to run crg2.
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE Analyses csv output from STAGER
-d path, --dir path Absolute path where crg2 directory structure will be created under familyID/analysisID as base directory
The script parses an analysis request csv from Stager for exome re-analyses and sets up necessary directories (under 2nd argument), files as below:
- create family analysis directory under directory provided
- copy config_cheo_ri.yaml, slurm-config.yaml and dnaseq_slurm_cheo_ri.sh from crg2 repo and replace necessary strings
- search results directory /srv/shared/hpf/exomes/results for crams from previous analyses
- create units.tsv and samples.tsv for snakemake
- submit job if all the above goes well
genome_reanalysis.py
script can be used to automate the above process for genome re-analysis.
usage: genome_reanalysis.py [-h] -f FILE -d path
Reads sample info from Stager analysis csv file (-f) and creates directory (-d) necessary to run crg2.
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE Analyses csv output from STAGER
-d path, --dir path Absolute path where crg2 directory structure will be created under familyID as base directory
The script parses an analysis request csv from Stager for genome re-analyses and sets up necessary directories (under 2nd argument), files as below:
- create family analysis directory under directory provided
- copy config_cheo_ri.yaml, slurm-config.yaml and dnaseq_slurm_cheo_ri.sh from crg2 repo and replace necessary strings
- search results directories for crams from previous analyses
- create units.tsv and samples.tsv for snakemake
- submit job if all the above goes well
The following output files are not included in the main Snakefile and can be requested in snakemake
command-line.
- HPO annotated reports:
Reports from coding, panel and panel-flank can be annotated with HPO terms whenever HPO file is available with us. This is done mainly for monthly GenomeRounds. HPO annotated TSV files are created in directory:
report/hpo_annotated
using the following command:
snakemake --use-conda -s $SF --conda-prefix $CP --profile ${PBS} -p report/hpo_annotated
The reason to not include this output in Snakefile is that the output of rule allsnvreport
is a directory and hence snakemake will not check for the creation of final csv reports. So, users are required to make sure the csv reports are created in the three folders above, and then request for the output of rule annotate_hpo
separately on command-line (or append to the dnaseq_cluster.pbs).
CNV and SV comparison outputs are not yet part of the pipeline. Please follow the steps 7 & 8 in crg to generate the following three TSVs (this is also required for GenomeRounds)
- <FAMILYID>.<DATE>.cnv.withSVoverlaps.tsv
- <FAMILYID>.unfiltered.wgs.sv.<VER>.<DATE>.withCNVoverlaps.tsv
- <FAMILYID>.wgs.sv.<VER>.<DATE>.withCNVoverlaps.tsv
SNV calls from WES and WGS pipeline can be benchmarked using the GIAB dataset HG001_NA12878 (family_sample) and truth calls from NISTv3.3.2
- Copy all required files for run as here.
The inputs in
units.tsv
are downsampled for testing purposes. Edit the tsv to use the inputs from HPF:ccmmarvin_shared/validation/benchmarking/benchmark-datasets
or CHEO-RI:/srv/shared/data/benchmarking-datasets/HG001
. - Copy
crg2/benchmark_hpf.tsv
orcrg2/benchmark_cheo_ri.tsv
to current directory. Note: benchmark_x.tsv uses HG001_NA12878 as family_sample name, so you should edit the "project" name inconfig_hpf.yaml
- Edit the
config_hpf.yaml
to set "wes" or "wgs" pipeline - Edit
dnaseq_slurm_hpf.sh
to include the targetvalidation/HG001
A guideline to developing crg2 can be found in this document.