-
Notifications
You must be signed in to change notification settings - Fork 3
Haloplex
NB: the haloplex pipeline makefiles are tailored for use with data generated at SciLife, Stockholm. For easier processing, this means that data should be organized as follows (example data available from repository https://github.com/percyfal/ngs.test.data):
|-- P001_101_index3
| |-- 120924_AC003CCCXX
| | |-- P001_101_index3_TGACCA_L001_R1_001.fastq.gz
| | |-- P001_101_index3_TGACCA_L001_R2_001.fastq.gz
| | `-- SampleSheet.csv
| |-- 121015_BB002BBBXX
| | |-- P001_101_index3_TGACCA_L001_R1_001.fastq.gz
| | |-- P001_101_index3_TGACCA_L001_R2_001.fastq.gz
| | `-- SampleSheet.csv
|-- P001_102_index6
| `-- 120924_AC003CCCXX
| |-- P001_102_index6_ACAGTG_L002_R1_001.fastq.gz
| |-- P001_102_index6_ACAGTG_L002_R2_001.fastq.gz
| `-- SampleSheet.csv
However, you could always refactor your data to comply with this directory structure, and there are options that govern how biomake searches for input files. Lastly, you can also explicitly name the locations of the input files, if need be.
Copy the file example/Makefile.pipeline.halo to the root directory where project data resides and rename it (or link it) to Makefile. The latter is necessary for the auto-generated batch submission scripts to work since they invoke make without the '-f' option. Uncomment relevant sections and set the variables. See the included Makefile (Makefile.halo) for further options.
The pipeline currently does the following:
-
makes flowcell targets *.sort.rg.bam (phony target flowcells)
- adapter trimming
- resyncing mates
- alignment with bwa mem
- sorting and adding of read group information
-
makes sample targets *.sort.merge.realign.recal.clip.bam (phony target samples)
- calls raw genotypes to identify realignment target regions
- indel realignment
- base recalibration
- read clipping
-
merges data and makes target all.filtered.eval_metrics (phony target all)
- merges samples to all.bam
- genotyping
- variant filtration with Haloplex-specific hard filters
- variant evaluation
-
calculates four picard metrics (alignment, duplication, insertion, and hybrid selection) on flowcell- and sample-level bam files. For convenience, the metrics are summarized in four files (*metrics.txt). Various QC metrics are also plotted (to files *metrics.pdf) for easier identification of potentially problematic samples.
TODO: fastqc
A minimal makefile could look like this:
#-*- makefile -*-
# Makefile variable: make sure to use make version >=3.82
MAKE=/path/to/make-3.82/make
# SLURM settings - for submission with sbatch
# NB: account must be set, all other variables have defaults which may
# or may not do what you want
SLURM_ACCOUNT=Account_name
SLURM_WORKDIR=/path/to/inputdata
SLURM_TIME=40:00:00
SLURM_PARTITION=node
SLURM_MAILUSER=me@mail.com
SLURM_MAKE_J=16
SLURM_MODULES=samtools/0.1.19 picard/1.92 cutadapt/1.2.1 bwa/0.7.5a GATK/2.7.2
SLURM_MAKE_OPTIONS=-k
SLURM_CLUSTER=clustername
SAMPLE_PREFIX=P001
# Alternatively, set samples manually
# SAMPLES=
READ1_LABEL=_1
READ2_LABEL=_2
# If computer has enough memory (>6GB/core), parallelize via make
HALO_BATCH_SIZE=16
THREADS=1
# Search paths for GATK and picard needed
GATK_HOME=/sw/apps/bioinfo/GATK/2.7.2
PICARD_HOME=/sw/apps/bioinfo/picard/1.92/kalkyl
TARGET_REGIONS=/path/to/targets.interval_list
BAIT_REGIONS=/path/to/baits.interval_list
DBSNP=/path/to/dbsnp_137.vcf
REF=/path/to/hg19.fa
BWA_REF=/path/to/bwa/hg19.fa
# Additional prerequisite resyncMates.pl
# Script located at https://github.com/percyfal/ratatosk.ext.scilife/blob/master/scripts/resyncMates.pl
RESYNCMATES=/path/to/resyncMates.pl
# Annovar
ANNOVAR_HOME=/path/to/annovar
ANNOVAR_TABLE_OPTIONS=--nastring NA --protocol refGene,phastConsElements46way,genomicSuperDups,esp6500si_all,1000g2012apr_all,snp137,avsift,ljb_all -operation g,r,r,f,f,f,f,f --otherinfo
# Include the haloplex pipeline Makefile. Look in this file for further options.
include /path/to/biomake/Makefile.halo
# Include variation make file for annovar and friends
include /path/to/biomake/Makefile.variation
Once the makefile has been setup properly, running the pipeline is just a manner of running three commands.
Run either 1a (sbatch submission) or 1b.
The halo-sbatch target is a special target that groups samples into batches and creates sbatch files that make the samples target for each sample group. The size of the batch can be modified via the HALO_BATCH_SIZE variable. Running
make halo-sbatch
will partition the samples into batches of eight samples and run make samples on each batch.
Alternatively, you can make the samples targets interactively:
make samples
The previous command will end by removing intermediate files in the flowcell directories. This modifies the timestamp, and makes make believe that there are newer source files than the newly generated targets. Hence, before proceeding, we need to update the timestamps of the samples targets:
make -t samples
This runs a touch command on the samples targets.
The final step merges the samples, performs variant calling, filtering and evaluation.
make halo
or if you want to submit an sbatch job
make halo.sbatch
The end result is a file called halo.filtered.vcf.
TODO.
There are several downstream analyses that could be performed. Makefile.variation contains recipes for some of these analyses. I'll give the example for ANNOVAR here.
Set the following two variables
ANNOVAR_HOME=/path/to/annovar
ANNOVAR_TABLE_OPTIONS=--nastring NA --protocol refGene,phastConsElements46way,genomicSuperDups,esp6500si_all,1000g2012apr_all,snp137,avsift,ljb_all -operation g,r,r,f,f,f,f,f --otherinfo
and include the makefile
include /path/to/biomake/Makefile.variation
Now, by running the following command
make halo.filtered.avinput.hg19_multianno.txt
annovar will run on halo.filtered.vcf.
If you have a local annovar installation, there is a shorthand for installing the necessary databases:
make -n annovar-setupdb -f /path/to/biomake/Makefile.variation
will run (NB: here ANNOVAR_HOME=./ !):
./annotate_variation.pl -buildver hg19 -downdb dgv ./humandb
./annotate_variation.pl -buildver hg19 -downdb genomicSuperDups ./humandb
./annotate_variation.pl -buildver hg19 -downdb gwascatalog ./humandb
./annotate_variation.pl -buildver hg19 -downdb tfbs ./humandb
./annotate_variation.pl -buildver hg19 -downdb wgEncodeRegTfbsClustered ./humandb
./annotate_variation.pl -buildver hg19 -downdb wgEncodeRegDnaseClustered ./humandb
./annotate_variation.pl -buildver hg19 -downdb phastConsElements46way ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar 1000g2012apr ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar cosmic64 ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar esp6500si_all ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar esp6500si_ea ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb_all ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar snp137 ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsift ./humandb
You can modify the build version via the ANGSD_BUILDVER variable. Execute
make -f /path/to/biomake/Makefile.variation variation-settings
for a list of variables.
Run with '-n' flag to monitor commands:
make -n halo
Make halo target with one of the following commands:
make halo
make halo.filtered.eval_metrics
make flowcells
make samples
The clean target removes everything.
make clean
make halo SAMPLES=P001_101_index3
TODO.