Skip to content
Per Unneberg edited this page Nov 27, 2013 · 7 revisions

NB: the haloplex pipeline makefiles are tailored for use with data generated at SciLife, Stockholm. For easier processing, this means that data should be organized as follows (example data available from repository https://github.com/percyfal/ngs.test.data):

|-- P001_101_index3
|   |-- 120924_AC003CCCXX
|   |   |-- P001_101_index3_TGACCA_L001_R1_001.fastq.gz
|   |   |-- P001_101_index3_TGACCA_L001_R2_001.fastq.gz
|   |   `-- SampleSheet.csv
|   |-- 121015_BB002BBBXX
|   |   |-- P001_101_index3_TGACCA_L001_R1_001.fastq.gz
|   |   |-- P001_101_index3_TGACCA_L001_R2_001.fastq.gz
|   |   `-- SampleSheet.csv
|-- P001_102_index6
|   `-- 120924_AC003CCCXX
|       |-- P001_102_index6_ACAGTG_L002_R1_001.fastq.gz
|       |-- P001_102_index6_ACAGTG_L002_R2_001.fastq.gz
|       `-- SampleSheet.csv

However, you could always refactor your data to comply with this directory structure, and there are options that govern how biomake searches for input files. Lastly, you can also explicitly name the locations of the input files, if need be.

Instructions

Copy the file example/Makefile.pipeline.halo to the root directory where project data resides and rename it (or link it) to Makefile. The latter is necessary for the auto-generated batch submission scripts to work since they invoke make without the '-f' option. Uncomment relevant sections and set the variables. See the included Makefile (Makefile.halo) for further options.

The pipeline currently does the following:

  1. makes flowcell targets *.sort.rg.bam (phony target flowcells)

    1. adapter trimming
    2. resyncing mates
    3. alignment with bwa mem
    4. sorting and adding of read group information
  2. makes sample targets *.sort.merge.realign.recal.clip.bam (phony target samples)

    1. calls raw genotypes to identify realignment target regions
    2. indel realignment
    3. base recalibration
    4. read clipping
  3. merges data and makes target all.filtered.eval_metrics (phony target all)

    1. merges samples to all.bam
    2. genotyping
    3. variant filtration with Haloplex-specific hard filters
    4. variant evaluation
  4. calculates four picard metrics (alignment, duplication, insertion, and hybrid selection) on flowcell- and sample-level bam files. For convenience, the metrics are summarized in four files (*metrics.txt). Various QC metrics are also plotted (to files *metrics.pdf) for easier identification of potentially problematic samples.

TODO: fastqc

A minimal makefile

A minimal makefile could look like this:

#-*- makefile -*-

# Makefile variable: make sure to use make version >=3.82
MAKE=/path/to/make-3.82/make

# SLURM settings - for submission with sbatch 
# NB: account must be set, all other variables have defaults which may
# or may not do what you want
SLURM_ACCOUNT=Account_name
SLURM_WORKDIR=/path/to/inputdata
SLURM_TIME=40:00:00
SLURM_PARTITION=node
SLURM_MAILUSER=me@mail.com
SLURM_MAKE_J=16
SLURM_MODULES=samtools/0.1.19 picard/1.92 cutadapt/1.2.1 bwa/0.7.5a GATK/2.7.2
SLURM_MAKE_OPTIONS=-k
SLURM_CLUSTER=clustername

SAMPLE_PREFIX=P001
# Alternatively, set samples manually
# SAMPLES=
READ1_LABEL=_1
READ2_LABEL=_2

# If computer has enough memory (>6GB/core), parallelize via make
HALO_BATCH_SIZE=16
THREADS=1

# Search paths for GATK and picard needed
GATK_HOME=/sw/apps/bioinfo/GATK/2.7.2
PICARD_HOME=/sw/apps/bioinfo/picard/1.92/kalkyl

TARGET_REGIONS=/path/to/targets.interval_list
BAIT_REGIONS=/path/to/baits.interval_list

DBSNP=/path/to/dbsnp_137.vcf
REF=/path/to/hg19.fa
BWA_REF=/path/to/bwa/hg19.fa

# Additional prerequisite resyncMates.pl
# Script located at https://github.com/percyfal/ratatosk.ext.scilife/blob/master/scripts/resyncMates.pl
RESYNCMATES=/path/to/resyncMates.pl

# Annovar
ANNOVAR_HOME=/path/to/annovar
ANNOVAR_TABLE_OPTIONS=--nastring NA --protocol refGene,phastConsElements46way,genomicSuperDups,esp6500si_all,1000g2012apr_all,snp137,avsift,ljb_all -operation g,r,r,f,f,f,f,f --otherinfo

# Include the haloplex pipeline Makefile. Look in this file for further options. 
include /path/to/biomake/Makefile.halo
# Include variation make file for annovar and friends
include /path/to/biomake/Makefile.variation

Running the pipeline

Once the makefile has been setup properly, running the pipeline is just a manner of running three commands.

1. make samples targets

Run either 1a (sbatch submission) or 1b.

1a

The halo-sbatch target is a special target that groups samples into batches and creates sbatch files that make the samples target for each sample group. The size of the batch can be modified via the HALO_BATCH_SIZE variable. Running

make halo-sbatch

will partition the samples into batches of eight samples and run make samples on each batch.

1b

Alternatively, you can make the samples targets interactively:

make samples

2. Update samples timestamps

The previous command will end by removing intermediate files in the flowcell directories. This modifies the timestamp, and makes make believe that there are newer source files than the newly generated targets. Hence, before proceeding, we need to update the timestamps of the samples targets:

make -t samples

This runs a touch command on the samples targets.

3. Run variant calling and evaluation

The final step merges the samples, performs variant calling, filtering and evaluation.

make halo

or if you want to submit an sbatch job

make halo.sbatch

The end result is a file called halo.filtered.vcf.

Downstream analyses

1. QC evaluation

TODO.

2. Annovar

There are several downstream analyses that could be performed. Makefile.variation contains recipes for some of these analyses. I'll give the example for ANNOVAR here.

Set the following two variables

ANNOVAR_HOME=/path/to/annovar
ANNOVAR_TABLE_OPTIONS=--nastring NA --protocol refGene,phastConsElements46way,genomicSuperDups,esp6500si_all,1000g2012apr_all,snp137,avsift,ljb_all -operation g,r,r,f,f,f,f,f --otherinfo

and include the makefile

include /path/to/biomake/Makefile.variation

Now, by running the following command

make halo.filtered.avinput.hg19_multianno.txt

annovar will run on halo.filtered.vcf.

If you have a local annovar installation, there is a shorthand for installing the necessary databases:

make -n annovar-setupdb -f /path/to/biomake/Makefile.variation

will run (NB: here ANNOVAR_HOME=./ !):

./annotate_variation.pl -buildver hg19 -downdb dgv ./humandb
./annotate_variation.pl -buildver hg19 -downdb genomicSuperDups ./humandb
./annotate_variation.pl -buildver hg19 -downdb gwascatalog ./humandb
./annotate_variation.pl -buildver hg19 -downdb tfbs ./humandb
./annotate_variation.pl -buildver hg19 -downdb wgEncodeRegTfbsClustered ./humandb
./annotate_variation.pl -buildver hg19 -downdb wgEncodeRegDnaseClustered ./humandb
./annotate_variation.pl -buildver hg19 -downdb phastConsElements46way ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar 1000g2012apr ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar cosmic64 ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar esp6500si_all ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar esp6500si_ea ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ljb_all ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar snp137 ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene ./humandb
./annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsift ./humandb

You can modify the build version via the ANGSD_BUILDVER variable. Execute

make -f /path/to/biomake/Makefile.variation variation-settings

for a list of variables.

Example commands

1. Make entire pipeline

Run with '-n' flag to monitor commands:

make -n halo

Make halo target with one of the following commands:

make halo
make halo.filtered.eval_metrics

2. Make flowcell targets

make flowcells

3. Make sample targets

make samples

4. Clean up

The clean target removes everything.

make clean

5. Running a specific sample

make halo SAMPLES=P001_101_index3

Running a specific flowcell

TODO.