Skip to content

Commit

Permalink
Merge pull request #324 from /issues/297-refactor-bwa
Browse files Browse the repository at this point in the history
Refactor bwa (resolves #320, resolves #297, resolves #322)
  • Loading branch information
hannes-ucsc authored Jul 11, 2016
2 parents fce430a + eadc6b5 commit d4bc20f
Show file tree
Hide file tree
Showing 17 changed files with 520 additions and 360 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,5 @@ The general dependencies for these pipelines are:
Our group utilizes genomics tools encapsulated within Docker containers for portability. Each of these
pipelines can be run locally on your laptop, on a baremetal cluster, or on a cloud provider.

If there are any questions please contact John Vivian (jtvivian@gmail.com).
If there are any questions please contact the Toil team at: Toil@googlegroups.com
If you find any errors or corrections please feel free to make a pull request. Feedback of any kind is appreciated.
8 changes: 4 additions & 4 deletions src/toil_scripts/adam_gatk_pipeline/align_and_call.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,13 +178,13 @@ def static_dag(job, uuid, rg_line, inputs):
'dir_suffix': inputs.dir_suffix}

# get head BWA alignment job function and encapsulate it
bwa = job.wrapJobFn(download_shared_files,
inputs.rg_line = rg_line
inputs.output_dir = 's3://{s3_bucket}/alignment{dir_suffix}'.format(**args)
bwa = job.wrapJobFn(download_reference_files,
inputs,
[uuid,
's3://{s3_bucket}/{sequence_dir}/{uuid}_1.fastq.gz'.format(**args),
's3://{s3_bucket}/{sequence_dir}/{uuid}_2.fastq.gz'.format(**args)],
's3://{s3_bucket}/alignment{dir_suffix}'.format(**args),
rg_line).encapsulate()
's3://{s3_bucket}/{sequence_dir}/{uuid}_2.fastq.gz'.format(**args)]).encapsulate()

# get head ADAM preprocessing job function and encapsulate it
adam_preprocess = job.wrapJobFn(static_adam_preprocessing_dag,
Expand Down
149 changes: 124 additions & 25 deletions src/toil_scripts/batch_alignment/README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,146 @@
## University of California, Santa Cruz Genomics Institute
### GATK-compatible Alignment
### Guide: Running the BWA Pipeline using Toil

This guide attempts to walk the user through running this pipeline from start to finish.

If you find any errors or corrections please feel free to make a pull request. Feedback of any kind is appreciated.

If there are any questions please contact John Vivian ([email protected]).
If you find any errors or corrections please feel free to make a pull request.
Feedback of any kind is appreciated.

## Overview

This pipeline accepts two fastq files (by URL) to be aligned into a BAMFILE, which is the final output of the pipeline.
A launch script is provided for 4 different references (b37, hg19, hg38, and hg38 no alternative loci).
Fastqs are aligned to create a BAM that is compatible with GATK.

## Installation

Toil-scripts is now pip installable! `pip install toil-scripts` for a toil-stable version
or `pip install --pre toil-scripts` for cutting edge development version.

Type: `toil-bwa` to get basic help menu and instructions

To decrease the chance of versioning conflicts, install toil-scripts into a virtualenv:

- `virtualenv ~/toil-scripts`
- `source ~/toil-scripts/bin/activate`
- `pip install toil`
- `pip install toil-scripts`

If Toil is already installed globally (true for CGCloud users), or there are global dependencies (like Mesos),
use virtualenv's `--system-site-packages` flag.

## Dependencies

This pipeline has been tested on Ubuntu 14.04, but should also run on other unix based systems. `apt-get` and `pip`
often require `sudo` privilege, so if the below commands fail, try prepending `sudo`. If you do not have sudo
privileges you will need to build these tools from source, or bug a sysadmin (they don't mind).
often require `sudo` privilege, so if the below commands fail, try prepending `sudo`. If you do not have `sudo`
privileges you will need to build these tools from source, or bug a sysadmin about how to get them.

#### General Dependencies

1. Python 2.7
2. Curl apt-get install curl
3. Docker http://docs.docker.com/engine/installation/
2. Curl apt-get install curl
3. Docker http://docs.docker.com/engine/installation/

#### Python Dependencies

1. Toil pip install toil
2. S3AM pip install --pre s3am (optional, for upload of BAMFILE to S3)
1. Toil pip install toil
2. S3AM pip install --pre s3am (optional, needed for uploading output to S3)

## Inputs

The BWA pipeline requires input files in order to run. The only required input, aside from the sample(s), is a
reference genome. The pipeline can be sped up by specifying URLs for the reference index files, which are generated
with `bwa index` and `samtools faidx`.

## General Usage

## Output
1. Type `toil-bwa generate` to create an editable manifest and config in the current working directory.
2. Parameterize the pipeline by editing the config.
3. Fill in the manifest with information pertaining to your samples.
4. Type `toil-bwa run [jobStore]` to execute the pipeline.

This pipeline produces a BAMFILE for a given sample.
## Example Commands

## Running / Help
Run sample(s) locally using the manifest
1. `toil-bwa generate`
2. Fill in config and manifest
3. `toil-bwa run ./example-jobstore`

It is recommended to use the associated launch scripts which provide default arguments needed to run the pipeline.
It is likely that the job store positional argument, `--workDir`, and `--output-dir` arguments will need to be modified.
To run a pipeline after dependencies have been installed, simply:
Toil options can be appended to `toil-bwa run`, for example:
`toil-bwa run ./example-jobstore --retryCount=1 --workDir=/data`

* `git clone https://github.com/BD2KGenomics/toil-scripts`
* `/toil-scripts/src/toil_scripts/batch_alignment/launch_bwa_hg38_no_alt.sh`
For a complete list of Toil options, just type `toil-bwa run -h`

Run a variety of samples locally
1. `toil-bwa generate-config`
2. Fill in config
3. `toil-bwa run ./example-jobstore --retryCount=1 --workDir=/data --sample \
test-uuid file:///full/path/to/read1.fq.gz file:///full/path/to/read2.fq.gz`

## Example Config

```
# BWA Alignment Pipeline configuration file
# This configuration file is formatted in YAML. Simply write the value (at least one space) after the colon.
# Edit the values in this configuration file and then rerun the pipeline: "toil-bwa run"
# URLs can take the form: http://, file://, s3://, gnos://.
# Comments (beginning with #) do not need to be removed. Optional parameters may be left blank
##############################################################################################################
# Required: Reference fasta file
ref: s3://cgl-pipeline-inputs/alignment/hg19.fa
# Required: Output location of sample. Can be full path to a directory or an s3:// URL
output-dir: /data/
# Required: The library entry to go in the BAM read group.
library: Illumina
# Required: Platform to put in the read group
platform: Illumina
# Required: Program Unit for BAM header. Required for use with GATK.
program_unit: 12345
# Required: Approximate input file size. Provided as a number followed by (base-10) [TGMK]. E.g. 10M, 150G
file-size: 50G
# Optional: If true, sorts bam
sort: True
# Optional. If true, trims adapters
trim: false
# Optional: Reference fasta file (amb) -- if not present will be generated
amb: s3://cgl-pipeline-inputs/alignment/hg19.fa.amb
# Optional: Reference fasta file (ann) -- If not present will be generated
ann: s3://cgl-pipeline-inputs/alignment/hg19.fa.ann
# Optional: Reference fasta file (bwt) -- If not present will be generated
bwt: s3://cgl-pipeline-inputs/alignment/hg19.fa.bwt
# Optional: Reference fasta file (pac) -- If not present will be generated
pac: s3://cgl-pipeline-inputs/alignment/hg19.fa.pac
# Optional: Reference fasta file (sa) -- If not present will be generated
sa: s3://cgl-pipeline-inputs/alignment/hg19.fa.sa
# Optional: Reference fasta file (fai) -- If not present will be generated
fai: s3://cgl-pipeline-inputs/alignment/hg19.fa.fai
# Optional: (string) Path to Key File for SSE-C Encryption
ssec:
# Optional: Use instead of library, program_unit, and platform.
rg-line:
# Optional: Alternate file for reference build (alt). Necessary for alt aware alignment
alt:
# Optional: If true, runs the pipeline in mock mode, generating a fake output bam
mock-mode:
```

Due to PYTHONPATH issues, help can be found by typing:
## Distributed Run

* `cd toil-scripts/src`
* `python -m toil_scripts.batch_alignment.bwa_alignment --help`

To run on a distributed AWS cluster, see [CGCloud](https://github.com/BD2KGenomics/cgcloud) for instance provisioning,
then run `toil-bwa run aws:us-west-2:example-jobstore-bucket --batchSystem=mesos --mesosMaster mesos-master:5050`
to use the AWS job store and mesos batch system.
Loading

0 comments on commit d4bc20f

Please sign in to comment.