Skip to content

Commit

Permalink
Refactor BWA (resolves #297, resolves #320, resolves #322)
Browse files Browse the repository at this point in the history
Add missing uuid to config
Fix bad unpacking
Add sample URLs to logger
Add suffix option to test
Remove unused import
Move google group line to parent README
Add whitespace before map_job
Add comment explaining map_job
Fix missing trim attribute in test
Clean up logic for reference indices
Change point of contact to google group
Improve legibility with comments on their own lines
Unindent FollowOn job
Add logging for index creation
Add inline explaining alt file can't be generated
Clarify that FAI is created separately from BWA indices
Add docstring to move_file_job
Add inline comment to explain disk space
Reverse use of csv
Replace run_bwakit arguments with config
Fix imports and spacing in bwa_alignment
Add /data/ prefix
Add missing import
Move run_bwa_index and run_samtools_faidx to indexing.py
Remove unnecessary renaming
Colocate parameter logic
Add inline comments for conditionals and rg
Provide clarity for rg_line
Move required_length to common lib
Remove bad print statement
Add comments explaining conditional
Clarify samples type
Clarify file_size option
Remove attempt at humor (+5 squashed commits)
Squashed commits:
Move bwa_kit to tool library (resolves #297)
Add bwa_index and samtools_faidx to tool library
Indent Job.Runner, add sanity checks
options -> args (convention)
Print docstring help if no arguments provided
Remove config options
Change config formatting
Add tab-friendly config and manifest names
Replace old bwa_kit job with download_sample_and_align
Clean imports
Move top docstring to main()
Fix test call
Fix adam_gatk_pipeline's call to download_reference_files
Add job version of `move_files`
Update README.md
Remove deprecated launch scripts
Require only reference (resolves #320)
Change download_shared_files -> download_reference_files
Remove old reference requirements
Add single end support for BWA (resolves #322)
Edit manifest docs to include single-end
Add nargs range for single-end and paired-end
Replace parse_config job with parse_manifest function
  • Loading branch information
jvivian committed Jul 7, 2016
1 parent 45ae6e2 commit 7b316cc
Show file tree
Hide file tree
Showing 17 changed files with 520 additions and 360 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,5 @@ The general dependencies for these pipelines are:
Our group utilizes genomics tools encapsulated within Docker containers for portability. Each of these
pipelines can be run locally on your laptop, on a baremetal cluster, or on a cloud provider.

If there are any questions please contact John Vivian (jtvivian@gmail.com).
If there are any questions please contact the Toil team at: Toil@googlegroups.com
If you find any errors or corrections please feel free to make a pull request. Feedback of any kind is appreciated.
8 changes: 4 additions & 4 deletions src/toil_scripts/adam_gatk_pipeline/align_and_call.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,13 +178,13 @@ def static_dag(job, uuid, rg_line, inputs):
'dir_suffix': inputs.dir_suffix}

# get head BWA alignment job function and encapsulate it
bwa = job.wrapJobFn(download_shared_files,
inputs.rg_line = rg_line
inputs.output_dir = 's3://{s3_bucket}/alignment{dir_suffix}'.format(**args)
bwa = job.wrapJobFn(download_reference_files,
inputs,
[uuid,
's3://{s3_bucket}/{sequence_dir}/{uuid}_1.fastq.gz'.format(**args),
's3://{s3_bucket}/{sequence_dir}/{uuid}_2.fastq.gz'.format(**args)],
's3://{s3_bucket}/alignment{dir_suffix}'.format(**args),
rg_line).encapsulate()
's3://{s3_bucket}/{sequence_dir}/{uuid}_2.fastq.gz'.format(**args)]).encapsulate()

# get head ADAM preprocessing job function and encapsulate it
adam_preprocess = job.wrapJobFn(static_adam_preprocessing_dag,
Expand Down
149 changes: 124 additions & 25 deletions src/toil_scripts/batch_alignment/README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,146 @@
## University of California, Santa Cruz Genomics Institute
### GATK-compatible Alignment
### Guide: Running the BWA Pipeline using Toil

This guide attempts to walk the user through running this pipeline from start to finish.

If you find any errors or corrections please feel free to make a pull request. Feedback of any kind is appreciated.

If there are any questions please contact John Vivian ([email protected]).
If you find any errors or corrections please feel free to make a pull request.
Feedback of any kind is appreciated.

## Overview

This pipeline accepts two fastq files (by URL) to be aligned into a BAMFILE, which is the final output of the pipeline.
A launch script is provided for 4 different references (b37, hg19, hg38, and hg38 no alternative loci).
Fastqs are aligned to create a BAM that is compatible with GATK.

## Installation

Toil-scripts is now pip installable! `pip install toil-scripts` for a toil-stable version
or `pip install --pre toil-scripts` for cutting edge development version.

Type: `toil-bwa` to get basic help menu and instructions

To decrease the chance of versioning conflicts, install toil-scripts into a virtualenv:

- `virtualenv ~/toil-scripts`
- `source ~/toil-scripts/bin/activate`
- `pip install toil`
- `pip install toil-scripts`

If Toil is already installed globally (true for CGCloud users), or there are global dependencies (like Mesos),
use virtualenv's `--system-site-packages` flag.

## Dependencies

This pipeline has been tested on Ubuntu 14.04, but should also run on other unix based systems. `apt-get` and `pip`
often require `sudo` privilege, so if the below commands fail, try prepending `sudo`. If you do not have sudo
privileges you will need to build these tools from source, or bug a sysadmin (they don't mind).
often require `sudo` privilege, so if the below commands fail, try prepending `sudo`. If you do not have `sudo`
privileges you will need to build these tools from source, or bug a sysadmin about how to get them.

#### General Dependencies

1. Python 2.7
2. Curl apt-get install curl
3. Docker http://docs.docker.com/engine/installation/
2. Curl apt-get install curl
3. Docker http://docs.docker.com/engine/installation/

#### Python Dependencies

1. Toil pip install toil
2. S3AM pip install --pre s3am (optional, for upload of BAMFILE to S3)
1. Toil pip install toil
2. S3AM pip install --pre s3am (optional, needed for uploading output to S3)

## Inputs

The BWA pipeline requires input files in order to run. The only required input, aside from the sample(s), is a
reference genome. The pipeline can be sped up by specifying URLs for the reference index files, which are generated
with `bwa index` and `samtools faidx`.

## General Usage

## Output
1. Type `toil-bwa generate` to create an editable manifest and config in the current working directory.
2. Parameterize the pipeline by editing the config.
3. Fill in the manifest with information pertaining to your samples.
4. Type `toil-bwa run [jobStore]` to execute the pipeline.

This pipeline produces a BAMFILE for a given sample.
## Example Commands

## Running / Help
Run sample(s) locally using the manifest
1. `toil-bwa generate`
2. Fill in config and manifest
3. `toil-bwa run ./example-jobstore`

It is recommended to use the associated launch scripts which provide default arguments needed to run the pipeline.
It is likely that the job store positional argument, `--workDir`, and `--output-dir` arguments will need to be modified.
To run a pipeline after dependencies have been installed, simply:
Toil options can be appended to `toil-bwa run`, for example:
`toil-bwa run ./example-jobstore --retryCount=1 --workDir=/data`

* `git clone https://github.com/BD2KGenomics/toil-scripts`
* `/toil-scripts/src/toil_scripts/batch_alignment/launch_bwa_hg38_no_alt.sh`
For a complete list of Toil options, just type `toil-bwa run -h`

Run a variety of samples locally
1. `toil-bwa generate-config`
2. Fill in config
3. `toil-bwa run ./example-jobstore --retryCount=1 --workDir=/data --sample \
test-uuid file:///full/path/to/read1.fq.gz file:///full/path/to/read2.fq.gz`

## Example Config

```
# BWA Alignment Pipeline configuration file
# This configuration file is formatted in YAML. Simply write the value (at least one space) after the colon.
# Edit the values in this configuration file and then rerun the pipeline: "toil-bwa run"
# URLs can take the form: http://, file://, s3://, gnos://.
# Comments (beginning with #) do not need to be removed. Optional parameters may be left blank
##############################################################################################################
# Required: Reference fasta file
ref: s3://cgl-pipeline-inputs/alignment/hg19.fa
# Required: Output location of sample. Can be full path to a directory or an s3:// URL
output-dir: /data/
# Required: The library entry to go in the BAM read group.
library: Illumina
# Required: Platform to put in the read group
platform: Illumina
# Required: Program Unit for BAM header. Required for use with GATK.
program_unit: 12345
# Required: Approximate input file size. Provided as a number followed by (base-10) [TGMK]. E.g. 10M, 150G
file-size: 50G
# Optional: If true, sorts bam
sort: True
# Optional. If true, trims adapters
trim: false
# Optional: Reference fasta file (amb) -- if not present will be generated
amb: s3://cgl-pipeline-inputs/alignment/hg19.fa.amb
# Optional: Reference fasta file (ann) -- If not present will be generated
ann: s3://cgl-pipeline-inputs/alignment/hg19.fa.ann
# Optional: Reference fasta file (bwt) -- If not present will be generated
bwt: s3://cgl-pipeline-inputs/alignment/hg19.fa.bwt
# Optional: Reference fasta file (pac) -- If not present will be generated
pac: s3://cgl-pipeline-inputs/alignment/hg19.fa.pac
# Optional: Reference fasta file (sa) -- If not present will be generated
sa: s3://cgl-pipeline-inputs/alignment/hg19.fa.sa
# Optional: Reference fasta file (fai) -- If not present will be generated
fai: s3://cgl-pipeline-inputs/alignment/hg19.fa.fai
# Optional: (string) Path to Key File for SSE-C Encryption
ssec:
# Optional: Use instead of library, program_unit, and platform.
rg-line:
# Optional: Alternate file for reference build (alt). Necessary for alt aware alignment
alt:
# Optional: If true, runs the pipeline in mock mode, generating a fake output bam
mock-mode:
```

Due to PYTHONPATH issues, help can be found by typing:
## Distributed Run

* `cd toil-scripts/src`
* `python -m toil_scripts.batch_alignment.bwa_alignment --help`

To run on a distributed AWS cluster, see [CGCloud](https://github.com/BD2KGenomics/cgcloud) for instance provisioning,
then run `toil-bwa run aws:us-west-2:example-jobstore-bucket --batchSystem=mesos --mesosMaster mesos-master:5050`
to use the AWS job store and mesos batch system.
Loading

0 comments on commit 7b316cc

Please sign in to comment.