Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor bwa (resolves #320, resolves #297, resolves #322) #324

Merged
merged 1 commit into from
Jul 11, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,5 @@ The general dependencies for these pipelines are:
Our group utilizes genomics tools encapsulated within Docker containers for portability. Each of these
pipelines can be run locally on your laptop, on a baremetal cluster, or on a cloud provider.

If there are any questions please contact John Vivian (jtvivian@gmail.com).
If there are any questions please contact the Toil team at: Toil@googlegroups.com
If you find any errors or corrections please feel free to make a pull request. Feedback of any kind is appreciated.
8 changes: 4 additions & 4 deletions src/toil_scripts/adam_gatk_pipeline/align_and_call.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,13 +178,13 @@ def static_dag(job, uuid, rg_line, inputs):
'dir_suffix': inputs.dir_suffix}

# get head BWA alignment job function and encapsulate it
bwa = job.wrapJobFn(download_shared_files,
inputs.rg_line = rg_line
inputs.output_dir = 's3://{s3_bucket}/alignment{dir_suffix}'.format(**args)
bwa = job.wrapJobFn(download_reference_files,
inputs,
[uuid,
's3://{s3_bucket}/{sequence_dir}/{uuid}_1.fastq.gz'.format(**args),
's3://{s3_bucket}/{sequence_dir}/{uuid}_2.fastq.gz'.format(**args)],
's3://{s3_bucket}/alignment{dir_suffix}'.format(**args),
rg_line).encapsulate()
's3://{s3_bucket}/{sequence_dir}/{uuid}_2.fastq.gz'.format(**args)]).encapsulate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to rg_line now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er. Nevermind. I see that rg_line is now being passed through inputs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't rg_line just in inputs to start with? If we're refactoring this, mightn't we factor out passing rg_line as a parameter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't rg_line just in inputs to start with?

Looking at the code, I'm not sure if it's even defined anywhere in align_and_call.

rg_line gets set from uuid_items[1], which is taken from uuid_rg from uuid_list, but uuid_list is parsed from the manifest which has no entry for read group... (confused yet?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ace!


# get head ADAM preprocessing job function and encapsulate it
adam_preprocess = job.wrapJobFn(static_adam_preprocessing_dag,
Expand Down
149 changes: 124 additions & 25 deletions src/toil_scripts/batch_alignment/README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,146 @@
## University of California, Santa Cruz Genomics Institute
### GATK-compatible Alignment
### Guide: Running the BWA Pipeline using Toil

This guide attempts to walk the user through running this pipeline from start to finish.

If you find any errors or corrections please feel free to make a pull request. Feedback of any kind is appreciated.

If there are any questions please contact John Vivian ([email protected]).
If you find any errors or corrections please feel free to make a pull request.
Feedback of any kind is appreciated.

## Overview

This pipeline accepts two fastq files (by URL) to be aligned into a BAMFILE, which is the final output of the pipeline.
A launch script is provided for 4 different references (b37, hg19, hg38, and hg38 no alternative loci).
Fastqs are aligned to create a BAM that is compatible with GATK.

## Installation

Toil-scripts is now pip installable! `pip install toil-scripts` for a toil-stable version
or `pip install --pre toil-scripts` for cutting edge development version.

Type: `toil-bwa` to get basic help menu and instructions

To decrease the chance of versioning conflicts, install toil-scripts into a virtualenv:

- `virtualenv ~/toil-scripts`
- `source ~/toil-scripts/bin/activate`
- `pip install toil`
- `pip install toil-scripts`

If Toil is already installed globally (true for CGCloud users), or there are global dependencies (like Mesos),
use virtualenv's `--system-site-packages` flag.

## Dependencies

This pipeline has been tested on Ubuntu 14.04, but should also run on other unix based systems. `apt-get` and `pip`
often require `sudo` privilege, so if the below commands fail, try prepending `sudo`. If you do not have sudo
privileges you will need to build these tools from source, or bug a sysadmin (they don't mind).
often require `sudo` privilege, so if the below commands fail, try prepending `sudo`. If you do not have `sudo`
privileges you will need to build these tools from source, or bug a sysadmin about how to get them.

#### General Dependencies

1. Python 2.7
2. Curl apt-get install curl
3. Docker http://docs.docker.com/engine/installation/
2. Curl apt-get install curl
3. Docker http://docs.docker.com/engine/installation/

#### Python Dependencies

1. Toil pip install toil
2. S3AM pip install --pre s3am (optional, for upload of BAMFILE to S3)
1. Toil pip install toil
2. S3AM pip install --pre s3am (optional, needed for uploading output to S3)

## Inputs

The BWA pipeline requires input files in order to run. The only required input, aside from the sample(s), is a
reference genome. The pipeline can be sped up by specifying URLs for the reference index files, which are generated
with `bwa index` and `samtools faidx`.

## General Usage

## Output
1. Type `toil-bwa generate` to create an editable manifest and config in the current working directory.
2. Parameterize the pipeline by editing the config.
3. Fill in the manifest with information pertaining to your samples.
4. Type `toil-bwa run [jobStore]` to execute the pipeline.

This pipeline produces a BAMFILE for a given sample.
## Example Commands

## Running / Help
Run sample(s) locally using the manifest
1. `toil-bwa generate`
2. Fill in config and manifest
3. `toil-bwa run ./example-jobstore`

It is recommended to use the associated launch scripts which provide default arguments needed to run the pipeline.
It is likely that the job store positional argument, `--workDir`, and `--output-dir` arguments will need to be modified.
To run a pipeline after dependencies have been installed, simply:
Toil options can be appended to `toil-bwa run`, for example:
`toil-bwa run ./example-jobstore --retryCount=1 --workDir=/data`

* `git clone https://github.com/BD2KGenomics/toil-scripts`
* `/toil-scripts/src/toil_scripts/batch_alignment/launch_bwa_hg38_no_alt.sh`
For a complete list of Toil options, just type `toil-bwa run -h`

Run a variety of samples locally
1. `toil-bwa generate-config`
2. Fill in config
3. `toil-bwa run ./example-jobstore --retryCount=1 --workDir=/data --sample \
test-uuid file:///full/path/to/read1.fq.gz file:///full/path/to/read2.fq.gz`

## Example Config

```
# BWA Alignment Pipeline configuration file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put in verbatim block? ```

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I indented it because a ``` block wasn't working — or at least not formatting correctly in pycharm. I think the line of hashes screwed something up.

screen shot 2016-06-29 at 10 51 45 am

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok!

# This configuration file is formatted in YAML. Simply write the value (at least one space) after the colon.
# Edit the values in this configuration file and then rerun the pipeline: "toil-bwa run"
# URLs can take the form: http://, file://, s3://, gnos://.
# Comments (beginning with #) do not need to be removed. Optional parameters may be left blank
##############################################################################################################
# Required: Reference fasta file
ref: s3://cgl-pipeline-inputs/alignment/hg19.fa

# Required: Output location of sample. Can be full path to a directory or an s3:// URL
output-dir: /data/

# Required: The library entry to go in the BAM read group.
library: Illumina

# Required: Platform to put in the read group
platform: Illumina

# Required: Program Unit for BAM header. Required for use with GATK.
program_unit: 12345

# Required: Approximate input file size. Provided as a number followed by (base-10) [TGMK]. E.g. 10M, 150G
file-size: 50G

# Optional: If true, sorts bam
sort: True

# Optional. If true, trims adapters
trim: false

# Optional: Reference fasta file (amb) -- if not present will be generated
amb: s3://cgl-pipeline-inputs/alignment/hg19.fa.amb

# Optional: Reference fasta file (ann) -- If not present will be generated
ann: s3://cgl-pipeline-inputs/alignment/hg19.fa.ann

# Optional: Reference fasta file (bwt) -- If not present will be generated
bwt: s3://cgl-pipeline-inputs/alignment/hg19.fa.bwt

# Optional: Reference fasta file (pac) -- If not present will be generated
pac: s3://cgl-pipeline-inputs/alignment/hg19.fa.pac

# Optional: Reference fasta file (sa) -- If not present will be generated
sa: s3://cgl-pipeline-inputs/alignment/hg19.fa.sa

# Optional: Reference fasta file (fai) -- If not present will be generated
fai: s3://cgl-pipeline-inputs/alignment/hg19.fa.fai

# Optional: (string) Path to Key File for SSE-C Encryption
ssec:

# Optional: Use instead of library, program_unit, and platform.
rg-line:

# Optional: Alternate file for reference build (alt). Necessary for alt aware alignment
alt:

# Optional: If true, runs the pipeline in mock mode, generating a fake output bam
mock-mode:
```

Due to PYTHONPATH issues, help can be found by typing:
## Distributed Run

* `cd toil-scripts/src`
* `python -m toil_scripts.batch_alignment.bwa_alignment --help`

To run on a distributed AWS cluster, see [CGCloud](https://github.com/BD2KGenomics/cgcloud) for instance provisioning,
then run `toil-bwa run aws:us-west-2:example-jobstore-bucket --batchSystem=mesos --mesosMaster mesos-master:5050`
to use the AWS job store and mesos batch system.
Loading