-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor bwa (resolves #320, resolves #297, resolves #322) #324
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,47 +1,146 @@ | ||
## University of California, Santa Cruz Genomics Institute | ||
### GATK-compatible Alignment | ||
### Guide: Running the BWA Pipeline using Toil | ||
|
||
This guide attempts to walk the user through running this pipeline from start to finish. | ||
|
||
If you find any errors or corrections please feel free to make a pull request. Feedback of any kind is appreciated. | ||
|
||
If there are any questions please contact John Vivian ([email protected]). | ||
If you find any errors or corrections please feel free to make a pull request. | ||
Feedback of any kind is appreciated. | ||
|
||
## Overview | ||
|
||
This pipeline accepts two fastq files (by URL) to be aligned into a BAMFILE, which is the final output of the pipeline. | ||
A launch script is provided for 4 different references (b37, hg19, hg38, and hg38 no alternative loci). | ||
Fastqs are aligned to create a BAM that is compatible with GATK. | ||
|
||
## Installation | ||
|
||
Toil-scripts is now pip installable! `pip install toil-scripts` for a toil-stable version | ||
or `pip install --pre toil-scripts` for cutting edge development version. | ||
|
||
Type: `toil-bwa` to get basic help menu and instructions | ||
|
||
To decrease the chance of versioning conflicts, install toil-scripts into a virtualenv: | ||
|
||
- `virtualenv ~/toil-scripts` | ||
- `source ~/toil-scripts/bin/activate` | ||
- `pip install toil` | ||
- `pip install toil-scripts` | ||
|
||
If Toil is already installed globally (true for CGCloud users), or there are global dependencies (like Mesos), | ||
use virtualenv's `--system-site-packages` flag. | ||
|
||
## Dependencies | ||
|
||
This pipeline has been tested on Ubuntu 14.04, but should also run on other unix based systems. `apt-get` and `pip` | ||
often require `sudo` privilege, so if the below commands fail, try prepending `sudo`. If you do not have sudo | ||
privileges you will need to build these tools from source, or bug a sysadmin (they don't mind). | ||
often require `sudo` privilege, so if the below commands fail, try prepending `sudo`. If you do not have `sudo` | ||
privileges you will need to build these tools from source, or bug a sysadmin about how to get them. | ||
|
||
#### General Dependencies | ||
|
||
1. Python 2.7 | ||
2. Curl apt-get install curl | ||
3. Docker http://docs.docker.com/engine/installation/ | ||
2. Curl apt-get install curl | ||
3. Docker http://docs.docker.com/engine/installation/ | ||
|
||
#### Python Dependencies | ||
|
||
1. Toil pip install toil | ||
2. S3AM pip install --pre s3am (optional, for upload of BAMFILE to S3) | ||
1. Toil pip install toil | ||
2. S3AM pip install --pre s3am (optional, needed for uploading output to S3) | ||
|
||
## Inputs | ||
|
||
The BWA pipeline requires input files in order to run. The only required input, aside from the sample(s), is a | ||
reference genome. The pipeline can be sped up by specifying URLs for the reference index files, which are generated | ||
with `bwa index` and `samtools faidx`. | ||
|
||
## General Usage | ||
|
||
## Output | ||
1. Type `toil-bwa generate` to create an editable manifest and config in the current working directory. | ||
2. Parameterize the pipeline by editing the config. | ||
3. Fill in the manifest with information pertaining to your samples. | ||
4. Type `toil-bwa run [jobStore]` to execute the pipeline. | ||
|
||
This pipeline produces a BAMFILE for a given sample. | ||
## Example Commands | ||
|
||
## Running / Help | ||
Run sample(s) locally using the manifest | ||
1. `toil-bwa generate` | ||
2. Fill in config and manifest | ||
3. `toil-bwa run ./example-jobstore` | ||
|
||
It is recommended to use the associated launch scripts which provide default arguments needed to run the pipeline. | ||
It is likely that the job store positional argument, `--workDir`, and `--output-dir` arguments will need to be modified. | ||
To run a pipeline after dependencies have been installed, simply: | ||
Toil options can be appended to `toil-bwa run`, for example: | ||
`toil-bwa run ./example-jobstore --retryCount=1 --workDir=/data` | ||
|
||
* `git clone https://github.com/BD2KGenomics/toil-scripts` | ||
* `/toil-scripts/src/toil_scripts/batch_alignment/launch_bwa_hg38_no_alt.sh` | ||
For a complete list of Toil options, just type `toil-bwa run -h` | ||
|
||
Run a variety of samples locally | ||
1. `toil-bwa generate-config` | ||
2. Fill in config | ||
3. `toil-bwa run ./example-jobstore --retryCount=1 --workDir=/data --sample \ | ||
test-uuid file:///full/path/to/read1.fq.gz file:///full/path/to/read2.fq.gz` | ||
|
||
## Example Config | ||
|
||
``` | ||
# BWA Alignment Pipeline configuration file | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Put in verbatim block? ``` There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, ok! |
||
# This configuration file is formatted in YAML. Simply write the value (at least one space) after the colon. | ||
# Edit the values in this configuration file and then rerun the pipeline: "toil-bwa run" | ||
# URLs can take the form: http://, file://, s3://, gnos://. | ||
# Comments (beginning with #) do not need to be removed. Optional parameters may be left blank | ||
############################################################################################################## | ||
# Required: Reference fasta file | ||
ref: s3://cgl-pipeline-inputs/alignment/hg19.fa | ||
|
||
# Required: Output location of sample. Can be full path to a directory or an s3:// URL | ||
output-dir: /data/ | ||
|
||
# Required: The library entry to go in the BAM read group. | ||
library: Illumina | ||
|
||
# Required: Platform to put in the read group | ||
platform: Illumina | ||
|
||
# Required: Program Unit for BAM header. Required for use with GATK. | ||
program_unit: 12345 | ||
|
||
# Required: Approximate input file size. Provided as a number followed by (base-10) [TGMK]. E.g. 10M, 150G | ||
file-size: 50G | ||
|
||
# Optional: If true, sorts bam | ||
sort: True | ||
|
||
# Optional. If true, trims adapters | ||
trim: false | ||
|
||
# Optional: Reference fasta file (amb) -- if not present will be generated | ||
amb: s3://cgl-pipeline-inputs/alignment/hg19.fa.amb | ||
|
||
# Optional: Reference fasta file (ann) -- If not present will be generated | ||
ann: s3://cgl-pipeline-inputs/alignment/hg19.fa.ann | ||
|
||
# Optional: Reference fasta file (bwt) -- If not present will be generated | ||
bwt: s3://cgl-pipeline-inputs/alignment/hg19.fa.bwt | ||
|
||
# Optional: Reference fasta file (pac) -- If not present will be generated | ||
pac: s3://cgl-pipeline-inputs/alignment/hg19.fa.pac | ||
|
||
# Optional: Reference fasta file (sa) -- If not present will be generated | ||
sa: s3://cgl-pipeline-inputs/alignment/hg19.fa.sa | ||
|
||
# Optional: Reference fasta file (fai) -- If not present will be generated | ||
fai: s3://cgl-pipeline-inputs/alignment/hg19.fa.fai | ||
|
||
# Optional: (string) Path to Key File for SSE-C Encryption | ||
ssec: | ||
|
||
# Optional: Use instead of library, program_unit, and platform. | ||
rg-line: | ||
|
||
# Optional: Alternate file for reference build (alt). Necessary for alt aware alignment | ||
alt: | ||
|
||
# Optional: If true, runs the pipeline in mock mode, generating a fake output bam | ||
mock-mode: | ||
``` | ||
|
||
Due to PYTHONPATH issues, help can be found by typing: | ||
## Distributed Run | ||
|
||
* `cd toil-scripts/src` | ||
* `python -m toil_scripts.batch_alignment.bwa_alignment --help` | ||
|
||
To run on a distributed AWS cluster, see [CGCloud](https://github.com/BD2KGenomics/cgcloud) for instance provisioning, | ||
then run `toil-bwa run aws:us-west-2:example-jobstore-bucket --batchSystem=mesos --mesosMaster mesos-master:5050` | ||
to use the AWS job store and mesos batch system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to
rg_line
now?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Er. Nevermind. I see that
rg_line
is now being passed throughinputs
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't
rg_line
just ininputs
to start with? If we're refactoring this, mightn't we factor out passingrg_line
as a parameter?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code, I'm not sure if it's even defined anywhere in
align_and_call
.rg_line
gets set fromuuid_items[1]
, which is taken fromuuid_rg
fromuuid_list
, butuuid_list
is parsed from the manifest which has no entry for read group... (confused yet?).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ace!