Merge pull request #324 from /issues/297-refactor-bwa

Refactor bwa (resolves #320, resolves #297, resolves #322)
BD2KGenomics · Jul 11, 2016 · d4bc20f · d4bc20f
2 parents fce430a + eadc6b5
commit d4bc20f
Show file tree

Hide file tree

Showing 17 changed files with 520 additions and 360 deletions.
diff --git a/README.md b/README.md
@@ -22,5 +22,5 @@ The general dependencies for these pipelines are:
 Our group utilizes genomics tools encapsulated within Docker containers for portability.  Each of these
 pipelines can be run locally on your laptop, on a baremetal cluster, or on a cloud provider. 
 
-If there are any questions please contact John Vivian (jtvivian@gmail.com).
+If there are any questions please contact the Toil team at: Toil@googlegroups.com 
 If you find any errors or corrections please feel free to make a pull request.  Feedback of any kind is appreciated.
diff --git a/src/toil_scripts/adam_gatk_pipeline/align_and_call.py b/src/toil_scripts/adam_gatk_pipeline/align_and_call.py
@@ -178,13 +178,13 @@ def static_dag(job, uuid, rg_line, inputs):
             'dir_suffix': inputs.dir_suffix}
 
     # get head BWA alignment job function and encapsulate it
-    bwa = job.wrapJobFn(download_shared_files,
+    inputs.rg_line = rg_line
+    inputs.output_dir = 's3://{s3_bucket}/alignment{dir_suffix}'.format(**args)
+    bwa = job.wrapJobFn(download_reference_files,
                         inputs,
                         [uuid,
                          's3://{s3_bucket}/{sequence_dir}/{uuid}_1.fastq.gz'.format(**args),
-                         's3://{s3_bucket}/{sequence_dir}/{uuid}_2.fastq.gz'.format(**args)],
-                        's3://{s3_bucket}/alignment{dir_suffix}'.format(**args),
-                        rg_line).encapsulate()
+                         's3://{s3_bucket}/{sequence_dir}/{uuid}_2.fastq.gz'.format(**args)]).encapsulate()
 
     # get head ADAM preprocessing job function and encapsulate it
     adam_preprocess = job.wrapJobFn(static_adam_preprocessing_dag,

diff --git a/src/toil_scripts/batch_alignment/README.md b/src/toil_scripts/batch_alignment/README.md
@@ -1,47 +1,146 @@
 ## University of California, Santa Cruz Genomics Institute
-### GATK-compatible Alignment
+### Guide: Running the BWA Pipeline using Toil
+
+This guide attempts to walk the user through running this pipeline from start to finish. 
+
+If you find any errors or corrections please feel free to make a pull request.  Feedback of any kind is appreciated.
 
-If there are any questions please contact John Vivian ([email protected]). 
-If you find any errors or corrections please feel free to make a pull request.  
-Feedback of any kind is appreciated.
 
 ## Overview
 
-This pipeline accepts two fastq files (by URL) to be aligned into a BAMFILE, which is the final output of the pipeline.
-A launch script is provided for 4 different references (b37, hg19, hg38, and hg38 no alternative loci).
+Fastqs are aligned to create a BAM that is compatible with GATK.
+
+## Installation
+
+Toil-scripts is now pip installable! `pip install toil-scripts` for a toil-stable version 
+or `pip install --pre toil-scripts` for cutting edge development version.
+
+Type: `toil-bwa` to get basic help menu and instructions
 
+To decrease the chance of versioning conflicts, install toil-scripts into a virtualenv: 
+
+- `virtualenv ~/toil-scripts` 
+- `source ~/toil-scripts/bin/activate`
+- `pip install toil`
+- `pip install toil-scripts`
+
+If Toil is already installed globally (true for CGCloud users), or there are global dependencies (like Mesos),
+use virtualenv's `--system-site-packages` flag.
+
 ## Dependencies
 
 This pipeline has been tested on Ubuntu 14.04, but should also run on other unix based systems.  `apt-get` and `pip`
-often require `sudo` privilege, so if the below commands fail, try prepending `sudo`.  If you do not have sudo 
-privileges you will need to build these tools from source, or bug a sysadmin (they don't mind). 
+often require `sudo` privilege, so if the below commands fail, try prepending `sudo`.  If you do not have `sudo` 
+privileges you will need to build these tools from source, or bug a sysadmin about how to get them. 
 
 #### General Dependencies
 
     1. Python 2.7
-    2. Curl     apt-get install curl
-    3. Docker   http://docs.docker.com/engine/installation/
-    
+    2. Curl         apt-get install curl
+    3. Docker       http://docs.docker.com/engine/installation/
+
 #### Python Dependencies
 
-    1. Toil     pip install toil
-    2. S3AM     pip install --pre s3am (optional, for upload of BAMFILE to S3)
+    1. Toil         pip install toil
+    2. S3AM         pip install --pre s3am (optional, needed for uploading output to S3)
+
+## Inputs
+
+The BWA pipeline requires input files in order to run. The only required input, aside from the sample(s), is a 
+reference genome.  The pipeline can be sped up by specifying URLs for the reference index files, which are generated 
+with `bwa index` and `samtools faidx`.
+
+## General Usage
 
-## Output
+1. Type `toil-bwa generate` to create an editable manifest and config in the current working directory.
+2. Parameterize the pipeline by editing the config.
+3. Fill in the manifest with information pertaining to your samples.
+4. Type `toil-bwa run [jobStore]` to execute the pipeline.
 
-This pipeline produces a BAMFILE for a given sample.
+## Example Commands
 
-## Running / Help
+Run sample(s) locally using the manifest
+1. `toil-bwa generate`
+2. Fill in config and manifest
+3. `toil-bwa run ./example-jobstore`
 
-It is recommended to use the associated launch scripts which provide default arguments needed to run the pipeline. 
-It is likely that the job store positional argument, `--workDir`, and `--output-dir` arguments will need to be modified.
-To run a pipeline after dependencies have been installed, simply:
+Toil options can be appended to `toil-bwa run`, for example:
+`toil-bwa run ./example-jobstore --retryCount=1 --workDir=/data`
 
-* `git clone https://github.com/BD2KGenomics/toil-scripts`
-* `/toil-scripts/src/toil_scripts/batch_alignment/launch_bwa_hg38_no_alt.sh`
+For a complete list of Toil options, just type `toil-bwa run -h`
+
+Run a variety of samples locally
+1. `toil-bwa generate-config`
+2. Fill in config
+3. `toil-bwa run ./example-jobstore --retryCount=1 --workDir=/data --sample \
+    test-uuid file:///full/path/to/read1.fq.gz file:///full/path/to/read2.fq.gz`
+
+## Example Config
+
+   ``` 
+    # BWA Alignment Pipeline configuration file
+    # This configuration file is formatted in YAML. Simply write the value (at least one space) after the colon.
+    # Edit the values in this configuration file and then rerun the pipeline: "toil-bwa run"
+    # URLs can take the form: http://, file://, s3://, gnos://.
+    # Comments (beginning with #) do not need to be removed. Optional parameters may be left blank
+    ##############################################################################################################
+    # Required: Reference fasta file
+    ref: s3://cgl-pipeline-inputs/alignment/hg19.fa
+    
+    # Required: Output location of sample. Can be full path to a directory or an s3:// URL
+    output-dir: /data/
+    
+    # Required: The library entry to go in the BAM read group.
+    library: Illumina
+    
+    # Required: Platform to put in the read group
+    platform: Illumina
+    
+    # Required: Program Unit for BAM header. Required for use with GATK.
+    program_unit: 12345
+    
+    # Required: Approximate input file size. Provided as a number followed by (base-10) [TGMK]. E.g. 10M, 150G
+    file-size: 50G
+    
+    # Optional: If true, sorts bam
+    sort: True
+    
+    # Optional. If true, trims adapters
+    trim: false
+    
+    # Optional: Reference fasta file (amb) -- if not present will be generated
+    amb: s3://cgl-pipeline-inputs/alignment/hg19.fa.amb
+    
+    # Optional: Reference fasta file (ann) -- If not present will be generated
+    ann: s3://cgl-pipeline-inputs/alignment/hg19.fa.ann
+    
+    # Optional: Reference fasta file (bwt) -- If not present will be generated
+    bwt: s3://cgl-pipeline-inputs/alignment/hg19.fa.bwt
+    
+    # Optional: Reference fasta file (pac) -- If not present will be generated
+    pac: s3://cgl-pipeline-inputs/alignment/hg19.fa.pac
+    
+    # Optional: Reference fasta file (sa) -- If not present will be generated
+    sa: s3://cgl-pipeline-inputs/alignment/hg19.fa.sa
+    
+    # Optional: Reference fasta file (fai) -- If not present will be generated
+    fai: s3://cgl-pipeline-inputs/alignment/hg19.fa.fai
+    
+    # Optional: (string) Path to Key File for SSE-C Encryption
+    ssec:
+    
+    # Optional: Use instead of library, program_unit, and platform.
+    rg-line:
+    
+    # Optional: Alternate file for reference build (alt). Necessary for alt aware alignment
+    alt:
+    
+    # Optional: If true, runs the pipeline in mock mode, generating a fake output bam
+    mock-mode:
+```
 
-Due to PYTHONPATH issues, help can be found by typing:
+## Distributed Run
 
-* `cd toil-scripts/src`
-* `python -m toil_scripts.batch_alignment.bwa_alignment --help`
-
+To run on a distributed AWS cluster, see [CGCloud](https://github.com/BD2KGenomics/cgcloud) for instance provisioning, 
+then run `toil-bwa run aws:us-west-2:example-jobstore-bucket --batchSystem=mesos --mesosMaster mesos-master:5050`
+to use the AWS job store and mesos batch system.