Refactor BWA (resolves #297, resolves #320, resolves #322)

Add missing uuid to config Fix bad unpacking Add sample URLs to logger Add suffix option to test Remove unused import Move google group line to parent README Add whitespace before map_job Add comment explaining map_job Fix missing trim attribute in test Clean up logic for reference indices Change point of contact to google group Improve legibility with comments on their own lines Unindent FollowOn job Add logging for index creation Add inline explaining alt file can't be generated Clarify that FAI is created separately from BWA indices Add docstring to move_file_job Add inline comment to explain disk space Reverse use of csv Replace run_bwakit arguments with config Fix imports and spacing in bwa_alignment Add /data/ prefix Add missing import Move run_bwa_index and run_samtools_faidx to indexing.py Remove unnecessary renaming Colocate parameter logic Add inline comments for conditionals and rg Provide clarity for rg_line Move required_length to common lib Remove bad print statement Add comments explaining conditional Clarify samples type Clarify file_size option Remove attempt at humor (+5 squashed commits) Squashed commits: Move bwa_kit to tool library (resolves #297) Add bwa_index and samtools_faidx to tool library Indent Job.Runner, add sanity checks options -> args (convention) Print docstring help if no arguments provided Remove config options Change config formatting Add tab-friendly config and manifest names Replace old bwa_kit job with download_sample_and_align Clean imports Move top docstring to main() Fix test call Fix adam_gatk_pipeline's call to download_reference_files Add job version of `move_files` Update README.md Remove deprecated launch scripts Require only reference (resolves #320) Change download_shared_files -> download_reference_files Remove old reference requirements Add single end support for BWA (resolves #322) Edit manifest docs to include single-end Add nargs range for single-end and paired-end Replace parse_config job with parse_manifest function
BD2KGenomics · Jul 7, 2016 · 7b316cc · 7b316cc
1 parent 45ae6e2
commit 7b316cc
Show file tree

Hide file tree

Showing 17 changed files with 520 additions and 360 deletions.
diff --git a/README.md b/README.md
@@ -22,5 +22,5 @@ The general dependencies for these pipelines are:
 Our group utilizes genomics tools encapsulated within Docker containers for portability.  Each of these
 pipelines can be run locally on your laptop, on a baremetal cluster, or on a cloud provider. 
 
-If there are any questions please contact John Vivian (jtvivian@gmail.com).
+If there are any questions please contact the Toil team at: Toil@googlegroups.com 
 If you find any errors or corrections please feel free to make a pull request.  Feedback of any kind is appreciated.
diff --git a/src/toil_scripts/adam_gatk_pipeline/align_and_call.py b/src/toil_scripts/adam_gatk_pipeline/align_and_call.py
@@ -178,13 +178,13 @@ def static_dag(job, uuid, rg_line, inputs):
             'dir_suffix': inputs.dir_suffix}
 
     # get head BWA alignment job function and encapsulate it
-    bwa = job.wrapJobFn(download_shared_files,
+    inputs.rg_line = rg_line
+    inputs.output_dir = 's3://{s3_bucket}/alignment{dir_suffix}'.format(**args)
+    bwa = job.wrapJobFn(download_reference_files,
                         inputs,
                         [uuid,
                          's3://{s3_bucket}/{sequence_dir}/{uuid}_1.fastq.gz'.format(**args),
-                         's3://{s3_bucket}/{sequence_dir}/{uuid}_2.fastq.gz'.format(**args)],
-                        's3://{s3_bucket}/alignment{dir_suffix}'.format(**args),
-                        rg_line).encapsulate()
+                         's3://{s3_bucket}/{sequence_dir}/{uuid}_2.fastq.gz'.format(**args)]).encapsulate()
 
     # get head ADAM preprocessing job function and encapsulate it
     adam_preprocess = job.wrapJobFn(static_adam_preprocessing_dag,

diff --git a/src/toil_scripts/batch_alignment/README.md b/src/toil_scripts/batch_alignment/README.md
@@ -1,47 +1,146 @@
 ## University of California, Santa Cruz Genomics Institute
-### GATK-compatible Alignment
+### Guide: Running the BWA Pipeline using Toil
+
+This guide attempts to walk the user through running this pipeline from start to finish. 
+
+If you find any errors or corrections please feel free to make a pull request.  Feedback of any kind is appreciated.
 
-If there are any questions please contact John Vivian ([email protected]). 
-If you find any errors or corrections please feel free to make a pull request.  
-Feedback of any kind is appreciated.
 
 ## Overview
 
-This pipeline accepts two fastq files (by URL) to be aligned into a BAMFILE, which is the final output of the pipeline.
-A launch script is provided for 4 different references (b37, hg19, hg38, and hg38 no alternative loci).
+Fastqs are aligned to create a BAM that is compatible with GATK.
+
+## Installation
+
+Toil-scripts is now pip installable! `pip install toil-scripts` for a toil-stable version 
+or `pip install --pre toil-scripts` for cutting edge development version.
+
+Type: `toil-bwa` to get basic help menu and instructions
 
+To decrease the chance of versioning conflicts, install toil-scripts into a virtualenv: 
+
+- `virtualenv ~/toil-scripts` 
+- `source ~/toil-scripts/bin/activate`
+- `pip install toil`
+- `pip install toil-scripts`
+
+If Toil is already installed globally (true for CGCloud users), or there are global dependencies (like Mesos),
+use virtualenv's `--system-site-packages` flag.
+
 ## Dependencies
 
 This pipeline has been tested on Ubuntu 14.04, but should also run on other unix based systems.  `apt-get` and `pip`
-often require `sudo` privilege, so if the below commands fail, try prepending `sudo`.  If you do not have sudo 
-privileges you will need to build these tools from source, or bug a sysadmin (they don't mind). 
+often require `sudo` privilege, so if the below commands fail, try prepending `sudo`.  If you do not have `sudo` 
+privileges you will need to build these tools from source, or bug a sysadmin about how to get them. 
 
 #### General Dependencies
 
     1. Python 2.7
-    2. Curl     apt-get install curl
-    3. Docker   http://docs.docker.com/engine/installation/
-    
+    2. Curl         apt-get install curl
+    3. Docker       http://docs.docker.com/engine/installation/
+
 #### Python Dependencies
 
-    1. Toil     pip install toil
-    2. S3AM     pip install --pre s3am (optional, for upload of BAMFILE to S3)
+    1. Toil         pip install toil
+    2. S3AM         pip install --pre s3am (optional, needed for uploading output to S3)
+
+## Inputs
+
+The BWA pipeline requires input files in order to run. The only required input, aside from the sample(s), is a 
+reference genome.  The pipeline can be sped up by specifying URLs for the reference index files, which are generated 
+with `bwa index` and `samtools faidx`.
+
+## General Usage
 
-## Output
+1. Type `toil-bwa generate` to create an editable manifest and config in the current working directory.
+2. Parameterize the pipeline by editing the config.
+3. Fill in the manifest with information pertaining to your samples.
+4. Type `toil-bwa run [jobStore]` to execute the pipeline.
 
-This pipeline produces a BAMFILE for a given sample.
+## Example Commands
 
-## Running / Help
+Run sample(s) locally using the manifest
+1. `toil-bwa generate`
+2. Fill in config and manifest
+3. `toil-bwa run ./example-jobstore`
 
-It is recommended to use the associated launch scripts which provide default arguments needed to run the pipeline. 
-It is likely that the job store positional argument, `--workDir`, and `--output-dir` arguments will need to be modified.
-To run a pipeline after dependencies have been installed, simply:
+Toil options can be appended to `toil-bwa run`, for example:
+`toil-bwa run ./example-jobstore --retryCount=1 --workDir=/data`
 
-* `git clone https://github.com/BD2KGenomics/toil-scripts`
-* `/toil-scripts/src/toil_scripts/batch_alignment/launch_bwa_hg38_no_alt.sh`
+For a complete list of Toil options, just type `toil-bwa run -h`
+
+Run a variety of samples locally
+1. `toil-bwa generate-config`
+2. Fill in config
+3. `toil-bwa run ./example-jobstore --retryCount=1 --workDir=/data --sample \
+    test-uuid file:///full/path/to/read1.fq.gz file:///full/path/to/read2.fq.gz`
+
+## Example Config
+
+   ``` 
+    # BWA Alignment Pipeline configuration file
+    # This configuration file is formatted in YAML. Simply write the value (at least one space) after the colon.
+    # Edit the values in this configuration file and then rerun the pipeline: "toil-bwa run"
+    # URLs can take the form: http://, file://, s3://, gnos://.
+    # Comments (beginning with #) do not need to be removed. Optional parameters may be left blank
+    ##############################################################################################################
+    # Required: Reference fasta file
+    ref: s3://cgl-pipeline-inputs/alignment/hg19.fa
+    
+    # Required: Output location of sample. Can be full path to a directory or an s3:// URL
+    output-dir: /data/
+    
+    # Required: The library entry to go in the BAM read group.
+    library: Illumina
+    
+    # Required: Platform to put in the read group
+    platform: Illumina
+    
+    # Required: Program Unit for BAM header. Required for use with GATK.
+    program_unit: 12345
+    
+    # Required: Approximate input file size. Provided as a number followed by (base-10) [TGMK]. E.g. 10M, 150G
+    file-size: 50G
+    
+    # Optional: If true, sorts bam
+    sort: True
+    
+    # Optional. If true, trims adapters
+    trim: false
+    
+    # Optional: Reference fasta file (amb) -- if not present will be generated
+    amb: s3://cgl-pipeline-inputs/alignment/hg19.fa.amb
+    
+    # Optional: Reference fasta file (ann) -- If not present will be generated
+    ann: s3://cgl-pipeline-inputs/alignment/hg19.fa.ann
+    
+    # Optional: Reference fasta file (bwt) -- If not present will be generated
+    bwt: s3://cgl-pipeline-inputs/alignment/hg19.fa.bwt
+    
+    # Optional: Reference fasta file (pac) -- If not present will be generated
+    pac: s3://cgl-pipeline-inputs/alignment/hg19.fa.pac
+    
+    # Optional: Reference fasta file (sa) -- If not present will be generated
+    sa: s3://cgl-pipeline-inputs/alignment/hg19.fa.sa
+    
+    # Optional: Reference fasta file (fai) -- If not present will be generated
+    fai: s3://cgl-pipeline-inputs/alignment/hg19.fa.fai
+    
+    # Optional: (string) Path to Key File for SSE-C Encryption
+    ssec:
+    
+    # Optional: Use instead of library, program_unit, and platform.
+    rg-line:
+    
+    # Optional: Alternate file for reference build (alt). Necessary for alt aware alignment
+    alt:
+    
+    # Optional: If true, runs the pipeline in mock mode, generating a fake output bam
+    mock-mode:
+```
 
-Due to PYTHONPATH issues, help can be found by typing:
+## Distributed Run
 
-* `cd toil-scripts/src`
-* `python -m toil_scripts.batch_alignment.bwa_alignment --help`
-
+To run on a distributed AWS cluster, see [CGCloud](https://github.com/BD2KGenomics/cgcloud) for instance provisioning, 
+then run `toil-bwa run aws:us-west-2:example-jobstore-bucket --batchSystem=mesos --mesosMaster mesos-master:5050`
+to use the AWS job store and mesos batch system.