-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #309 from uclahs-cds/jsalmingo-readme-consistency
Fix README consistency
- Loading branch information
Showing
2 changed files
with
51 additions
and
23 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,27 @@ | ||
# Call DNA Align Nextflow Pipeline for BWA Alignment of Paired-End Reads | ||
|
||
1. [Overview](#Overview) | ||
2. [How To Run](#How-To-Run) | ||
3. [Flow Diagram](#Flow-Diagram) | ||
4. [Pipeline Steps](#Pipeline-Steps) | ||
5. [Inputs](#Inputs) | ||
5. [Outputs](#Outputs) | ||
6. [Testing and Validation](#Testing-and-Validation) | ||
7. [References](#References) | ||
|
||
# Pipeline-align-DNA | ||
Call DNA Align Nextflow Pipeline for BWA Alignment of Paired-End Reads | ||
|
||
- [Pipeline-align-DNA](#Pipeline-align-DNA) | ||
- [Overview](#Overview) | ||
- [How To Run](#How-To-Run) | ||
- [Flow Diagram](#Flow-Diagram) | ||
- [Pipeline Steps](#Pipeline-Steps) | ||
- [1. Alignment](#1-Alignment) | ||
- [2. Convert Align SAM File to BAM Format](#2-Convert-Align-SAM-File-To-BAM-Format) | ||
- [3. Sort BAM Files in Coordinate or Queryname Order](#3-Sort-BAM-Files-In-Coordinate-Or-Queryname-Order) | ||
- [4. Mark Duplicates in BAM Files](#4-Mark-Duplicates-in-BAM-Files) | ||
- [5. Index BAM Files](#5-Index-BAM-Files) | ||
- [Inputs](#Inputs) | ||
- [Outputs](#Outputs) | ||
- [Testing and Validation](#Testing-and-Validation) | ||
- [Test Data Set](#Test-Data-Set) | ||
- [Validation <version number\>](#Validation-Version-Number) | ||
- [Validation Tool](#Validation-Tool) | ||
- [References](#References) | ||
- [Discussions](#Discussions) | ||
- [Contributors](#Contributors) | ||
- [License](#License) | ||
|
||
## Overview | ||
|
||
The align-DNA Nextflow pipeline, aligns paired-end data utilizing [BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) and/or [HISAT2](http://daehwankimlab.github.io/hisat2/main), [Picard](https://github.com/broadinstitute/picard) Tools and [SAMtools](https://github.com/samtools/samtools). The pipeline has been engineered to run in a 4 layer stack in a cloud-based scalable environment of CycleCloud, Slurm, Nextflow and Docker. Additionally, it has been validated with the SMC-HET dataset and reference GRCh38, where paired-end fastq’s were created with BAM Surgeon. | ||
|
@@ -109,8 +122,6 @@ python path/to/pipeline-align-DNA/script/write_dna_align_config_file.py \ | |
|
||
## Flow Diagram | ||
|
||
A directed acyclic graph of your pipeline. | ||
|
||
>Following alignment, processes are run separately for each aligner used. | ||
![align-DNA flow diagram](docs/dna-align-flow.svg?raw=true) | ||
|
@@ -119,27 +130,27 @@ A directed acyclic graph of your pipeline. | |
|
||
## Pipeline Steps | ||
|
||
### 1A. Alignment | ||
### 1. Alignment | ||
|
||
The first step of the pipeline utilizes [BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [HISAT2](http://daehwankimlab.github.io/hisat2/main) to align paired reads. BWA-MEM2 is the successor for the well-known aligner BWA. The bwa-mem2 mem command utilizes the `-M` option for marking shorter splits as secondary. This allows for compatibility with Picard Tools in downstream process and in particular prevents the underlying library of Picard Tools from recognizing these splits as duplicate reads (read names). Additionally, the `-t` option is utilized to increase the number of threads used for alignment. The number of threads used in this step is by default to allow at least 2.5Gb memory per CPU, because of the large memory usage by BWA-MEM2. This can be overwritten by setting the `bwa_mem_number_of_cpus` parameter from the config file. | ||
|
||
### 1B. Convert Align SAM File to BAM Format | ||
### 2. Convert Align SAM File to BAM Format | ||
|
||
SAMtools `view` command is used to convert the aligned SAM files into the compressed BAM format. The SAMtools `view` command utilizes the `-S` option for increasing the speed by removing duplicates and outputs the reads as they are ordered in the file. Additionally, the `-b` option ensures the output is in BAM format and the `-@` option is utilized to increase the number of threads. | ||
|
||
### 2. Sort BAM Files in Coordinate or Queryname Order | ||
### 3. Sort BAM Files in Coordinate or Queryname Order | ||
|
||
The next step of the pipeline utilizes SAMtools `sort` command to sort the aligned BAM files in coordinate order or queryname order that is needed for downstream duplicate marking tools. Specifically, the `sort_order` option is utilized to ensure the file is sorted in coordinate order for Picard or queryname order for Spark. | ||
|
||
For certain use-cases the pipeline may be configured to stop after this step using the `mark_duplicates=false` parameter in the config file. This option is intended for datasets generated with targeted sequencing panels (like our custom Proseq-G Prostate panel). High coverage target enrichment sequencing (like Illumina's [protocol](https://www.illumina.com/techniques/sequencing/dna-sequencing/targeted-resequencing/target-enrichment.html)) results in a large amount of read duplication that is not an artifact of PCR amplification. Marking these reads as duplicates will severely reduce coverage, and it is recommended that the pipeline be configured to not mark duplicates in this case. | ||
|
||
## 3. Mark Duplicates in BAM Files | ||
### 4. Mark Duplicates in BAM Files | ||
|
||
If `mark_duplicates=true` then the next step of the pipeline utilizes Picard Tool’s `MarkDuplicates` command to mark duplicates in the BAM files. The `MarkDuplicates` command utilizes the `VALIDATION_STRINGENCY=LENIENT` option to indicate how errors should be handled and keep the process running if possible. Additionally, the `Program_Record_Id` is set to `MarkDuplicates`. | ||
|
||
A faster Spark implementation of `MarkDuplicates` can also be used (`MarkDuplicatesSpark` from GATK). The process matches the output of Picard's `MarkDuplicates` with significant runtime improvements. An important note, however, the Spark version requires more disk space and can fail with large inputs with multiple aligners being specified due to insufficient disk space. In such cases, Picard's `MarkDuplicates` should be used instead. | ||
|
||
## 4. Index BAM Files | ||
### 5. Index BAM Files | ||
|
||
After marking duplicated reads in BAM files, the BAM files are then indexed by using `--CREATE_INDEX true` for Picard's `MarkDuplicates`, or `--create-output-bam-index` for `MarkDuplicatesSpark`. This utilizes the `VALIDATION_STRINGENCY=LENIENT` option to indicate how errors should be handled and keep the process running if possible. | ||
|
||
|
@@ -217,7 +228,7 @@ After marking duplicated reads in BAM files, the BAM files are then indexed by u | |
|
||
This pipeline was tested using the synthesized SMC-HET dataset as well as a multi-lane real sample CPCG0196-B1, using reference genome version GRCh38. Some benchmarking has been done comparing BWA-MEM2 v2.1, v2.0, and the original BWA. BWA-MEM2 is able to reduce approximately half of the runtime comparing to the original BWA, with the output BAM almost identical. See [here](docs/benchmarking.rst) for the benchmarking. | ||
|
||
### Validation <version number\> | ||
### Validation <10.0.0> | ||
|
||
| metric | Result | | ||
|:-------|:-------| | ||
|
@@ -253,6 +264,10 @@ This pipeline was tested using the synthesized SMC-HET dataset as well as a mult | |
| pairs with other orientation | 0.9999726 | | ||
| pairs on different chromosomes | 1.0000416 | | ||
|
||
### Validation Tool | ||
|
||
Included is a template for validating your input files. For more information on the tool check out the following link: https://github.com/uclahs-cds/package-PipeVal | ||
|
||
--- | ||
|
||
## References | ||
|
@@ -263,15 +278,29 @@ Daehwan Kim, Ben Langmead, Steven L Salzberg. HISAT: a fast spliced aligner with | |
|
||
--- | ||
|
||
## Discussions | ||
|
||
- [Issue tracker](https://github.com/uclahs-cds/pipeline-align-DNA/issues/) to report errors and enhancement ideas. | ||
- Discussions can take place in [align-DNA discussions](https://github.com/uclahs-cds/pipeline-align-DNA/discussions/) | ||
- [align-DNA Pull Requests](https://github.com/uclahs-cds/pipeline-align-DNA/pulls) are also open for discussion. | ||
|
||
--- | ||
|
||
## Contributors | ||
|
||
Please see list of [Contributors](https://github.com/uclahs-cds/pipeline-align-DNA/graphs/contributors) at GitHub. | ||
|
||
--- | ||
|
||
## License | ||
|
||
Authors: Benjamin Carlin, Chenghao Zhu ([email protected]), Aaron Holmes ([email protected]), Takafumi Yamaguchi ([email protected]), Aakarsh Anand ([email protected]), Yash Patel ([email protected]) | ||
Authors: Benjamin Carlin, Chenghao Zhu ([email protected]), Aaron Holmes ([email protected]), Takafumi Yamaguchi ([email protected]), Aakarsh Anand ([email protected]), Yash Patel ([email protected]), Joseph Salmingo ([email protected]) | ||
|
||
Align-DNA is licensed under the GNU General Public License version 2. See the file LICENSE for the terms of the GNU GPL license. | ||
|
||
Align-DNA aligned paired-end reads using the BWA-MEM2 and/or HISAT2 aligners. | ||
|
||
Copyright (C) 2020-2022 University of California Los Angeles ("Boutros Lab") All rights reserved. | ||
Copyright (C) 2020-2024 University of California Los Angeles ("Boutros Lab") All rights reserved. | ||
|
||
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. | ||
|
||
|