Skip to content

Commit

Permalink
Merge pull request #309 from uclahs-cds/jsalmingo-readme-consistency
Browse files Browse the repository at this point in the history
Fix README consistency
  • Loading branch information
j2salmingo authored Apr 30, 2024
2 parents a6769c3 + 59e91e0 commit 4cc44ce
Show file tree
Hide file tree
Showing 2 changed files with 51 additions and 23 deletions.
3 changes: 1 addition & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,8 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
### Changed
- Update Picard version to 3.1.1
- Update BWA-MEM2, HISAT2 images to use SAMTools version 1.17

### [Changed]
- Update Nextflow configuration test workflows
- Update README.md to match template

## [10.0.0] - 2024-03-29
### Added
Expand Down
71 changes: 50 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,27 @@
# Call DNA Align Nextflow Pipeline for BWA Alignment of Paired-End Reads

1. [Overview](#Overview)
2. [How To Run](#How-To-Run)
3. [Flow Diagram](#Flow-Diagram)
4. [Pipeline Steps](#Pipeline-Steps)
5. [Inputs](#Inputs)
5. [Outputs](#Outputs)
6. [Testing and Validation](#Testing-and-Validation)
7. [References](#References)

# Pipeline-align-DNA
Call DNA Align Nextflow Pipeline for BWA Alignment of Paired-End Reads

- [Pipeline-align-DNA](#Pipeline-align-DNA)
- [Overview](#Overview)
- [How To Run](#How-To-Run)
- [Flow Diagram](#Flow-Diagram)
- [Pipeline Steps](#Pipeline-Steps)
- [1. Alignment](#1-Alignment)
- [2. Convert Align SAM File to BAM Format](#2-Convert-Align-SAM-File-To-BAM-Format)
- [3. Sort BAM Files in Coordinate or Queryname Order](#3-Sort-BAM-Files-In-Coordinate-Or-Queryname-Order)
- [4. Mark Duplicates in BAM Files](#4-Mark-Duplicates-in-BAM-Files)
- [5. Index BAM Files](#5-Index-BAM-Files)
- [Inputs](#Inputs)
- [Outputs](#Outputs)
- [Testing and Validation](#Testing-and-Validation)
- [Test Data Set](#Test-Data-Set)
- [Validation <version number\>](#Validation-Version-Number)
- [Validation Tool](#Validation-Tool)
- [References](#References)
- [Discussions](#Discussions)
- [Contributors](#Contributors)
- [License](#License)

## Overview

The align-DNA Nextflow pipeline, aligns paired-end data utilizing [BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) and/or [HISAT2](http://daehwankimlab.github.io/hisat2/main), [Picard](https://github.com/broadinstitute/picard) Tools and [SAMtools](https://github.com/samtools/samtools). The pipeline has been engineered to run in a 4 layer stack in a cloud-based scalable environment of CycleCloud, Slurm, Nextflow and Docker. Additionally, it has been validated with the SMC-HET dataset and reference GRCh38, where paired-end fastq’s were created with BAM Surgeon.
Expand Down Expand Up @@ -109,8 +122,6 @@ python path/to/pipeline-align-DNA/script/write_dna_align_config_file.py \

## Flow Diagram

A directed acyclic graph of your pipeline.

>Following alignment, processes are run separately for each aligner used.
![align-DNA flow diagram](docs/dna-align-flow.svg?raw=true)
Expand All @@ -119,27 +130,27 @@ A directed acyclic graph of your pipeline.

## Pipeline Steps

### 1A. Alignment
### 1. Alignment

The first step of the pipeline utilizes [BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [HISAT2](http://daehwankimlab.github.io/hisat2/main) to align paired reads. BWA-MEM2 is the successor for the well-known aligner BWA. The bwa-mem2 mem command utilizes the `-M` option for marking shorter splits as secondary. This allows for compatibility with Picard Tools in downstream process and in particular prevents the underlying library of Picard Tools from recognizing these splits as duplicate reads (read names). Additionally, the `-t` option is utilized to increase the number of threads used for alignment. The number of threads used in this step is by default to allow at least 2.5Gb memory per CPU, because of the large memory usage by BWA-MEM2. This can be overwritten by setting the `bwa_mem_number_of_cpus` parameter from the config file.

### 1B. Convert Align SAM File to BAM Format
### 2. Convert Align SAM File to BAM Format

SAMtools `view` command is used to convert the aligned SAM files into the compressed BAM format. The SAMtools `view` command utilizes the `-S` option for increasing the speed by removing duplicates and outputs the reads as they are ordered in the file. Additionally, the `-b` option ensures the output is in BAM format and the `-@` option is utilized to increase the number of threads.

### 2. Sort BAM Files in Coordinate or Queryname Order
### 3. Sort BAM Files in Coordinate or Queryname Order

The next step of the pipeline utilizes SAMtools `sort` command to sort the aligned BAM files in coordinate order or queryname order that is needed for downstream duplicate marking tools. Specifically, the `sort_order` option is utilized to ensure the file is sorted in coordinate order for Picard or queryname order for Spark.

For certain use-cases the pipeline may be configured to stop after this step using the `mark_duplicates=false` parameter in the config file. This option is intended for datasets generated with targeted sequencing panels (like our custom Proseq-G Prostate panel). High coverage target enrichment sequencing (like Illumina's [protocol](https://www.illumina.com/techniques/sequencing/dna-sequencing/targeted-resequencing/target-enrichment.html)) results in a large amount of read duplication that is not an artifact of PCR amplification. Marking these reads as duplicates will severely reduce coverage, and it is recommended that the pipeline be configured to not mark duplicates in this case.

## 3. Mark Duplicates in BAM Files
### 4. Mark Duplicates in BAM Files

If `mark_duplicates=true` then the next step of the pipeline utilizes Picard Tool’s `MarkDuplicates` command to mark duplicates in the BAM files. The `MarkDuplicates` command utilizes the `VALIDATION_STRINGENCY=LENIENT` option to indicate how errors should be handled and keep the process running if possible. Additionally, the `Program_Record_Id` is set to `MarkDuplicates`.

A faster Spark implementation of `MarkDuplicates` can also be used (`MarkDuplicatesSpark` from GATK). The process matches the output of Picard's `MarkDuplicates` with significant runtime improvements. An important note, however, the Spark version requires more disk space and can fail with large inputs with multiple aligners being specified due to insufficient disk space. In such cases, Picard's `MarkDuplicates` should be used instead.

## 4. Index BAM Files
### 5. Index BAM Files

After marking duplicated reads in BAM files, the BAM files are then indexed by using `--CREATE_INDEX true` for Picard's `MarkDuplicates`, or `--create-output-bam-index` for `MarkDuplicatesSpark`. This utilizes the `VALIDATION_STRINGENCY=LENIENT` option to indicate how errors should be handled and keep the process running if possible.

Expand Down Expand Up @@ -217,7 +228,7 @@ After marking duplicated reads in BAM files, the BAM files are then indexed by u

This pipeline was tested using the synthesized SMC-HET dataset as well as a multi-lane real sample CPCG0196-B1, using reference genome version GRCh38. Some benchmarking has been done comparing BWA-MEM2 v2.1, v2.0, and the original BWA. BWA-MEM2 is able to reduce approximately half of the runtime comparing to the original BWA, with the output BAM almost identical. See [here](docs/benchmarking.rst) for the benchmarking.

### Validation <version number\>
### Validation <10.0.0>

| metric | Result |
|:-------|:-------|
Expand Down Expand Up @@ -253,6 +264,10 @@ This pipeline was tested using the synthesized SMC-HET dataset as well as a mult
| pairs with other orientation | 0.9999726 |
| pairs on different chromosomes | 1.0000416 |

### Validation Tool

Included is a template for validating your input files. For more information on the tool check out the following link: https://github.com/uclahs-cds/package-PipeVal

---

## References
Expand All @@ -263,15 +278,29 @@ Daehwan Kim, Ben Langmead, Steven L Salzberg. HISAT: a fast spliced aligner with

---

## Discussions

- [Issue tracker](https://github.com/uclahs-cds/pipeline-align-DNA/issues/) to report errors and enhancement ideas.
- Discussions can take place in [align-DNA discussions](https://github.com/uclahs-cds/pipeline-align-DNA/discussions/)
- [align-DNA Pull Requests](https://github.com/uclahs-cds/pipeline-align-DNA/pulls) are also open for discussion.

---

## Contributors

Please see list of [Contributors](https://github.com/uclahs-cds/pipeline-align-DNA/graphs/contributors) at GitHub.

---

## License

Authors: Benjamin Carlin, Chenghao Zhu ([email protected]), Aaron Holmes ([email protected]), Takafumi Yamaguchi ([email protected]), Aakarsh Anand ([email protected]), Yash Patel ([email protected])
Authors: Benjamin Carlin, Chenghao Zhu ([email protected]), Aaron Holmes ([email protected]), Takafumi Yamaguchi ([email protected]), Aakarsh Anand ([email protected]), Yash Patel ([email protected]), Joseph Salmingo ([email protected])

Align-DNA is licensed under the GNU General Public License version 2. See the file LICENSE for the terms of the GNU GPL license.

Align-DNA aligned paired-end reads using the BWA-MEM2 and/or HISAT2 aligners.

Copyright (C) 2020-2022 University of California Los Angeles ("Boutros Lab") All rights reserved.
Copyright (C) 2020-2024 University of California Los Angeles ("Boutros Lab") All rights reserved.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

Expand Down

0 comments on commit 4cc44ce

Please sign in to comment.