Merge pull request #309 from uclahs-cds/jsalmingo-readme-consistency

Fix README consistency
uclahs-cds · Apr 30, 2024 · 4cc44ce · 4cc44ce
2 parents a6769c3 + 59e91e0
commit 4cc44ce
Show file tree

Hide file tree

Showing 2 changed files with 51 additions and 23 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,9 +11,8 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
 ### Changed
 - Update Picard version to 3.1.1
 - Update BWA-MEM2, HISAT2 images to use SAMTools version 1.17
-
-### [Changed]
 - Update Nextflow configuration test workflows
+- Update README.md to match template
 
 ## [10.0.0] - 2024-03-29
 ### Added

diff --git a/README.md b/README.md
@@ -1,14 +1,27 @@
-# Call DNA Align Nextflow Pipeline for BWA Alignment of Paired-End Reads
-
-1. [Overview](#Overview)
-2. [How To Run](#How-To-Run)
-3. [Flow Diagram](#Flow-Diagram)
-4. [Pipeline Steps](#Pipeline-Steps)
-5. [Inputs](#Inputs)
-5. [Outputs](#Outputs)
-6. [Testing and Validation](#Testing-and-Validation)
-7. [References](#References)
-
+# Pipeline-align-DNA
+Call DNA Align Nextflow Pipeline for BWA Alignment of Paired-End Reads
+
+- [Pipeline-align-DNA](#Pipeline-align-DNA)
+  - [Overview](#Overview)
+  - [How To Run](#How-To-Run)
+  - [Flow Diagram](#Flow-Diagram)
+  - [Pipeline Steps](#Pipeline-Steps)
+    - [1. Alignment](#1-Alignment)
+    - [2. Convert Align SAM File to BAM Format](#2-Convert-Align-SAM-File-To-BAM-Format)
+    - [3. Sort BAM Files in Coordinate or Queryname Order](#3-Sort-BAM-Files-In-Coordinate-Or-Queryname-Order)
+    - [4. Mark Duplicates in BAM Files](#4-Mark-Duplicates-in-BAM-Files)
+    - [5. Index BAM Files](#5-Index-BAM-Files)
+  - [Inputs](#Inputs)
+  - [Outputs](#Outputs)
+  - [Testing and Validation](#Testing-and-Validation)
+    - [Test Data Set](#Test-Data-Set)
+    - [Validation <version number\>](#Validation-Version-Number)
+    - [Validation Tool](#Validation-Tool)
+  - [References](#References)
+  - [Discussions](#Discussions)
+  - [Contributors](#Contributors)
+  - [License](#License)
+
 ## Overview
 
 The align-DNA Nextflow pipeline, aligns paired-end data utilizing [BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) and/or [HISAT2](http://daehwankimlab.github.io/hisat2/main), [Picard](https://github.com/broadinstitute/picard) Tools and [SAMtools](https://github.com/samtools/samtools). The pipeline has been engineered to run in a 4 layer stack in a cloud-based scalable environment of CycleCloud, Slurm, Nextflow and Docker. Additionally, it has been validated with the SMC-HET dataset and reference GRCh38, where paired-end fastq’s were created with BAM Surgeon.
@@ -109,8 +122,6 @@ python path/to/pipeline-align-DNA/script/write_dna_align_config_file.py \
 
 ## Flow Diagram
 
-A directed acyclic graph of your pipeline.
-
 >Following alignment, processes are run separately for each aligner used.
 
 ![align-DNA flow diagram](docs/dna-align-flow.svg?raw=true)
@@ -119,27 +130,27 @@ A directed acyclic graph of your pipeline.
 
 ## Pipeline Steps
 
-### 1A. Alignment
+### 1. Alignment
 
 The first step of the pipeline utilizes [BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [HISAT2](http://daehwankimlab.github.io/hisat2/main) to align paired reads. BWA-MEM2 is the successor for the well-known aligner BWA. The bwa-mem2 mem command utilizes the `-M` option for marking shorter splits as secondary. This allows for compatibility with Picard Tools in downstream process and in particular prevents the underlying library of Picard Tools from recognizing these splits as duplicate reads (read names). Additionally, the `-t` option is utilized to increase the number of threads used for alignment. The number of threads used in this step is by default to allow at least 2.5Gb memory per CPU, because of the large memory usage by BWA-MEM2. This can be overwritten by setting the `bwa_mem_number_of_cpus` parameter from the config file.
 
-### 1B. Convert Align SAM File to BAM Format
+### 2. Convert Align SAM File to BAM Format
 
 SAMtools `view` command is used to convert the aligned SAM files into the compressed BAM format. The SAMtools `view` command utilizes the `-S` option for increasing the speed by removing duplicates and outputs the reads as they are ordered in the file.  Additionally, the `-b` option ensures the output is in BAM format and the `-@` option is utilized to increase the number of threads.
 
-### 2. Sort BAM Files in Coordinate or Queryname Order
+### 3. Sort BAM Files in Coordinate or Queryname Order
 
 The next step of the pipeline utilizes SAMtools `sort` command to sort the aligned BAM files in coordinate order or queryname order that is needed for downstream duplicate marking tools. Specifically, the `sort_order` option is utilized to ensure the file is sorted in coordinate order for Picard or queryname order for Spark.
 
 For certain use-cases the pipeline may be configured to stop after this step using the `mark_duplicates=false` parameter in the config file. This option is intended for datasets generated with targeted sequencing panels (like our custom Proseq-G Prostate panel). High coverage target enrichment sequencing (like Illumina's [protocol](https://www.illumina.com/techniques/sequencing/dna-sequencing/targeted-resequencing/target-enrichment.html)) results in a large amount of read duplication that is not an artifact of PCR amplification. Marking these reads as duplicates will severely reduce coverage, and it is recommended that the pipeline be configured to not mark duplicates in this case.
 
-## 3. Mark Duplicates in BAM Files
+### 4. Mark Duplicates in BAM Files
 
 If `mark_duplicates=true` then the next step of the pipeline utilizes Picard Tool’s `MarkDuplicates` command to mark duplicates in the BAM files. The `MarkDuplicates` command utilizes the `VALIDATION_STRINGENCY=LENIENT` option to indicate how errors should be handled and keep the process running if possible. Additionally, the `Program_Record_Id` is set to `MarkDuplicates`.
 
 A faster Spark implementation of `MarkDuplicates` can also be used (`MarkDuplicatesSpark` from GATK). The process matches the output of Picard's `MarkDuplicates` with significant runtime improvements. An important note, however, the Spark version requires more disk space and can fail with large inputs with multiple aligners being specified due to insufficient disk space. In such cases, Picard's `MarkDuplicates` should be used instead.
 
-## 4. Index BAM Files
+### 5. Index BAM Files
 
 After marking duplicated reads in BAM files, the BAM files are then indexed by using `--CREATE_INDEX true` for Picard's `MarkDuplicates`, or `--create-output-bam-index` for `MarkDuplicatesSpark`. This utilizes the `VALIDATION_STRINGENCY=LENIENT` option to indicate how errors should be handled and keep the process running if possible.
 
@@ -217,7 +228,7 @@ After marking duplicated reads in BAM files, the BAM files are then indexed by u
 
 This pipeline was tested using the synthesized SMC-HET dataset as well as a multi-lane real sample CPCG0196-B1, using reference genome version GRCh38. Some benchmarking has been done comparing BWA-MEM2 v2.1, v2.0, and the original BWA. BWA-MEM2 is able to reduce approximately half of the runtime comparing to the original BWA, with the output BAM almost identical. See [here](docs/benchmarking.rst) for the benchmarking.
 
-### Validation <version number\>
+### Validation <10.0.0>
 
 | metric | Result |
 |:-------|:-------|
@@ -253,6 +264,10 @@ This pipeline was tested using the synthesized SMC-HET dataset as well as a mult
 | pairs with other orientation | 0.9999726 |
 | pairs on different chromosomes | 1.0000416 |
 
+### Validation Tool
+
+Included is a template for validating your input files. For more information on the tool check out the following link: https://github.com/uclahs-cds/package-PipeVal
+
 ---
 
 ## References
@@ -263,15 +278,29 @@ Daehwan Kim, Ben Langmead, Steven L Salzberg. HISAT: a fast spliced aligner with
 
 ---
 
+## Discussions 
+
+- [Issue tracker](https://github.com/uclahs-cds/pipeline-align-DNA/issues/) to report errors and enhancement ideas.
+- Discussions can take place in [align-DNA discussions](https://github.com/uclahs-cds/pipeline-align-DNA/discussions/)
+- [align-DNA Pull Requests](https://github.com/uclahs-cds/pipeline-align-DNA/pulls) are also open for discussion.
+
+---
+
+## Contributors
+
+Please see list of [Contributors](https://github.com/uclahs-cds/pipeline-align-DNA/graphs/contributors) at GitHub.
+
+---
+
 ## License
 
-Authors: Benjamin Carlin, Chenghao Zhu ([email protected]), Aaron Holmes ([email protected]), Takafumi Yamaguchi ([email protected]), Aakarsh Anand ([email protected]), Yash Patel ([email protected])
+Authors: Benjamin Carlin, Chenghao Zhu ([email protected]), Aaron Holmes ([email protected]), Takafumi Yamaguchi ([email protected]), Aakarsh Anand ([email protected]), Yash Patel ([email protected]), Joseph Salmingo ([email protected])
 
 Align-DNA is licensed under the GNU General Public License version 2. See the file LICENSE for the terms of the GNU GPL license.
 
 Align-DNA aligned paired-end reads using the BWA-MEM2 and/or HISAT2 aligners.
 
-Copyright (C) 2020-2022 University of California Los Angeles ("Boutros Lab") All rights reserved.
+Copyright (C) 2020-2024 University of California Los Angeles ("Boutros Lab") All rights reserved.
 
 This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.