From e4151549ed38580dda814fdd584cc402f064df76 Mon Sep 17 00:00:00 2001 From: Joseph Salmingo <134560406+j2salmingo@users.noreply.github.com> Date: Fri, 26 Apr 2024 21:48:10 -0700 Subject: [PATCH 1/4] Update README.md Convert half to new template --- README.md | 45 +++++++++++++++++++++++++++------------------ 1 file changed, 27 insertions(+), 18 deletions(-) diff --git a/README.md b/README.md index 0c410ee..ed059d7 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,25 @@ -# Call DNA Align Nextflow Pipeline for BWA Alignment of Paired-End Reads - -1. [Overview](#Overview) -2. [How To Run](#How-To-Run) -3. [Flow Diagram](#Flow-Diagram) -4. [Pipeline Steps](#Pipeline-Steps) -5. [Inputs](#Inputs) -5. [Outputs](#Outputs) -6. [Testing and Validation](#Testing-and-Validation) -7. [References](#References) - +# Pipeline-align-DNA +Call DNA Align Nextflow Pipeline for BWA Alignment of Paired-End Reads + +- [Pipeline-align-DNA](#Pipeline-align-DNA) + - [Overview](#Overview) + - [How To Run](#How-To-Run) + - [Flow Diagram](#Flow-Diagram) + - [Pipeline Steps](#Pipeline-Steps) + - [1. Alignment](#1-) + - [2. Convert to BAM Format](#2-stepproccess-2) + - [3. Step/Proccess n](#3-stepproccess-n) + - [Inputs](#Inputs) + - [Outputs](#Outputs) + - [Testing and Validation](#Testing-and-Validation) + - [Test Data Set](#test-data-set) + - [Validation ](#validation-version-number) + - [Validation Tool](#validation-tool) + - [References](#References) + - [Discussions](#Discussions) + - [Contributors](#Contributors) + - [License](#License) + ## Overview The align-DNA Nextflow pipeline, aligns paired-end data utilizing [BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) and/or [HISAT2](http://daehwankimlab.github.io/hisat2/main), [Picard](https://github.com/broadinstitute/picard) Tools and [SAMtools](https://github.com/samtools/samtools). The pipeline has been engineered to run in a 4 layer stack in a cloud-based scalable environment of CycleCloud, Slurm, Nextflow and Docker. Additionally, it has been validated with the SMC-HET dataset and reference GRCh38, where paired-end fastq’s were created with BAM Surgeon. @@ -109,8 +120,6 @@ python path/to/pipeline-align-DNA/script/write_dna_align_config_file.py \ ## Flow Diagram -A directed acyclic graph of your pipeline. - >Following alignment, processes are run separately for each aligner used. ![align-DNA flow diagram](docs/dna-align-flow.svg?raw=true) @@ -119,27 +128,27 @@ A directed acyclic graph of your pipeline. ## Pipeline Steps -### 1A. Alignment +### 1. Alignment The first step of the pipeline utilizes [BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [HISAT2](http://daehwankimlab.github.io/hisat2/main) to align paired reads. BWA-MEM2 is the successor for the well-known aligner BWA. The bwa-mem2 mem command utilizes the `-M` option for marking shorter splits as secondary. This allows for compatibility with Picard Tools in downstream process and in particular prevents the underlying library of Picard Tools from recognizing these splits as duplicate reads (read names). Additionally, the `-t` option is utilized to increase the number of threads used for alignment. The number of threads used in this step is by default to allow at least 2.5Gb memory per CPU, because of the large memory usage by BWA-MEM2. This can be overwritten by setting the `bwa_mem_number_of_cpus` parameter from the config file. -### 1B. Convert Align SAM File to BAM Format +### 2. Convert Align SAM File to BAM Format SAMtools `view` command is used to convert the aligned SAM files into the compressed BAM format. The SAMtools `view` command utilizes the `-S` option for increasing the speed by removing duplicates and outputs the reads as they are ordered in the file. Additionally, the `-b` option ensures the output is in BAM format and the `-@` option is utilized to increase the number of threads. -### 2. Sort BAM Files in Coordinate or Queryname Order +### 3. Sort BAM Files in Coordinate or Queryname Order The next step of the pipeline utilizes SAMtools `sort` command to sort the aligned BAM files in coordinate order or queryname order that is needed for downstream duplicate marking tools. Specifically, the `sort_order` option is utilized to ensure the file is sorted in coordinate order for Picard or queryname order for Spark. For certain use-cases the pipeline may be configured to stop after this step using the `mark_duplicates=false` parameter in the config file. This option is intended for datasets generated with targeted sequencing panels (like our custom Proseq-G Prostate panel). High coverage target enrichment sequencing (like Illumina's [protocol](https://www.illumina.com/techniques/sequencing/dna-sequencing/targeted-resequencing/target-enrichment.html)) results in a large amount of read duplication that is not an artifact of PCR amplification. Marking these reads as duplicates will severely reduce coverage, and it is recommended that the pipeline be configured to not mark duplicates in this case. -## 3. Mark Duplicates in BAM Files +## 4. Mark Duplicates in BAM Files If `mark_duplicates=true` then the next step of the pipeline utilizes Picard Tool’s `MarkDuplicates` command to mark duplicates in the BAM files. The `MarkDuplicates` command utilizes the `VALIDATION_STRINGENCY=LENIENT` option to indicate how errors should be handled and keep the process running if possible. Additionally, the `Program_Record_Id` is set to `MarkDuplicates`. A faster Spark implementation of `MarkDuplicates` can also be used (`MarkDuplicatesSpark` from GATK). The process matches the output of Picard's `MarkDuplicates` with significant runtime improvements. An important note, however, the Spark version requires more disk space and can fail with large inputs with multiple aligners being specified due to insufficient disk space. In such cases, Picard's `MarkDuplicates` should be used instead. -## 4. Index BAM Files +## 5. Index BAM Files After marking duplicated reads in BAM files, the BAM files are then indexed by using `--CREATE_INDEX true` for Picard's `MarkDuplicates`, or `--create-output-bam-index` for `MarkDuplicatesSpark`. This utilizes the `VALIDATION_STRINGENCY=LENIENT` option to indicate how errors should be handled and keep the process running if possible. From f8f3f5388c6da6db8b7bb1a8e38da49af91bfb13 Mon Sep 17 00:00:00 2001 From: Joseph Salmingo <134560406+j2salmingo@users.noreply.github.com> Date: Fri, 26 Apr 2024 22:00:30 -0700 Subject: [PATCH 2/4] Fully Update README.md --- README.md | 40 ++++++++++++++++++++++++++++++---------- 1 file changed, 30 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index ed059d7..657a024 100644 --- a/README.md +++ b/README.md @@ -6,15 +6,17 @@ Call DNA Align Nextflow Pipeline for BWA Alignment of Paired-End Reads - [How To Run](#How-To-Run) - [Flow Diagram](#Flow-Diagram) - [Pipeline Steps](#Pipeline-Steps) - - [1. Alignment](#1-) - - [2. Convert to BAM Format](#2-stepproccess-2) - - [3. Step/Proccess n](#3-stepproccess-n) + - [1. Alignment](#1-Alignment) + - [2. Convert Align SAM File to BAM Format](#2-Convert-Align-SAM-File-To-BAM-Format) + - [3. Sort BAM Files in Coordinate or Queryname Order](#3-Sort-BAM-Files-In-Coordinate-Or-Queryname-Order) + - [4. Mark Duplicates in BAM Files](#4-Mark-Duplicates-in-BAM-Files) + - [5. Index BAM Files](#5-Index-BAM-Files) - [Inputs](#Inputs) - [Outputs](#Outputs) - [Testing and Validation](#Testing-and-Validation) - - [Test Data Set](#test-data-set) - - [Validation ](#validation-version-number) - - [Validation Tool](#validation-tool) + - [Test Data Set](#Test-Data-Set) + - [Validation ](#Validation-Version-Number) + - [Validation Tool](#Validation-Tool) - [References](#References) - [Discussions](#Discussions) - [Contributors](#Contributors) @@ -142,13 +144,13 @@ The next step of the pipeline utilizes SAMtools `sort` command to sort the align For certain use-cases the pipeline may be configured to stop after this step using the `mark_duplicates=false` parameter in the config file. This option is intended for datasets generated with targeted sequencing panels (like our custom Proseq-G Prostate panel). High coverage target enrichment sequencing (like Illumina's [protocol](https://www.illumina.com/techniques/sequencing/dna-sequencing/targeted-resequencing/target-enrichment.html)) results in a large amount of read duplication that is not an artifact of PCR amplification. Marking these reads as duplicates will severely reduce coverage, and it is recommended that the pipeline be configured to not mark duplicates in this case. -## 4. Mark Duplicates in BAM Files +### 4. Mark Duplicates in BAM Files If `mark_duplicates=true` then the next step of the pipeline utilizes Picard Tool’s `MarkDuplicates` command to mark duplicates in the BAM files. The `MarkDuplicates` command utilizes the `VALIDATION_STRINGENCY=LENIENT` option to indicate how errors should be handled and keep the process running if possible. Additionally, the `Program_Record_Id` is set to `MarkDuplicates`. A faster Spark implementation of `MarkDuplicates` can also be used (`MarkDuplicatesSpark` from GATK). The process matches the output of Picard's `MarkDuplicates` with significant runtime improvements. An important note, however, the Spark version requires more disk space and can fail with large inputs with multiple aligners being specified due to insufficient disk space. In such cases, Picard's `MarkDuplicates` should be used instead. -## 5. Index BAM Files +### 5. Index BAM Files After marking duplicated reads in BAM files, the BAM files are then indexed by using `--CREATE_INDEX true` for Picard's `MarkDuplicates`, or `--create-output-bam-index` for `MarkDuplicatesSpark`. This utilizes the `VALIDATION_STRINGENCY=LENIENT` option to indicate how errors should be handled and keep the process running if possible. @@ -262,6 +264,10 @@ This pipeline was tested using the synthesized SMC-HET dataset as well as a mult | pairs with other orientation | 0.9999726 | | pairs on different chromosomes | 1.0000416 | +### Validation Tool + +Included is a template for validating your input files. For more information on the tool check out the following link: https://github.com/uclahs-cds/package-PipeVal + --- ## References @@ -272,15 +278,29 @@ Daehwan Kim, Ben Langmead, Steven L Salzberg. HISAT: a fast spliced aligner with --- +## Discussions + +- [Issue tracker](https://github.com/uclahs-cds/pipeline-align-DNA/issues/) to report errors and enhancement ideas. +- Discussions can take place in [ discussions](https://github.com/uclahs-cds/pipeline-align-DNA/discussions/) +- [ Pull Requests](https://github.com/uclahs-cds/pipeline-align-DNA/pulls) are also open for discussion. + +--- + +## Contributors + +Please see list of [Contributors](https://github.com/uclahs-cds/pipeline-align-DNA/graphs/contributors) at GitHub. + +--- + ## License -Authors: Benjamin Carlin, Chenghao Zhu (ChenghaoZhu@mednet.ucla.edu), Aaron Holmes (AHolmes@mednet.ucla.edu), Takafumi Yamaguchi (TYamaguchi@mednet.ucla.edu), Aakarsh Anand (AakarshAnand@mednet.ucla.edu), Yash Patel (YashPatel@mednet.ucla.edu) +Authors: Benjamin Carlin, Chenghao Zhu (ChenghaoZhu@mednet.ucla.edu), Aaron Holmes (AHolmes@mednet.ucla.edu), Takafumi Yamaguchi (TYamaguchi@mednet.ucla.edu), Aakarsh Anand (AakarshAnand@mednet.ucla.edu), Yash Patel (YashPatel@mednet.ucla.edu), Joseph Salmingo (JSalmingo@mednet.ucla.edu) Align-DNA is licensed under the GNU General Public License version 2. See the file LICENSE for the terms of the GNU GPL license. Align-DNA aligned paired-end reads using the BWA-MEM2 and/or HISAT2 aligners. -Copyright (C) 2020-2022 University of California Los Angeles ("Boutros Lab") All rights reserved. +Copyright (C) 2020-2024 University of California Los Angeles ("Boutros Lab") All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. From 56aa21af8e5a1b85adb28251a873673dccaa71fa Mon Sep 17 00:00:00 2001 From: Joseph Salmingo <134560406+j2salmingo@users.noreply.github.com> Date: Fri, 26 Apr 2024 22:04:29 -0700 Subject: [PATCH 3/4] Update links in README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 657a024..c474e5a 100644 --- a/README.md +++ b/README.md @@ -228,7 +228,7 @@ After marking duplicated reads in BAM files, the BAM files are then indexed by u This pipeline was tested using the synthesized SMC-HET dataset as well as a multi-lane real sample CPCG0196-B1, using reference genome version GRCh38. Some benchmarking has been done comparing BWA-MEM2 v2.1, v2.0, and the original BWA. BWA-MEM2 is able to reduce approximately half of the runtime comparing to the original BWA, with the output BAM almost identical. See [here](docs/benchmarking.rst) for the benchmarking. -### Validation +### Validation <10.0.0> | metric | Result | |:-------|:-------| @@ -281,8 +281,8 @@ Daehwan Kim, Ben Langmead, Steven L Salzberg. HISAT: a fast spliced aligner with ## Discussions - [Issue tracker](https://github.com/uclahs-cds/pipeline-align-DNA/issues/) to report errors and enhancement ideas. -- Discussions can take place in [ discussions](https://github.com/uclahs-cds/pipeline-align-DNA/discussions/) -- [ Pull Requests](https://github.com/uclahs-cds/pipeline-align-DNA/pulls) are also open for discussion. +- Discussions can take place in [align-DNA discussions](https://github.com/uclahs-cds/pipeline-align-DNA/discussions/) +- [align-DNA Pull Requests](https://github.com/uclahs-cds/pipeline-align-DNA/pulls) are also open for discussion. --- From 59e91e03ba94db98533e4311b260b0b67f05991d Mon Sep 17 00:00:00 2001 From: Joseph Salmingo <134560406+j2salmingo@users.noreply.github.com> Date: Tue, 30 Apr 2024 11:06:00 -0700 Subject: [PATCH 4/4] Update CHANGELOG.md --- CHANGELOG.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index d1b7f60..a383646 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,9 +11,8 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm ### Changed - Update Picard version to 3.1.1 - Update BWA-MEM2, HISAT2 images to use SAMTools version 1.17 - -### [Changed] - Update Nextflow configuration test workflows +- Update README.md to match template ## [10.0.0] - 2024-03-29 ### Added