Skip to content

Commit b8f4c6c

Browse files
authored
Merge pull request #56 from genepi/code_refactoring
Code refactoring
2 parents bae68d0 + a886f08 commit b8f4c6c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+500
-1110
lines changed

CHANGELOG.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
# ecSeq/DNAseq
1+
# genepi/umi-pipeline-nf
22
---
33
# Releases
44

55
---
66
# Prereleases
77
## v0.1.0 -
8-
* Initialised repo
8+
* Initialised repo

README.md

+30-13
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,34 @@
1-
[<img width="200" align="right" src="docs/images/ecseq.jpg">](https://www.ecseq.com)
21
[![Nextflow](https://img.shields.io/badge/nextflow-20.07.1-brightgreen.svg)](https://www.nextflow.io/)
32
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg)](http://bioconda.github.io/)
4-
[![Docker](https://img.shields.io/docker/automated/ecseq/dnaseq.svg)](https://hub.docker.com/r/ecseq/dnaseq)
53

6-
umi-pipeline-nf Pipeline
4+
Umi-pipeline-nf
75
======================
86

9-
**umi-pipeline-nf** is based on a [snakemake pipeline](https://github.com/nanoporetech/pipeline-umi-amplicon) provided by [Oxford Nanopore Technologies (ONT)](https://nanoporetech.com/). To increase efficiency and usability the pipeline was transferred to [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation simple and results highly reproducible.
7+
**Umi-pipeline-nf** creates highly accurate single-molecule consensus sequences for unique molecular identifier (UMI)-tagged amplicon data.
8+
The pipeline can be run for the whole fastq_pass folder of your nanopore run and, per default, outputs the aligned consensus sequences of each UMI cluster in bam file. The optional variant calling creates a vcf file for all variants that are found in the consensus sequences.
9+
umi-pipeline-nf is based on the snakemake [ONT UMI analysis pipeline](https://github.com/nanoporetech/pipeline-umi-amplicon) (workflow originally developed by [Karst et al, Nat Biotechnol 18:165–169, 2021](https://www.nature.com/articles/s41592-020-01041-y)). We transferred the pipeline to [Nextflow](https://www.nextflow.io) and included [additional functionalities](#main-adaptations).
1010

11-
## Overview
12-
`umi-pipeline-nf` creates highly accurate single-molecule consensus sequences based on amplicon data tagged by unique molecular identifiers (UMIs). The pipeline can be run for the whole fastq_pass folder of your nanopore run and per default, the output are the aligned consensus sequences in bam file format.
13-
Additional flags can be set to perform a variant calling ( [freebayes](https://github.com/freebayes/freebayes), [lofreq](http://csb5.github.io/lofreq/) or [mutserve](https://mitoverse.readthedocs.io/mutserve/mutserve/) )
11+
## Workflow
1412

15-
> See the [output documentation](docs/output.md) for more details of the results.
13+
1. Input reads are aligned against a reference genome.
14+
2. The flanking UMI sequences of all reads are extracted.
15+
3. The extracted UMIs are used to cluster the reads.
16+
4. Per cluster, highly accurate consensus sequences are created.
17+
5. The consensus sequences are aligned against the reference sequenced.
18+
6. An optional variant calling step can be performed.
19+
20+
> See the [output documentation](docs/output.md) for a detailed overview of the pipeline and its output files.
21+
22+
## Main Adaptations
23+
24+
* It comes with docker containers making **installation simple, portable** and **results highly reproducible**.
25+
* The pipeline is **optimized for parallelization**.
26+
* Read filtering strategy per UMI cluster was adapted to **preserve the highest quality reads**.
27+
* **Three commonly used variant callers** ([freebayes](https://github.com/freebayes/freebayes), [lofreq](http://csb5.github.io/lofreq/) or [mutserve](https://mitoverse.readthedocs.io/mutserve/mutserve/)) are supported by the pipeline.
28+
* The raw reads can be optionally **subsampled**.
29+
* The raw reads can be **filtered by read length and quality**.
30+
31+
> See the [usage documentation](docs/usage.md) for all of the available parameters of the pipeline.
1632
1733
## Quick Start
1834

@@ -21,19 +37,20 @@ Additional flags can be set to perform a variant calling ( [freebayes](https://g
2137
2. Download the pipeline and test it on a minimal dataset with a single command
2238

2339
```bash
24-
nextflow run AmstlerStephan/umi-pipeline-nf -profile test,docker
40+
nextflow run genepi/umi-pipeline-nf -profile test,docker
2541
```
2642

2743
3. Start running your own analysis!
2844
3.1 Download and adapt the config/custom.config with paths to your data (relative and absolute paths possible)
2945

3046
```bash
31-
nextflow run AmstlerStephan/umi-pipeline-nf -r main -c <custom.config> -profile docker
47+
nextflow run genepi/umi-pipeline-nf -r main -c <custom.config> -profile docker
3248
```
3349

34-
> See the [usage documentation](docs/usage.md) for all of the available options when running the pipeline.
35-
3650

3751
### Credits
3852

39-
These scripts were originally written for use by [GENEPI](https://genepi.i-med.ac.at/), by ([@StephanAmstler](https://github.com/AmstlerStephan)).
53+
The pipeline was written by ([@StephanAmstler](https://github.com/AmstlerStephan)).
54+
Nextflow template pipeline: [EcSeq](https://github.com/ecSeq).
55+
Original snakemake-based pipeline: [nanoporetech/pipeline-umi-amplicon](https://github.com/nanoporetech/pipeline-umi-amplicon).
56+
Original workflow: [SorenKarst/longread_umi](https://github.com/SorenKarst/longread_umi).

bin/bam_to_phred.py

-149
This file was deleted.

bin/extract_umis.py

+26-4
Original file line numberDiff line numberDiff line change
@@ -52,13 +52,26 @@ def parse_args(argv):
5252
help="Length of adapter",
5353
)
5454
parser.add_argument(
55-
"-t", "--threads", dest="THREADS", type=int, default=1, help="Number of threads."
55+
"-t",
56+
"--threads",
57+
dest="THREADS",
58+
type=int,
59+
default=1,
60+
help="Number of threads."
5661
)
5762
parser.add_argument(
58-
"--tsv", dest="TSV", action="store_true", help="write TSV output file"
63+
"--tsv",
64+
dest="TSV",
65+
action="store_true",
66+
help="write TSV output file"
5967
)
6068
parser.add_argument(
61-
"-o", "--output", dest="OUT", type=str, required=False, help="Output directory"
69+
"-o",
70+
"--output",
71+
dest="OUT",
72+
type=str,
73+
required=False,
74+
help="Output directory"
6275
)
6376
parser.add_argument(
6477
"--output_format",
@@ -82,7 +95,10 @@ def parse_args(argv):
8295
help="Reverse UMI sequence",
8396
)
8497
parser.add_argument(
85-
"INPUT_FA", type=str, default="/dev/stdin", help="Filtered Reads"
98+
"INPUT_FA",
99+
type=str,
100+
default="/dev/stdin",
101+
help="Filtered Reads"
86102
)
87103

88104
args = parser.parse_args(argv)
@@ -109,8 +125,10 @@ def extract_umi(query_seq, query_qual, pattern, max_edit_dist, format):
109125
edit_dist = result["editDistance"]
110126
locs = result["locations"][0]
111127
umi = query_seq[locs[0]:locs[1]+1]
128+
112129
if format == "fastq":
113130
umi_qual = query_qual[locs[0]:locs[1]+1]
131+
114132
return edit_dist, umi, umi_qual
115133

116134

@@ -123,15 +141,18 @@ def extract_adapters(entry, max_adapter_length, format):
123141
if len(entry.sequence) > max_adapter_length:
124142
read_5p_seq = entry.sequence[:max_adapter_length]
125143
read_3p_seq = entry.sequence[-max_adapter_length:]
144+
126145
if format == "fastq":
127146
read_5p_qual = entry.quality[:max_adapter_length]
128147
read_3p_qual = entry.quality[-max_adapter_length:]
129148

130149
return read_5p_seq, read_3p_seq, read_5p_qual, read_3p_qual
131150

151+
132152
def get_read_name(entry):
133153
return entry.name.split(";")[0]
134154

155+
135156
def get_read_strand(entry):
136157
strand = entry.name.split("strand=")
137158
if len(strand) > 1:
@@ -140,6 +161,7 @@ def get_read_strand(entry):
140161
else:
141162
return "+"
142163

164+
143165
def combine_umis_fasta(seq_5p, seq_3p, strand):
144166
if strand == "+":
145167
return seq_5p + seq_3p

bin/setup.py

-2
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@
1515
description='Toolset to work with ONT amplicon sequencing using UMIs',
1616
zip_safe=False,
1717
install_requires=[
18-
'tqdm',
1918
'pysam',
2019
'numpy',
2120
'pandas',
@@ -32,7 +31,6 @@
3231
'umi_extract = umi_amplicon_tools.extract_umis:main',
3332
'umi_reformat_consensus = umi_amplicon_tools.reformat_consensus:main',
3433
'umi_parse_clusters = umi_amplicon_tools.parse_clusters:main',
35-
'umi_bam_to_phred = umi_amplicon_tools.bam_to_phred:main',
3634
'umi_stats = umi_amplicon_tools.umi_stats:main'
3735
]
3836
},

0 commit comments

Comments
 (0)