Skip to content

Commit

Permalink
Merge pull request #6 from TalusBio/global-rework
Browse files Browse the repository at this point in the history
Global rework
  • Loading branch information
ricomnl authored Feb 11, 2022
2 parents df6a8b8 + 4693c2d commit 92426ec
Show file tree
Hide file tree
Showing 26 changed files with 539 additions and 293 deletions.
11 changes: 6 additions & 5 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,10 @@ testing/
testing*
*.pyc
experiment-bucket/
metadata-bucket/
mzml-bucket/
raw-bucket/
cache-bucket/
metadata/
mzml/
raw/
input_folder/
.pytest_cache/
.pytest_cache/
store
run.config
86 changes: 77 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,79 @@
# Nextflow - Encyclopedia
This repository contains the nextflow pipeline for Encyclopedia.
# NextFlow - EncyclopeDIA

# Git Flow
This repository contains Talus' NextFlow pipeline for EncyclopeDIA. It connects three open-source tools---msconvert, EncylopeDIA, and MSstats---to go from raw mass spectrometry data to quantified peptides and proteins that are ready for statistical analysis.

## Dependencies
To run the pipeline locally, you'll need these dependencies:
- [NextFlow](https://www.nextflow.io/) - Personally I install it using conda:

``` sh
conda install -c bioconda nextflow
```

- [Docker](https://www.docker.com/)[^1] - On my Mac, I install it with [Homebrew](https://brew.sh/)

``` sh
brew install docker
```

[^1]: Note that Docker is only required to really run the pipeline. Testing using NextFlow process stubs can be done without it.

## Usage
Generally, we launch this pipeline from our [pipeline launcher](https://share.streamlit.io/talusbio/talus-pipeline-launcher/main/apps/pipeline_launcher.py). However, it can be launched like any other NextFlow pipeline locally:

``` sh
nextflow run /path/to/nf-encyclopedia --<parameters>
```

Where `<parameters>` are the pipeline parameters. The pipeline has 3 required parameters:

- `ms_file_csv` - A comma-separated values (CSV) file containing the raw mass spectrometry data files. It is required to have 3 columns: `file`, `chrlib`, and `group`.
* `file` specifies the path of a raw MS data file.
* `chrlib` is either `true` or `false` and specifies whether the file is part of a chromatogram library ("library files") or used for quantitation ("quant files"), respectively.
* `group` specifies an experiment group. Quant files will searched only using library files from the same group. Any group with no library files will be searched directly with the DLIB instead. Additionally, the group will specify a subdirectory in which the pipeline results will be written. An example of such a file would be:
```
file, chrlib, group
data/a.raw, true, x
data/b.raw, true, y
data/c.raw, false, x
data/d.raw, false, y
data/e.raw, false, z
```

- `encyclopedia.fasta` - The FASTA file of protein sequences for EncyclopeDIA to use. This must match the provided DLIB.

- `encyclopedia.dlib` - The spectral library for EncyclopeDIA to use, in the DLIB format.

Other important optional parameters are:

- `aggregate` is either `true` or `false` (default: `false`). When set to `true`, the pipeline will perform a single global EncyclopeDIA analysis encompassing all of the quant files. When set to `false`, a global EncyclopeDIA analysis is conducted for each experiment.

## Development
### Running Tests
We use the [pytest](https://docs.pytest.org/en/7.0.x/contents.html) Python package to run our tests. It can be installed either with either pip:

```sh
pip install pytest
```

or conda:

``` sh
conda install pytest
```

Once installed, tests can be run from the root directory of the workflow. These tests use the process stubs to test the workflow logic, but do not test the commands for the tools themselves. Run them with:

``` sh
pytest
```


### Git Flow
We use git flow for features, releases, fixes, etc. Here's an introductory article: https://jeffkreeftmeijer.com/git-flow/.
And a cheatsheet: https://danielkummer.github.io/git-flow-cheatsheet/index.html.

# Release
### Releases
```
# See existing tags
git tag
Expand All @@ -17,8 +85,8 @@ git tag -a v0.0.1 -m "First version"
git push origin "v0.0.1"
```

# Issues
## Command `aws` cannot be found
## Known Issues
### Command `aws` cannot be found
Problem:
When creating the custom AMI, make sure to install the aws-cli outside of the /usr/ directory. During the docker mount it will overwrite it and render the docker image content unusable.

Expand All @@ -34,7 +102,7 @@ aws --version

Also mentioned here: https://github.com/nextflow-io/nextflow/issues/2322

## Command 'ps' required by nextflow to collect task metrics cannot be found
### Command 'ps' required by nextflow to collect task metrics cannot be found
Problem:
Nextflow needs certain tools installed on the system to collect metrics: https://www.nextflow.io/docs/latest/tracing.html#tasks.

Expand All @@ -49,6 +117,6 @@ RUN apt-get update && \
Also mentioned here:
https://github.com/replikation/What_the_Phage/issues/89

# Resources
## Resources
- Running Nextflow on AWS - https://t-neumann.github.io/pipelines/AWS-pipeline/
- Getting started with Nextflow - https://carpentries-incubator.github.io/workflows-nextflow/aio/index.html
- Getting started with Nextflow - https://carpentries-incubator.github.io/workflows-nextflow/aio/index.html
237 changes: 63 additions & 174 deletions main.nf
Original file line number Diff line number Diff line change
@@ -1,192 +1,80 @@
#!/usr/bin/env nextflow
include { msconvert as msconvert_narrow } from "./modules/msconvert.nf"
include { msconvert as msconvert_wide } from "./modules/msconvert.nf"
include { unique_peptides_proteins } from "./modules/unique_peptides_proteins.nf"
include { msstats } from "./modules/msstats.nf"

nextflow.enable.dsl = 2

process run_encyclopedia_local {
echo true
publishDir params.publish_dir, mode: "copy"
storeDir params.store_dir

input:
path mzml_gz_file
each path(library_file)
each path(fasta_file)

output:
tuple(
path("${mzml_gz_file.baseName}.elib"),
path("${file(mzml_gz_file.baseName).baseName}.dia"),
path("${mzml_gz_file.baseName}.{features,encyclopedia,encyclopedia.decoy}.txt"),
path("logs/${mzml_gz_file.baseName}.local.log"),
)

script:
"""
mkdir logs
gzip -df ${mzml_gz_file}
java -Djava.awt.headless=true ${params.encyclopedia.memory} \\
-jar /code/encyclopedia-\$VERSION-executable.jar \\
-i ${mzml_gz_file.baseName} \\
-f ${fasta_file} \\
-l ${library_file} \\
${params.encyclopedia.local_options} \\
| tee logs/${mzml_gz_file.baseName}.local.log
"""

stub:
"""
mkdir logs
touch ${mzml_gz_file.baseName}.elib
touch ${file(mzml_gz_file.baseName).baseName}.dia
touch ${mzml_gz_file.baseName}.features.txt
touch ${mzml_gz_file.baseName}.encyclopedia.txt
touch ${mzml_gz_file.baseName}.encyclopedia.decoy.txt
touch logs/${mzml_gz_file.baseName}.local.log
"""
}
// Subworkflows:
include { CONVERT_TO_MZML } from "./subworkflows/msconvert"
include {
BUILD_CHROMATOGRAM_LIBRARY;
PERFORM_QUANT;
PERFORM_GLOBAL_QUANT
} from "./subworkflows/encyclopedia"

process run_encyclopedia_global {
echo true
publishDir params.publish_dir, mode: "copy"
storeDir params.store_dir

input:
path local_files
path mzml_gz_files
path library_file
path fasta_file
val output_postfix

output:
tuple(
path("result-${output_postfix}*.elib"),
path("result-${output_postfix}*.{peptides,proteins}.txt"),
path("logs/result-${output_postfix}*.global.log")
)

script:
"""
mkdir logs
find . -name '*.gz' -exec gzip -df {} \\;
java -Djava.awt.headless=true ${params.encyclopedia.memory} \\
-jar /code/encyclopedia-\$VERSION-executable.jar \\
-libexport \\
-o result-${output_postfix}.elib \\
-i ./ \\
-f ${fasta_file} \\
-l ${library_file} \\
${params.encyclopedia.global_options} \\
| tee logs/result-${output_postfix}.global.log
"""

stub:
def stem = "result-${output_postfix}"
"""
mkdir logs
touch ${stem}.elib
touch ${stem}.peptides.txt
touch ${stem}.proteins.txt
touch logs/${stem}.global.log
"""
}

workflow encyclopedia_narrow {
take:
mzml_gz_files
dlib
fasta
main:
run_encyclopedia_local(mzml_gz_files, dlib, fasta)
| flatten
| collect
| set { narrow_local_files }

run_encyclopedia_global(
narrow_local_files,
mzml_gz_files | collect,
dlib,
fasta,
params.encyclopedia.narrow_lib_postfix,
)
| flatten
| filter { it.name =~ /.*elib$/ }
| set { narrow_elib }
emit:
narrow_elib
def replace_missing_elib(elib) {
// Use the DLIB when the ELIB is unavailable.
if (elib == null) {
return file(params.encyclopedia.dlib)
}
return elib
}

workflow encyclopedia_wide {
take:
mzml_gz_files
elib
fasta
main:
// Run encyclopedia for all local files
run_encyclopedia_local(mzml_gz_files, elib, fasta)
.flatten()
.tap { wide_local_files }
| filter { it.name =~ /.*mzML.elib$/ }
| collect
| unique_peptides_proteins

// Use the local .elib's as an input to the global run
run_encyclopedia_global(
wide_local_files | collect,
mzml_gz_files | collect,
elib,
fasta,
params.encyclopedia.wide_lib_postfix
)
| flatten
| filter { it.name =~ /.*(elib|txt)$/ }
| set { output_files }
emit:
output_files
}

workflow {
// Get .fasta and .dlib from metadata-bucket
fasta = Channel.fromPath(params.encyclopedia.fasta, checkIfExists: true)
dlib = Channel.fromPath(params.encyclopedia.dlib, checkIfExists: true)
fasta = Channel.fromPath(params.encyclopedia.fasta, checkIfExists: true).first()
dlib = Channel.fromPath(params.encyclopedia.dlib, checkIfExists: true).first()

// Get the narrow and wide files:
narrow_files = Channel
.fromPath(params.narrow_files, checkIfExists: true)
.splitCsv()
.map { row -> file(row[0]) }

wide_files = Channel
.fromPath(params.wide_files, checkIfExists: true)
.splitCsv()
.map { row -> file(row[0]) }

if ( !narrow_files && !wide_files ) {
error "No raw files were given. Nothing to do."
ms_files = Channel
.fromPath(params.ms_file_csv, checkIfExists: true)
.splitCsv(header: true, strip: true)
.multiMap { it ->
runs: it.file
meta: tuple it.file, it.chrlib.toBoolean(), it.group
}

if ( !ms_files.runs ) {
error "No MS data files were given. Nothing to do."
}

// Convert raw files to gzipped mzML.
narrow_files | msconvert_narrow | set { narrow_mzml_files }
wide_files | msconvert_wide | set { wide_mzml_files }

// Build a chromatogram library with EncyclopeDIA
encyclopedia_narrow(narrow_mzml_files, dlib, fasta)

// If no narrow file are given, use the dlib instead.
encyclopedia_narrow.out
.ifEmpty(file(params.encyclopedia.dlib))
.set { chr_elib }

// Perform quant runs on wide window files.
encyclopedia_wide(wide_mzml_files, chr_elib, fasta)

if (params.use_msstats) {
encyclopedia_wide.out
| filter { it.name =~ /.*quant.elib.peptides.txt$/ }
| msstats
// Convert raw files to gzipped mzML and group them by experiment.
// The chrlib and quant channels take the following form:
// [[file_ids], [mzml_gz_files], is_chrlib, group]
CONVERT_TO_MZML(ms_files.runs)
| join(ms_files.meta)
| groupTuple(by: [2, 3])
| branch {
chrlib: it[2]
quant: !it[2]
}
| set { mzml_gz_files }

// Build chromatogram libraries with EncyclopeDIA:
// The output is [group, elib]
BUILD_CHROMATOGRAM_LIBRARY(mzml_gz_files.chrlib, dlib, fasta)
| set { chrlib_elib_files }

// Group quant files with either corresponding library ELIB.
// If none exists, use the DLIB.
// The output is [group, [quant_mzml_gz_files], elib_file]
mzml_gz_files.quant
| map { tuple it[3], it[1] }
| join(chrlib_elib_files, remainder: true)
| map { tuple it[0], it[1], replace_missing_elib(it[2]) }
| set { quant_files }

// Analyze the quantitative runs with EncyclopeDIA.
// The output has two channels:
// local -> [group, [local_elib_files], [mzml_gz_files]]
// global -> [group, global_elib, peptides, proteins] or null
// msstats -> [group, input_csv, feature_csv]
PERFORM_QUANT(quant_files, dlib, fasta, params.aggregate)
| set { quant_results }

// Perform an global analysis on all files if needed:
if ( params.aggregate ) {
PERFORM_GLOBAL_QUANT(quant_results.local, dlib, fasta)
}
}

Expand All @@ -204,4 +92,5 @@ workflow.onError {
subject: "Error: ${params.experimentName} failed.",
body: "Experiment run ${params.experimentName} using Encyclopedia failed.",
)

}
Loading

0 comments on commit 92426ec

Please sign in to comment.