Merge pull request #6 from TalusBio/global-rework

Global rework
TalusBio · Feb 11, 2022 · 92426ec · 92426ec
2 parents df6a8b8 + 4693c2d
commit 92426ec
Show file tree

Hide file tree

Showing 26 changed files with 539 additions and 293 deletions.
diff --git a/.gitignore b/.gitignore
@@ -43,9 +43,10 @@ testing/
 testing*
 *.pyc
 experiment-bucket/
-metadata-bucket/
-mzml-bucket/
-raw-bucket/
-cache-bucket/
+metadata/
+mzml/
+raw/
 input_folder/
-.pytest_cache/
+.pytest_cache/
+store
+run.config
diff --git a/README.md b/README.md
@@ -1,11 +1,79 @@
-# Nextflow - Encyclopedia
-This repository contains the nextflow pipeline for Encyclopedia.
+# NextFlow - EncyclopeDIA
 
-# Git Flow
+This repository contains Talus' NextFlow pipeline for EncyclopeDIA. It connects three open-source tools---msconvert, EncylopeDIA, and MSstats---to go from raw mass spectrometry data to quantified peptides and proteins that are ready for statistical analysis. 
+
+## Dependencies
+To run the pipeline locally, you'll need these dependencies:
+- [NextFlow](https://www.nextflow.io/) - Personally I install it using conda:
+
+``` sh
+conda install -c bioconda nextflow
+```
+
+- [Docker](https://www.docker.com/)[^1] - On my Mac, I install it with [Homebrew](https://brew.sh/)
+
+``` sh
+brew install docker
+```
+
+[^1]: Note that Docker is only required to really run the pipeline. Testing using NextFlow process stubs can be done without it.
+
+## Usage
+Generally, we launch this pipeline from our [pipeline launcher](https://share.streamlit.io/talusbio/talus-pipeline-launcher/main/apps/pipeline_launcher.py). However, it can be launched like any other NextFlow pipeline locally:
+
+``` sh
+nextflow run /path/to/nf-encyclopedia --<parameters>
+```
+
+Where `<parameters>` are the pipeline parameters. The pipeline has 3 required parameters:
+
+- `ms_file_csv` - A comma-separated values (CSV) file containing the raw mass spectrometry data files. It is required to have 3 columns: `file`, `chrlib`, and `group`.
+  * `file` specifies the path of a raw MS data file.
+  * `chrlib` is either `true` or `false` and specifies whether the file is part of a chromatogram library ("library files") or used for quantitation ("quant files"), respectively.
+  * `group` specifies an experiment group. Quant files will searched only using library files from the same group. Any group with no library files will be searched directly with the DLIB instead. Additionally, the group will specify a subdirectory in which the pipeline results will be written. An example of such a file would be:
+```
+      file, chrlib, group
+data/a.raw,   true,     x
+data/b.raw,   true,     y
+data/c.raw,  false,     x
+data/d.raw,  false,     y
+data/e.raw,  false,     z
+```
+
+- `encyclopedia.fasta` - The FASTA file of protein sequences for EncyclopeDIA to use. This must match the provided DLIB.
+
+- `encyclopedia.dlib` - The spectral library for EncyclopeDIA to use, in the DLIB format.
+
+Other important optional parameters are:
+
+- `aggregate` is either `true` or `false` (default: `false`). When set to `true`, the pipeline will perform a single global EncyclopeDIA analysis encompassing all of the quant files. When set to `false`, a global EncyclopeDIA analysis is conducted for each experiment. 
+
+## Development
+### Running Tests
+We use the [pytest](https://docs.pytest.org/en/7.0.x/contents.html) Python package to run our tests. It can be installed either with either pip:
+
+```sh
+pip install pytest
+```
+
+or conda:
+
+``` sh
+conda install pytest
+```
+
+Once installed, tests can be run from the root directory of the workflow. These tests use the process stubs to test the workflow logic, but do not test the commands for the tools themselves. Run them with:
+
+``` sh
+pytest
+```
+
+
+### Git Flow
 We use git flow for features, releases, fixes, etc. Here's an introductory article: https://jeffkreeftmeijer.com/git-flow/.
 And a cheatsheet: https://danielkummer.github.io/git-flow-cheatsheet/index.html.
 
-# Release
+### Releases
 ```
 # See existing tags
 git tag
@@ -17,8 +85,8 @@ git tag -a v0.0.1 -m "First version"
 git push origin "v0.0.1"
 ```
 
-# Issues
-## Command `aws` cannot be found
+## Known Issues
+### Command `aws` cannot be found
 Problem:
 When creating the custom AMI, make sure to install the aws-cli outside of the /usr/ directory. During the docker mount it will overwrite it and render the docker image content unusable. 
 
@@ -34,7 +102,7 @@ aws --version
 
 Also mentioned here: https://github.com/nextflow-io/nextflow/issues/2322
 
-## Command 'ps' required by nextflow to collect task metrics cannot be found
+### Command 'ps' required by nextflow to collect task metrics cannot be found
 Problem: 
 Nextflow needs certain tools installed on the system to collect metrics: https://www.nextflow.io/docs/latest/tracing.html#tasks.
 
@@ -49,6 +117,6 @@ RUN apt-get update && \
 Also mentioned here:
 https://github.com/replikation/What_the_Phage/issues/89
 
-# Resources
+## Resources
 - Running Nextflow on AWS - https://t-neumann.github.io/pipelines/AWS-pipeline/
-- Getting started with Nextflow - https://carpentries-incubator.github.io/workflows-nextflow/aio/index.html
+- Getting started with Nextflow - https://carpentries-incubator.github.io/workflows-nextflow/aio/index.html
diff --git a/main.nf b/main.nf
@@ -1,192 +1,80 @@
 #!/usr/bin/env nextflow
-include { msconvert as msconvert_narrow } from "./modules/msconvert.nf"
-include { msconvert as msconvert_wide } from "./modules/msconvert.nf"
-include { unique_peptides_proteins } from "./modules/unique_peptides_proteins.nf"
-include { msstats } from "./modules/msstats.nf"
 
 nextflow.enable.dsl = 2
 
-process run_encyclopedia_local {
-    echo true
-    publishDir params.publish_dir, mode: "copy"
-    storeDir params.store_dir
-
-    input:
-        path mzml_gz_file
-        each path(library_file)
-        each path(fasta_file)
-
-    output:
-        tuple(
-            path("${mzml_gz_file.baseName}.elib"),
-            path("${file(mzml_gz_file.baseName).baseName}.dia"),
-            path("${mzml_gz_file.baseName}.{features,encyclopedia,encyclopedia.decoy}.txt"),
-            path("logs/${mzml_gz_file.baseName}.local.log"),
-        )
-
-    script:
-    """
-    mkdir logs
-    gzip -df ${mzml_gz_file}
-    java -Djava.awt.headless=true ${params.encyclopedia.memory} \\
-        -jar /code/encyclopedia-\$VERSION-executable.jar \\
-        -i ${mzml_gz_file.baseName} \\
-        -f ${fasta_file} \\
-        -l ${library_file} \\
-        ${params.encyclopedia.local_options} \\
-    | tee logs/${mzml_gz_file.baseName}.local.log
-    """
-
-    stub:
-    """
-    mkdir logs
-    touch ${mzml_gz_file.baseName}.elib
-    touch ${file(mzml_gz_file.baseName).baseName}.dia
-    touch ${mzml_gz_file.baseName}.features.txt
-    touch ${mzml_gz_file.baseName}.encyclopedia.txt
-    touch ${mzml_gz_file.baseName}.encyclopedia.decoy.txt
-    touch logs/${mzml_gz_file.baseName}.local.log
-    """
-}
+// Subworkflows:
+include { CONVERT_TO_MZML } from "./subworkflows/msconvert"
+include {
+    BUILD_CHROMATOGRAM_LIBRARY;
+    PERFORM_QUANT;
+    PERFORM_GLOBAL_QUANT
+} from "./subworkflows/encyclopedia"
 
-process run_encyclopedia_global {
-    echo true
-    publishDir params.publish_dir, mode: "copy"
-    storeDir params.store_dir
-
-    input:
-        path local_files
-        path mzml_gz_files
-        path library_file
-        path fasta_file
-        val output_postfix
-
-    output:
-        tuple(
-            path("result-${output_postfix}*.elib"),
-            path("result-${output_postfix}*.{peptides,proteins}.txt"),
-            path("logs/result-${output_postfix}*.global.log")
-        )
-
-    script:
-    """
-    mkdir logs
-    find . -name '*.gz' -exec gzip -df {} \\;
-    java -Djava.awt.headless=true ${params.encyclopedia.memory} \\
-        -jar /code/encyclopedia-\$VERSION-executable.jar \\
-        -libexport \\
-        -o result-${output_postfix}.elib \\
-        -i ./ \\
-        -f ${fasta_file} \\
-        -l ${library_file} \\
-        ${params.encyclopedia.global_options} \\
-    | tee logs/result-${output_postfix}.global.log
-    """
-
-    stub:
-    def stem = "result-${output_postfix}"
-    """
-    mkdir logs
-    touch ${stem}.elib
-    touch ${stem}.peptides.txt
-    touch ${stem}.proteins.txt
-    touch logs/${stem}.global.log
-    """
-}
 
-workflow encyclopedia_narrow {
-    take: 
-        mzml_gz_files
-        dlib
-        fasta
-    main:
-        run_encyclopedia_local(mzml_gz_files, dlib, fasta)
-            | flatten
-            | collect
-            | set { narrow_local_files }
-
-        run_encyclopedia_global(
-            narrow_local_files,
-            mzml_gz_files | collect,
-            dlib,
-            fasta,
-            params.encyclopedia.narrow_lib_postfix,
-        )
-            | flatten
-            | filter { it.name =~ /.*elib$/ }
-            | set { narrow_elib }
-    emit:
-        narrow_elib
+def replace_missing_elib(elib) {
+    // Use the DLIB when the ELIB is unavailable.
+    if (elib == null) {
+        return file(params.encyclopedia.dlib)
+    }
+    return elib
 }
 
-workflow encyclopedia_wide {
-    take: 
-        mzml_gz_files
-        elib
-        fasta
-    main:
-        // Run encyclopedia for all local files
-        run_encyclopedia_local(mzml_gz_files, elib, fasta)
-            .flatten()
-            .tap { wide_local_files }
-            | filter { it.name =~ /.*mzML.elib$/ }
-            | collect
-            | unique_peptides_proteins
-
-        // Use the local .elib's as an input to the global run
-        run_encyclopedia_global(
-            wide_local_files | collect,
-            mzml_gz_files | collect,
-            elib,
-            fasta,
-            params.encyclopedia.wide_lib_postfix
-        )
-            | flatten
-            | filter { it.name =~ /.*(elib|txt)$/ }
-            | set { output_files }
-    emit:
-        output_files
-}
 
 workflow {
     // Get .fasta and .dlib from metadata-bucket
-    fasta = Channel.fromPath(params.encyclopedia.fasta, checkIfExists: true)
-    dlib = Channel.fromPath(params.encyclopedia.dlib, checkIfExists: true)
+    fasta = Channel.fromPath(params.encyclopedia.fasta, checkIfExists: true).first()
+    dlib = Channel.fromPath(params.encyclopedia.dlib, checkIfExists: true).first()
 
     // Get the narrow and wide files:
-    narrow_files = Channel
-        .fromPath(params.narrow_files, checkIfExists: true)
-        .splitCsv()
-        .map { row -> file(row[0]) }
-
-    wide_files = Channel
-        .fromPath(params.wide_files, checkIfExists: true)
-        .splitCsv()
-        .map { row -> file(row[0]) }
-
-    if ( !narrow_files && !wide_files ) {
-        error "No raw files were given. Nothing to do."
+    ms_files = Channel
+        .fromPath(params.ms_file_csv, checkIfExists: true)
+        .splitCsv(header: true, strip: true)
+        .multiMap { it ->
+            runs: it.file
+            meta: tuple it.file, it.chrlib.toBoolean(), it.group
+        }
+
+    if ( !ms_files.runs ) {
+        error "No MS data files were given. Nothing to do."
     }
 
-    // Convert raw files to gzipped mzML.
-    narrow_files | msconvert_narrow | set { narrow_mzml_files }
-    wide_files | msconvert_wide | set { wide_mzml_files }
-
-    // Build a chromatogram library with EncyclopeDIA
-    encyclopedia_narrow(narrow_mzml_files, dlib, fasta)
-
-    // If no narrow file are given, use the dlib instead.
-    encyclopedia_narrow.out
-        .ifEmpty(file(params.encyclopedia.dlib))
-        .set { chr_elib }
-
-    // Perform quant runs on wide window files.
-    encyclopedia_wide(wide_mzml_files, chr_elib, fasta)
-
-    if (params.use_msstats) {
-        encyclopedia_wide.out
-            | filter { it.name =~ /.*quant.elib.peptides.txt$/ }
-            | msstats
+    // Convert raw files to gzipped mzML and group them by experiment.
+    // The chrlib and quant channels take the following form:
+    // [[file_ids], [mzml_gz_files], is_chrlib, group]
+    CONVERT_TO_MZML(ms_files.runs)
+    | join(ms_files.meta)
+    | groupTuple(by: [2, 3])
+    | branch {
+        chrlib: it[2]
+        quant: !it[2]
+    }
+    | set { mzml_gz_files }
+
+    // Build chromatogram libraries with EncyclopeDIA:
+    // The output is [group, elib]
+    BUILD_CHROMATOGRAM_LIBRARY(mzml_gz_files.chrlib, dlib, fasta)
+    | set { chrlib_elib_files }
+
+    // Group quant files with either corresponding library ELIB.
+    // If none exists, use the DLIB.
+    // The output is [group, [quant_mzml_gz_files], elib_file]
+    mzml_gz_files.quant
+    | map { tuple it[3], it[1] }
+    | join(chrlib_elib_files, remainder: true)
+    | map { tuple it[0], it[1], replace_missing_elib(it[2]) }
+    | set { quant_files }
+
+    // Analyze the quantitative runs with EncyclopeDIA.
+    // The output has two channels:
+    // local -> [group, [local_elib_files], [mzml_gz_files]]
+    // global -> [group, global_elib, peptides, proteins] or null
+    // msstats -> [group, input_csv, feature_csv]
+    PERFORM_QUANT(quant_files, dlib, fasta, params.aggregate)
+    | set { quant_results }
+
+    // Perform an global analysis on all files if needed:
+    if ( params.aggregate ) {
+        PERFORM_GLOBAL_QUANT(quant_results.local, dlib, fasta)
     }
 }
 
@@ -204,4 +92,5 @@ workflow.onError {
         subject: "Error: ${params.experimentName} failed.",
         body: "Experiment run ${params.experimentName} using Encyclopedia failed.",
     )
+
 }