Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kraken2 phylogenetic assignment subworkflow #47

Open
wants to merge 53 commits into
base: dev
Choose a base branch
from

Conversation

ctuni
Copy link

@ctuni ctuni commented Oct 28, 2024

Added a subworkflow for a "phylogenetic QC" that does kraken2 assignment for each sample and then plots them on an interactive krona plot. I have not added the kraken2 reports to multiqc, this should be the next step.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/seqinspector branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nf-test test main.nf.test -profile test,docker).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Copy link

github-actions bot commented Oct 28, 2024

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 758ce12

+| ✅ 194 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗  22 tests had warnings |!

❗ Test warnings:

  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file
  • pipeline_todos - TODO string in nextflow.config: Specify your pipeline's command line flags
  • pipeline_todos - TODO string in nextflow.config: Optionally, you can add a pipeline-specific nf-core config at https://github.com/nf-core/configs
  • pipeline_todos - TODO string in README.md: TODO nf-core:
  • pipeline_todos - TODO string in README.md: Include a figure that guides the user through the major workflow steps. Many nf-core
  • pipeline_todos - TODO string in README.md: Fill in short bullet-pointed list of the default steps in the pipeline
  • pipeline_todos - TODO string in README.md: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file.
  • pipeline_todos - TODO string in README.md: Add bibliography of tools and data used in your pipeline
  • pipeline_todos - TODO string in usage.md: Add documentation about anything specific to running your pipeline. For general topics, please point to (and add to) the main nf-core website.
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in test_full.config: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA)
  • pipeline_todos - TODO string in test_full.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in test.config: Specify the paths to your test data on nf-core/test-datasets
  • pipeline_todos - TODO string in test.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in base.config: Customise requirements for specific processes.
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required
  • schema_description - Ungrouped param in schema: save_uncompressed_k2db

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2024-10-30 15:14:41

@nf-core-bot
Copy link
Member

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 2.14.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

@ctuni ctuni force-pushed the feature/kraken2 branch 2 times, most recently from 75a8e33 to 3e0d357 Compare October 28, 2024 14:52
@ctuni ctuni linked an issue Oct 28, 2024 that may be closed by this pull request
@ctuni ctuni changed the title first commit with kraken2 module Add kraken2 phylogenetic assignment subworkflow Oct 28, 2024
docs/output.md Outdated Show resolved Hide resolved
@MatthiasZepper
Copy link
Member

I think, you have this here well covered.

Therefore, I just wanted to point out, that a similar functionality was recently added to the rnaseq pipeline. So in case you still need some inspiration or have some more ugh moments (fancy commit messages^^), you might already find a suitable solution over there.

@ctuni
Copy link
Author

ctuni commented Oct 30, 2024

I think, you have this here well covered.

Therefore, I just wanted to point out, that a similar functionality was recently added to the rnaseq pipeline. So in case you still need some inspiration or have some more ugh moments (fancy commit messages^^), you might already find a suitable solution over there.

Thank you! I took inspiration from the taxprofiler pipeline and tried simplifying it. I'll check how the rnaseq pipeline does it and see if the PR could be improved.

Copy link

@nggvs nggvs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I have made some suggestions that maybe can be interesting, but feel free to apply them or not

workflows/seqinspector.nf Outdated Show resolved Hide resolved
@@ -22,6 +22,31 @@ process {
ext.args = '--quiet'
}

withName: 'KRAKEN2_KRAKEN2' {
publishDir = [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a general statement for this in this (file)[https://github.com/nf-core/seqinspector/blob/31c1f829d97c4b98d21b68beed4af050fd331a37/conf/modules.config#L15], so I don't think is needed to add it twice, except if that bit is going to be removed it later?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is! I just added it here for two reasons: the first is that I wanted more descriptive names for the folders (kraken2_reports instead of just kraken2) and I wanted the krona plots to be inside the kraken2_reports folder, with a more descriptive name as well.
The second reason I have added this seemingly redundant code is that kraken2 and kronatools can produce more output than what is produced now. I left these lines here looking into the future: they might need to be modified depending on the needs of the pipeline once it reaches a more stable status.

}

withName: 'KRONA_KTIMPORTTAXONOMY' {
publishDir = [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before

}

withName: 'UNTAR' {
publishDir = [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you want to output the kraken db, because it's size can be huge (depending on the selected one) and also it has been previously downloaded by the user, so already in user's device? You may want to use the storeDir in case you want to store the db and reuse it for later without the need of publishing it in the output

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean! I could create a patch to the UNTAR module to add the storeDir directive, that would also need some changes to the config but it can be done.

In any case, to avoid unnecessary waste of space by saving the uncompressed database, the pipeline works differently if the user provides a gzipped database or an uncompressed one. If the pipeline is gzipped, the UNTAR module uncompresses it and uses it, but by default, it won't save the uncompressed database if the user provided a compressed database.

The outputting of the uncompressed kraken2 db is turned off by default by the params.save_uncompressed_k2db, which is set as false. On the modules.config file this is read by the enable declaration.

If the database is uncompressed, and the user passes a path to the kraken2_db param, the UNTAR module is not called; the database is simply used and remains in the user's original directory.

nextflow.config Outdated
@@ -19,6 +19,12 @@ params {
igenomes_base = 's3://ngi-igenomes/igenomes/'
igenomes_ignore = false

// Kraken2 options
kraken2_db = 'https://github.com/nf-core/test-datasets/raw/taxprofiler/data/database/kraken2/testdb-kraken2.tar.gz'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which db is? there are different ones: https://benlangmead.github.io/aws-indexes/k2 which requires different resources depending on the size. The one used, even if its for test should be documented somewhere (maybe it's and I haven't seen it).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the minimalest possible database for testing purposes, but I agree with you that it should not be default one, it should just be set to null. I used the taxprofiler test one, which was built like this: https://github.com/nf-core/test-datasets/blob/taxprofiler/README.md#kraken2

nextflow_schema.json Outdated Show resolved Hide resolved
nextflow.config Show resolved Hide resolved
@ctuni ctuni requested review from Aratz and nggvs October 30, 2024 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add kraken2 to seqinspector
6 participants