-
Notifications
You must be signed in to change notification settings - Fork 19
Add samplesheet examples #82
Changes from 5 commits
c6b31f3
1352539
a22a947
8d99ac1
302ec09
cbabb8e
3b16ddb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,79 +5,192 @@ description: Examples of advanced sample sheet creation techniques. | |
|
||
# Sample sheet channel manipulation examples | ||
|
||
## Separate entries based on a condition | ||
## Introduction | ||
|
||
You can use the [`.branch()` operator](https://www.nextflow.io/docs/latest/operator.html#branch) to separate the channel entries based on a condition. This is especially useful when you can get multiple types of input data. | ||
Understanding channel structure and manipulation is critical for getting the most out of Nextflow. nf-validation helps initialise your channels from the text inputs to get you started, but further work might be required to fit your exact use case. In this page we run through some common cases for transforming the output of `.fromSamplesheet`. | ||
|
||
### Glossary | ||
|
||
- A channel is the Nextflow object, referenced in the code | ||
- An item is each thing passing through the channel, equivalent to one row in the samplesheet | ||
- An element is each thing in the item, e.g., the meta value, fastq_1 etc. It may be a file or value | ||
|
||
## Default mode | ||
|
||
Each item in the channel emitted by `.fromSamplesheet()` is a flat tuple, corresponding with each row of the samplesheet. Each item will be composed of a meta value (if present) and any additional elements from columns in the samplesheet, e.g.: | ||
|
||
This example shows a channel which can have entries for WES or WGS data. These analysis are different so we want to separate the WES and WGS entries from each other. We also don't want the `bed` file input for the WGS data, so the resulting channel with WGS data should not have this file in it. | ||
```csv | ||
sample,fastq_1,fastq_2,bed | ||
sample1,fastq1.R1.fq.gz,fastq1.R2.fq.gz,sample1.bed | ||
sample2,fastq2.R1.fq.gz,fastq2.R2.fq.gz, | ||
``` | ||
|
||
Might create a channel where each element consists of 4 items, a map value followed by three files: | ||
|
||
```groovy | ||
// Columns: | ||
[ val([ sample: sample ]), file(fastq1), file(fastq2), file(bed) ] | ||
|
||
// Resulting in: | ||
[ [ id: "sample" ], fastq1.R1.fq.gz, fastq1.R2.fq.gz, sample1.bed] | ||
[ [ id: "sample2" ], fastq2.R1.fq.gz, fastq2.R2.fq.gz, [] ] // A missing value from the samplesheet is an empty list | ||
``` | ||
|
||
This could be used in the Nextflow process as: | ||
adamrtalbot marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```nextflow | ||
tuple val(meta), path(fastq_1), path(fastq_2), path(bed) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not entirely convinced this is the cleanest and most clear way of representing the channel structure, maybe something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've tried to match an input/output declaration for a process. Perhaps I could include both. What does the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It was to emulate a filename, but yeah this is just a matter of representation. I'm not sure which way is the best or most clear to beginners. (although both should be fine I think) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Take a look now, I've added a clarifying statement about what the channel is and how you would use it. |
||
``` | ||
|
||
It may be necessary to manipulate this channel to fit your process inputs. For more documentation, check out the [Nextflow operator docs](https://www.nextflow.io/docs/latest/operator.html), however here are some common use cases with `.fromSamplesheet()`. | ||
|
||
## Changing the structure of channel items | ||
|
||
Each item in the channel will be a flat tuple, but some processes will use multiple files as a list in their input channel, this is common in nf-core modules. For example, consider the following input declaration in a process, where FASTQ could be > 1 file: | ||
|
||
```nextflow | ||
process ZCAT_FASTQS { | ||
input: | ||
tuple val(meta), path(fastq) | ||
|
||
""" | ||
zcat $fastq | ||
""" | ||
} | ||
``` | ||
|
||
The output of `.fromSamplesheet()` is the following by default: | ||
adamrtalbot marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```nextflow | ||
val(meta, path(fastq_1), path(fastq_2) | ||
adamrtalbot marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
|
||
To manipulate each item within a channel, you should use the [Nextflow `.map()` operator](https://www.nextflow.io/docs/latest/operator.html#map). This will apply a function to each element of the channel in turn. Here, we convert the flat tuple into a tuple composed of a meta and a list of FASTQ files: | ||
|
||
```nextflow | ||
Channel.fromSamplesheet("input") | ||
.map { meta, fastq_1, fastq_2 -> tuple(meta, [ fastq_1, fastq_2 ]) } | ||
.set { input } | ||
|
||
input.view() // Channel has 2 elements: meta, fastqs | ||
``` | ||
|
||
This is now compatible with the process defined above and will not raise a warning about input cardinality: | ||
|
||
```nextflow | ||
ZCAT_FASTQS(input) | ||
``` | ||
|
||
## Removing elements in channel items | ||
|
||
For example, to remove the BED file from the channel created above, we could not return it from the map. Note the absence of the `bed` item in the return of the closure below: | ||
|
||
```nextflow | ||
Channel.fromSamplesheet("input") | ||
.map { meta, fastq_1, fastq_2, bed -> tuple(meta, fastq_1, fastq_2) } | ||
.set { input } | ||
|
||
input.view() // Channel has 3 elements: meta, fastq_1, fastq_2 | ||
``` | ||
|
||
In this way you can drop items from a channel. | ||
|
||
## Separating channel items | ||
|
||
We could perform this twice to create one channel containing the FASTQs and one containing the BED files, however Nextflow has a native operator to separate channels called [`.multiMap()`](https://www.nextflow.io/docs/latest/operator.html#multimap). Here, we separate the FASTQs and BEDs into two separate channels using `multiMap`. Note, the channels are both contained in `input` and accessed as an attribute using dot notation: | ||
|
||
```nextflow | ||
Channel.fromSamplesheet("input") | ||
.multiMap { meta, fastq_1, fastq_2, bed -> | ||
fastq: tuple(meta, fastq_1, fastq_2) | ||
bed: tuple(meta, bed) | ||
} | ||
.set { input } | ||
``` | ||
|
||
The channel has two attributes, `fastq` and `bed`, which can be accessed separately. | ||
|
||
```nextflow | ||
input.fastq.view() // Channel has 3 elements: meta, fastq_1, fastq_2 | ||
input.bed.view() // Channel has 2 elements: meta, bed | ||
``` | ||
|
||
Importantly, `multiMap` apples to every item in the channel and returns an item to both channels for every input, i.e. `input`, `input.fastq` and `input.bed` all contain the same number of items, however each item will be different. | ||
adamrtalbot marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Separate items based on a condition | ||
|
||
You can use the [`.branch()` operator](https://www.nextflow.io/docs/latest/operator.html#branch) to separate the channel entries based on a condition. This is especially useful when you can get multiple types of input data. | ||
|
||
This example shows a channel which can have entries for WES or WGS data. WES data includes a BED file denoting the target regions, but WGS data does not. These analysis are different so we want to separate the WES and WGS entries from each other. We can separate the two using `.branch` based on the presence of the BED file: | ||
|
||
```nextflow | ||
// Channel with four elements - see docs for examples | ||
params.input = "samplesheet.csv" | ||
|
||
Channel.fromSamplesheet("input") | ||
.branch { meta, bam, bai, bed -> | ||
WGS: meta.type == "WGS" | ||
return [meta, bam, bai] | ||
WES: meta.type == "WES" | ||
// The original channel structure will be used when no return statement is used. | ||
.branch { meta, fastq_1, fastq_2, bed -> | ||
// If BED does not exist | ||
WGS: !bed | ||
return [meta, fastq_1, fastq_2] | ||
// If BED exists | ||
WES: bed | ||
// The original channel structure will be used when no return statement is used. | ||
} | ||
.set { input } | ||
|
||
input.WGS.view() // Channel has 3 elements: meta, bam, bai | ||
input.WES.view() // Channel has 4 elements: meta, bam, bai, bed | ||
input.WGS.view() // Channel has 3 elements: meta, fastq_1, fastq_2 | ||
input.WES.view() // Channel has 4 elements: meta, fastq_1, fastq_2, bed | ||
``` | ||
|
||
## Count entries with a common value | ||
Unlike `multiMap`, the outputs of `.branch()`, the resulting channels will contain a different number of items. | ||
|
||
## Combining a channel | ||
|
||
After splitting the channel, it may be necessary to rejoin the channel. There are many ways to join a channel, but here we will demonstrate the simplest which uses the [Nextflow join operator](https://www.nextflow.io/docs/latest/operator.html#join) to rejoin any of the channels from above based on the first element in each item, the `meta` value. | ||
|
||
```nextflow | ||
input.fastq.view() // Channel has 3 elements: meta, fastq_1, fastq_2 | ||
input.bed.view() // Channel has 2 elements: meta, bed | ||
|
||
input.fastq | ||
.join( input.bed ) | ||
.set { input_joined } | ||
|
||
input_joined.view() // Channel has 4 elements: meta, fastq_1, fastq_2, bed | ||
``` | ||
|
||
## Count items with a common value | ||
|
||
This example is based on this [code](https://github.com/mribeirodantas/NextflowSnippets/blob/main/snippets/countBy.md) from [Marcel Ribeiro-Dantas](https://github.com/mribeirodantas). | ||
|
||
It's useful to determine the count of channel entries with similar values when you want to merge them later on (to prevent pipeline bottlenecks with `.groupTuple()`). | ||
|
||
This example contains a channel where multiple samples can be in the same family. Later on in the pipeline we want to merge the analyzed files so one file gets created for each family. The result will be a channel with an extra meta field containing the count of channel entries with the same family name. | ||
|
||
```groovy | ||
```nextflow | ||
// channel created by fromSamplesheet() previous to modification: | ||
// [[id:example1, family:family1], example1.txt] | ||
// [[id:example2, family:family1], example2.txt] | ||
// [[id:example3, family:family2], example3.txt] | ||
|
||
params.input = "samplesheet.csv" | ||
|
||
Channel.fromSamplesheet("input") | ||
.tap { ch_raw } // Create a copy of the original channel | ||
.tap { ch_raw } // Create a copy of the original channel | ||
.map { meta, txt -> [ meta.family ] } // Isolate the value to count on | ||
.reduce([:]) { counts, family -> | ||
.reduce([:]) { counts, family -> // Creates a map like this: [family1:2, family2:1] | ||
counts[family] = (counts[family] ?: 0) + 1 | ||
counts | ||
} // Creates a map like this: [family1:2, family2:1] | ||
.combine(ch_raw) // Add the count map to the original channel | ||
.map { counts, meta, txt -> | ||
} | ||
.combine(ch_raw) // Add the count map to the original channel | ||
.map { counts, meta, txt -> // Add the counts of the current family to the meta | ||
new_meta = meta + [count:counts[meta.family]] | ||
[ new_meta, txt ] | ||
} // Add the counts of the current family to the meta | ||
} | ||
.set { input } | ||
|
||
input.view() | ||
// [[id:example1, family:family1, count:2], example1.txt] | ||
// [[id:example2, family:family1, count:2], example2.txt] | ||
// [[id:example3, family:family2, count:1], example3.txt] | ||
``` | ||
|
||
## Split into multiple channels | ||
|
||
Sometimes you don't want all inputs to remain in the same channel (e.g. when the files need to be pre-processed separately). | ||
|
||
Following code shows an example where a `cram` file and a `bed` file are given in the samplesheet. The result contains two channels: one with the `cram` file and one with the `bed` file. | ||
|
||
```groovy | ||
// channel created by fromSamplesheet() previous to modification: | ||
// [[id:example], example.cram, example.cram.crai, example.bed] | ||
params.input = "samplesheet.csv" | ||
Channel.fromSamplesheet("input") | ||
.multiMap { meta, cram, crai, bed -> | ||
cram: [meta, cram, crai] | ||
bed: [meta, bed] | ||
} | ||
.set { input } | ||
|
||
input.cram.view() // [[id:example], example.cram, example.cram.crai] | ||
input.bed.view() // [[id:example], example.bed] | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence sounds a bit confusing to me, as I would expect seeing the use of the channel instead of the process input declaration. Could we rephrase with something like "This channel can be used as input of a process where the input declaration is:"