diff --git a/README.md b/README.md index 938d5e9..8aab699 100644 --- a/README.md +++ b/README.md @@ -8,13 +8,25 @@ Other dependencies: * bbmap * fastqc +* seqtk * figaro * R v4+ with dada2, devtools, tidyverse, and cowplot installed +* MapSeq For convenience, `gaga2` comes with a Singularity container with all dependencies installed. ```singularity pull oras://ghcr.io/zellerlab/gaga2:latest``` +#### MapSeq +While most dependencies can be installed via conda, `MapSeq` can not. If you don't want / can't use the `gaga2` singularity container, you need to make `gaga2` aware of your own MapSeq installation. In this case, please modify the following entries in the `run.config`: +* `mapseq_bin` should be the path to your `mapseq` binary; if `mapseq` is in your `$PATH`, you can leave this entry unchanged. +* `mapseq_db_path` is the `dirname` of the path to your `mapseq` database, i.e. for `/path/to/mapseq_db_folder/` it is `/path/to/`. +* `mapseq_db_name` is the `basename` of the path to your `mapseq` database, i.e. for `/path/to/mapseq_db_folder/` it is `mapseq_db_folder`. + +#### Custom MapSeq databases. + +If you use the MapSeq as provided by the gaga2 container, you can modify the `mapseq_db_path` and `mapseq_db_name` parameters to point at a custom database. This custom database will be used in addition to MapSeq's own database and is currently limited to a single taxonomy. Please note, that the database files need to be named `.fasta` and `.tax`. + ## Usage instructions @@ -28,22 +40,22 @@ They should, but don't have to, be arranged in a sample-based directory structur ``` (aka "input_dir") |___ - | |____ - | |____ + | |____ _R?1.{fastq,fq,fastq.gz,fq.gz} + | |____ _R?2.{fastq,fq,fastq.gz,fq.gz} | |___ | |____ | |___ - |____ - |____ + |____ _R?1.{fastq,fq,fastq.gz,fq.gz} + |____ _R?2.{fastq,fq,fastq.gz,fq.gz} ``` A flat directory structure (with all read files in the same directory) or a deeply-branched (with read files scattered over multiple levels) should also work. If `gaga2` preprocesses the reads, it will automatically use `_R1/2` endings internally. -* If input reads have already been preprocessed, you can set the `--preprocessed` flag. In this case, `gaga2` will do no preprocessing at all and instruct `dada2` to perform no trimming. Otherwie, `gaga2` will assess the read lengths for uniformity. If read lengths differ within and between samples, preprocessing with `figaro` is not possible and `dada2` will be run without trimming. +* If input reads have already been preprocessed, you can set the `--preprocessed` flag. In this case, `gaga2` will do no preprocessing at all and instruct `dada2` to perform no trimming. Otherwise, `gaga2` will preprocess the reads with `bbduk`. If primer sequences are given via `--primers [,]`, `bbduk` will attempt to remove them + any up/downstream adapter remains. Otherwise, `bbduk` will attempt to remove adapter remains only. Reads are then cut to a uniform length (all forward reads across all samples and all reverse reads across all samples need to have the length, but forward and reverse reads may differ in length.) The latter treatment is a requirement for `figaro`, which will then take the primer lengths (supplied via `--left_primer` and `--right_primer`) into account when determining the optimal cut sites. * Samples with less than `110` reads after `dada2` preprocessing, will be discarded. @@ -61,6 +73,8 @@ before. In addition, you should obtain a copy of the `run.config` from the `gaga2` github repo and modify it according to your environment. +Additional arguments can be set via the command line (e.g. `--input_dir /path/to/input_dir`) or in the params section of the `run.config` file (recommended for documentation and reproducibility purposes.) + #### Mandatory arguments * `--input_dir` is the project directory mentioned above. * `--output_dir` will be created automatically. @@ -71,6 +85,10 @@ In addition, you should obtain a copy of the `run.config` from the `gaga2` githu * `--min_overlap` of read pairs is `20bp` by default * `--primers ` or `--left_primer`, and `--right_primer` If primer sequences are provided via `--primers`, `gaga2` will remove primers and upstream sequences (using `bbduk`), such as adapters based on the primer sequences. If non-zero primer lengths are provided instead (via `--left_primer` and `--right_primer`), `figaro` will take those into account when determining the best trim positions. * `--preprocessed` will prevent any further preprocessing by `gaga2` - this flag should only be used if the read data is reliably clean. +* `--long_reads` should be only set if individual reads (i.e., forward and/or reverse) are known to be of equal length or longer than the amplicon. This will trigger primer/adapter removal (if enabled) on the 3'-ends in order to get rid of readthroughs. +* `--qc_minlen` sets the minimum length for reads to be retained after preprocessing. +* `--dada2_chimera_method` sets the method with which `dada2` will remove chimeras (default: `"consensus"`, alternative: `"pool"`) +* `--dada2_chimera_min_fold_parent_over_abundance` sets the `minFoldParentOverAbundance` parameter for `removeBimeraDeNovo` (default: `2`) ### internal beta-testing instructions diff --git a/assets/adapters.fa b/assets/adapters.fa index a87d425..9de5e45 100755 --- a/assets/adapters.fa +++ b/assets/adapters.fa @@ -1,3 +1,11 @@ +>AmpliconPCR16S_primer_fwd +TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG +>AmpliconPCR16S_primer_rev +GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC +>Overhang_fwd +TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG +>Overhand_rev +GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG >Reverse_adapter AGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG >TruSeq_Universal_Adapter diff --git a/config/run.config b/config/run.config index 05baa3a..b08752d 100644 --- a/config/run.config +++ b/config/run.config @@ -1,29 +1,46 @@ params { - publish_mode = "symlink" + publish_mode = "copy" + // default figaro parameters + // overlap between read1 and read2 min_overlap = 20 + // length of left primer left_primer = 0 + // length of right primer right_primer = 0 - /* - bbduk qc parameters - s. https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/ - qtrim=rl trimq=3 : gentle quality trimming (only discard bases < phred 3; phred 2 = junk marker) on either side (rl) of the read - maq=25 : discard reads below average quality of pred 25 - minlen=45 : discard reads < 45bp - ref=?? ktrim=r k=23 mink=11 hdist=1 tpe tbo : right-side k-mer based adapter clipping with 1 mismatch allowed, try overlap-detection (tbo), and trim pairs to same length (tpe) upon adapter detection - ftm=5 : get rid off (n*5)+1st base (last sequencing cycle illumina garbage) - entropy=0.5 entropywindow=50 entropyk=5 : discard low complexity sequences - */ - - qc_params_primers = "qtrim=rl trimq=3 ktrim=l k=14 mink=1 hdist=1 cu=t" - qc_params_adapters = "qtrim=rl trimq=3 ktrim=l k=23 mink=1 hdist=1 tpe tbo cu=t" + // only keep reads of at least length = qc_minlen qc_minlen = 100 + // If only primer lengths are supplied, figaro/dada2 will take care of primer removal. + // Otherwise, if primer sequences are supplied, + // primer + adapter removal is a two-step process: + // 1. gentle quality trimming (< phred 3) + remove left primer on 5'-R1 and potentially on 3'-R2 (rc) + // 2. remove right primer on 5'-R2 and potentially on 3'-R1 (rc) + + // Primer removal is highly dataset-specific, you might have to play with the settings below: + // also refer to: https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/ + // cu=t : allow degenerate primer sequences + // qtrim=rl trimq=3 : gentle quality trimming (< phred 3) on both sides + // ktrim=(r|l) : clip adapters from right xor left end -- DO NOT MODIFY. + // restrictleft|restrictright : only take into account the first / last N bases for adapter clipping + // k=9 hdist=1: adapter/primer k-mers of length 9 have to match with at most one mismatch + // mink=1: at the ends of reads, perfect (mismatch=0) adapter/primer k-mer matches of length 1 are allowed (similar to cutadapt) + // -- to allow mismatches, set hdist2 to a positive, non-zero integer + + p5_primer_params = "cu=t qtrim=rl ktrim=l trimq=3 k=9 mink=1 hdist=1 restrictleft=50" + p3_primer_params = "cu=t ktrim=r k=9 mink=1 hdist=1 restrictright=50" + + // default settings for mapseq (when served from gaga2-singularity container) + // only edit these three, if you're using a local mapseq installation mapseq_bin = "mapseq" - mapseq_db_path = projectDir - mapseq_db_name = "" + //mapseq_db_path = projectDir // dirname of /path/to/mapseq_folder (i.e. /path/to) + //mapseq_db_name = "" // basename of /path/to/mapseq_folder (i.e. mapseq_folder) + mapseq_db_path = "/g/scb/zeller/schudoma/mapseq_db/zeller_db" + mapseq_db_name = "zeller_db" + + // parameters for dada2 chimera removal -- potentially useful for troubleshooting dada2_chimera_method = "consensus" // can be "consensus" (default dada2 since 1.4) or "pool" dada2_chimera_min_fold_parent_over_abundance = 2 } @@ -47,7 +64,7 @@ process { executor = "slurm" errorStrategy = {task.attempt <= 3 ? "retry" : "ignore"} memory = '16.GB' - time = '4d' + time = '14d' maxRetries = 3 cpus = 8 } @@ -55,8 +72,8 @@ process { container = "oras://ghcr.io/zellerlab/gaga2:latest" executor = "slurm" errorStrategy = {task.attempt <= 3 ? "retry" : "ignore"} - memory = '16.GB' - time = '4d' + memory = {16.GB * task.attempt} + time = '14d' maxRetries = 3 cpus = 8 } @@ -66,13 +83,13 @@ process { errorStrategy = {task.attempt <= 3 ? "retry" : "ignore"} cpus = 4 memory = {8.GB * task.attempt} - time = '2h' + time = '24h' maxRetries = 3 } withName: fastqc { container = "oras://ghcr.io/zellerlab/gaga2:latest" executor = "slurm" - errorStrategy = {task.attempt <= 3 ? "retry" : "ignore"} + errorStrategy = {task.attempt <= 3 ? "retry" : "ignore"} cpus = 2 memory = {4.GB * task.attempt} time = '7d' diff --git a/main.nf b/main.nf index 08024a4..95b5794 100644 --- a/main.nf +++ b/main.nf @@ -2,7 +2,7 @@ nextflow.enable.dsl = 2 -include { qc_bbduk } from "./modules/nevermore/qc/bbduk" +include { qc_bbduk_stepwise_amplicon } from "./modules/nevermore/qc/bbduk" include { fastqc } from "./modules/nevermore/qc/fastqc" include { classify_sample } from "./modules/nevermore/functions" include { mapseq; mapseq_otutable } from "./modules/profilers/mapseq" @@ -48,8 +48,9 @@ process figaro { publishDir "${params.output_dir}", mode: params.publish_mode input: - path input_reads - val is_paired_end + path(input_reads) + val(is_paired_end) + val(read_lengths) output: path("figaro/trimParameters.json"), emit: trim_params @@ -58,9 +59,20 @@ process figaro { script: def paired_params = (is_paired_end == true) ? "-r ${params.right_primer} -m ${params.min_overlap}" : "" - """ - figaro -i . -o figaro/ -a ${params.amplicon_length} -f ${params.left_primer} ${paired_params} - """ + if (params.single_end) { + """ + figaro -i . -o figaro/ -a ${params.amplicon_length} -f ${params.left_primer} + """ + } else { + amplen = read_lengths[0].toInteger() + read_lengths[1].toInteger() - params.min_overlap.toInteger() - 1 + if (amplen >= params.amplicon_length.toInteger()) { + amplen = params.amplicon_length.toInteger() + } + println 'Amplicon length', amplen, params.amplicon_length + """ + figaro -i . -o figaro/ -a ${amplen} -f ${params.left_primer} -r ${params.right_primer} -m ${params.min_overlap} + """ + } } @@ -134,6 +146,28 @@ process dada2_analysis { """ } +process prepare_fastqs { + input: + tuple val(sample), path(fq) + + output: + tuple val(sample), path("fastq/${sample.id}/${sample.id}_R*.fastq.gz"), emit: reads + + script: + if (sample.is_paired) { + """ + mkdir -p fastq/${sample.id} + ln -sf ../../${fq[0]} fastq/${sample.id}/${sample.id}_R1.fastq.gz + ln -sf ../../${fq[1]} fastq/${sample.id}/${sample.id}_R2.fastq.gz + """ + } else { + """ + mkdir -p fastq/${sample.id} + ln -sf ../../${fq[0]} fastq/${sample.id}/${sample.id}_R1.fastq.gz + """ + } +} + process asv2fasta { publishDir "${params.output_dir}", mode: params.publish_mode @@ -150,7 +184,6 @@ process asv2fasta { """ tail -n +2 ${asv_seqs} | sed 's/^/>/' | tr '\t' '\n' > ASVs.fasta """ - } process assess_read_length_distribution { @@ -163,11 +196,10 @@ process assess_read_length_distribution { script: """ - python ${projectDir}/scripts/assess_readlengths.py . > read_length_thresholds.txt + python ${projectDir}/scripts/assess_readlengths.py --amplicon_length ${params.amplicon_length} --min_overlap ${params.min_overlap} . > read_length_thresholds.txt """ } - process homogenise_readlengths { label 'bbduk' @@ -186,14 +218,20 @@ process homogenise_readlengths { mkdir -p ${sample.id} r1len=\$(head -n 1 ${read_lengths} | cut -f 1) r2len=\$(head -n 1 ${read_lengths} | cut -f 4) - bbduk.sh -Xmx${maxmem} t=${task.cpus} ordered=t minlength=\$((r1len-1)) ftr=\$((r1len-1)) stats=${sample.id}/${sample.id}.homr_stats_1.txt in=${sample.id}_R1.fastq.gz out=${sample.id}/${sample.id}_R1.fastq.gz - bbduk.sh -Xmx${maxmem} t=${task.cpus} ordered=t minlength=\$((r2len-1)) ftr=\$((r2len-1)) stats=${sample.id}/${sample.id}.homr_stats_2.txt in=${sample.id}_R2.fastq.gz out=${sample.id}/${sample.id}_R2.fastq.gz + bbduk.sh -Xmx${maxmem} t=${task.cpus} ordered=t trd=t minlength=\$r1len ftr=\$((r1len-1)) stats=${sample.id}/${sample.id}.homr_stats_1.txt in=${sample.id}_R1.fastq.gz out=left.fastq.gz + bbduk.sh -Xmx${maxmem} t=${task.cpus} ordered=t trd=t minlength=\$r2len ftr=\$((r2len-1)) stats=${sample.id}/${sample.id}.homr_stats_2.txt in=${sample.id}_R2.fastq.gz out=right.fastq.gz + gzip -dc left.fastq.gz | awk 'NR%4==1' | sed 's/^@//' | sed 's/\\/1//' | sort > left.txt + gzip -dc right.fastq.gz | awk 'NR%4==1' | sed 's/^@//' | sed 's/\\/2//' | sort > right.txt + join -1 1 -2 1 left.txt right.txt > both.txt + seqtk subseq left.fastq.gz both.txt | gzip -c - > ${sample.id}/${sample.id}_R1.fastq.gz + seqtk subseq right.fastq.gz both.txt | gzip -c - > ${sample.id}/${sample.id}_R2.fastq.gz + rm left.* right.* both.txt """ } else { """ mkdir -p ${sample.id} r1len=\$(head -n 1 ${read_lengths} | cut -f 1) - bbduk.sh -Xmx${maxmem} t=${task.cpus} ordered=t minlength=\$((r1len-1)) ftr=\$((r1len-1)) stats=${sample.id}/${sample.id}.homr_stats_1.txt in=${sample.id}_R1.fastq.gz out=${sample.id}/${sample.id}_R1.fastq.gz + bbduk.sh -Xmx${maxmem} t=${task.cpus} ordered=t minlength=\$r1len ftr=\$((r1len-1)) stats=${sample.id}/${sample.id}.homr_stats_1.txt in=${sample.id}_R1.fastq.gz out=${sample.id}/${sample.id}_R1.fastq.gz """ } } @@ -203,69 +241,29 @@ workflow raw_reads_figaro { take: reads - run_figaro main: - qc_bbduk(reads, "${projectDir}/assets/adapters.fa", run_figaro) + qc_bbduk_stepwise_amplicon(reads, "${projectDir}/assets/adapters.fa") + + fastqc(qc_bbduk_stepwise_amplicon.out.reads) - fastqc(qc_bbduk.out.reads) - fastqc_ch = fastqc.out.reports - .map { sample, report -> return report } - .collect() + read_lengths_ch = qc_bbduk_stepwise_amplicon.out.read_lengths.collect() + assess_read_length_distribution(read_lengths_ch) + homogenise_readlengths(qc_bbduk_stepwise_amplicon.out.reads, assess_read_length_distribution.out.read_length) - assess_read_length_distribution(fastqc_ch) - homogenise_readlengths(qc_bbduk.out.reads, assess_read_length_distribution.out.read_length) + read_lengths_figaro_input_ch = assess_read_length_distribution.out.read_length + .splitCsv(header: false, sep: '\t') + .first() + .map { values -> (params.single_end) ? tuple(values[0], null) : tuple(values[0], values[3]) } + .view() hom_reads = homogenise_readlengths.out.reads.collect() - figaro(hom_reads, !params.single_end) + figaro(hom_reads, !params.single_end, read_lengths_figaro_input_ch) extract_trim_parameters(figaro.out.trim_params) emit: reads = hom_reads trim_params = extract_trim_parameters.out.trim_params - -} - - -workflow check_for_preprocessing { - take: - reads - - main: - reads.view() - fastqc(reads) - assess_read_length_distribution( - fastqc.out.reports - .map { sample, report -> return report } - .collect() - ) - - emit: - readlen_dist = assess_read_length_distribution.out.read_length - hom_reads_marker = assess_read_length_distribution.out.hom_reads_marker -} - - -process prepare_fastqs { - input: - tuple val(sample), path(fq) - - output: - tuple val(sample), path("fastq/${sample.id}/${sample.id}_R*.fastq.gz"), emit: reads - - script: - if (sample.is_paired) { - """ - mkdir -p fastq/${sample.id} - ln -sf ../../${fq[0]} fastq/${sample.id}/${sample.id}_R1.fastq.gz - ln -sf ../../${fq[1]} fastq/${sample.id}/${sample.id}_R2.fastq.gz - """ - } else { - """ - mkdir -p fastq/${sample.id} - ln -sf ../../${fq[0]} fastq/${sample.id}/${sample.id}_R1.fastq.gz - """ - } } @@ -278,7 +276,13 @@ workflow { return tuple(sample, file) } .groupTuple(sort: true) - .map { classify_sample(it[0], it[1]) } + .map { sample, files -> + def meta = [:] + meta.is_paired = (files instanceof Collection && files.size() == 2) + meta.id = sample + + return tuple(meta, files) + } prepare_fastqs(fastq_ch) @@ -301,34 +305,31 @@ workflow { if (!params.preprocessed) { - /* check if dataset was preprocessed */ - - check_for_preprocessing(prepare_fastqs.out.reads) - - raw_reads_figaro(prepare_fastqs.out.reads, check_for_preprocessing.out.hom_reads_marker) + raw_reads_figaro(prepare_fastqs.out.reads) trim_params_ch = trim_params_ch .concat(raw_reads_figaro.out.trim_params) - dada_reads_ch = dada_reads_ch.concat(raw_reads_figaro.out.reads) + dada_reads_ch = dada_reads_ch.concat(raw_reads_figaro.out.reads).collect() - } + } else { - trim_params = file("${workDir}/trim_params.txt") - trim_params.text = "-1 -1\n" + trim_params = file("${workDir}/trim_params.txt") + trim_params.text = "-1 -1\n" - trim_params_ch = trim_params_ch - .concat(Channel.fromPath("${workDir}/trim_params.txt")) + trim_params_ch = trim_params_ch + .concat(Channel.fromPath("${workDir}/trim_params.txt")) - dada_reads_ch = dada_reads_ch.concat( - prepare_fastqs.out.reads - .map { sample, reads -> reads } - .collect() - ) + dada_reads_ch = dada_reads_ch.concat( + prepare_fastqs.out.reads + .map { sample, reads -> reads } + .collect() + ) + + } - dada2_preprocess(dada_reads_ch.first(), trim_params_ch.first(), dada2_preprocess_script, is_paired_end) + dada2_preprocess(dada_reads_ch, trim_params_ch, dada2_preprocess_script, is_paired_end) dada2_analysis(dada2_preprocess.out.filtered_reads, dada2_preprocess.out.trim_table, dada2_analysis_script, is_paired_end) asv2fasta(dada2_analysis.out.asv_sequences) mapseq(asv2fasta.out.asv_fasta, params.mapseq_db_path, params.mapseq_db_name) - mapseq_otutable(mapseq.out.bac_ssu) } diff --git a/modules/nevermore/qc/bbduk.nf b/modules/nevermore/qc/bbduk.nf index 22a91b6..96d9540 100644 --- a/modules/nevermore/qc/bbduk.nf +++ b/modules/nevermore/qc/bbduk.nf @@ -1,35 +1,115 @@ - -process qc_bbduk { +process qc_bbduk_amplicon { label 'bbduk' - publishDir path: params.output_dir, mode: params.publish_mode, pattern: "${sample.id}/${sample.id}.bbduk_stats.txt" + publishDir path: params.output_dir, mode: params.publish_mode, pattern: "${sample.id}/${sample.id}.*{txt,lhist}" //bbduk_stats.txt" input: tuple val(sample), path(reads) path(adapters) - path(run_sentinel) output: tuple val(sample), path("${sample.id}/${sample.id}_R*.fastq.gz"), emit: reads tuple val(sample), path("${sample.id}/${sample.id}_O.fastq.gz"), emit: orphans, optional: true - path("${sample.id}/${sample.id}.bbduk_stats.txt") + path("${sample.id}/${sample.id}.*bbduk_stats.txt") + path("${sample.id}/${sample.id}*lhist") script: def maxmem = task.memory.toString().replace(/ GB/, "g") - //def read1 = "in1=${sample.id}_R1.fastq.gz out1=${sample.id}/${sample.id}_R1.fastq.gz" - def read1 = "in1=${sample.id}_R1.fastq.gz out1=${sample.id}/${sample.id}_R1.fastq.gz" - //read2 = sample.is_paired ? "in2=${sample.id}_R2.fastq.gz out2=${sample.id}/${sample.id}_R2.fastq.gz outs=${sample.id}/${sample.id}_O.fastq.gz" : "" - read2 = sample.is_paired ? "in2=${sample.id}_R2.fastq.gz out2=${sample.id}/${sample.id}_R2.fastq.gz outs=${sample.id}/${sample.id}_O.fastq.gz" : "" + + def read1_p5 = "in1=${sample.id}_R1.fastq.gz out1=${sample.id}_R1.p5.fastq.gz" + def read1_p3 = "in1=${sample.id}_R1.p5.fastq.gz out1=${sample.id}/${sample.id}_R1.fastq.gz" + + def read2_p5 = "" + def read2_p3 = "" + def read2_hist = "" + + if (sample.is_paired) { + read2_p5 = "in2=${sample.id}_R2.fastq.gz out2=${sample.id}_R2.p5.fastq.gz" + read2_p3 = "in2=${sample.id}_R2.p5.fastq.gz out2=${sample.id}/${sample.id}_R2.fastq.gz" + read2_hist = "bbduk.sh -Xmx${maxmem} in1=${sample.id}/${sample.id}_R2.fastq.gz lhist=${sample.id}/${sample.id}_R2.post_lhist" + } if (params.primers) { - qc_params = params.qc_params_primers - trim_params = "literal=${params.primers} minlen=${params.qc_minlen}" + trim_params = "literal=${params.primers} minlength=${params.qc_minlen}" } else { - qc_params = params.qc_params_adapters - trim_params = "ref=${adapters} minlen=${params.qc_minlen}" + trim_params = "ref=${adapters} minlength=${params.qc_minlen}" } """ mkdir -p ${sample.id} - bbduk.sh -Xmx${maxmem} t=${task.cpus} ordered=t ${qc_params} ${trim_params} stats=${sample.id}/${sample.id}.bbduk_stats.txt ${read1} ${read2} + bbduk.sh -Xmx${maxmem} t=${task.cpus} ordered=t ${params.p5_primer_params} ${trim_params} stats=${sample.id}/${sample.id}.fwd_bbduk_stats.txt ${read1_p5} ${read2_p5} lhist=${sample.id}/${sample.id}.p5_lhist + bbduk.sh -Xmx${maxmem} t=${task.cpus} ordered=t ${params.p3_primer_params} ${trim_params} stats=${sample.id}/${sample.id}.rev_bbduk_stats.txt ${read1_p3} ${read2_p3} lhist=${sample.id}/${sample.id}.p3_lhist + bbduk.sh -Xmx${maxmem} t=${task.cpus} in1=${sample.id}/${sample.id}_R1.fastq.gz \$(echo ${read2_p3} | cut -f 1,3 -d =) lhist=${sample.id}/${sample.id}.post_lhist + + bbduk.sh -Xmx${maxmem} in1=${sample.id}/${sample.id}_R1.fastq.gz lhist=${sample.id}/${sample.id}_R1.post_lhist + ${read2_hist} + """ + // bbduk.sh -Xmx${maxmem} t=${task.cpus} in1=${sample.id}_R1.fastq.gz \$(echo ${read2_p5} | cut -f 1 -d " ") lhist=${sample.id}/${sample.id}.pre_lhist +} + +process qc_bbduk_stepwise_amplicon { + label 'bbduk' + publishDir path: params.output_dir, mode: params.publish_mode, pattern: "${sample.id}/${sample.id}*txt" + + input: + tuple val(sample), path(reads) + path(adapters) + + output: + tuple val(sample), path("${sample.id}/${sample.id}_R*.fastq.gz"), emit: reads + tuple val(sample), path("${sample.id}/${sample.id}_O.fastq.gz"), emit: orphans, optional: true + path("${sample.id}/${sample.id}.*bbduk_stats.txt"), optional: true + path("${sample.id}/${sample.id}*lhist.txt"), emit: read_lengths, optional: true + + script: + def maxmem = task.memory.toString().replace(/ GB/, "g") + + if (params.primers) { + trim_params = "literal=${params.primers} minlength=${params.qc_minlen}" + } else { + trim_params = "ref=${adapters} minlength=${params.qc_minlen}" + } + + def bbduk_call = "bbduk.sh -Xmx${maxmem} t=${task.cpus} ordered=t trd=t" + + ref_p5_r1 = (params.primers) ? "literal=" + params.primers.split(",")[0] : "ref=${adapters}" + ref_p5_r2 = (params.primers && !params.single_end) ? "literal=" + params.primers.split(",")[1] : "ref=${adapters}" + ref_p3_r1 = ref_p5_r2 + ref_p3_r2 = ref_p5_r1 + + if (params.single_end) { + """ + mkdir -p ${sample.id} + ${bbduk_call} ${trim_params} in1=${sample.id}_R1.fastq.gz out1=${sample.id}/${sample.id}_R1.fastq.gz stats=${sample.id}/${sample.id}.fwd_bbduk_stats.txt lhist=${sample.id}/${sample.id}.p5_lhist.txt + ${bbduk_call} in1=${sample.id}/${sample.id}_R1.fastq.gz lhist=${sample.id}/${sample.id}_R1.post_lhist.txt + """ + } else if (params.long_reads) { + """ + mkdir -p ${sample.id} + ${bbduk_call} ${ref_p5_r1} minlength=${params.qc_minlen} ${params.p5_primer_params} in1=${sample.id}_R1.fastq.gz out1=fwd_p5.fastq.gz + ${bbduk_call} ${ref_p5_r2} minlength=${params.qc_minlen} ${params.p5_primer_params} in1=${sample.id}_R2.fastq.gz out1=rev_p5.fastq.gz + ${bbduk_call} ${ref_p3_r1} minlength=${params.qc_minlen} ${params.p3_primer_params} in1=fwd_p5.fastq.gz out1=fwd.fastq.gz + ${bbduk_call} ${ref_p3_r2} minlength=${params.qc_minlen} ${params.p3_primer_params} in1=rev_p5.fastq.gz out1=rev.fastq.gz + gzip -dc fwd.fastq.gz | awk 'NR%4==1' | sed 's/^@//' | sed 's/\\/1//' | sort > fwd.txt + gzip -dc rev.fastq.gz | awk 'NR%4==1' | sed 's/^@//' | sed 's/\\/2//' | sort > rev.txt + join -1 1 -2 1 fwd.txt rev.txt > both.txt + seqtk subseq fwd.fastq.gz both.txt | gzip -c - > ${sample.id}/${sample.id}_R1.fastq.gz + seqtk subseq rev.fastq.gz both.txt | gzip -c - > ${sample.id}/${sample.id}_R2.fastq.gz + ${bbduk_call} in1=${sample.id}/${sample.id}_R1.fastq.gz lhist=${sample.id}/${sample.id}_R1.post_lhist.txt + ${bbduk_call} in1=${sample.id}/${sample.id}_R2.fastq.gz lhist=${sample.id}/${sample.id}_R2.post_lhist.txt + """ + } else { + """ + mkdir -p ${sample.id} + ${bbduk_call} ${ref_p5_r1} minlength=${params.qc_minlen} ${params.p5_primer_params} in1=${sample.id}_R1.fastq.gz out1=fwd.fastq.gz + ${bbduk_call} ${ref_p5_r2} minlength=${params.qc_minlen} ${params.p5_primer_params} in1=${sample.id}_R2.fastq.gz out1=rev.fastq.gz + gzip -dc fwd.fastq.gz | awk 'NR%4==1' | sed 's/^@//' | sed 's/\\/1//' | sort > fwd.txt + gzip -dc rev.fastq.gz | awk 'NR%4==1' | sed 's/^@//' | sed 's/\\/2//' | sort > rev.txt + join -1 1 -2 1 fwd.txt rev.txt > both.txt + seqtk subseq fwd.fastq.gz both.txt | gzip -c - > ${sample.id}/${sample.id}_R1.fastq.gz + seqtk subseq rev.fastq.gz both.txt | gzip -c - > ${sample.id}/${sample.id}_R2.fastq.gz + ${bbduk_call} in1=${sample.id}/${sample.id}_R1.fastq.gz lhist=${sample.id}/${sample.id}_R1.post_lhist.txt + ${bbduk_call} in1=${sample.id}/${sample.id}_R2.fastq.gz lhist=${sample.id}/${sample.id}_R2.post_lhist.txt + """ + } } diff --git a/modules/profilers/mapseq.nf b/modules/profilers/mapseq.nf index 78338ca..e5cfbee 100644 --- a/modules/profilers/mapseq.nf +++ b/modules/profilers/mapseq.nf @@ -1,4 +1,6 @@ process mapseq { + publishDir params.output_dir, mode: params.publish_mode + input: tuple val(sample), path(seqs) path(db_path) @@ -10,9 +12,13 @@ process mapseq { script: def db = (db_name == "default" || db_name == "") ? "" : "${db_path}/${db_name}.fasta ${db_path}/*.tax*" + db_run = (db != "") ? "${params.mapseq_bin} -outfmt simple ${seqs} ${db} > ${sample.id}/${sample.id}.${db_name}_bac_ssu.mseq" : "" + """ mkdir -p ${sample.id} - ${params.mapseq_bin} ${seqs} > ${sample.id}/${sample.id}_bac_ssu.mseq + ${params.mapseq_bin} -outfmt simple ${seqs} > ${sample.id}/${sample.id}_bac_ssu.mseq + + ${db_run} """ } diff --git a/nextflow.config b/nextflow.config index 904d854..d4032a3 100644 --- a/nextflow.config +++ b/nextflow.config @@ -4,5 +4,5 @@ manifest { description = "DADA2/figaro-based 16S amplicon analysis pipeline" name = "gaga2" nextflowVersion = ">=21.0" - version = "0.4" + version = "0.5" } diff --git a/scripts/assess_readlengths.py b/scripts/assess_readlengths.py index 3135796..0032b00 100644 --- a/scripts/assess_readlengths.py +++ b/scripts/assess_readlengths.py @@ -18,17 +18,32 @@ def parse_fastqc_report(f): return dict(item for item in readlengths if item[1] > 0) +def parse_bbduk_hist(f): + return { + int(line.strip().split("\t")[0]): int(line.strip().split("\t")[1]) + for line in open(f) + if line[0] != "#" + } + + def main(): ap = argparse.ArgumentParser() ap.add_argument("input_dir", type=str, default=".") + ap.add_argument("--amplicon_length", type=int, required=True) + ap.add_argument("--min_overlap", type=int, required=True) args = ap.parse_args() + minlength = 0.5 * args.amplicon_length + args.min_overlap + #Ā minlength = 0 + read_yields = {} - for r in (1, 2): + for mate in (1, 2): read_lengths = Counter() - for fastq_report in glob.glob(os.path.join(args.input_dir, f"*{r}_fastqc_data.txt")): - read_lengths.update(parse_fastqc_report(fastq_report)) + #for fastq_report in glob.glob(os.path.join(args.input_dir, f"*{mate}_fastqc_data.txt")): + # read_lengths.update(parse_fastqc_report(fastq_report)) + for hist in glob.glob(os.path.join(args.input_dir, f"*R{mate}.post_lhist.txt")): + read_lengths.update(parse_bbduk_hist(hist)) yields = list() for length, count in read_lengths.items(): @@ -36,7 +51,7 @@ def main(): (length, sum(v for k, v in read_lengths.items() if k >= length), sum(v * length for k, v in read_lengths.items() if k >= length)) ) for length, reads, bases in sorted(yields, key=lambda x:(x[2], x[1]), reverse=True): - read_yields.setdefault(r, list()).append((length, reads, bases)) + read_yields.setdefault(mate, list()).append((length, reads, bases)) r1_lengths = read_yields.get(1, list()) r2_lengths = read_yields.get(2, list()) @@ -45,15 +60,13 @@ def main(): all_lengths = r1_lengths is_hom = len(r1_lengths) == 1 else: - if len(r1_lengths) < len(r2_lengths): - r1_lengths.extend(("NA", "NA", "NA") for i in range(len(r2_lengths) - len(r1_lengths))) - elif len(r2_lengths) < len(r1_lengths): - r2_lengths.extend(("NA", "NA", "NA") for i in range(len(r1_lengths) - len(r2_lengths))) - all_lengths = [x + y for x, y in zip(r1_lengths, r2_lengths)] is_hom = len(r1_lengths) == len(r2_lengths) == 1 + smaller = min(len(r1_lengths), len(r2_lengths)) + all_lengths = [x + y for x, y in zip(r1_lengths[:smaller], r2_lengths[:smaller])] for item in all_lengths: - print(*item, sep="\t") + if len(item) == 3 or item[0] + item[3] > minlength: + print(*item, sep="\t") if is_hom: open("READSET_HOMOGENEOUS", "wt").close()