Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to go from FASTQ to Salmon Quantification #1333

Closed
SuhasSrinivasan opened this issue Jul 2, 2024 · 1 comment
Closed

Unable to go from FASTQ to Salmon Quantification #1333

SuhasSrinivasan opened this issue Jul 2, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@SuhasSrinivasan
Copy link

SuhasSrinivasan commented Jul 2, 2024

Description of the bug

Dear Researchers and Developers,

Thank you for developing this pipeline.

I am trying to go from FASTQ files to Salmon pseudo-alignment and quantification, as per the flow chart (Phase 1 and 3 only): https://raw.githubusercontent.com/nf-core/rnaseq/3.14.0//docs/images/nf-core-rnaseq_metro_map_grey.png

Specifically, trying to achieve:

  1. Infer strandedness
  2. FastQC
  3. FastP/TrimGalore
  4. FastQC
  5. SortMeRNA
  6. Salmon (pseudo-alignment and quantification)
  7. MultiQC on FastQC output

Issue 1:

Despite supplying a pre-built decoy-aware Salmon index for transcripts, both Genome fasta and GTF files are still needed.
It is not clear why this is needed.

Genome fasta file not specified with e.g. '--fasta genome.fa' or via a detectable config file.
No GTF or GFF3 annotation specified! The pipeline requires at least one of these files.

Issue 2:

The fq subsample step is run, not sure if this is necessary for Salmon to infer strandedness.

Issue 3:

At some point in the pipeline, there is a failure due to an RSEM error.
It is not clear why RSEM is being called for the Reference Genome, when it is not part of Steps 1 and 3.

process > NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:MAKE_TRANSCRIPTS_FASTA (rsem/GRCh38.primary_assembly.genome.fa) [  0%] 0 of 1

Issue 4:

The pipeline does not stop at Salmon quantification and tries to continue to unexpected next steps.

[78/cab4d8] process > NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:SALMON_QUANT (ERR2179089)                             [100%] 1 of 1 ✔
[78/c6d1b3] process > NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (gencode.v46.primary_assembly.annotation.gtf) [100%] 1 of 1 ✔
[8d/caed0e] process > NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TXIMPORT                                              [100%] 1 of 1, failed: 1 ✘

It would be very helpful to know what switches need to toggled to only execute Steps 1–7.
Thank you for your consideration.

  1. Infer strandedness
  2. FastQC
  3. FastP/TrimGalore
  4. FastQC
  5. SortMeRNA
  6. Salmon (pseudo-alignment and quantification)
  7. MultiQC on FastQC output

Command used and terminal output

nextflow run nf-core/rnaseq \
    --input samplesheet.csv \
    --outdir ~/bioinformatics/output/salmon/ \
    --fasta ~/bioinformatics/references/salmon_hs/GRCh38.primary_assembly.genome.fa.gz \
    --gtf ~/bioinformatics/references/salmon_hs/gencode.v46.primary_assembly.annotation.gtf.gz \
    --gencode \
    --trimmer fastp \
    --salmon_index ~/bioinformatics/references/salmon_hs/index/ \
    --pseudo_aligner salmon \
    --skip_gtf_filter \
    --skip_gtf_transc \
    --skip_umi_extract \
    --skip_bbsplit \
    --skip_alignment \
    --skip_markduplic \
    --skip_bigwig \
    --skip_stringtie \
    --skip_preseq \
    --skip_dupradar \
    --skip_qualimap \
    --skip_rseqc \
    --skip_biotype_qc \
    --skip_deseq2_qc \
    --skip_multiqc \
    --max_memory 100.GB \
    --max_cpus 24 \

System information

Nextflow: 24.04.2
Hardware: Desktop
Executor: local
Container: conda
OS: Ubuntu 22.04.4 LTS
nf-core/rnaseq v3.14.0-gb89fac3

@SuhasSrinivasan SuhasSrinivasan added the bug Something isn't working label Jul 2, 2024
@pinin4fjords
Copy link
Member

So, as a general point you need to consider the flow chart as a qualitative guide to what's going on. The workflow doesn't provided you with absolute control on the modules that are run- for that you'll need to make your own workflow (which is definitely an option for you to get exactly what you want).

We have a related feature request to reduce the genome requirements in this context, but haven't got to it yet.

Further:

  1. This step is necessary, we don't need to example all reads to infer the strandedness so we down-sample first.
  2. This is just using a utility from the RSEM suite to generate a transcriptome. We may be able to remove that dependency if and when we tackle the issue above.
  3. We use tximport to construct matrices from the output of Salmon, we don't have any plans to remove that.

To summarise:

  • Reducing dependencies when using pseudo-aligners is a valid point we will try to address as priorities allow.
  • But you don't have absolute control of the specific modules used. For that, I would encourage you to build your own workflow using the pre-build nf-core modules and subworkflows that are available.

I'm closing this as not being a bug, and we're already tracking the feature request elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants