RNAseq processing with Salmon, extraction of unmapped reads and subsequent de novo transcriptome synthesis
This collection of scripts takes raw fastq files from RNAseq, aligns them to the transcriptome using salmon, extracts read pairs that did not map, preprocesses and assembles them de novo using the Trinity pipeline.
Steps of the workflow are as follows:
Quality of raw reads is assessed before and after trimming with script salmon_run.sh
FastQC and MultiQC are used to assess quality of raw reads
Trim Galore is used for adapter and quality trimming.
FastQC and multiQC are used to assess read quality after trimming.
Quantify reads with Salmon
Use index_salmon.sh to generate an index for Ceratopteris richardii
Run Salmon quantification of trimmed reads with quantify_salmon.sh. After quantification, unmapped reads (where both reads in a pair did not map) are extracted and written to 4_unmapped.
All preprocessing steps are in the file preprocess_unmapped.sh. Code roughly follows pipeline from (https://github.com/matevzl533/Noccaea_praecox_transcriptome/tree/main)
FastQC and MultiQC are used to assess quality of "raw" unmapped reads
rCorrector is used to tag reads in the fastq output as corrected or uncorrectable. rcorrector is a tool specifically designed for kmer-bases read error correction of RNA-seq data.
Uses a python script from the Harvard Informatics GitHub repository TranscriptomeAssemblyTools. The script has been updated to Python3.
From Silva, the SSUParc and LSUParc fasta files were downloaded (https://ftp.arb-silva.de/?pk_vid=8352a8ccf0ead1d7168388545541b6c1). Before running bowtie2-build, SSUParc and LSUParc were concatenated and U translated to T.
cat *.fasta > SILVA.db
awk '/^[^>]/ { gsub(/U/,"T"); print; next }1' SILVA.db > SILVA.db
Re-run QC from step 1.
Trinity accepts a text file via --samples_file rather than looping through reads see here. Run make_sample_table.py and provide the directory containing your clean reads.
Trinity is used for de novo transcriptome assembly with default parameters. Script to run Trinity is in trinity.sh. Ensure you have the latest trinity image downloaded and stored in the same directory as your clean reads.
postprocessing.sh contains all post-processing steps to process and assess quality and completeness of de novo assembly:
- Remove redundancy with CD-HIT
- Produce basic statistics with trinity script TrinityStats.pl
- Quanitfy read representation by mapping reads back to assembly with BowTie2
- Prepare new gene trans map for non-reduntant assembly
- Build gene expression matrices for DEG analysis with kallisto (can also modify to run with salmon)
- Calculate ExN50 for assembly