STRT2-NextSeq analysis pipeline

A pipeline for the analysis of STRT2 RNA-sequencing outputs from NextSeq.
Please cite Ezer et al. 2021 when you use this pipeline.

Ezer S, Yoshihara M, Katayama S, DoGA consortium, Daub C, Lohi H, Krjutskov K, Kere J.
Generation of RNA sequencing libraries for transcriptome analysis of globin-rich tissues of the domestic dog.
STAR Protoc. 2021 Dec 12;2(4):100995.

Install

git clone https://github.com/my0916/STRT2.git

Dependencies

For STRT2.sh (the main pipeline)

For STRT-TFE.sh (TFE-based analysis)

StringTie

For fastq-fastQC.sh

The conda environment is provided as condaEnv.yml. The environment can be created with the followings:

conda env create -f condaEnv.yml
conda activate STRT2_env

For UPPMAX, these software are available through the module command in the scripts (STRT2-UPPMAX.sh, STRT2-TFE-UPPMAX.sh, and fastq-fastqc-uppmax.sh).

Requirements

Illumina BaseCalls files (.bcl). The number of lanes is determined based on the number of directories in the basecalls directory. Here is an example of 4 lanes:

  ├── L001
  ├── L002
  ├── L003
  └── L004

HISAT2 index built with a reference genome, (ribosomal DNA), and ERCC spike-ins
- See also How to build HISAT2 index.
- The HISAT2 index directory should include the followings:

    ├── [basename].1.ht2
    ├── [basename].2.ht2
    ├── [basename].3.ht2
    ├── [basename].4.ht2
    ├── [basename].5.ht2
    ├── [basename].6.ht2
    ├── [basename].7.ht2
    ├── [basename].8.ht2
    ├── [basename].fasta
    └── [basename].dict

Source files (in src directory)
- barcode.txt : Barcode sequence with barcode name (1–48). Please modify if you used different (number of) barcodes.
- ERCC.bed : 5'-end 50 nt region of ERCC spike-ins (SRM2374) for annotation and quality check.

Example usage

./STRT2.sh -o STRT2LIB -g canFam3 -a ens -b /path/to/Data/Intensities/BaseCalls/ \
-i /path/to/index/canFam3_reference -c HUDDINGE -r ABCDEFG123

For UPPMAX:

sbatch -A snic2017-7-317 -p core -n 8 -t 24:00:00 ./STRT2-UPPMAX.sh -o STRT2LIB -g canFam3 -a ens \
-b /path/to/Data/Intensities/BaseCalls/ -i /path/to/index/canFam3_reference -c HUDDINGE -r ABCDEFG123

Parameters

Mandatory

Name	Description
`-g, --genome`	Reference genome. Choose from `hg19`/`hg38`/`mm9`/`mm10`/`canFam3`.
`-b, --basecalls`	/PATH/to/the Illumina basecalls directory.
`-i, --index`	/PATH/to/the directory and basename of the HISAT2 index for the reference genome.

Optional

Name	Default value	Description
`-o, --out`	OUTPUT	Output file name.
`-a, --annotation`	ref	Gene annotation for QC and counting. Choose from `ref`(RefSeq)/`ens`(Ensembl)/`kg`(UCSC KnownGenes), or directly input the Gencode annotation file name (eg. `wgEncodeGencodeBasicV28lift37`) for Gencode. Note that some annotations are unavailable in some cases. Please find the details below.
`-c, --center`	CENTER	The name of the sequencing center that produced the reads. Required for the the Picard IlluminaBasecallsToSam program.
`-r, --run`	RUNBARCODE	The barcode of the run. Prefixed to read names. Required for the the Picard IlluminaBasecallsToSam program.
`-s, --structure`	8M3S74T6B	Read structure. Required for the the Picard IlluminaBasecallsToSam program. Details are described here.
`-d, --dta`		Add `-d, --dta` (downstream-transcriptome-assembly) if you plan to perform TFE-based analysis. Please note that this leads to fewer alignments with short-anchors.
`-h, --help`		Show usage.
`-v, --version`		Show version.

-a, --annotation availability as of July 2020:

	RefSeq (ref)	Ensembl (ens)	KnownGenes (kg)	Gencode
hg19 (human)	✔️	✔️	✔️	✔️
hg38 (human)	✔️	NA	✔️	✔️
mm9 (mouse)	✔️	✔️	✔️	NA
mm10 (mouse)	✔️	NA	✔️	✔️
canFam3 (dog)	✔️	✔️	NA	NA

Outputs

Outputs are provided in out directory. Unaligned BAM files generated with Picard IlluminaBasecallsToSam program are found in tmp/Unaligned_bam.

OUTPUT-QC.txt
Quality check report for all samples.

Column	Value
`Barcode`	Sample name. `OUTPUT` with numbers
`Qualified_reads`	Primary aligned read count
`Total_reads`	Read count without redundant (duplicate) reads
`Redundancy`	Qualified reads / Total reads
`Mapped_reads`	Mapped read count (Total reads without unmapped reads)
`Mapped_rate`	Mapped reads / Total reads
`Spikein_reads`	Read count mapped to ERCC spike-ins
`Spikein-5end_reads`	Read count mapped to the 5'-end 50 nt region of ERCC spike-ins
`Spikein-5end_rate`	Spikein-5end reads / Spikein reads
`Coding_reads`	Read count aligned within any exon or the 500 bp upstream of coding genes
`Coding-5end_reads`	Read count aligned the 5′-UTR or 500 bp upstream of coding genes
`Coding-5end_rate`	Coding-5end reads / Coding reads

OUTPUT-QC-plots.pdf
Quality check report by boxplots. Mapped_reads, Mapped_rate, Spikein_reads, Mapped / Spikein, Spikein-5end_rate, and Coding-5end_rate are shown for all samples. Barcode numbers of outlier samples are marked with red characters.
Please consider these outlier samples for the further downstream analysis.
OUTPUT_byGene-counts.txt
Read count table output from featureCounts. http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf
OUTPUT_byGene-counts.txt.summary
Filtering summary from featureCounts. http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf
Output_bam
Resulting BAM files including unmapped, non-primary aligned, and duplicated (marked) reads.
Output_bai
Index files (.bai) of the resulting BAM files in the Output_bam directory.
ExtractIlluminaBarcodes_Metrics
Metrics file produced by the Picard ExtractIlluminaBarcodes program. The number of matches/mismatches between the barcode reads and the actual barcodes is shown per lane. https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/picard_illumina_ExtractIlluminaBarcodes.php
HISAT2_Metrics
Alignment summary of samples from each lane produced by the HISAT2 program. https://ccb.jhu.edu/software/hisat2/manual.shtml#alignment-summary
MarkDuplicates_Metrics
Metrics file indicating the numbers of duplicates produced by the Picard MarkDuplicates program. https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.4.0/picard_sam_markduplicates_MarkDuplicates.php

fastq-fastQC-UPPMAX.sh

After running the pipeline above, you can generate fastq files for each sample from the output BAM files in the fastq directory. These fastq files (without duplicated reads) can be submitted to public sequence databases.
FastQC files are also generated for each fastq file in the fastqc directory.
Based on the FastQC results, MultiQC report (MultiQC_report.html) is generated.

How to build HISAT2 index

Here is the case for the dog genome (canFam3).

1. Obtain the genome sequences of reference and ERCC spike-ins.

You may add the ribosomal DNA repetitive unit for human (U13369) and mouse (BK000964).

wget http://hgdownload.cse.ucsc.edu/goldenPath/canFam3/bigZips/canFam3.fa.gz
unpigz -c canFam3.fa.gz | ruby -ne '$ok = $_ !~ /^>chrUn_/ if $_ =~ /^>/; puts $_ if $ok' > canFam3_reference.fasta

wget https://www-s.nist.gov/srmors/certificates/documents/SRM2374_putative_T7_products_NoPolyA_v2.FASTA
cat SRM2374_putative_T7_products_NoPolyA_v2.FASTA >> canFam3_reference.fasta

2. Extract splice sites and exons from a GTF file.

Here Ensembl transcript map (canFam3.transMapEnsemblV4.gtf.gz) was downloaded from the UCSC Table Browser.

unpigz -c canFam3.transMapEnsemblV4.gtf.gz | hisat2_extract_splice_sites.py - | grep -v ^chrUn > canFam3.ss
unpigz -c canFam3.transMapEnsemblV4.gtf.gz | hisat2_extract_exons.py - | grep -v ^chrUn > canFam3.exon

You may additionally perform hisat2_extract_snps_haplotypes_UCSC.py to extract SNPs and haplotypes from a dbSNP file for human and mouse.

3. Build the HISAT2 index.

hisat2-build canFam3_reference.fasta --ss canFam3.ss --exon canFam3.exon canFam3_reference

This outputs a set of files with suffixes. Here, canFam3_reference.1.ht2, canFam3_reference.2.ht2, ..., canFam3_reference.8.ht2 are generated.
In this case, canFam3_reference is the basename used for -i, --index.

4. Prepare the sequence dictionary for the reference sequence.

java -jar picard.jar CreateSequenceDictionary R=canFam3_reference.fasta O=canFam3_reference.dict

This is required for the Picard MergeBamAlignment program. Note that the original FASTA file (canFam3_reference.fasta here) is also required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STRT2-NextSeq analysis pipeline

Install

Dependencies

Requirements

Example usage

Parameters

Outputs

fastq-fastQC-UPPMAX.sh

How to build HISAT2 index

1. Obtain the genome sequences of reference and ERCC spike-ins.

2. Extract splice sites and exons from a GTF file.

3. Build the HISAT2 index.

4. Prepare the sequence dictionary for the reference sequence.

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 304 Commits
bin		bin
src		src
LICENSE		LICENSE
README.md		README.md
STRT-TFE.sh		STRT-TFE.sh
STRT2-TFE-UPPMAX.sh		STRT2-TFE-UPPMAX.sh
STRT2-UPPMAX.sh		STRT2-UPPMAX.sh
STRT2.sh		STRT2.sh
TFE-README.md		TFE-README.md
condaEnv.yml		condaEnv.yml
fastq-fastQC-UPPMAX.sh		fastq-fastQC-UPPMAX.sh
fastq-fastQC.sh		fastq-fastQC.sh

License

my0916/STRT2

Folders and files

Latest commit

History

Repository files navigation

STRT2-NextSeq analysis pipeline

Install

Dependencies

Requirements

Example usage

Parameters

Outputs

fastq-fastQC-UPPMAX.sh

How to build HISAT2 index

1. Obtain the genome sequences of reference and ERCC spike-ins.

2. Extract splice sites and exons from a GTF file.

3. Build the HISAT2 index.

4. Prepare the sequence dictionary for the reference sequence.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages