annotate_my_genomes benchmarking

Benchmarking:

We benchmarked annotate_my_genomes performance against:

BRAKER2 (https://github.com/Gaius-Augustus/BRAKER)
SQANTI3 (https://github.com/ConesaLab/SQANTI3)
Augustus (https://github.com/Gaius-Augustus/Augustus)

Installation of each program:

At January 2022, each program was tested on Ubuntu 16.04.7 LTS (xenial) and installed as follows:

1) annotate_my_genomes:

See here: https://github.com/cfarkas/annotate_my_genomes#ii-installation

2) BRAKER2:

In my machine, ~ equals /home/wslab

### Create conda environment
conda config --add channels conda-forge
conda config --add channels bioconda
conda create -n braker2 braker2
conda activate braker2

### Dependencies
conda install -c anaconda perl
conda install -c bioconda perl-app-cpanminus
conda install -c bioconda perl-hash-merge
conda install -c bioconda perl-parallel-forkmanager
conda install -c bioconda perl-scalar-util-numeric
conda install -c bioconda perl-yaml
conda install -c bioconda perl-class-data-inheritable
conda install -c bioconda perl-exception-class
conda install -c bioconda perl-test-pod
conda install -c anaconda biopython
conda install -c bioconda perl-file-which # skip if you are not comparing to reference annotation
conda install -c bioconda perl-mce
conda install -c bioconda perl-threaded
conda install -c bioconda perl-list-util
conda install -c bioconda perl-math-utils
conda install -c bioconda cdbtools

### The following Perl modules are required:
sudo cpanm YAML
sudo cpanm Hash::Merge
sudo cpanm Parallel::ForkManager
sudo cpanm MCE::Mutex
sudo cpanm Thread::Queue
sudo cpanm threads

### clone repository
git clone https://github.com/Gaius-Augustus/BRAKER.git

### Unpack gmes_linux_64_4.tar.gz inside BRAKER folder and:
cd ~/BRAKER/gmes_linux_64_4
rm gm_key
mv gm_key_64 gm_key
cp gm_key ~/.gm_key

# check install as follows:
bash check_install.bash


### Installing AUGUSTUS

cd ~/BRAKER/

conda install -c anaconda cmake

sudo -i 

apt-get update
apt-get install build-essential wget git autoconf

# Install dependencies for AUGUSTUS comparative gene prediction mode (CGP)
apt-get install libgsl-dev libboost-all-dev libsuitesparse-dev liblpsolve55-dev
apt-get install libsqlite3-dev libmysql++-dev

# Install dependencies for the optional support of gzip compressed input files
apt-get install libboost-iostreams-dev zlib1g-dev

# Install dependencies for bam2hints and filterBam
apt-get install libbamtools-dev zlib1g-dev

# Install additional dependencies for bam2wig
apt-get install samtools libhts-dev

# Install additional dependencies for homGeneMapping and utrrnaseq
apt-get install libboost-all-dev

# Install additional dependencies for scripts
apt-get install cdbfasta diamond-aligner libfile-which-perl libparallel-forkmanager-perl libyaml-perl libdbd-mysql-perl
apt-get install --no-install-recommends python3-biopython

# make Augustus
cd ~/BRAKER/Augustus
make

### To run BRAKER2, export the following paths
export AUGUSTUS_CONFIG_PATH=/home/wslab/braker2_install/BRAKER/config
export GENEMARK_PATH=/home/wslab/braker2_install/BRAKER/gmes_linux_64_4/
echo $AUGUSTUS_CONFIG_PATH
echo $GENEMARK_PATH

3) SQANTI3:

In my machine, ~ equals /home/wslab

cd ~

### For general users
wget https://github.com/ConesaLab/SQANTI3/archive/refs/tags/v4.2.tar.gz
tar -xvf v4.2.tar.gz

### Create Environment
cd SQANTI3-4.2/
conda env create -f SQANTI3.conda_env.yml
source activate SQANTI3.env
conda activate SQANTI3.env

### download gtfToGenePred inside SQANTI3-4.2/utilities/
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred -P /home/wslab/SQANTI3-4.2/utilities/
chmod +x /home/wslab/SQANTI3-4.2/utilities/gtfToGenePred 

### Clone and install cDNA_cupcake
git clone https://github.com/Magdoll/cDNA_Cupcake.git

# Requirements:
# pip install Cython
# conda install -c bioconda bcbio-gff

cd /home/wslab/SQANTI3-4.2/cDNA_Cupcake/
python setup.py build
python setup.py install

4) IsoSeq3 (required to assemble transcripts to run SQANTI3)

# See here: https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki/Tutorial%3A-Installing-and-Running-Iso-Seq-3-using-Conda

# Environment and installation:
conda create -n anaCogent5.2 python=3.7 anaconda
conda activate anaCogent5.2
conda install -n anaCogent5.2 biopython
conda install -n anaCogent5.2 -c http://conda.anaconda.org/cgat bx-python
conda install -n anaCogent5.2 -c bioconda isoseq3
conda install -n anaCogent5.2 -c bioconda pbccs
conda install -n anaCogent5.2 -c bioconda pbcoretools # for manipulating PacBio datasets
conda install -n anaCogent5.2 -c bioconda bamtools    # for converting BAM to fasta
conda install -n anaCogent5.2 -c bioconda pysam       # for making CSV reports
conda install -c bioconda gs-tama

Datasets:

We tested each pipeline on the following datasets:

Gallus gallus

### PacBio data (IsoSeq3)

mkdir PacBio_galGal6 && cd PacBio_galGal6
# HH23 all transcripts 
wget -O m54027_190807_082031.subreads.bam ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR388/ERR3887439/m54027_190807_082031.subreads.bam
# HH23 >4kb_transcripts_enrichment
wget -O m54027_191028_202826.subreads.bam ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR388/ERR3887440/m54027_191028_202826.subreads.bam
# HH30 all transcripts 
wget -O m54027_190808_044316.subreads.bam ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR388/ERR3887441/m54027_190808_044316.subreads.bam
# HH30 >4kb_transcripts_enrichment 
wget -O m54027_191029_165345.subreads.bam ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR388/ERR3887442/m54027_191029_165345.subreads.bam

### Illumina data (paired-end)

# HH23 rep1
wget -O 41-A3_S1_R1_001.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR388/ERR3887419/41-A3_S1_R1_001.fastq.gz
wget -O 41-A3_S1_R2_001.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR388/ERR3887419/41-A3_S1_R2_001.fastq.gz

# HH23 rep2
wget -O 41-B3_S2_R1_001.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR388/ERR3887420/41-B3_S2_R1_001.fastq.gz
wget -O 41-B3_S2_R2_001.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR388/ERR3887420/41-B3_S2_R2_001.fastq.gz

# HH30 rep1
wget -O 71-A4_S3_R1_001.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR388/ERR3887421/71-A4_S3_R1_001.fastq.gz
wget -O 71-A4_S3_R2_001.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR388/ERR3887421/71-A4_S3_R2_001.fastq.gz

# HH30 rep2
wget -O 71-B4_S4_R1_001.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR388/ERR3887422/71-B4_S4_R1_001.fastq.gz
wget -O 71-B4_S4_R2_001.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR388/ERR3887422/71-B4_S4_R2_001.fastq.gz

Mus musculus

mkdir PacBio_Mus_musculus && cd PacBio_Mus_musculus

### PacBio data (IsoSeq3)
# GSM4118746: C57/BL6J oocyte cells pacbio; Mus musculus; RNA-Seq
wget -O m54180_190618_225637.subreads.bam https://sra-pub-src-2.s3.amazonaws.com/SRR10267009/m54180_190618_225637.subreads.bam 

### Illumina (paired-end)
prefetch --max-size 800G -O ./ SRR10266994    # GSM4118734: C57/BL6J oocyte cells rep2; Mus musculus; RNA-Seq
prefetch --max-size 800G -O ./ SRR10266993    # GSM4118733: C57/BL6J oocyte cells rep1; Mus musculus; RNA-Seq
prefetch --max-size 800G -O ./ SRR10266992    # GSM4118733: C57/BL6J oocyte cells rep1; Mus musculus; RNA-Seq
fastq-dump --split-files SRR10266992.sra
fastq-dump --split-files SRR10266993.sra
fastq-dump --split-files SRR10266994.sra

Homo sapiens

mkdir PacBio_Homo_sapiens && cd PacBio_Homo_sapiens

# PacBio data (IsoSeq)
wget -O m54119_170713_231222.subreads.bam https://zenodo.org/record/1581809/files/m54119_170713_231222.subreads.bam?download=1
wget -O m54119_170713_231222.subreads.bam.pbi https://zenodo.org/record/1581809/files/m54119_170713_231222.subreads.bam.pbi?download=1

Danio rerio

mkdir PacBio_Danio_rerio && cd PacBio_Danio_rerio

### PacBio data (fastq files)
prefetch --max-size 800G -O ./ SRR5865381      # GSM2717135: long-reads preZGA; Danio rerio; RNA-Seq
prefetch --max-size 800G -O ./ SRR5865382      # GSM2717136: long-reads postZGA; Danio rerio; RNA-Seq
fastq-dump SRR5865381.sra
fastq-dump SRR5865382.sra
cat SRR5865381.fastq.gz SRR5865382.fastq.gz > long_reads.fastq.gz
gunzip long_reads.fastq.gz

### illumina data (paired-end)
prefetch --max-size 800G -O ./ SRR5865383      # GSM2717137: short-reads preZGA; Danio rerio; RNA-Seq
prefetch --max-size 800G -O ./ SRR5865384      # GSM2717138: short-reads postZGA; Danio rerio; RNA-Seq
fastq-dump --split-files SRR5865383.sra
fastq-dump --split-files SRR5865384.sra

C. elegans

### PacBio data (fastq files)
wget -O L1_rep1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3245464/L1_rep1.fastq.gz
wget -O L1_rep2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3245465/L1_rep2.fastq.gz
wget -O L2_rep1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3245466/L2_rep1.fastq.gz
wget -O L2_rep2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3245467/L2_rep2.fastq.gz
wget -O L3_rep1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3245468/L3_rep1.fastq.gz
wget -O L3_rep2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3245469/L3_rep2.fastq.gz
wget -O L4_rep1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3245470/L4_rep1.fastq.gz
wget -O L4_rep2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3245471/L4_rep2.fastq.gz

Alignments, PacBio data processing and GTF obtention:

We processed sequencing datasets per species as follows:

Gallus gallus

# Directory
cd PacBio_galGal6

# Genome (download with genome-download program from annotate_my_genomes)
genome-download galGal6

###########################
### Illumina alignments ###
###########################

# fastp install
conda install -c bioconda fastp

# Trimming of illumina fastq datasets
fastp -w 16 -i 41-A3_S1_R1_001.fastq.gz -I 41-A3_S1_R2_001.fastq.gz -o 41-A3_S1_R1_001.fq.gz -O 41-A3_S1_R2_001.fq.gz
fastp -w 16 -i 41-B3_S2_R1_001.fastq.gz -I 41-B3_S2_R2_001.fastq.gz -o 41-B3_S2_R1_001.fq.gz -O 41-B3_S2_R2_001.fq.gz
fastp -w 16 -i 71-A4_S3_R1_001.fastq.gz -I 71-A4_S3_R2_001.fastq.gz -o 71-A4_S3_R1_001.fq.gz -O 71-A4_S3_R2_001.fq.gz
fastp -w 16 -i 71-B4_S4_R1_001.fastq.gz -I 71-B4_S4_R2_001.fastq.gz -o 71-B4_S4_R1_001.fq.gz -O 71-B4_S4_R2_001.fq.gz

# build hisat2 index with 30 threads
hisat2-build galGal6.fa -p 30 galGal6.fa

# align short illumina reads agains galGal6, using 30 threads
hisat2 -x galGal6.fa -p 50 -1 41-A3_S1_R1_001.fq.gz -2 41-A3_S1_R2_001.fq.gz | samtools view -bS - > 41.bam
hisat2 -x galGal6.fa -p 50 -1 41-B3_S2_R1_001.fq.gz -2 41-B3_S2_R2_001.fq.gz | samtools view -bS - > 42.bam
hisat2 -x galGal6.fa -p 50 -1 71-A4_S3_R1_001.fq.gz -2 71-A4_S3_R2_001.fq.gz | samtools view -bS - > 71.bam
hisat2 -x galGal6.fa -p 50 -1 71-B4_S4_R1_001.fq.gz -2 71-B4_S4_R2_001.fq.gz | samtools view -bS - > 72.bam

# sorting bam files
samtools sort 41.bam -@ 50 > 41.sorted.bam
samtools sort 42.bam -@ 50 > 41.sorted.bam
samtools sort 71.bam -@ 50 > 71.sorted.bam
samtools sort 72.bam -@ 50 > 72.sorted.bam

# index bam files
samtools index 41.sorted.bam -@ 15 && rm 41.bam
samtools index 42.sorted.bam -@ 15 && rm 42.bam
samtools index 71.sorted.bam -@ 15 && rm 71.bam
samtools index 72.sorted.bam -@ 15 && rm 72.bam

# Merge illumina bam files
samtools merge Illumina_merged.bam 41.sorted.bam 42.sorted.bam 71.sorted.bam 72.sorted.bam -@ 30 -f
samtools sort -o Illumina_merged.sorted.bam Illumina_merged.bam -@ 30
samtools index Illumina_merged.sorted.bam -@ 30
rm Illumina_merged.bam

#############################################
### PacBio standard read length alignment ###
#############################################

samtools bam2fq m54027_190807_082031.subreads.bam > reads_HH23_s.fastq
samtools bam2fq m54027_190808_044316.subreads.bam > reads_HH30_s.fastq

minimap2 -ax splice galGal6.fa reads_HH23_s.fastq > aln_HH23_s.sam -t 30
minimap2 -ax splice galGal6.fa reads_HH30_s.fastq > aln_HH30_s.sam -t 30

# SAM to bam 
samtools view -S -b aln_HH23_s.sam > aln_HH23_s.bam --threads 30
samtools sort aln_HH23_s.bam> aln_HH23_s.sorted.bam --threads 30 && samtools index aln_HH23_s.sorted.bam -@ 30
rm aln_HH23_s.sam

samtools view -S -b aln_HH30_s.sam > aln_HH30_s.bam --threads 30
samtools sort aln_HH30_s.bam> aln_HH30_s.sorted.bam --threads 30 && samtools index aln_HH30_s.sorted.bam -@ 30
rm aln_HH30_s.sam

#########################################
### PacBio long read length alignment ###
#########################################

samtools bam2fq m54027_191028_202826.subreads.bam > reads_HH23_l.fastq
samtools bam2fq m54027_191029_165345.subreads.bam > reads_HH30_l.fastq

minimap2 -ax splice galGal6.fa reads_HH23_l.fastq > aln_HH23_l.sam -t 30
minimap2 -ax splice galGal6.fa reads_HH30_l.fastq > aln_HH30_l.sam -t 30

# SAM to bam 
samtools view -S -b aln_HH23_l.sam > aln_HH23_l.bam --threads 30
samtools sort aln_HH23_l.bam > aln_HH23_l.sorted.bam --threads 30 && samtools index aln_HH23_l.sorted.bam -@ 30
rm aln_HH23_l.sam

samtools view -S -b aln_HH30_l.sam > aln_HH30_l.bam --threads 30
samtools sort aln_HH30_l.bam > aln_HH30_l.sorted.bam --threads 30 && samtools index aln_HH30_l.sorted.bam -@ 30
rm aln_HH30_l.sam

############################################
### Merge PacBio and illumina alignments ###
############################################

# PacBio
samtools merge 4d_7d_merged.bam aln_HH23_s.sorted.bam aln_HH30_s.sorted.bam aln_HH23_l.sorted.bam aln_HH30_l.sorted.bam -@ 30 --reference galGal6.fa  
samtools sort -o 4d_7d_merged.sorted.bam 4d_7d_merged.bam -@ 30 && samtools index 4d_7d_merged.sorted.bam -@ 30
rm 4d_7d_merged.bam

# all
samtools merge PacBio_Illumina_merged.bam 4d_7d_merged.sorted.bam 41.sorted.bam 42.sorted.bam 71.sorted.bam 72.sorted.bam -@ 30 -f
samtools sort -o PacBio_Illumina_merged.sorted.bam PacBio_Illumina_merged.bam -@ 30
samtools index PacBio_Illumina_merged.sorted.bam -@ 30
rm PacBio_Illumina_merged.bam

########################
### IsoSeq3 protocol ###
########################

### IsoSeq3 protocol using 30 threads
dataset create --type SubreadSet merged.subreadset.xml m54027_190807_082031.subreads.bam m54027_190808_044316.subreads.bam m54027_191028_202826.subreads.bam m54027_191029_165345.subreads.bam
ccs merged.subreadset.xml merged.ccs.bam --noPolish --minPasses 1 -j 30
mv ccs_report.txt ./merged_ccs_report.txt
isoseq3 cluster merged.ccs.bam merged.unpolished.bam -j 30
isoseq3 polish merged.unpolished.transcriptset.xml merged.subreadset.xml merged.polished.bam -j 30
cat merged.polished.hq.fasta.gz merged.polished.lq.fasta.gz > merged.polished.fasta.gz
gunzip merged.polished.fasta.gz

Mus musculus

# Directory
cd PacBio_Mus_musculus

# Genome (download with genome-download program from annotate_my_genomes)
genome-download mm10

###########################
### Illumina alignments ###
###########################

# fastp install
conda install -c bioconda fastp

# Trimming of illumina fastq datasets
fastp -w 16 -i SRR10266992_1.fastq -I SRR10266992_2.fastq -o SRR10266992_1.fq.gz -O SRR10266992_2.fq.gz
fastp -w 16 -i SRR10266993_1.fastq -I SRR10266993_2.fastq -o SRR10266993_1.fq.gz -O SRR10266993_2.fq.gz
fastp -w 16 -i SRR10266994_1.fastq -I SRR10266994_2.fastq -o SRR10266994_1.fq.gz -O SRR10266994_2.fq.gz

# build hisat2 index with 30 threads
hisat2-build mm10.fa -p 30 mm10.fa

# align short illumina reads agains mm10, using 30 threads
hisat2 -x mm10 -p 30 -1 SRR10266992_1.fq.gz -2 SRR10266992_2.fq.gz | samtools view -bS - > SRR10266992.bam
hisat2 -x mm10 -p 30 -1 SRR10266993_1.fq.gz -2 SRR10266993_2.fq.gz | samtools view -bS - > SRR10266993.bam
hisat2 -x mm10 -p 30 -1 SRR10266994_1.fq.gz -2 SRR10266994_2.fq.gz | samtools view -bS - > SRR10266994.bam

########################
### PacBio alignment ###
########################

samtools bam2fq m54180_190618_225637.subreads.bam > reads.fastq
minimap2 -ax splice mm10.fa reads.fastq -t 30 | samtools view -bS - > PacBio.bam
gzip reads.fastq

############################################
### Merge PacBio and illumina alignments ###
############################################

samtools merge merged.bam SRR10266994.bam SRR10266993.bam SRR10266992.bam PacBio.bam -@ 30 -f
samtools sort -o merged.sorted.bam merged.bam -@ 40
samtools index merged.sorted.bam
rm merged.bam

Homo sapiens

# Directory
cd PacBio_Homo_sapiens

# Genome (download with genome-download program from annotate_my_genomes)
genome-download hg38

########################
### PacBio alignment ###
########################

samtools bam2fq m54119_170713_231222.subreads.bam > reads.fastq
minimap2 -ax splice hg38.fa reads.fastq -t 30 | samtools view -bS - > PacBio.bam
gzip reads.fastq

Danio rerio

# Directory
cd PacBio_Danio_rerio

# Genome (download with genome-download program from annotate_my_genomes)
genome-download danRer11

###########################
### Illumina alignments ###
###########################

# fastp install
conda install -c bioconda fastp

# Trimming of illumina fastq datasets
fastp -w 16 -i SRR5865383_1.fastq -I SRR5865383_2.fastq -o SRR5865383_1.fq.gz -O SRR5865383_2.fq.gz
fastp -w 16 -i SRR5865384_1.fastq -I SRR5865384_2.fastq -o SRR5865384_1.fq.gz -O SRR5865384_2.fq.gz

# align short illumina reads agains danRer11, using 30 threads
hisat2 -x danRer11 -p 50 -1 SRR5865383_1.fq.gz -2 SRR5865383_2.fq.gz | samtools view -bS - > SRR5865383.bam
hisat2 -x danRer11 -p 50 -1 SRR5865384_1.fq.gz -2 SRR5865384_2.fq.gz | samtools view -bS - > SRR5865384.bam

########################
### PacBio alignment ###
########################

minimap2 -ax splice danRer11.fa SRR5865381.fastq.gz -t 30 | samtools view -bS - > SRR5865381.bam
minimap2 -ax splice danRer11.fa SRR5865382.fastq.gz -t 30 | samtools view -bS - > SRR5865382.bam

############################################
### Merge PacBio and illumina alignments ###
############################################

samtools merge merged.bam SRR5865383.bam SRR5865384.bam SRR5865381.bam SRR5865382.bam -@ 30 -f
samtools sort -o merged.sorted.bam merged.bam -@ 40
samtools index merged.sorted.bam
rm merged.bam

C. elegans

# Directory
cd PacBio_C_elegans

# Genome (download with genome-download program from annotate_my_genomes)
genome-download ce11

#########################
### PacBio alignments ###
#########################

minimap2 -ax splice ce11.fa L1_rep1.fastq.gz -t 30 | samtools view -bS - > L1_rep1.bam
minimap2 -ax splice ce11.fa L1_rep2.fastq.gz -t 30 | samtools view -bS - > L1_rep2.bam
minimap2 -ax splice ce11.fa L2_rep1.fastq.gz -t 30 | samtools view -bS - > L2_rep1.bam
minimap2 -ax splice ce11.fa L2_rep2.fastq.gz -t 30 | samtools view -bS - > L2_rep2.bam
minimap2 -ax splice ce11.fa L3_rep1.fastq.gz -t 30 | samtools view -bS - > L3_rep1.bam
minimap2 -ax splice ce11.fa L3_rep2.fastq.gz -t 30 | samtools view -bS - > L3_rep2.bam
minimap2 -ax splice ce11.fa L4_rep1.fastq.gz -t 30 | samtools view -bS - > L4_rep1.bam
minimap2 -ax splice ce11.fa L4_rep2.fastq.gz -t 30 | samtools view -bS - > L4_rep2.bam

###############################
### Merge PacBio alignments ###
###############################

samtools merge merged.bam L1_rep1.bam L1_rep2.bam L2_rep1.bam L2_rep2.bam L3_rep1.bam L3_rep2.bam L4_rep1.bam L4_rep2.bam -@ 30 -f
samtools sort -o merged.sorted.bam merged.bam -@ 30
samtools index merged.sorted.bam
rm merged.bam

Pipeline execution:

We executed the pipelines on each species as follows:

Gallus gallus

# Directory
cd PacBio_galGal6

######################
### GTF generation ###
######################

git clone https://github.com/cfarkas/annotate_my_genomes.git   # clone repository, install if necessary. 
conda activate annotate_my_genomes

mkdir illumina_alone
mkdir PacBio+Illumina

### Illumina GTF
stringtie -p 1 -j 2 -c 2 -v -a 4 -o ./illumina_alone/illumina_transcripts.gtf Illumina_merged.sorted.bam

### Illumina + PacBio GTF
stringtie -p 1 -j 2 -c 2 -v -a 4 -o ./PacBio+Illumina/transcripts.gtf PacBio_Illumina_merged.sorted.bam

### Illumina + PacBio GTF guided
stringtie -p 1 -j 2 -c 2 -v -a 4 -o ./PacBio+Illumina/transcripts-guided.gtf -G galGal6_ncbiRefSeq.gtf PacBio_Illumina_merged.sorted.bam

####################################
### annotate_my_genomes pipeline ###
####################################

#### annotate_my_genomes (illumina + PacBio)

# set gawn_config.sh with 30 threads
mkdir output-add-ncbi-annotation

begin=`date +%s`
add-ncbi-annotation -a transcripts.gtf -n galGal6_ncbiRefSeq.gtf -r galGal6.gtf -g galGal6.fa -c gawn_config.sh -t 30 -o output-add-ncbi-annotation
end=`date +%s`
elapsed=`expr $end - $begin`
echo Time taken: $elapsed

#### annotate_my_genomes (illumina + PacBio, guided)

# set gawn_config.sh with 30 threads

begin=`date +%s`
add-ncbi-annotation -a transcripts-guided.gtf -n galGal6_ncbiRefSeq.gtf -r galGal6.gtf -g galGal6.fa -c gawn_config.sh -t 30 -o output-add-ncbi-annotation
end=`date +%s`
elapsed=`expr $end - $begin`
echo Time taken: $elapsed

#### annotate_my_genomes (illumina alone)

# set gawn_config.sh with 30 threads

begin=`date +%s`
add-ncbi-annotation -a illumina_transcripts.gtf -n galGal6_ncbiRefSeq.gtf -r galGal6.gtf -g galGal6.fa -c gawn_config.sh -t 30 -o output-add-ncbi-annotation
end=`date +%s`
elapsed=`expr $end - $begin`
echo Time taken: $elapsed

########################
### BRAKER2 pipeline ###
########################

#### test1 with: PacBio_Illumina_merged.sorted.bam

conda activate braker2
export AUGUSTUS_CONFIG_PATH=/home/wslab/braker2_install/BRAKER/config
export GENEMARK_PATH=/home/wslab/braker2_install/BRAKER/gmes_linux_64_4/
echo $AUGUSTUS_CONFIG_PATH
echo $GENEMARK_PATH

rm -r -f /home/wslab/braker2_install/BRAKER/config/species/Gallus_gallus
braker.pl --verbosity=4 --AUGUSTUS_ab_initio --cores 30 --genome=galGal6.fa --species=Gallus_gallus --bam=PacBio_Illumina_merged.sorted.bam

#### test2 with: Illumina_merged.sorted.bam

conda activate braker2
export AUGUSTUS_CONFIG_PATH=/home/wslab/braker2_install/BRAKER/config
export GENEMARK_PATH=/home/wslab/braker2_install/BRAKER/gmes_linux_64_4/
echo $AUGUSTUS_CONFIG_PATH
echo $GENEMARK_PATH

rm -r -f /home/wslab/braker2_install/BRAKER/config/species/Gallus_gallus
braker.pl --verbosity=4 --AUGUSTUS_ab_initio --cores 30 --genome=galGal6.fa --species=Gallus_gallus --bam=Illumina_merged.sorted.bam

#### test3 with Illumina_merged.sorted.bam and Gallus gallus uniprot proteome 9031

# Downloading Gallus gallus proteome 9031
conda activate annotate_my_genomes
perl ./annotate_my_genomes/additional_scripts/download_proteome_uniprot.pl 9031

conda activate braker2

# see and fix this issue with diamond: https://github.com/Gaius-Augustus/BRAKER/issues/430
cd /home/wslab/braker2_install/BRAKER/gmes_linux_64_4/ProtHint/dependencies/

# downloading diamond tool
wget http://github.com/bbuchfink/diamond/releases/download/v2.0.13/diamond-linux64.tar.gz
gunzip diamond-linux64.tar.gz
tar -xvf diamond-linux64.tar

export AUGUSTUS_CONFIG_PATH=/home/wslab/braker2_install/BRAKER/config
export GENEMARK_PATH=/home/wslab/braker2_install/BRAKER/gmes_linux_64_4/
echo $AUGUSTUS_CONFIG_PATH
echo $GENEMARK_PATH

rm -r -f /home/wslab/braker2_install/BRAKER/config/species/Gallus_gallus
braker.pl --verbosity=4 --AUGUSTUS_ab_initio --cores 30 --genome=galGal6.fa --species=Gallus_gallus --bam=Illumina_merged.sorted.bam

#### test4 with PacBio_Illumina_merged.sorted.bam and Gallus gallus uniprot proteome 9031

# Downloading Gallus gallus proteome 9031
conda activate annotate_my_genomes
perl ./annotate_my_genomes/additional_scripts/download_proteome_uniprot.pl 9031

conda activate braker2

# see and fix this issue with diamond: https://github.com/Gaius-Augustus/BRAKER/issues/430
cd /home/wslab/braker2_install/BRAKER/gmes_linux_64_4/ProtHint/dependencies/

# downloading diamond tool
wget http://github.com/bbuchfink/diamond/releases/download/v2.0.13/diamond-linux64.tar.gz
gunzip diamond-linux64.tar.gz
tar -xvf diamond-linux64.tar

export AUGUSTUS_CONFIG_PATH=/home/wslab/braker2_install/BRAKER/config
export GENEMARK_PATH=/home/wslab/braker2_install/BRAKER/gmes_linux_64_4/
echo $AUGUSTUS_CONFIG_PATH
echo $GENEMARK_PATH

rm -r -f /home/wslab/braker2_install/BRAKER/config/species/Gallus_gallus
braker.pl --verbosity=4 --AUGUSTUS_ab_initio --cores 30 --genome=galGal6.fa --species=Gallus_gallus --bam=PacBio_Illumina_merged.sorted.bam

########################
### SQANTI3 pipeline ###
########################

# Using merged.polished.fasta, output from IsoSeq3 pipeline

#### SQANTI3 (finished with errors)

cd PacBio_galGal6/

conda activate SQANTI3.env
begin=`date +%s`
export PYTHONPATH=$PYTHONPATH:/home/wslab/SQANTI3-4.2/cDNA_Cupcake/sequence/
export PYTHONPATH=$PYTHONPATH:/home/wslab/SQANTI3-4.2/cDNA_Cupcake/
python /home/wslab/SQANTI3-4.2/sqanti3_qc.py --force_id_ignore --cpus 30 --output SQANTI_output --fasta merged.polished.fasta galGal6_ncbiRefSeq.gtf galGal6.fa
end=`date +%s`
elapsed=`expr $end - $begin`
echo Time taken: $elapsed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

annotate_my_genomes benchmarking

Benchmarking:

Installation of each program:

1) annotate_my_genomes:

2) BRAKER2:

3) SQANTI3:

4) IsoSeq3 (required to assemble transcripts to run SQANTI3)

Datasets:

Gallus gallus

Mus musculus

Homo sapiens

Danio rerio

C. elegans

Alignments, PacBio data processing and GTF obtention:

Gallus gallus

Mus musculus

Homo sapiens

Danio rerio

C. elegans

Pipeline execution:

Gallus gallus

Uh oh!

Uh oh!

Clone this wiki locally