Skip to content

DESeq2 with star

Meg Staton edited this page Nov 15, 2022 · 11 revisions

fix counts

We need counts files with only two columns and to remove the non-gene categories at the top

tail -n +5 EarlyBlommingRep1ReadsPerGene.out.tab | cut -f1,2 > EarlyBlommingRep1ReadsPerGene.out.fixed.tab
tail -n +5 EarlyBloomingRep2ReadsPerGene.out.tab | cut -f1,2 > EarlyBloomingRep2ReadsPerGene.out.fixed.tab
tail -n +5 LateBloomingRep1ReadsPerGene.out.tab | cut -f1,2 > LateBloomingRep1ReadsPerGene.out.fixed.tab
tail -n +5 LateBloomingRep2ReadsPerGene.out.tab | cut -f1,2 > LateBloomingRep2ReadsPerGene.out.fixed.tab

Alternative bash for loop...

for f in *Gene.out.tab
do
  outputfile=$(echo "$f" | sed 's/tab/fixed.tab/')
  echo "tail -n +5 $f | cut -f1,2 > $outputfile"
  tail -n +5 $f | cut -f1,2 > $outputfile
done

scp

Open a terminal on your laptop and navigate to a folder where you can do the differential expression analysis. Get the counts and a gene annotation file

scp [email protected]:/lustre/isaac/proj/UTK0208/rnaseq/analysis/YOURUSERNAME/2_star/\*Gene.out.fixed.tab .
scp [email protected]:/lustre/isaac/proj/UTK0208/rnaseq/raw_data/Ppersica_298_v2.1.annotation_info.txt .
pwd

Fix misspelling

mv EarlyBlommingRep1ReadsPerGene.out.fixed.tab EarlyBloomingRep1ReadsPerGene.out.fixed.tab

R

# folder with counts file
setwd("/Users/margaretstaton/Desktop/PrunusDormancy")

# installs if needed
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("DESeq2")

install.packages("pheatmap")
install.packages("RColorBrewer")
install.packages("ggplot2")

## load libraries

library( "DESeq2" )
library("pheatmap")
library("RColorBrewer")
library("ggplot2")

## set up a dataframe with our important information: filename, sample name, condition (in this case, genotype). Conditions need to be factors, not character type

sampleFiles <- grep("Rep",list.files(),value=TRUE)
sampleGenotype <- sub("(.*ming).*","\\1",sampleFiles)
sampleName <- sub("(.*mingRep[0-9]+).*","\\1",sampleFiles)
sampleTable <- data.frame(sampleName = sampleName,
                          fileName = sampleFiles,
                          genotype = sampleGenotype)
sampleTable$genotype <- factor(sampleTable$genotype)

## You could alternatively import a spreadsheet, again be sure experimental variables are factors
## sampleTable <- read.table("sampleTable.txt", header=TRUE, colClasses=c("character", "character", "factor"))

## import count data into a DESeqDataSet 
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = '.', design = ~ genotype)
dds

##Prefiltering very lowly expressed reads, which will make everything else run faster

keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
dds
# Do the differential expression analysis (we'll learn more about this in the next lab, right now we just want the software to accept all the data so we can explore the data a bit).

dds <- DESeq(dds)
res <- results(dds)
res

# Check dispersion plot

plotDispEsts(dds)

#Now we actually need to transform the data again. From the vignette: "In order to test for differential expression, we operate on raw counts and use discrete distributions as described in the previous section on differential expression. However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data."

# Lets go with rlog.

rld <- rlog(dds, blind=FALSE)

#Sample to sample distances

sampleDists <- dist(t(assay(rld)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- rld$genotype
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)
#PCA

plotPCA(rld, intgroup=c("genotype"))

# Lets go back to investigating our results

summary(res)

# For easy access to the genes of interest, let's create a dataframe with only the genes with an adjusted p value less than 0.05.

res_sig <- as.data.frame(res[ which(res$padj < 0.05),])
head(res_sig)
dim(res_sig)

#Lets sort the dataframe so we can see the most significant genes:

res_sig <- res_sig[order(res_sig$padj),]

#To see the 10 most significant genes, we can do a head and request 10 lines:

head(res_sig, n=10)

#Add gene annotation
# You can add gene annotation to your data frame, making it easier to see what each gene is. You'll need to download the annotation file

# read annotation info file in
annotat <- read.csv("Ppersica_298_v2.1.annotation_info.txt", header = TRUE, comment.char = '', sep='\t')

# keep first row for each unique gene in locusName, removing transcript isoforms
annotat <- annotat[!duplicated(annotat[,2]),]
head(annotat)

# remove '.v2.1' in the gene name to match the locus name in the annotation file
rownames(res_sig) <- gsub('.v2.1','',rownames(res_sig))

# Merge annotation content to DEGs
res.anno <- merge(res_sig, annotat, by.x = "row.names", by.y = "locusName", all.x = TRUE, all.y = FALSE)
## write annotated results to table
write.table(res.anno, file="DEG_sig.txt", sep = "\t", row.names = F)

DESEQ2 troubleshooting

Notes for ONLY if you have an M1 mac and error relates to gfortran.

On a zch terminal:

curl -O https://mac.r-project.org/tools/gfortran-12.0.1-20220312-is-darwin20-arm64.tar.xz
sudo tar fxz gfortran-12.0.1-20220312-is-darwin20-arm64.tar.xz -C /
sudo echo "export PATH=$PATH:/opt/R/arm64/gfortran/bin" >> .zshrc
  • Make sure it installed into /opt/R/arm64/gfortran/bin
  • Open new terminal, check $PATH, restart R, reinstall DESeq2

Notes for ONLY if you have an M1 mac and error relates to no permission to write:

Find out the library paths in R"

.libPaths()

For me, only the first one is writeable, and if I specify it, it works:

BiocManager::install("DESeq2", lib="/Users/margaretstaton/Library/R/4.0/library")

Must add this for all packages that complain (or fix permissions problems)