-
Notifications
You must be signed in to change notification settings - Fork 1
DESeq2 with star
Meg Staton edited this page Nov 15, 2022
·
11 revisions
We need counts files with only two columns and to remove the non-gene categories at the top
tail -n +5 EarlyBlommingRep1ReadsPerGene.out.tab | cut -f1,2 > EarlyBlommingRep1ReadsPerGene.out.fixed.tab
tail -n +5 EarlyBloomingRep2ReadsPerGene.out.tab | cut -f1,2 > EarlyBloomingRep2ReadsPerGene.out.fixed.tab
tail -n +5 LateBloomingRep1ReadsPerGene.out.tab | cut -f1,2 > LateBloomingRep1ReadsPerGene.out.fixed.tab
tail -n +5 LateBloomingRep2ReadsPerGene.out.tab | cut -f1,2 > LateBloomingRep2ReadsPerGene.out.fixed.tab
Alternative bash for loop...
for f in *Gene.out.tab
do
outputfile=$(echo "$f" | sed 's/tab/fixed.tab/')
echo "tail -n +5 $f | cut -f1,2 > $outputfile"
tail -n +5 $f | cut -f1,2 > $outputfile
done
Open a terminal on your laptop and navigate to a folder where you can do the differential expression analysis. Get the counts and a gene annotation file
scp [email protected]:/lustre/isaac/proj/UTK0208/rnaseq/analysis/YOURUSERNAME/2_star/\*Gene.out.fixed.tab .
scp [email protected]:/lustre/isaac/proj/UTK0208/rnaseq/raw_data/Ppersica_298_v2.1.annotation_info.txt .
pwd
Fix misspelling
mv EarlyBlommingRep1ReadsPerGene.out.fixed.tab EarlyBloomingRep1ReadsPerGene.out.fixed.tab
# folder with counts file
setwd("/Users/margaretstaton/Desktop/PrunusDormancy")
# installs if needed
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
install.packages("pheatmap")
install.packages("RColorBrewer")
install.packages("ggplot2")
## load libraries
library( "DESeq2" )
library("pheatmap")
library("RColorBrewer")
library("ggplot2")
## set up a dataframe with our important information: filename, sample name, condition (in this case, genotype). Conditions need to be factors, not character type
sampleFiles <- grep("Rep",list.files(),value=TRUE)
sampleGenotype <- sub("(.*ming).*","\\1",sampleFiles)
sampleName <- sub("(.*mingRep[0-9]+).*","\\1",sampleFiles)
sampleTable <- data.frame(sampleName = sampleName,
fileName = sampleFiles,
genotype = sampleGenotype)
sampleTable$genotype <- factor(sampleTable$genotype)
## You could alternatively import a spreadsheet, again be sure experimental variables are factors
## sampleTable <- read.table("sampleTable.txt", header=TRUE, colClasses=c("character", "character", "factor"))
## import count data into a DESeqDataSet
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = '.', design = ~ genotype)
dds
##Prefiltering very lowly expressed reads, which will make everything else run faster
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
dds
# Do the differential expression analysis (we'll learn more about this in the next lab, right now we just want the software to accept all the data so we can explore the data a bit).
dds <- DESeq(dds)
res <- results(dds)
res
# Check dispersion plot
plotDispEsts(dds)
#Now we actually need to transform the data again. From the vignette: "In order to test for differential expression, we operate on raw counts and use discrete distributions as described in the previous section on differential expression. However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data."
# Lets go with rlog.
rld <- rlog(dds, blind=FALSE)
#Sample to sample distances
sampleDists <- dist(t(assay(rld)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- rld$genotype
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
#PCA
plotPCA(rld, intgroup=c("genotype"))
# Lets go back to investigating our results
summary(res)
# For easy access to the genes of interest, let's create a dataframe with only the genes with an adjusted p value less than 0.05.
res_sig <- as.data.frame(res[ which(res$padj < 0.05),])
head(res_sig)
dim(res_sig)
#Lets sort the dataframe so we can see the most significant genes:
res_sig <- res_sig[order(res_sig$padj),]
#To see the 10 most significant genes, we can do a head and request 10 lines:
head(res_sig, n=10)
#Add gene annotation
# You can add gene annotation to your data frame, making it easier to see what each gene is. You'll need to download the annotation file
# read annotation info file in
annotat <- read.csv("Ppersica_298_v2.1.annotation_info.txt", header = TRUE, comment.char = '', sep='\t')
# keep first row for each unique gene in locusName, removing transcript isoforms
annotat <- annotat[!duplicated(annotat[,2]),]
head(annotat)
# remove '.v2.1' in the gene name to match the locus name in the annotation file
rownames(res_sig) <- gsub('.v2.1','',rownames(res_sig))
# Merge annotation content to DEGs
res.anno <- merge(res_sig, annotat, by.x = "row.names", by.y = "locusName", all.x = TRUE, all.y = FALSE)
## write annotated results to table
write.table(res.anno, file="DEG_sig.txt", sep = "\t", row.names = F)
On a zch terminal:
curl -O https://mac.r-project.org/tools/gfortran-12.0.1-20220312-is-darwin20-arm64.tar.xz
sudo tar fxz gfortran-12.0.1-20220312-is-darwin20-arm64.tar.xz -C /
sudo echo "export PATH=$PATH:/opt/R/arm64/gfortran/bin" >> .zshrc
- Make sure it installed into /opt/R/arm64/gfortran/bin
- Open new terminal, check $PATH, restart R, reinstall DESeq2
Find out the library paths in R"
.libPaths()
For me, only the first one is writeable, and if I specify it, it works:
BiocManager::install("DESeq2", lib="/Users/margaretstaton/Library/R/4.0/library")
Must add this for all packages that complain (or fix permissions problems)