Data analysis for TurboID samples from Scott Lab. AKAP18 mutants: P, a nuclear retention mutant that stop translocation from nuclear to cytoplasm; and a NLS that stay in the cytoplasm, were fused to TurboID. The fusion proteins were overexpressed in in vitro cultured cells in the present of biotin for proximal labeling of possible interacting proteins. TurboID was also overexpressed as control.
For biological replicates, there are 3 control samples, 4 nls smaples and 4 p samples.
library(tidyverse)
library(stringr)
library(ggthemes)
# for joy plot
library(ggridges)
# for labeling of geom_point, which repel the label when they are overlapping
library(ggrepel)
# for Venn diagram
library(VennDiagram)
Before analyzing the result, I need to tidy up the data, i.e., select the columns needed for the analysis, filter the contaminations (unwanted proteins).
The table contains 2494, 138, rows and columns, respectively. Only the columns containing gene names, protein expression, and contaminations are required. There are three columns containing the contamination information. They can be combined to filter out contaminations, then the columns can also be removed because they will not be used again.
raw.PG <- read.delim("../../proteinGroups.txt", header = TRUE, stringsAsFactors = FALSE)
clean.PG <- raw.PG %>%
dplyr::select(Protein.names,
Gene.names,
Fasta.headers,
Majority.protein.IDs,
starts_with("LFQ"),
Only.identified.by.site,
Reverse,
Potential.contaminant) %>%
# unite the contamination columns for easier filtering
unite(con,
Only.identified.by.site, Reverse, Potential.contaminant, sep = "") %>%
filter(con == "") %>%
dplyr::select(-con)
For the analysis, I need to log transform the data, check the distribution and normalize. Normalization is done using Z-scores transformation.
LFQ <- clean.PG %>% dplyr::select(starts_with("LFQ")) # extract the numeric columns
# log transform
log.LFQ <- apply(LFQ, 2, log2)
# replace the -Inf
log.LFQ <- as.data.frame(apply(log.LFQ, 2, function(x) ifelse(is.infinite(x), NA, x)), stringsAsFactors = FALSE)
# rename the columns
n <- str_locate(names(log.LFQ)[1], "LFQ.intensity.")[2]
column.names <- names(log.LFQ)
column.names <- substr(column.names, n + 1, nchar(column.names))
colnames(log.LFQ) <- column.names
# check distribution
log.LFQ.ridge <- log.LFQ %>% gather(key = "Samples", value = "Log2.LFQ")
ggplot(log.LFQ.ridge) +
geom_density_ridges(mapping = aes(x = Log2.LFQ, y = Samples))
## Picking joint bandwidth of 0.703
## Warning: Removed 19312 rows containing non-finite values
## (stat_density_ridges).
# Normalization, I have also tried to compared the peak shape between z-score and MAD normalization,
# but i don't see any obvious difference between the two method, so just do z-scores
log.LFQ <- apply(log.LFQ, 2, scale, center = TRUE, scale = TRUE)
log.LFQ <- data.frame(log.LFQ, stringsAsFactors = FALSE)
# check distribution after transformation
log.LFQ.ridge <- log.LFQ %>% gather(key = "Samples", value = "Log2.LFQ")
ggplot(log.LFQ.ridge) +
geom_density_ridges(mapping = aes(x = Log2.LFQ, y = Samples)) +
xlab("Normalized LFQ")
## Picking joint bandwidth of 0.24
## Warning: Removed 19312 rows containing non-finite values
## (stat_density_ridges).
Based on the following scatterplots, reproducibility is not bad. The controls look significantly different from the fusion proteins, which is expected. Counting the number of quantified proteins per experiment revealed that p.2 has a low number that the other replicates. So I will exclude it for further analysis.
# check how well the replicates are
pairs(log.LFQ, pch = ".")
# check the number of quantified proteins for each sample
apply(log.LFQ, 2, function(x) sum(!is.na(x)))
## ctrl.1 ctrl.3 ctrl.8 nls.1 nls.2 nls.3 nls.8 p.1 p.2 p.3
## 212 520 162 782 503 937 936 702 320 1087
## p.8
## 718
# remove p.2
log.LFQ$p.2 <- NULL
I count the missing data per gene for each treatment, i.e., control, P- and NLS- mutants. This will impact the filtering and p-value adjustment downstream. The missing values are imputed using a random number drawn from a distribution with a mean at -2 and 0.1 s.d. This assumes that the missing data is due to the signal being under the machine detection threshold.
# Counting NAs
sample.names <- c("ctrl", "nls", "p")
count.NAs <- function(x) {
# separate the samples here
sep.sample <- log.LFQ[, grepl(pattern = x, x = names(log.LFQ)), drop = FALSE]
# get length for calculating %
len.sample <- length(sep.sample)
# count NAs and calculate the percentage of missing data
sep.sample$NAs <- apply(sep.sample, 1,
function(x) sum(is.na(x)))
sep.sample$percent.NAs <- sep.sample$NAs / len.sample
# rename the column
colnames(sep.sample)[as.integer(length(sep.sample))] <- str_c(x, ".percent.NAs")
# keep the last column only
sep.sample <- sep.sample[, as.integer(length(sep.sample)), drop = FALSE]
return(sep.sample)
}
NA.counts <- lapply(sample.names, count.NAs)
NA.counts <- Reduce(x = NA.counts,
f = function(x,y) cbind(x,y))
# Imputation of missing values
# because it is z-score transformated, so the mean will be 0 and sd will be 1
data.size <- dim(log.LFQ)
# make the random distribution of values for imputation. Downshift 2 and sd = 0.1
set.seed(576)
impute.values <- rnorm(n = data.size[1] * data.size[2],
mean = -2, sd = 0.1)
# impute
set.seed(576)
log.LFQ.imputed <- data.frame(apply(log.LFQ, 2,
function(x) ifelse(is.na(x), sample(impute.values), x)),
stringsAsFactors = FALSE)
# rename columns
colnames(log.LFQ.imputed) <- str_c("Imputed.", names(log.LFQ.imputed))
t-test is used to identify the proteins that are significantly enriched by the fusion proteins (P or NLS) compared to the control. The p-values are adjusted using "fdr" method after removing rows with too many missing values. Results are visualized using volcano plots and Venn diagram.
# write an automated function for t-test, in case more samples are added in the future
# the sample names
treatment <- c("nls",
"\\.p\\.")
ctrl <- "ctrl"
log.LFQ.imputed <- log.LFQ.imputed
auto.ttest <- function(x) { # input x will be a vector containing the treatment names, so treatment here
# extracdt the treatment sample according to the sample vector from above
treatment <- log.LFQ.imputed[, grepl(pattern = x, x = names(log.LFQ.imputed)), drop = FALSE]
# the rest will be anything, but the sample of interest
ctrl <- log.LFQ.imputed[, grepl(pattern = ctrl, x = names(log.LFQ.imputed)), drop = FALSE]
# get the length of the sample and ctrl, this will be used for indexing later
len.t <- length(treatment) # number of columns in the treatment
len.c <- length(ctrl) # number of columns in the control
# do the t.test
df.for.ttest <- cbind(ctrl, treatment)
len.df <- as.integer(length(df.for.ttest))
df.for.ttest$Test.results <- NA
df.for.ttest$Sample.mean <- NA
df.for.ttest$Ctrl.mean <- NA
df.for.ttest$Difference <- NA
df.for.ttest$pvalue <- NA
df.for.ttest$Test.results <-
apply(df.for.ttest,
1,
function(x) t.test(x = x[as.integer(len.c+1):as.integer(len.c+len.t)], # x is treatment
y = x[1:as.integer(len.c)], var.equal = TRUE)[c(3,5)])
df.for.ttest$pvalue <- apply(df.for.ttest[,"Test.results", drop = FALSE], 1,
function(x) unlist(x[[1]][1]))
df.for.ttest$Sample.mean <- apply(df.for.ttest[,"Test.results", drop = FALSE], 1,
function(x) unlist(x[[1]][[2]][1]))
df.for.ttest$Ctrl.mean <- apply(df.for.ttest[,"Test.results", drop = FALSE], 1,
function(x) unlist(x[[1]][[2]][2]))
df.for.ttest$Difference <- df.for.ttest$Sample.mean - df.for.ttest$Ctrl.mean
results <- df.for.ttest[, c("Sample.mean", "Ctrl.mean", "Difference", "pvalue")]
colnames(results) <- str_c(names(results), x)
return(results)
}
# apply the function
ttest.result.list <- lapply(treatment, auto.ttest)
ttest.results <- Reduce(x = ttest.result.list, f = function(x,y) cbind(x,y))
ttest.results <- ttest.results[, c(2,5,7,8,1,3,4)]
colnames(ttest.results) <- c("Ctrl.means", "p.means", "p.differences", "p.pvalues",
"nls.means", "nls.differences", "nls.pvalues")
# tidy up the resulting data from the t-test
p.data <- cbind(clean.PG[,c(2,3)], ttest.results[, c(2:4)], NA.counts)
p.data <- p.data[p.data$p.percent.NAs < 0.6, ]
p.data$p.adjust.p <- p.adjust(p.data$p.pvalues)
nls.data <- cbind(clean.PG[,c(2,3)], ttest.results[, c(5:7)], NA.counts)
nls.data <- nls.data[nls.data$nls.percent.NAs < 0.6, ]
nls.data$nls.adjust.p <- p.adjust(nls.data$nls.pvalues)
# find 0.05 fdr for each
p.05 <- p.data[p.data$p.adjust.p > 0.05,]
p.05 <- min(p.05$p.pvalues)
nls.05 <- nls.data[nls.data$nls.adjust.p > 0.05,]
nls.05 <- min(nls.05$nls.pvalues)
both.plot <- ggplot() +
# p
geom_point(data = p.data[p.data$p.pvalues > p.05,],
mapping = aes(x = p.differences, y = -log(p.pvalues)), alpha = 0.1) +
geom_hline(yintercept = -log(p.05), alpha = 0.5, lty = 2) +
# above fdr
geom_point(data = p.data[p.data$p.pvalues <= p.05,],
mapping = aes(x = p.differences, y = -log(p.pvalues)), alpha = 0.3, colour = "Red") +
# nls mutant, below fdr
geom_point(data = nls.data[nls.data$nls.pvalues > nls.05,],
mapping = aes(x = -(nls.differences), y = -log(nls.pvalues)), alpha = 0.1) +
geom_hline(yintercept = -log(nls.05), alpha = 0.5, lty = 2) +
# above fdr
geom_point(data = nls.data[nls.data$nls.pvalues <= nls.05,],
mapping = aes(x = -(nls.differences), y = -log(nls.pvalues)), alpha = 0.3, colour = "blue") +
#coord_cartesian(xlim = c(0.5, 4.5)) +
theme_classic()
both.plot
# i want to see whether I can plot both data into 1 plot
p.plot <- ggplot() +
# p mutant, below fdr
geom_point(data = p.data[p.data$p.pvalues > p.05,],
mapping = aes(x = p.differences, y = -log(p.pvalues)), alpha = 0.1) +
geom_hline(yintercept = -log(p.05), alpha = 0.5, lty = 2) +
# above fdr
geom_point(data = p.data[p.data$p.pvalues <= p.05,],
mapping = aes(x = p.differences, y = -log(p.pvalues)), alpha = 0.3, colour = "Red") +
coord_cartesian(xlim = c(0.3, 4.6), ylim = c(0, 18)) +
xlab("P Mutant Difference (Z-score Normalized log2(P - Ctrl)") +
theme_classic()
p.plot
nls.plot <- ggplot() +
# nls mutant, below fdr
geom_point(data = nls.data[nls.data$nls.pvalues > nls.05,],
mapping = aes(x = -(nls.differences), y = -log(nls.pvalues)), alpha = 0.1) +
geom_hline(yintercept = -log(nls.05), alpha = 0.5, lty = 2) +
# above fdr
geom_point(data = nls.data[nls.data$nls.pvalues <= nls.05,],
mapping = aes(x = -(nls.differences), y = -log(nls.pvalues)), alpha = 0.3, colour = "blue") +
coord_cartesian(xlim = c(-0.3, -4.6), ylim = c(0, 18)) +
xlab("NLS Mutant Difference (-(Z-score Normalized log2(NLS - Ctrl))") +
theme_classic()
#nls.plot
nls.plot2 <- ggplot() +
# nls mutant, below fdr
geom_point(data = nls.data[nls.data$nls.pvalues > nls.05,],
mapping = aes(x = nls.differences, y = -log(nls.pvalues)), alpha = 0.1) +
geom_hline(yintercept = -log(nls.05), alpha = 0.5, lty = 2) +
# above fdr
geom_point(data = nls.data[nls.data$nls.pvalues <= nls.05,],
mapping = aes(x = nls.differences, y = -log(nls.pvalues)), alpha = 0.3, colour = "blue") +
coord_cartesian(xlim = c(0.3, 4.6), ylim = c(0, 18)) +
xlab("NLS Mutant Difference (-(Z-score Normalized log2(NLS - Ctrl))") +
theme_classic()
nls.plot2
venn <- draw.pairwise.venn(57 + 24, 40 + 24, 24,
fill = c("blue", "red"),
alpha = c(0.5, 0.5),
category = c("NLS", "P"))
venn
## (polygon[GRID.polygon.384], polygon[GRID.polygon.385], polygon[GRID.polygon.386], polygon[GRID.polygon.387], text[GRID.text.388], text[GRID.text.389], text[GRID.text.390], text[GRID.text.391], text[GRID.text.392])
#pdf(file = "fdr_05.pdf")
#grid.draw(venn)
#dev.off()
venn
## (polygon[GRID.polygon.384], polygon[GRID.polygon.385], polygon[GRID.polygon.386], polygon[GRID.polygon.387], text[GRID.text.388], text[GRID.text.389], text[GRID.text.390], text[GRID.text.391], text[GRID.text.392])
Head(output.table)
output <- NULL
output <- cbind(clean.PG[, 1:3], ttest.results[, 1:4])
output <- left_join(output, p.data[, c(2,9)], by = "Fasta.headers")
output <- cbind(output, ttest.results[, 5:7])
output <- left_join(output, nls.data[, c(2,9)], by = "Fasta.headers")
output <- cbind(output, clean.PG[,5:15])
output <- cbind(output, log.LFQ)
output <- cbind(output, log.LFQ.imputed)
output <- data.frame(apply(output, 2, function(x) ifelse(is.na(x), "", x)))
#write.table(output, "AKAP18_p_nls_BioID.txt",
# append = FALSE, sep = "\t", row.names = FALSE, quote = FALSE)
head(output)