-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathGSE_analysis_microarray.Rmd
341 lines (228 loc) · 15.9 KB
/
GSE_analysis_microarray.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
---
title: "GSE analysis for microarray data"
author: "Lian Chee Foong"
date: "2/3/2021"
output:
pdf_document:
df_print: tibble
toc: true
number_sections: true
highlight: tango
fig_width: 3
fig_height: 1.5
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
***
# Introduction
This R script is to demonstrate the steps to download GSE data from NCBI GEO database, and to obtain the differential expressed genes from GSE data.
## A little background of GEO
The Gene Expression Omnibus (GEO) is a data repository hosted by the National Center for Biotechnology Information (NCBI). NCBI contains all publicly available nucleotide and protein sequences. Presently, all records in GenBank NCBI are generated from direct submission to the DNA sequence databases from the original authors, who volunteer their records to make the data publicly available or do so as part of the publication process. The NCBI GEO is intended to house different types of expression data, covering all type of sequencing data in both raw and processed formats.
## Example here
Here, GSE63477, which is a microarray data, will be analysed. It contains an expression profile of prostate cancer cells (LNCaP) after treatment of cabazitaxel or docetaxel for 16 hr. You may read the details in NCBI GEO, under the “overall design” section. Or for better understanding, it’s always good if we read the original paper...link here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi
***
# Using GEOquery to obtain microarray data
First, set the working directory.
The GEOquery package allows you to access the data from GEO. Depending on your needs, you can download only the processed data and metadata provided by the depositor. In some cases, you may want to download the raw data as well, if it was provided by the depositor.
The function to download a GEO dataset is ‘getGEO’ from the ‘GEOquery’ package.Check how many platforms used for the GSE data, usually there will only be one platform. We set the first object in the list and gse now is an expressionSet. You can see that it contains assayData, phenoData, feature etc.
We can have a look at the sample information, gene annotation, and the expression data. This allow us to have a rough idea of the information storing in this expressionSet.
```{R}
getwd()
setwd("C:/Users/Lynn/Documents/R_GEOdata")
###https://sbc.shef.ac.uk/geo_tutorial/tutorial.nb.html#Introduction
#----import the data------------------------------------
library(GEOquery)
my_id <- "GSE63477"
gse <- getGEO(my_id)
## check how many platforms used
length(gse)
gse <-gse[[1]]
gse
pData(gse) ## print the sample information
fData(gse) ## print the gene annotation
exprs(gse)[1,] ## print the expression data
```
# Check the normalisation and scales used
We can use this command to check the normalization method, to see if the data has already processed. So this expression data was RMA normalized and filtered to remove low-expressing genes. RMA means Robust Multiarray Average, it is the most common method to determine probeset expression level for Affymetrix arrays.
The ‘summary’ function can then be used to print the distributions of expression levels, if the data has been log transformed, typically in the range of 0 to 16. Hmm...the values are quite big and go beyond 16. It’s quite weird because RMA should already log2 transform the data at the last step, but well, let’s do it on our own and move to the next step. For a more careful analysis, we can try to run the raw data of this dataset again, by applying RMA normalization on our own, to see if there is any difference.
Anyway, here, let’s perform a log2 transformation. We may check the summary of expression level again. And draw a boxplot. We can see that the distributions of each sample are highly similar, which means the data have been normalised.
```{R}
pData(gse)$data_processing[1]
# For visualisation and statistical analysis, we will inspect the data to
# discover what scale the data are presented in. The methods we will use assume
# the data are on a log2 scale; typically in the range of 0 to 16.
## have a look on the expression value
summary(exprs(gse))
# From this output we clearly see that the values go beyond 16,
# so we need to perform a log2 transformation.
exprs(gse) <- log2(exprs(gse))
# check again the summary
summary(exprs(gse))
boxplot(exprs(gse),outline=F)
```
# Inspect the clinical variables
Now we try to look into the pData for the elements that we need for the analysis. We want to know the sample name, whether it is treatment or control...in this dataset, the info is stored in the column of 'characteristics_ch1.1'.
We can use the select function to subset the column of interest. It will be useful also to rename the column to something more shorter and easier.
To make a column of simplified group names for each sample, ‘Stringr’ is helpful. Two new columns are created, named “group” and "serum". The function ‘str_detect’ is to detect the presence of the words, and then fill the row accordingly. It totally depends on your dataset to make these necessary categories in the new columns, just modify these commands for your dataset of interest.
```{R}
library(dplyr)
sampleInfo <- pData(gse)
head(sampleInfo)
table(sampleInfo$characteristics_ch1.1)
#Let's pick just those columns that seem to contain factors we might
#need for the analysis.
sampleInfo <- select(sampleInfo, characteristics_ch1.1)
## Optionally, rename to more convenient column names
sampleInfo <- rename(sampleInfo, sample = characteristics_ch1.1)
head(sampleInfo)
dim(sampleInfo)
sampleInfo$sample
library(stringr)
sampleInfo$group <- ""
for(i in 1:nrow(sampleInfo)){
if(str_detect(sampleInfo$sample[i], "CTRL") && str_detect(sampleInfo$sample[i], "full"))
{sampleInfo$group[i] <- "Conf"}
if(str_detect(sampleInfo$sample[i], "CTRL") && str_detect(sampleInfo$sample[i], "dextran"))
{sampleInfo$group[i] <- "Cond"}
if(str_detect(sampleInfo$sample[i], "cabazitaxel") && str_detect(sampleInfo$sample[i], "full"))
{sampleInfo$group[i] <- "cabazitaxelf"}
if(str_detect(sampleInfo$sample[i], "cabazitaxel") && str_detect(sampleInfo$sample[i], "dextran"))
{sampleInfo$group[i] <- "cabazitaxeld"}
if(str_detect(sampleInfo$sample[i], "docetaxel") && str_detect(sampleInfo$sample[i], "full"))
{sampleInfo$group[i] <- "docetaxelf"}
if(str_detect(sampleInfo$sample[i], "docetaxel") && str_detect(sampleInfo$sample[i], "dextran"))
{sampleInfo$group[i] <- "docetaxeld"}
}
sampleInfo
sampleInfo$serum <- ""
for(i in 1:nrow(sampleInfo)){
if(str_detect(sampleInfo$sample[i], "dextran"))
{sampleInfo$serum[i] <- "dextran"}
if(str_detect(sampleInfo$sample[i], "full"))
{sampleInfo$serum[i] <- "full_serum"}
}
sampleInfo <- sampleInfo[,-1]
sampleInfo
```
# Sample clustering and Principal Components Analaysis
We can visualize the correlations between the samples by hierarchical clustering.
The function ‘cor’ can calculate the correlation on the scale of 0 to 1, in a pairwise fashion between all samples, then visualise on a heatmap. There are many ways to create heatmaps in R, here I use ‘pheatmap’, the only argument it requires is a matrix of numeric values.
We can add more sample info onto the plot to get a better pic of the group and clustering. Here, we make use of the 'sampleInfo' file that was created earlier, to match with the columns of the correlation matrix.
```{R}
library(pheatmap)
## argument use="c" stops an error if there are any missing data points
corMatrix <- cor(exprs(gse),use="c")
pheatmap(corMatrix)
## Print the rownames of the sample information and check it matches the correlation matrix
rownames(sampleInfo)
colnames(corMatrix)
## If not, force the rownames to match the columns
#rownames(sampleInfo) <- colnames(corMatrix)
pheatmap(corMatrix, annotation_col= sampleInfo)
```
Another way is to use Principal component analysis (PCA). It has to note that the data has to be transposed, so that the genelist is in the column, while rownames are the samples, so the PCA process will not run out of the memory in the oher way round.
Let’s add labels to plot the results, here, we use the ‘ggplots2’ package, while the ‘ggrepel’ package is used to position the text labels more cleverly so they can be read. Here we can see that the samples are divided into two groups based on the serum treatment types.
```{R}
#make PCA
library(ggplot2)
library(ggrepel)
## MAKE SURE TO TRANSPOSE THE EXPRESSION MATRIX
pca <- prcomp(t(exprs(gse)))
## Join the PCs to the sample information
cbind(sampleInfo, pca$x) %>%
ggplot(aes(x = PC1, y=PC2, col=group, label=paste("",group))) + geom_point() + geom_text_repel()
```
# Differential expression analysis
In this section, we use the limma package to perform differential expressions. Limma stands for “Linear models for microarray”. Here, we need to tell limma what sample groups we want to compare. I choose sampleInfo$group. A design matrix will be created, this is a matrix of 0 and 1, one row for each sample and one column for each sample group.
We can rename the column names so that it is easier to see.
Now, let’s check if the expression data contain any lowly-expressed genes, this will affect the quality of DE analysis. A big problem in doing statistical analysis like limma is the inference of type 1 statistical errors, also called false positive. One simple way to reduce the possibility for type 1 errors is to do fewer comparisons, by filtering the data. For example, we know that not all genes are expressed in all tissues and many genes will not be expressed in any sample. As a result, in DGE analysis, it makes sense to remove the genes that are likely not expressed at all.
It is quite subjective how one defines a gene being expressed, here, I follow the tutorial, to make the cut off at the median of the expression values, which means to consider around 50% of the genes will not be expressed. Keep those expressed genes if they are present in more than 2 samples.
We can see that around half of the genes are not qualified as an “expressed” gene here, which makes sense, bcoz our cut-off is the median value.
```{R}
library(limma)
design <- model.matrix(~0 + sampleInfo$group)
design
## the column names are a bit ugly, so we will rename
colnames(design) <- c("Cabazitaxeld","Cabazitaxelf","Cond","Conf","Docetaxeld","Docetaxelf")
design
## calculate median expression level
cutoff <- median(exprs(gse))
## TRUE or FALSE for whether each gene is "expressed" in each sample
is_expressed <- exprs(gse) > cutoff
## Identify genes expressed in more than 2 samples
keep <- rowSums(is_expressed) > 3
## check how many genes are removed / retained.
table(keep)
## subset to just those expressed genes
gse <- gse[keep,]
```
Here there is a little extra step to find out the outliers. This has to be done carefully so the filtered data won't be too biased. We calculate ‘weights’ to define the reliability of each sample. The ‘arrayweights’ function will assign a score to each sample, with a value of 1 implying equal weight. Samples with score less than 1 are down-weighed, or else up-weighed.
```{R}
# coping with outliers
## calculate relative array weights
aw <- arrayWeights(exprs(gse),design)
aw
```
Now we have a design matrix, we need to estimate the coefficients. For this design, we will essentially average the replicate arrays for each sample level. In addition, we will calculate standard deviations for each gene, and the average intensity for the genes across all microarrays.
We are ready to tell limma which pairwise contrasts that we want to make. For this experiment, we are going to contrast treatment (there are two types of texane drugs) and control in each serum type. So there are 4 contrasts to specify.
To do the statistical comparisons, Limma uses Bayesian statistics to minimize type 1 error. The eBayes function performs the tests. To summarize the results of the statistical test, 'topTable' will adjust the p-values and return the top genes that meet the cutoffs that you supply as arguments; while 'decideTests' will make calls for DEGs by adjusting the p-values and applying a logFC cutoff similar to topTable.
```{R}
## Fitting the coefficients
fit <- lmFit(exprs(gse), design,
weights = aw)
head(fit$coefficients)
## Making comparisons between samples, can define multiple contrasts
contrasts <- makeContrasts(Docetaxeld - Cond, Cabazitaxeld - Cond, Docetaxelf - Conf, Cabazitaxelf - Conf, levels = design)
fit2 <- contrasts.fit(fit, contrasts)
fit2 <- eBayes(fit2)
topTable(fit2)
topTable1 <- topTable(fit2, coef=1)
topTable2 <- topTable(fit2, coef=2)
topTable3 <- topTable(fit2, coef=3)
topTable4 <- topTable(fit2, coef=4)
#if we want to know how many genes are differentially expressed overall, we can use the decideTest function.
summary(decideTests(fit2))
table(decideTests(fit2))
```
# Further visualization with gene annotation
Now we want to know the gene name associated with the gene ID. The annotation data can be retrieved with the ‘fData’ function. Let’s select the ID, GB_ACC, this is genbank accession ID. Add into fit2 table.
The “Volcano Plot” function is a common way of visualising the results of a DE analysis. The x axis shows the log-fold change and the y axis is some measure of statistical significance, which in this case is the log-odds, or “B” statistic. We can also change the color of those genes with p value cutoff more than 0.05, and fold change cut off more than 1.
```{R}
anno <- fData(gse)
head(anno)
anno <- select(anno,ID,GB_ACC)
fit2$genes <- anno
topTable(fit2)
## Create volcano plot
full_results1 <- topTable(fit2, coef=1, number=Inf)
library(ggplot2)
ggplot(full_results1,aes(x = logFC, y=B)) + geom_point()
## change according to your needs
p_cutoff <- 0.05
fc_cutoff <- 1
full_results1 %>%
mutate(Significant = P.Value < p_cutoff, abs(logFC) > fc_cutoff ) %>%
ggplot(aes(x = logFC, y = B, col=Significant)) + geom_point()
```
# Further visualization of selected gene
I think at this point, we are quite clear about data structure of GSE data. It has an experiment data, pData; the expression data, exprs; and also annotation data, fData. And we have learned how to check the expression data, normalize them, and perform differential expression analysis.
Now, with the differential expression gene tables, there are some downstream analyses that we can continue, such as to export a full table of DE genes, to generate a heatmap for your selected genes, get the gene list for a particular pathway, or survival analysis (but this is only for those clinical data).
Here, I just want to look into the fold change data of a selected gene, whether it is significantly differential expressed or not.
```{R}
## Get the results for particular gene of interest
#GB_ACC for Nkx3-1 is NM_001256339 or NM_006167
##no NM_001256339 in this data
full_results2 <- topTable(fit2, coef=2, number=Inf)
full_results3 <- topTable(fit2, coef=3, number=Inf)
full_results4 <- topTable(fit2, coef=4, number=Inf)
filter(full_results1, GB_ACC == "NM_006167")
filter(full_results2, GB_ACC == "NM_006167")
filter(full_results3, GB_ACC == "NM_006167")
filter(full_results4, GB_ACC == "NM_006167")
```
That’s all for the walk-through, thanks for reading, I hope you have learned something new here.
# Acknowlegdement
Many thanks to the following tutorials made publicly available:
1. Introduction to microarray analysis GSE15947, by Department of Statistics, Purdue Univrsity https://www.stat.purdue.edu/bigtap/online/docs/Introduction_to_Microarray_Analysis_GSE15947.html
2. Mark Dunning, 2020, GEO tutorial, by Sheffield Bioinformatics Core https://sbc.shef.ac.uk/geo_tutorial/tutorial.nb.html#Further_processing_and_visualisation_of_DE_results