forked from bioinformatics-core-shared-training/basicr
-
Notifications
You must be signed in to change notification settings - Fork 40
/
Copy pathSession2.4-programming.Rmd
341 lines (248 loc) · 9.65 KB
/
Session2.4-programming.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
---
title: "Introduction to Solving Biological Problems Using R - Day 2"
author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
html_notebook:
toc: yes
toc_float: yes
---
#4. Programming in R
## Motivation
From the previous exercise, you should see how we can easily adapt our markdown scripts:
- e.g. ESR1 versus GATA3
- But what if we want to analyse many genes?
- It would be tedious to create a new markdown document for every gene
- ...and prone to error too
##Introducing loops
- Many programming languages have ways of doing the same thing many times, perhaps changing some variable each time. This is called **looping**
- Loops are not used in R so often, because we can usually achieve the same thing using vector calculations
- For example, to add two vectors together, we do not need to add each pair of elements one by one, we can just add the vectors
```{r}
x <- 1:10
y <- 11:20
x+y
```
- But there are some situations where R functions can not take vectors as input. For example, `t.test()` will only test one gene at a time
- What if we wanted to test multiple genes?
For completeness, we can re-run the R code to import the data
```{r}
geneAnnotation <- read.delim("gene.description.txt",stringsAsFactors = FALSE)
patientMetadata <- read.delim("cancer.patients.txt",stringsAsFactors = FALSE)
normalizedValues <- read.delim("gene.expression.txt")
```
- We could run the following code to perform t-tests on the first two genes
```{r eval=FALSE}
t.test(as.numeric(normalizedValues[1,]) ~ factor(patientMetadata$er))
t.test(as.numeric(normalizedValues[2,]) ~ factor(patientMetadata$er))
```
- But for many genes this will be boring to type, difficult to change, and prone to error
- As we are doing the same thing multiple times, but with a different index each time, we can use a **loop** instead
##Loops: Commands and flow control
- R has two basic types of loop
+ a **`for`** loop: run some code on every value in a vector
+ a **`while`** loop: run some code while some condition is true (*hardly ever used!*)
`for`
```{r eval=FALSE}
for(i in 1:10) {
print(i)
}
```
`while`
```{r eval=FALSE}
i <- 1
while(i <= 10 ) {
print(i)
i <- i + 1
}
```
- Here's how we might use a `for` loop to test the first 10 genes
```{r}
for(i in 1:10) {
t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er))
}
```
- This is *exactly* the same as:
```{r eval=FALSE}
i <- 1
t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er))
i <- 2
t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er))
i <- 3
###....etc....####
```
## Storing results
However, this for loop is doing the calculations but not storing the results
- The output of `t.test()` is an object with data placed in different slots
+ the `names()` of the object tells us what data we can retrieve, and what variable name to use
```{r}
testResult <- t.test(as.numeric(normalizedValues[1,]) ~ factor(patientMetadata$er))
names(testResult)
testResult$statistic
```
- When using a loop, we often create an empty "dummy" variable
- This is used store the results at each stage of the loop
```{r}
stats <- NULL
for(i in 1:10) {
testResult <- t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er))
stats[i] <- testResult$statistic
}
stats
```
## Practical application
Previously we have identified probes on chromosome 8
- Lets say that we want to do a t-test for each gene on chromosome 8
```{r}
chr8Genes <- geneAnnotation[geneAnnotation$Chromosome=="chr8",]
head(chr8Genes)
chr8GenesOrd <- chr8Genes[order(chr8Genes$Start),]
head(chr8GenesOrd)
```
- The first step is to extract the expression values for chromosome 8 genes from our expression matrix, which has expression values for all genes
- We can use the `match` function to tell us which rows in the matrix correspond to chromosome 8 genes
```{r}
match(chr8GenesOrd$probe, rownames(normalizedValues))
chr8Expression <- normalizedValues[match(chr8GenesOrd$probe, rownames(normalizedValues)),]
dim(chr8Expression)
```
We are now ready to write the for loop
## Exercise:
- Create a for loop to perform to test if the expression level of each gene on chromosome 8 is significantly different between ER positive and negative samples
- Store the ***p-value*** from each individual test
- How many genes have a p-value < 0.05?
- N.B. Our code will be more robust if we store the number of chromosome 8 genes as a variable
+ if the data change, the code should still run
```{r}
chr8Genes <- geneAnnotation[geneAnnotation$Chromosome=="chr8",]
chr8GenesOrd <- chr8Genes[order(chr8Genes$Start),]
chr8Expression <- normalizedValues[match(chr8GenesOrd$probe, rownames(normalizedValues)),]
### Your Answer Here ###
```
##Conditional branching: Commands and flow control
- Use an `if` statement for any kind of condition testing
- Different outcomes can be selected based on a condition within brackets
```
if (condition) {
... do this ...
} else {
... do something else ...
}
```
- `condition` is any logical value, and can contain multiple conditions.
+ e.g. `(a == 2 & b < 5)`, this is a compound conditional argument
- The condition should return a *single* value of `TRUE` or `FALSE`
## Other conditional tests
- There are various tests that can check the type of data stored in a variable
+ these tend to be called **`is...()`**.
+ try *tab-complete* on `is.`
```{r}
is.numeric(10)
is.numeric("TEN")
is.character(10)
```
- `is.na()` is useful for seeing if an `NA` value is found
+ cannot use `== NA`!
- could be used to check if a gene symbol is found in the data before proceeding with statistical test
```{r}
match("foo", geneAnnotation$HUGO.gene.symbol)
is.na(match("foo", geneAnnotation$HUGO.gene.symbol))
```
- Using the **`for`** loop we wrote before, we could add some code to plot the expression of each gene
+ a boxplot would be ideal
- However, we might only want plots for genes with a "significant" pvalue
- Here's how we can use an `if` statement to test for this
+ for each iteration of the the loop:
1. test if the p-value from the test is below 0.05 or not
2. if the p-value is less than 0.05 make a boxplot
3. if not, do nothing
```{r}
pdf("Chromosome8Genes.pdf")
pvals <- NULL
for (i in 1:18) {
testResult <- t.test(as.numeric(chr8Expression[i,]) ~ factor(patientMetadata$er))
pvals[i] <- testResult$p.value
if(testResult$p.value < 0.05){
boxplot(as.numeric(chr8Expression[i,]) ~ factor(patientMetadata$er),
main=chr8Genes$HUGO.gene.symbol[i])
}
}
pvals
dev.off()
```
##Code formatting avoids bugs!
Compare:
```{r eval=FALSE}
f<-26
while(f!=0){
print(letters[f])
f<-f-1}
```
to:
```{r eval=FALSE}
f <- 26
while(f != 0 ){
print(letters[f])
f <- f-1
}
```
- The code between brackets `{}` *always* is *indented*, this clearly separates what is executed once, and what is run multiple times
- Trailing bracket `}` always alone on the line at the same indentation level as the initial bracket `{`
- Use white spaces to divide the horizontal space between units of your code, e.g. around assignments, comparisons
# Making a heatmap
- A heatmap is often used to visualise how the expression level of a set of genes vary between conditions
- Making the plot is actually quite straightforward
+ providing you have processed the data appropriately!
- Let's take a list of "most-variable genes"
+ see below for how we identified such genes
- `heatmap` requires a matrix object rather than a data frame
```{r}
genelist <- c("CLIC6","TFF3","PDZK1","SCUBE2","CYP2B6","HOXB13","NAT1","LY6D","SLC7A2")
probes <- geneAnnotation$probe[match(genelist, geneAnnotation$HUGO.gene.symbol)]
probes
exprows <- match(probes, rownames(normalizedValues))
heatmap(as.matrix(normalizedValues[exprows,]))
```
## Heatmap adjustments
- We can provide a colour legend for the samples
- Adjust colour of cells
- Label the rows according to gene name
- Don't print the sample names as they are too cluttered
```{r}
library(RColorBrewer)
sampcol <- rep("blue", ncol(normalizedValues))
sampcol[patientMetadata$er == 1 ] <- "yellow"
rbPal <- brewer.pal(10, "RdBu")
heatmap(as.matrix(normalizedValues[exprows,]),
ColSideColors = sampcol,
col=rbPal,
labRow = genelist,labCol="")
```
- see also
+ `heatmap.2` from `library(gplots)`; `example(heatmap.2)`
+ `heatmap.plus` from `library(heatmap.plus)`; `example(heatmap.plus)`
## (Supplementary) Choosing the genes for the heatmap
Often when using R you will come across a convenient shortcut function that can save you many hours of coding and frustration.
- the `genefilter` package in Bioconductor contains many useful methods for filtering genomic datasets
- you can install this package with the following commands
```{r eval=FALSE}
source("http://www.bioconductor.org/biocLite.R")
biocLite("genefilter")
```
- the `rowSds` function will calculate the standard deviation for each row in a numeric matrix
- the output will be vector, with each element being the standard deviation for a corresponding gene
```{r}
library(genefilter)
geneVar <- rowSds(normalizedValues)
geneVar[1]
sd(normalizedValues[1,])
```
- we can now `order` this matrix to get the subset with `[]` to get the indices of the most-variable genes (10 in this case).
- the same indices can be used to subset the gene annotation data frame
+ we can do this because the annotation data frame and expression matrix are in the same order
```{r}
topVar <- order(geneVar,decreasing = TRUE)[1:10]
topVar
geneAnnotation[topVar,]
```