Session2.4-programming.Rmd

---
title: "Introduction to Solving Biological Problems Using R - Day 2"
author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
  Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
  html_notebook:
    toc: yes
    toc_float: yes
---

#4. Programming in R

## Motivation

From the previous exercise, you should see how we can easily adapt our markdown scripts:

- e.g. ESR1 versus GATA3
- But what if we want to analyse many genes?
- It would be tedious to create a new markdown document for every gene
- ...and prone to error too

##Introducing loops

- Many programming languages have ways of doing the same thing many times, perhaps changing some variable each time. This is called **looping**
- Loops are not used in R so often, because we can usually achieve the same thing using vector calculations
- For example, to add two vectors together, we do not need to add each pair of elements one by one, we can just add the vectors

```{r}
x <- 1:10
y <- 11:20
x+y
```

- But there are some situations where R functions can not take vectors as input. For example, `t.test()` will only test one gene at a time
- What if we wanted to test multiple genes?

For completeness, we can re-run the R code to import the data
```{r}
geneAnnotation    <- read.delim("gene.description.txt",stringsAsFactors = FALSE)
patientMetadata <- read.delim("cancer.patients.txt",stringsAsFactors = FALSE)
normalizedValues    <- read.delim("gene.expression.txt")
```


- We could run the following code to perform t-tests on the first two genes

```{r eval=FALSE}
t.test(as.numeric(normalizedValues[1,]) ~ factor(patientMetadata$er))
t.test(as.numeric(normalizedValues[2,]) ~ factor(patientMetadata$er))

```

- But for many genes this will be boring to type, difficult to change, and prone to error
- As we are doing the same thing multiple times, but with a different index each time, we can use a **loop** instead

##Loops: Commands and flow control
- R has two basic types of loop
    + a **`for`** loop: run some code on every value in a vector
    + a **`while`** loop: run some code while some condition is true (*hardly ever used!*)
    
`for` 
```{r eval=FALSE}
for(i in 1:10) {
  print(i)
  }

```

`while`

```{r eval=FALSE}
i <- 1
while(i <= 10 ) {
  print(i)
  i <- i + 1
  }
```


- Here's how we might use a `for` loop to test the first 10 genes


```{r}
for(i in 1:10) {
  
  t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er))
  
  }
```

- This is *exactly* the same as:

```{r eval=FALSE}
i <- 1
t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er))
i <- 2
t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er))
i <- 3
###....etc....####
```


## Storing results

However, this for loop is doing the calculations but not storing the results

- The output of `t.test()` is an object with data placed in different slots
    + the `names()` of the object tells us what data we can retrieve, and what variable name to use


```{r}
testResult <- t.test(as.numeric(normalizedValues[1,]) ~ factor(patientMetadata$er))
names(testResult)
testResult$statistic
```


- When using a loop, we often create an empty "dummy" variable
- This is used store the results at each stage of the loop

```{r}
stats <- NULL
for(i in 1:10) {
  testResult <- t.test(as.numeric(normalizedValues[i,]) ~ factor(patientMetadata$er))
  stats[i] <- testResult$statistic
  }
stats
```

## Practical application

Previously we have identified probes on chromosome 8

- Lets say that we want to do a t-test for each gene on chromosome 8
```{r}
chr8Genes <- geneAnnotation[geneAnnotation$Chromosome=="chr8",]
head(chr8Genes)
chr8GenesOrd <- chr8Genes[order(chr8Genes$Start),]
head(chr8GenesOrd)
```

- The first step is to extract the expression values for chromosome 8 genes from our expression matrix, which has expression values for all genes
- We can use the `match` function to tell us which rows in the matrix correspond to chromosome 8 genes

```{r}
match(chr8GenesOrd$probe, rownames(normalizedValues))
chr8Expression <- normalizedValues[match(chr8GenesOrd$probe, rownames(normalizedValues)),]
dim(chr8Expression)
```

We are now ready to write the for loop

## Exercise:

- Create a for loop to perform to test if the expression level of each gene on chromosome 8 is significantly different between ER positive and negative samples
- Store the ***p-value*** from each individual test
- How many genes have a p-value < 0.05?
- N.B. Our code will be more robust if we store the number of chromosome 8 genes as a variable
    + if the data change, the code should still run

```{r}
chr8Genes <- geneAnnotation[geneAnnotation$Chromosome=="chr8",]
chr8GenesOrd <- chr8Genes[order(chr8Genes$Start),]
chr8Expression <- normalizedValues[match(chr8GenesOrd$probe, rownames(normalizedValues)),]
### Your Answer Here ###

```


##Conditional branching: Commands and flow control

- Use an `if` statement for any kind of condition testing
- Different outcomes can be selected based on a condition within brackets

```
if (condition) {
  ... do this ...
  } else {
    ... do something else ...
    }
```

- `condition` is any logical value, and can contain multiple conditions. 
    + e.g. `(a == 2 & b < 5)`, this is a compound conditional argument
- The condition should return a *single* value of `TRUE` or `FALSE`
    
    
## Other conditional tests

- There are various tests that can check the type of data stored in a variable
    + these tend to be called **`is...()`**. 
        + try *tab-complete* on `is.`

```{r}
is.numeric(10)
is.numeric("TEN")
is.character(10)
```

- `is.na()` is useful for seeing if an `NA` value is found
    + cannot use `== NA`!
- could be used to check if a gene symbol is found in the data before proceeding with statistical test

```{r}
match("foo", geneAnnotation$HUGO.gene.symbol)
is.na(match("foo", geneAnnotation$HUGO.gene.symbol))
```


- Using the **`for`** loop we wrote before, we could add some code to plot the expression of each gene
    + a boxplot would be ideal
- However, we might only want plots for genes with a "significant" pvalue
- Here's how we can use an `if` statement to test for this
    + for each iteration of the the loop:
        1. test if the p-value from the test is below 0.05 or not
        2. if the p-value is less than 0.05 make a boxplot
        3. if not, do nothing
        
```{r}
pdf("Chromosome8Genes.pdf")
pvals <- NULL
for (i in 1:18) {
  testResult <- t.test(as.numeric(chr8Expression[i,]) ~ factor(patientMetadata$er))
  pvals[i] <- testResult$p.value
  if(testResult$p.value < 0.05){
    boxplot(as.numeric(chr8Expression[i,]) ~ factor(patientMetadata$er),
            main=chr8Genes$HUGO.gene.symbol[i])
  }
} 
pvals
dev.off()

```


##Code formatting avoids bugs!
Compare:
```{r eval=FALSE}
f<-26
while(f!=0){
print(letters[f])
f<-f-1}
```
to:
```{r eval=FALSE}
f <- 26
while(f != 0 ){
  print(letters[f])
  f <- f-1
  }
```
- The code between brackets `{}` *always* is *indented*, this clearly separates what is executed once, and what is run multiple times
- Trailing bracket `}` always alone on the line at the same indentation level as the initial bracket `{`
- Use white spaces to divide the horizontal space between units of your code, e.g. around assignments, comparisons


# Making a heatmap

- A heatmap is often used to visualise how the expression level of a set of genes vary between conditions
- Making the plot is actually quite straightforward
    + providing you have processed the data appropriately!
- Let's take a list of "most-variable genes"
    + see below for how we identified such genes
- `heatmap` requires a matrix object rather than a data frame

```{r}
genelist <- c("CLIC6","TFF3","PDZK1","SCUBE2","CYP2B6","HOXB13","NAT1","LY6D","SLC7A2")
probes   <- geneAnnotation$probe[match(genelist, geneAnnotation$HUGO.gene.symbol)]
probes
exprows  <- match(probes, rownames(normalizedValues))

heatmap(as.matrix(normalizedValues[exprows,]))

```


## Heatmap adjustments

- We can provide a colour legend for the samples
- Adjust colour of cells
- Label the rows according to gene name
- Don't print the sample names as they are too cluttered

```{r}
library(RColorBrewer)

sampcol <- rep("blue", ncol(normalizedValues))

sampcol[patientMetadata$er == 1 ] <- "yellow"

rbPal <- brewer.pal(10, "RdBu")

heatmap(as.matrix(normalizedValues[exprows,]), 
        ColSideColors = sampcol, 
        col=rbPal,
        labRow = genelist,labCol="")
```

- see also
    + `heatmap.2` from `library(gplots)`; `example(heatmap.2)`
    + `heatmap.plus` from `library(heatmap.plus)`; `example(heatmap.plus)`
    
## (Supplementary) Choosing the genes for the heatmap
    
Often when using R you will come across a convenient shortcut function that can save you many hours of coding and frustration.

- the `genefilter` package in Bioconductor contains many useful methods for filtering genomic datasets
- you can install this package with the following commands

```{r eval=FALSE}
source("http://www.bioconductor.org/biocLite.R")
biocLite("genefilter")
```

- the `rowSds` function will calculate the standard deviation for each row in a numeric matrix
- the output will be vector, with each element being the standard deviation for a corresponding gene
    
```{r}
library(genefilter)
geneVar <- rowSds(normalizedValues)
geneVar[1]
sd(normalizedValues[1,])
```

- we can now `order` this matrix to get the subset with `[]` to get the indices of the most-variable genes (10 in this case).
- the same indices can be used to subset the gene annotation data frame
    + we can do this because the annotation data frame and expression matrix are in the same order

```{r}
topVar <- order(geneVar,decreasing = TRUE)[1:10]
topVar
geneAnnotation[topVar,]

```