-
Notifications
You must be signed in to change notification settings - Fork 10
/
r-rnaseq-airway.Rmd
541 lines (361 loc) · 36.8 KB
/
r-rnaseq-airway.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
---
title: "Count-Based Differential Expression Analysis of RNA-seq Data"
---
```{r init, include=F}
library(knitr)
# opts_chunk$set(message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE, cache=TRUE)
opts_chunk$set(message=FALSE, warning=FALSE, eval=FALSE, echo=TRUE, cache=FALSE)
.ex <- 1
library(ggplot2)
theme_set(theme_bw(base_size=16) + theme(strip.background = element_blank()))
```
```{r makeData, include=FALSE, eval=FALSE}
# # THIS CHUNK IS USED TO CREATE THE SIMPLE COUNT DATA. SEE KALLISTO FOLDER IN DEV FOR SCALED.
#
# # Create the count and metadata -------------------------------------------
#
# library(dplyr)
# library(readr)
# library(stringr)
# library(airway)
# data(airway)
#
# # Create metadata
# metadata <- colData(airway) %>%
# data.frame(id=rownames(.), ., stringsAsFactors=FALSE) %>%
# as_tibble() %>%
# transmute(id, dex, celltype=cell, geo_id=SampleName) %>%
# mutate(dex=dex %>%
# as.character %>%
# str_replace("untrt", "control") %>%
# str_replace("trt", "treated"))
# metadata
# metadata %>% write_csv("data/airway_metadata.csv")
#
# # Create countdata
# countdata <- assay(airway) %>%
# data.frame(ensgene=rownames(.), ., stringsAsFactors=FALSE) %>%
# as_tibble()
# countdata %>% write_csv("data/airway_rawcounts.csv")
#
# # Test import with DESeq2
# rm(list=ls(all=TRUE))
# DESeq2::DESeqDataSetFromMatrix(
# countData=read.csv("data/airway_rawcounts.csv", row.names=1),
# colData=read.csv("data/airway_metadata.csv", row.names=1),
# design=~dex
# )
```
This is an introduction to RNAseq analysis involving reading in quantitated gene expression data from an RNA-seq experiment, exploring the data using base R functions and then analysis with the DESeq2 package. This lesson assumes a [basic familiarity with R](r-basics.html), [data frames](r-dataframes.html), and [manipulating data with dplyr and `%>%`](r-dplyr-yeast.html). See also the [_Bioconductor_ heading on the setup page](setup.html#bioconductor) -- you'll need a few additional packages that are available through Bioconductor, not CRAN (the installation process is slightly different).
**Recommended reading prior to class:**
1. [Conesa et al. A survey of best practices for RNA-seq data analysis. _Genome Biology_ 17:13 (2016)](http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8).
1. [Soneson et al. "Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences." _F1000Research_ 4 (2015)](https://f1000research.com/articles/4-1521/v2).
1. Abstract and introduction sections of [Himes et al. "RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells." _PLoS ONE_ 9.6 (2014): e99625](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0099625).
1. Review the [_Introduction_ (10.1)](http://r4ds.had.co.nz/tibbles.html#introduction-4), [_Tibbles vs. data.frame_ (10.3)](http://r4ds.had.co.nz/tibbles.html#tibbles-vs.data.frame), and [_Interacting with Older Code_ (10.4)](http://r4ds.had.co.nz/tibbles.html#interacting-with-older-code) sections of the [**_R for Data Science_ book**](http://r4ds.had.co.nz/tibbles.html). We will initially be using the `read_*` functions from the [**readr** package](http://readr.tidyverse.org/). These functions load data into a _tibble_ instead of R's traditional data.frame. Tibbles are data frames, but they tweak some older behaviors to make life a little easier. These sections explain the few key small differences between traditional data.frames and tibbles.
**Slides**: [click here](slides/r-rnaseq-airway.pdf).
## Review
### Prerequsite skills
- [R basics](r-basics.html)
- [Data frames](r-dataframes.html)
- [Manipulating data with dplyr and `%>%`](r-dplyr-yeast.html)
- [Tidy data & advanced manipulation](r-tidy.html)
- [Data Visualization with ggplot2](r-viz-gapminder.html)
### Data needed
- Length-scaled count matrix (i.e., `countData`): [airway_scaledcounts.csv](data/airway_scaledcounts.csv)
- Sample metadata (i.e., `colData`): [airway_metadata.csv](data/airway_metadata.csv)
- Gene Annotation data: [annotables_grch38.csv](data/annotables_grch38.csv)
## Background
### The biology
The data for this lesson comes from:
> Himes _et al_. "RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells." _PLoS ONE_. [2014 Jun 13;9(6):e99625](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0099625). PMID: [24926665](http://www.ncbi.nlm.nih.gov/pubmed/24926665).
Glucocorticoids are potent inhibitors of inflammatory processes, and are widely used to treat asthma because of their anti-inflammatory effects on airway smooth muscle (ASM) cells. But what's the molecular mechanism? This study used RNA-seq to profile gene expression changes in four different ASM cell lines treated with dexamethasone, a synthetic glucocorticoid molecule. They found a number of differentially expressed genes comparing dexamethasone-treated ASM cells to control cells, but focus much of the discussion on a gene called CRISPLD2. This gene encodes a secreted protein known to be involved in lung development, and SNPs in this gene in previous GWAS studies are associated with inhaled corticosteroid resistance and bronchodilator response in asthma patients. They confirmed the upregulated CRISPLD2 mRNA expression with qPCR and increased protein expression using Western blotting.
They did their analysis using [Tophat and Cufflinks](http://rdcu.be/gk0S). We're taking a different approach and using an R package called [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html). [Click here](help.html#rna-seq-resources) to read more on DESeq2 and other approaches.
### Data pre-processing
Analyzing an RNAseq experiment begins with sequencing reads. There are many ways to begin analyzing this data, and you should check out the three papers below to get a sense of other analysis strategies. In the workflow we'll use here, sequencing reads were pseudoaligned to a reference transcriptome and the abundance of each transcript quantified using **kallisto** ([software](https://pachterlab.github.io/kallisto/about), [paper](http://www.nature.com/nbt/journal/v34/n5/full/nbt.3519.html)). Transcript-level abundance estimates were then summarized to the gene level to produce length-scaled counts using **txImport** ([software](https://bioconductor.org/packages/tximport), [paper](https://f1000research.com/articles/4-1521/v2)), suitable for using in count-based analysis tools like DESeq. This is the starting point - a "count matrix," where each cell indicates the number of reads mapping to a particular gene (in rows) for each sample (in columns). This is one of several potential workflows, and relies on having a well-annotated reference transcriptome. However, there are many well-established alternative analysis paths, and the goal here is to provide a reference point to acquire fundamental skills that will be applicable to other bioinformatics tools and workflows.
1. Conesa, A. et al. "A survey of best practices for RNA-seq data analysis." [_Genome Biology_ 17:13](http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8) (2016).
1. Soneson, C., Love, M. I. & Robinson, M. D. "Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences." [_F1000Res._ 4:1521](https://f1000research.com/articles/4-1521/v2) (2016).
1. Griffith, Malachi, et al. "Informatics for RNA sequencing: a web resource for analysis on the cloud." [_PLoS Comput Biol_ 11.8: e1004393](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393) (2015).
This data was downloaded from GEO ([GSE:GSE52778](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778)). You can read more about how the data was processed by going over the [slides](slides/r-rnaseq-airway.pdf). If you'd like to see the code used for the upstream pre-processing with kallisto and txImport, see [the code](https://github.com/bioconnector/workshops/blob/0785f9ffbab87f451f35d52bf48f2fa228e8327a/_dev/airway_kallisto/airway_kallisto_tximport.R).
### Data structure
We'll come back to this again later, but the data at our starting point looks like this (_note: this is a generic schematic - our genes are not actually `geneA` and `geneB`, and our samples aren't called `ctrl_1`, `ctrl_2`, etc._):
![](img/countdatacoldata.png)
That is, we have **two tables**:
1. The "count matrix" (called the `countData` in DESeq-speak) -- where _genes_ are in _rows_ and _samples_ are in _columns_, and the number in each cell is the number of reads that mapped to exons in that gene for that sample: **[airway_scaledcounts.csv](data/airway_scaledcounts.csv)**.
1. The sample metadata (called the `colData` in DESeq-speak) -- where _samples_ are in _rows_ and _metadata_ about those samples are in _columns_: **[airway_metadata.csv](data/airway_metadata.csv)**. It's called the `colData` because this table supplies metadata/information about the columns of the `countData` matrix. Notice that the first column of `colData` must match the column names of `countData` (except the first, which is the gene ID column).[^tidy]
[^tidy]: This only works when using the argument `tidy=TRUE` when creating the `DESeqDataSetFromMatrix()`.
## Import data
First, let's load the **readr**, **dplyr**, and **ggplot2** packages. Then let's import our data [with **readr**'s `read_csv()` function](r-dataframes.html) (_note_: not `read.csv()`). Let's read in the actual count data and the experimental metadata.
```{r libs_importdata}
library(readr)
library(dplyr)
library(ggplot2)
mycounts <- read_csv("data/airway_scaledcounts.csv")
metadata <- read_csv("data/airway_metadata.csv")
```
Now, take a look at each.
```{r viewImportedData}
mycounts
metadata
```
Notice something here. The sample IDs in the metadata sheet (SRR1039508, SRR1039509, etc.) exactly match the column names of the countdata, except for the first column, which contains the Ensembl gene ID. This is important, and we'll get more strict about it later on.
## Poor man's DGE
Let's look for differential gene expression. **_Note: this analysis is for demonstration only. NEVER do differential expression analysis this way!_**
Let's start with an exercise.
----
### Exercise `r .ex``r .ex=.ex+1`
If we look at our metadata, we see that the control samples are SRR1039508, SRR1039512, SRR1039516, and SRR1039520. This bit of code will take the `mycounts` data, `mutate()` it to add a column called `controlmean`, then `select()` only the gene name and this newly created column, and assigning the result to a new object called `meancounts`. (_Hint_: `mycounts %>% mutate(...) %>% select(...)`)
```{r exMakeMeancounts, echo=TRUE, results="markup"}
meancounts <- mycounts %>%
mutate(controlmean = (SRR1039508+SRR1039512+SRR1039516+SRR1039520)/4) %>%
select(ensgene, controlmean)
meancounts
```
1. Build off of this code, `mutate()` it once more (prior to the `select()`) function, to add another column called `treatedmean` that takes the mean of the expression values of the treated samples. Then `select()` only the `ensgene`, `controlmean` and `treatedmean` columns, assigning it to a new object called meancounts.
```{r exMakeMeancounts2, echo=FALSE, results="markup"}
meancounts <- mycounts %>%
mutate(controlmean=(SRR1039508+SRR1039512+SRR1039516+SRR1039520)/4) %>%
mutate(treatedmean=(SRR1039509+SRR1039513+SRR1039517+SRR1039521)/4) %>%
select(ensgene, controlmean, treatedmean)
meancounts
```
2. Directly comparing the raw counts is going to be problematic if we just happened to sequence one group at a higher depth than another. Later on we'll do this analysis properly, normalizing by sequencing depth per sample using a better approach. But for now, `summarize()` the data to show the `sum` of the mean counts across all genes for each group. Your answer should look like this:
```{r exSummarizeMeancounts, echo=FALSE, results="markup"}
meancounts %>% summarize(sum(controlmean), sum(treatedmean))
```
----
How about another?
----
### Exercise `r .ex``r .ex=.ex+1`
1. Create a scatter plot showing the mean of the treated samples against the mean of the control samples.
```{r simplescatter, echo=F}
ggplot(meancounts, aes(controlmean, treatedmean)) + geom_point()
```
2. Wait a sec. There are 60,000-some rows in this data, but I'm only seeing a few dozen dots at most outside of the big clump around the origin. Try plotting both axes on a log scale (_hint_: `... + scale_..._log10()`)
```{r simplescatterlog, echo=F}
ggplot(meancounts, aes(controlmean, treatedmean)) + geom_point() + scale_x_log10() + scale_y_log10()
```
----
We can find candidate differentially expressed genes by looking for genes with a large change between control and dex-treated samples. We usually look at the $log_2$ of the fold change, because this has better mathematical properties. On the absolute scale, upregulation goes from 1 to infinity, while downregulation is bounded by 0 and 1. On the log scale, upregulation goes from 0 to infinity, and downregulation goes from 0 to negative infinity. So, let's mutate our `meancounts` object to add a `log2foldchange` column. Optionally pipe this to `View()`.
```{r calcLogfc}
meancounts %>% mutate(log2fc=log2(treatedmean/controlmean))
```
There are a couple of "weird" results. Namely, the `NaN` ("not a number") and `-Inf` (negative infinity) results. The `NaN` is returned when you divide by zero and try to take the log. The `-Inf` is returned when you try to take the log of zero. It turns out that there are a lot of genes with zero expression. Let's filter our `meancounts` data, mutate it to add the $log_2(Fold Change)$, and when we're happy with what we see, let's _reassign_ the result of that operation _back to the meancounts object_. (_Note: this is destructive. If you're coding interactively like we're doing now, before you do this it's good practice to see what the result of the operation is prior to making the reassignment_.)
```{r filteredLogFC}
# Try running the code first, prior to reassigning.
meancounts <- meancounts %>%
filter(controlmean>0 & treatedmean>0) %>%
mutate(log2fc=log2(treatedmean/controlmean))
meancounts
```
A common threshold used for calling something differentially expressed is a $log_2(FoldChange)$ of greater than 2 or less than -2. Let's filter the dataset both ways to see how many genes are up or down-regulated.
```{r findFC2}
meancounts %>% filter(log2fc>2)
meancounts %>% filter(log2fc<(-2))
```
<!-- Fixme: turn this back on -->
<!-- In total, we've got `sum(abs(meancounts$log2fc)>2)` differentially expressed genes, in either direction. -->
----
### Exercise `r .ex``r .ex=.ex+1`
Look up help on `?inner_join` or Google around for help for using **dplyr**'s `inner_join()` to join two tables by a common column/key. You downloaded [annotables_grch38.csv](data/annotables_grch38.csv) from [the data downloads page](data.html). Load this data with `read_csv()` into an object called `anno`. Pipe it to `View()` or click on the object in the Environment pane to view the entire dataset. This table links the unambiguous Ensembl gene ID to things like the gene symbol, full gene name, location, Entrez gene ID, etc.
```{r importAnno, echo=TRUE, results="markup"}
anno <- read_csv("data/annotables_grch38.csv")
anno
```
1. Take our newly created `meancounts` object, and `arrange()` it `desc`ending by the absolute value (`abs()`) of the `log2fc` column. The first few rows should look like this:
```{r arrangeMeancounts, echo=FALSE, results="markup"}
meancounts %>% arrange(desc(abs(log2fc))) %>% head(3)
```
2. Continue on that pipeline, and `inner_join()` it to the `anno` data by the `ensgene` column. Either assign it to a temporary object or pipe the whole thing to `View` to take a look. What do you notice? Would you trust these results? Why or why not?
```{r meancountsArrJoin, echo=FALSE, results="markup"}
meancounts %>%
arrange(desc(abs(log2fc))) %>%
inner_join(anno, by="ensgene")
```
----
## DESeq2 analysis
### DESeq2 package
Let's do this the right way. DESeq2 is an R package for analyzing count-based NGS data like RNA-seq. It is available from [Bioconductor](http://www.bioconductor.org/). Bioconductor is a project to provide tools for analysing high-throughput genomic data including RNA-seq, ChIP-seq and arrays. You can explore Bioconductor packages [here](http://www.bioconductor.org/packages/release/BiocViews.html#___Software).
Bioconductor packages usually have great documentation in the form of _vignettes_. For a great example, take a look at the [DESeq2 vignette for analyzing count data](http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf). This 40+ page manual is packed full of examples on using DESeq2, importing data, fitting models, creating visualizations, references, etc.
Just like R packages from CRAN, you only need to install Bioconductor packages once ([instructions here](setup.html#bioconductor)), then load them every time you start a new R session.
```{r loadDESeq2_without_junk, echo=FALSE}
suppressPackageStartupMessages(suppressWarnings(library(DESeq2)))
```
```{r loadDESeq2, eval=FALSE}
library(DESeq2)
citation("DESeq2")
```
Take a second and read through all the stuff that flies by the screen when you load the DESeq2 package. When you first installed DESeq2 it may have taken a while, because DESeq2 _depends_ on a number of other R packages (S4Vectors, BiocGenerics, parallel, IRanges, etc.) Each of these, in turn, may depend on other packages. These are all loaded into your working environment when you load DESeq2. Also notice the lines that start with `The following objects are masked from 'package:...`. One example of this is the `rename()` function from the dplyr package. When the S4Vectors package was loaded, it loaded it's own function called `rename()`. Now, if you wanted to use dplyr's rename function, you'll have to call it explicitly using this kind of syntax: `dplyr::rename()`. [See this Q&A thread for more](http://stackoverflow.com/questions/4879377/r-masked-functions).
### Importing data
DESeq works on a particular type of object called a DESeqDataSet. The DESeqDataSet is a single object that contains input values, intermediate calculations like how things are normalized, and all results of a differential expression analysis. You can construct a DESeqDataSet from a count matrix, a metadata file, and a formula indicating the design of the experiment. See the help for `?DESeqDataSetFromMatrix`. If you read through the [DESeq2 vignette](http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf) you'll read about the structure of the data that you need to construct a DESeqDataSet object.
`DESeqDataSetFromMatrix` requires the count matrix (`countData` argument) to be a matrix or numeric data frame. either the row names or the first column of the `countData` must be the identifier you'll use for each gene. The column names of `countData` are the sample IDs, and they must match the row names of `colData` (or the first column when `tidy=TRUE`). `colData` is an additional dataframe describing sample metadata. Both `colData` and `countData` must be regular `data.frame` objects -- they can't have the special `tbl` class wrapper created when importing with `readr::read_*`.
![](img/countdatacoldata.png)
Let's look at our mycounts and metadata again.
```{r reExamineData}
mycounts
metadata
class(mycounts)
class(metadata)
```
Remember, we read in our count data and our metadata using `read_csv()` which read them in as those "special" dplyr data frames / tbls. We'll need to convert them back to regular data frames for them to work well with DESeq2.
```{r, results="hide"}
mycounts <- as.data.frame(mycounts)
metadata <- as.data.frame(metadata)
head(mycounts)
head(metadata)
class(mycounts)
class(metadata)
```
Let's check that the column names of our count data (except the first, which is `ensgene`) are the same as the IDs from our colData.
```{r}
names(mycounts)[-1]
metadata$id
names(mycounts)[-1]==metadata$id
all(names(mycounts)[-1]==metadata$id)
```
Now we can move on to constructing the actual DESeqDataSet object. The last thing we'll need to specify is a _design_ -- a _formula_ which expresses how the counts for each gene depend on the variables in colData. Take a look at `metadata` again. The thing we're interested in is the `dex` column, which tells us which samples are treated with dexamethasone versus which samples are untreated controls. We'll specify the design with a tilde, like this: `design=~dex`. (The tilde is the shifted key to the left of the number 1 key on my keyboard. It looks like a little squiggly line). So let's contruct the object and call it `dds`, short for our DESeqDataSet. If you get a warning about "some variables in design formula are characters, converting to factors" don't worry about it. Take a look at the `dds` object once you create it.
```{r constructDDS_phony, eval=FALSE}
dds <- DESeqDataSetFromMatrix(countData=mycounts,
colData=metadata,
design=~dex,
tidy=TRUE)
dds
```
```{r constructDDS_real, echo=FALSE}
dds <- suppressMessages(suppressWarnings(DESeqDataSetFromMatrix(countData=mycounts, colData=metadata, design=~dex, tidy=TRUE)))
dds
```
### DESeq pipeline
Next, let's run the DESeq pipeline on the dataset, and reassign the resulting object back to the same variable. Before we start, `dds` is a bare-bones DESeqDataSet. The `DESeq()` function takes a DESeqDataSet and returns a DESeqDataSet, but with lots of other information filled in (normalization, dispersion estimates, differential expression results, etc). Notice how if we try to access these objects before running the analysis, nothing exists.
```{r, error=TRUE}
sizeFactors(dds)
dispersions(dds)
results(dds)
```
Here, we're running the DESeq pipeline on the `dds` object, and reassigning the whole thing back to `dds`, which will now be a DESeqDataSet populated with all those values. Get some help on `?DESeq` (notice, no "2" on the end). This function calls a number of other functions within the package to essentially run the entire pipeline (normalizing by library size by estimating the "size factors," estimating dispersion for the negative binomial model, and fitting models and getting statistics for each gene for the design specified when you imported the data).
```{r deseq_pipeline}
dds <- DESeq(dds)
```
### Getting results
Since we've got a fairly simple design (single factor, two groups, treated versus control), we can get results out of the object simply by calling the `results()` function on the DESeqDataSet that has been run through the pipeline. The help page for `?results` and the vignette both have extensive documentation about how to pull out the results for more complicated models (multi-factor experiments, specific contrasts, interaction terms, time courses, etc.).
Note two things:
1. We're passing the `tidy=TRUE` argument, which tells DESeq2 to output the results table with rownames as a first column called 'row.' If we didn't do this, the gene names would be stuck in the row.names, and we'd have a hard time filtering or otherwise using that column.
1. This returns a regular old data frame. Try displaying it to the screen by just typing `res`. You'll see that it doesn't print as nicly as the data we read in with `read_csv`. We can add this "special" attribute to the raw data returned which just tells R to print it nicely.
```{r getResults}
res <- results(dds, tidy=TRUE)
res <- as_tibble(res)
res
```
Either click on the `res` object in the environment pane or pass it to `View()` to bring it up in a data viewer. Why do you think so many of the adjusted p-values are missing (`NA`)? Try looking at the `baseMean` column, which tells you the average overall expression of this gene, and how that relates to whether or not the p-value was missing. Go to the [DESeq2 vignette](http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf) and read the section about "Independent filtering and multiple testing."
> The goal of independent filtering is to filter out those tests from the procedure that have no, or little chance of showing significant evidence, without even looking at the statistical result. Genes with very low counts are not likely to see significant differences typically due to high dispersion. This results in increased detection power at the same experiment-wide type I error [_i.e., better FDRs_].
----
### Exercise `r .ex``r .ex=.ex+1`
1. Using a `%>%`, `arrange` the results by the adjusted p-value.
```{r, echo=F, results="markup"}
res %>% arrange(padj)
```
2. Continue piping to `inner_join()`, joining the results to the `anno` object. See the help for `?inner_join`, specifically the `by=` argument. You'll have to do something like `... %>% inner_join(anno, by=c("row"="ensgene"))`. Once you're happy with this result, reassign the result back to `res`. It'll look like this.
```{r, echo=F, results="markup"}
res <- res %>%
arrange(padj) %>%
inner_join(anno, by=c("row"="ensgene"))
head(as.data.frame(res))
```
3. How many are significant with an adjusted p-value <0.05? (Pipe to `filter()`).
```{r, echo=F, results="markup"}
res %>% filter(padj<0.05) %>% nrow()
```
----
Finally, let's write out the significant results. See the help for `?write_csv`, which is part of the **readr** package (note: this is _not_ the same as `write.csv` with a dot.). We can continue that pipe and write out the significant results to a file like so:
```{r, eval=F, echo=TRUE}
res %>%
filter(padj<0.05) %>%
write_csv("sigresults.csv")
```
You can open this file in Excel or any text editor (try it now).
## Data Visualization
### Plotting counts
DESeq2 offers a function called `plotCounts()` that takes a DESeqDataSet that has been run through the pipeline, the name of a gene, and the name of the variable in the `colData` that you're interested in, and plots those values. See the help for `?plotCounts`. Let's first see what the gene ID is for the CRISPLD2 gene using `res %>% filter(symbol=="CRISPLD2")`. Now, let's plot the counts, where our `intgroup`, or "interesting group" variable is the "dex" column.
```{r plotCounts1}
plotCounts(dds, gene="ENSG00000103196", intgroup="dex")
```
That's just okay. Keep looking at the help for `?plotCounts`. Notice that we could have actually returned the data instead of plotting. We could then pipe this to ggplot and make our own figure. Let's make a boxplot.
```{r plotCounts2}
# Return the data
plotCounts(dds, gene="ENSG00000103196", intgroup="dex", returnData=TRUE)
# Plot it
plotCounts(dds, gene="ENSG00000103196", intgroup="dex", returnData=TRUE) %>%
ggplot(aes(dex, count)) + geom_boxplot(aes(fill=dex)) + scale_y_log10() + ggtitle("CRISPLD2")
```
### MA & Volcano plots
Let's make some commonly produced visualizations from this data. First, let's mutate our results object to add a column called `sig` that evaluates to `TRUE` if padj<0.05, and FALSE if not, and NA if padj is also NA.
```{r}
# Create the new column
res <- res %>% mutate(sig=padj<0.05)
# How many of each?
res %>%
group_by(sig) %>%
summarize(n=n())
```
----
### Exercise `r .ex``r .ex=.ex+1`
Look up the Wikipedia articles on [MA plots](https://en.wikipedia.org/wiki/MA_plot) and [volcano plots](https://en.wikipedia.org/wiki/Volcano_plot_(statistics)). An MA plot shows the average expression on the X-axis and the log fold change on the y-axis. A volcano plot shows the log fold change on the X-axis, and the $-log_{10}$ of the p-value on the Y-axis (the more significant the p-value, the larger the $-log_{10}$ of that value will be).
1. Make an MA plot. Use a $log_{10}$-scaled x-axis, color-code by whether the gene is significant, and give your plot a title. It should look like this. What's the deal with the gray points? Why are they missing? Go to the DESeq2 website on Bioconductor and look through the vignette for "Independent Filtering."
```{r maplot, echo=FALSE}
res %>%
filter(!is.na(log2FoldChange)) %>%
ggplot(aes(baseMean, log2FoldChange, col=sig)) +
geom_point() +
scale_x_log10() +
ggtitle("MA plot")
```
2. Make a volcano plot. Similarly, color-code by whether it's significant or not.
```{r volcanoplot, echo=FALSE}
res %>%
filter(!is.na(log2FoldChange) & !is.na(pvalue)) %>%
ggplot(aes(log2FoldChange, -1*log10(pvalue), col=sig)) +
geom_point() +
ggtitle("Volcano plot")
```
----
### Transformation
To test for differential expression we operate on raw counts. But for other downstream analyses like heatmaps, PCA, or clustering, we need to work with transformed versions of the data, because it's not clear how to best compute a distance metric on untransformed counts. The go-to choice might be a log transformation. But because many samples have a zero count (and $log(0)=-\infty$, you might try using pseudocounts, i. e. $y = log(n + 1)$ or more generally, $y = log(n + n_0)$, where $n$ represents the count values and $n_0$ is some positive constant.
But there are other approaches that offer better theoretical justification and a rational way of choosing the parameter equivalent to $n_0$, and they produce transformed data on the log scale that's normalized to library size. One is called a _variance stabilizing transformation_ (VST), and it also removes the dependence of the variance on the mean, particularly the high variance of the log counts when the mean is low.
```{r vst}
vsdata <- vst(dds, blind=FALSE)
```
### PCA
Let's do some exploratory plotting of the data using principal components analysis on the variance stabilized data from above. Let's use the DESeq2-provided `plotPCA` function. See the help for `?plotPCA` and notice that it also has a `returnData` option, just like `plotCounts`.
```{r plotPCA}
plotPCA(vsdata, intgroup="dex")
```
Principal Components Analysis (PCA) is a dimension reduction and visualization technique that is here used to project the multivariate data vector of each sample into a two-dimensional plot, such that the spatial arrangement of the points in the plot reflects the overall data (dis)similarity between the samples. In essence, principal component analysis distills all the global variation between samples down to a few variables called _principal components_. The majority of variation between the samples can be summarized by the first principal component, which is shown on the x-axis. The second principal component summarizes the residual variation that isn't explained by PC1. PC2 is shown on the y-axis. The percentage of the global variation explained by each principal component is given in the axis labels. In a two-condition scenario (e.g., mutant vs WT, or treated vs control), you might expect PC1 to separate the two experimental conditions, so for example, having all the controls on the left and all experimental samples on the right (or vice versa - the units and directionality isn't important). The secondary axis may separate other aspects of the design - cell line, time point, etc. Very often the experimental design is reflected in the PCA plot, and in this case, it is. But this kind of diagnostic can be useful for finding outliers, investigating batch effects, finding sample swaps, and other technical problems with the data. [This YouTube video](https://youtu.be/_UVHneBUBW0) from the Genetics Department at UNC gives a very accessible explanation of what PCA is all about in the context of a gene expression experiment, without the need for an advanced math background. Take a look.
### Bonus: Heatmaps
[Heatmaps are complicated, and are often poorly understood](http://www.opiniomics.org/you-probably-dont-understand-heatmaps/). It's a type of visualization used very often in high-throughput biology where data are clustered on rows and columns, and the actual data is displayed as tiles on a grid, where the values are mapped to some color spectrum. Our R useRs group MeetUp had a session on making heatmaps, which I summarized in [this blog post](http://www.gettinggeneticsdone.com/2015/04/r-user-group-recap-heatmaps-and-using.html). Take a look at [the code from that meetup](https://github.com/UVa-R-Users-Group/meetup/blob/master/2015-02-19-heat-maps/heatmaps.R), and the [documentation for the `aheatmap` function in the NMF package](http://nmf.r-forge.r-project.org/aheatmap.html) to see if you can re-create this image. Here, I'm clustering all samples using the top 25 most differentially regulated genes, labeling the rows with the gene symbol, and putting two annotation color bars across the top of the main heatmap panel showing treatment and cell line annotations from our metadata. Take a look at the [Rmarkdown source for this lesson](https://github.com/bioconnector/bims8382/blame/10667525b2798d4f79af616d936ecac0b58c9622/r-rnaseq-airway.Rmd#L406) for the code.
```{r NMFaheatmap, fig.height=12, echo=F}
NMF::aheatmap(assay(vsdata)[arrange(res, padj, pvalue)$row[1:25], ],
labRow=arrange(res, padj, pvalue)$symbol[1:25],
scale="row", distfun="pearson",
annCol=dplyr::select(metadata, dex, celltype),
col=c("green","black","black","red"),
width=6, height=8)
```
## Record `sessionInfo()`
The `sessionInfo()` prints version information about R and any attached packages. It's a good practice to always run this command at the end of your R session and record it for the sake of reproducibility in the future.
```{r sessionInfo}
sessionInfo()
```
## Pathway Analysis
_Pathway analysis_ or _gene set analysis_ means many different things, general approaches are nicely reviewed in: Khatri, et al. "Ten years of pathway analysis: current approaches and outstanding challenges." [_PLoS Comput Biol_ 8.2 (2012): e1002375](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002375).
There are many freely available tools for pathway or over-representation analysis. Bioconductor alone has over [70 packages categorized under _gene set enrichment_](http://bioconductor.org/packages/release/BiocViews.html#___GeneSetEnrichment) and [over 100 packages categorized under _pathways_](http://bioconductor.org/packages/release/BiocViews.html#___Pathways). I wrote [this tutorial](http://www.gettinggeneticsdone.com/2015/12/tutorial-rna-seq-differential.html) in 2015 showing how to use the GAGE (Generally Applicable Gene set Enrichment)[^gage] package to do KEGG pathway enrichment analysis on differential expression results.
[^gage]: Luo, W. et al., 2009. GAGE: generally applicable gene set enrichment for pathway analysis. [_BMC bioinformatics_, 10:161](https://www.ncbi.nlm.nih.gov/pubmed/19473525). Package: [bioconductor.org/packages/gage](http://bioconductor.org/packages/gage).
While there are many freely available tools to do this, and some are truly fantastic, many of them are poorly maintained or rarely updated. The [DAVID](https://david.ncifcrf.gov/) tool that a lot of folks use wasn't updated at all between Jan 2010 and Oct 2016.
UVA has a site license to [Ingenuity Pathway Analysis](https://www.qiagenbioinformatics.com/products/ingenuity-pathway-analysis/). Statistically, IPA isn't doing anything revolutionary methodologically, but the real value comes in with its (1) ease of use, and (2) highly curated knowledgebase. You can get access to IPA through the Health Sciences Library [at this link](https://www.bioconnector.virginia.edu/content/ingenuity-pathway-analysis-0), and there are also links to UVA support resources for using IPA.
**[This summary report](img/r-rnaseq-airway-ipa-all.pdf)** is the first thing you would get out of IPA after running a core analysis on the results of this analysis. Open it up and take a look.
It shows, among other things, that the endothelial nitric-oxide synthase signaling pathway is highly over-represented among the most differentially expressed genes. For this, or any pathway you're interested in, IPA will give you a [report like this one for eNOS](img/r-rnaseq-airway-ipa-enos-report.pdf) with a very detailed description of the pathway, what kind of diseases it's involved in, which molecules are in the pathway, what drugs might perturb the pathway, and more. If you're logged into IPA, clicking any of the links will take you to IPA's knowledge base where you can learn more about the connection between that molecule, the pathway, and a disease, and further overlay any of your gene expression data on top of the pathway.
![](img/r-rnaseq-airway-ipa-enos-pathway.jpg)
The report also shows us some [upstream regulators](img/r-rnaseq-airway-ipa-upstream.csv), which serves as a great positive control that this stuff actually works, because it's inferring that dexamethasone might be an upstream regulator based on the target molecules that are dysregulated in our data.
You can also start to visualize networks in the context of biology and how your gene expression data looks in those molecules. Here's a network related to "Connective Tissue Disorders, Inflammatory Disease, Inflammatory Response" showing dysregulation of some of the genes in our data.
![](img/r-rnaseq-airway-ipa-inflam-network.jpg)
## Homework
Looking for some extra practice? [Try the homework](r-rnaseq-homework.html).
----