-
Notifications
You must be signed in to change notification settings - Fork 10
/
r-tidy.Rmd
285 lines (196 loc) · 12.6 KB
/
r-tidy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
---
title: "Tidy Data and Advanced Data Manipulation"
---
```{r init, include=F}
library(knitr)
opts_chunk$set(message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE, cache=TRUE)
.ex <- 1
# library(ggplot2)
# theme_set(theme_bw(base_size=16) + theme(strip.background = element_blank()))
```
**Recommended reading prior to class**: Sections 1-3 of [Wickham, H. "Tidy Data." _Journal of Statistical Software_ 59:10 (2014)](http://www.jstatsoft.org/article/view/v059i10/v59i10.pdf).
## Review
### Prior classes
- [R basics](r-basics.html)
- [Data frames](r-dataframes.html)
- [Manipulating data with dplyr and `%>%`](r-dplyr-yeast.html)
### Data needed
Hit the ["Data"](data.html) link above or use the direct links below to download the following datasets, saving them in a `data` folder relative to your current working RStudio project:
- Heart rate data: [heartrate2dose.csv](data/heartrate2dose.csv)
- _Tidy_ yeast data: [brauer2007_tidy.csv](data/brauer2007_tidy.csv)
- _Original_ (untidy) yeast data: [brauer2007_messy.csv](data/brauer2007_messy.csv)
- Yeast systematic names to GO terms: [brauer2007_sysname2go.csv](data/brauer2007_sysname2go.csv)
## Tidy data
So far we've dealt exclusively with tidy data -- data that's easy to work with, manipulate, and visualize. That's because our dataset has two key properties:
1. Each _column_ is a _variable_.
2. Each _row_ is an _observation_.
You can read a lot more about tidy data [in this paper](http://www.jstatsoft.org/v59/i10/paper). Let's load some untidy data and see if we can see the difference. This is some made-up data for five different patients (Jon, Ann, Bill, Kate, and Joe) given three different drugs (A, B, and C), at two doses (10 and 20), and measuring their heart rate. Download the [heartrate2dose.csv](data/heartrate2dose.csv) file directly from [the data downloads page](data.html). Load **readr** and **dplyr**, and import and display the data.
```{r untidy}
library(readr)
library(dplyr)
hr <- read_csv("data/heartrate2dose.csv")
hr
```
Notice how with the yeast data each variable (symbol, nutrient, rate, expression, etc.) were each in their own column. In this heart rate data, we have four variables: name, drug, dose, and heart rate. _Name_ is in a column, but _drug_ is in the header row. Furthermore the drug and _dose_ are tied together in the same column, and the _heart rate_ is scattered around the entire table. If we wanted to do things like `filter` the dataset where `drug=="a"` or `dose==20` or `heartrate>=80` we couldn't do it because these variables aren't in columns.
## The **tidyr** package
The **tidyr** package helps with this. There are several functions in the tidyr package but the ones we're going to use are `separate()` and `gather()`. The `gather()` function takes multiple columns, and gathers them into key-value pairs: it makes "wide" data longer. The `separate()` function separates one column into multiple columns. So, what we need to do is _gather_ all the drug/dose data into a column with their corresponding heart rate, and then _separate_ that column into two separate columns for the drug and dose.
Before we get started, load the **tidyr** package, and look at the help pages for `?gather` and `?separate`. Notice how each of these functions takes a data frame as input and returns a data frame as output. Thus, we can pipe from one function to the next.
```{r tidydata}
library(tidyr)
```
### `gather()`
The help for `?gather` tells us that we first pass in a data frame (or omit the first argument, and pipe in the data with `%>%`). The next two arguments are the names of the key and value columns to create, and all the relevant arguments that come after that are the columns we want to _gather_ together. Here's one way to do it.
```{r gather1}
hr %>% gather(key=drugdose, value=hr, a_10, a_20, b_10, b_20, c_10, c_20)
```
But that gets cumbersome to type all those names. What if we had 100 drugs and 3 doses of each? There are two other ways of specifying which columns to gather. The help for `?gather` tells you how to do this:
> `...` Specification of columns to gather. Use bare variable names. Select all variables between x and z with x:z, exclude y with -y. For more options, see the `select` documentation.
So, we could accomplish the same thing by doing this:
```{r gather2}
hr %>% gather(key=drugdose, value=hr, a_10:c_20)
```
But what if we didn't know the drug names or doses, but we _did_ know that the only other column in there that we _don't_ want to gather is `name`?
```{r gather3}
hr %>% gather(key=drugdose, value=hr, -name)
```
### `separate()`
Finally, look at the help for `?separate`. We can pipe in data and omit the first argument. The second argument is the column to separate; the `into` argument is a _character vector_ of the new column names, and the `sep` argument is a character used to separate columns, or a number indicating the position to split at.
> **_Side note, and 60-second lesson on vectors_**: We can create arbitrary-length _vectors_, which are simply variables that contain an arbitrary number of values. To create a numeric vector, try this: `c(5, 42, 22908)`. That creates a three element vector. Try `c("cat", "dog")`.
```{r gathersep}
hr %>%
gather(key=drugdose, value=hr, -name) %>%
separate(drugdose, into=c("drug", "dose"), sep="_")
```
### `%>%` it all together
Let's put it all together with `gather %>% separate %>% filter %>% group_by %>% summarize`.
If we create a new data frame that's a tidy version of hr, we can do those kinds of manipulations we talked about before:
```{r hrtidy}
# Create a new data.frame
hrtidy <- hr %>%
gather(key=drugdose, value=hr, -name) %>%
separate(drugdose, into=c("drug", "dose"), sep="_")
# Optionally, view it
# View(hrtidy)
# filter
hrtidy %>% filter(drug=="a")
hrtidy %>% filter(dose==20)
hrtidy %>% filter(hr>=80)
# analyze
hrtidy %>%
filter(name!="joe") %>%
group_by(drug, dose) %>%
summarize(meanhr=mean(hr))
```
## Tidy the yeast data
Now, let's take a look at the yeast data again. The data we've been working with up to this point was already cleaned up to a good degree. All of our variables (symbol, nutrient, rate, expression, GO terms, etc.) were each in their own column. Make sure you have the necessary libraries loaded, and read in the tidy data once more into an object called `ydat`.
```{r}
# Load libraries
library(readr)
library(dplyr)
library(tidyr)
# Import data
ydat <- read_csv("data/brauer2007_tidy.csv")
# Optionally, View
# View(ydat)
# Or just display to the screen
ydat
```
But let's take a look to see what this data originally looked like.
```{r}
yorig <- read_csv("data/brauer2007_messy.csv")
# View(yorig)
yorig
```
There are several issues here.
1. **Multiple variables are stored in one column.** The `NAME` column contains lots of information, split up by `::`'s.
1. **Nutrient and rate variables are stuck in column headers.** That is, the column names contain the values of two variables: nutrient (G, N, P, S, L, U) and growth rate (0.05-0.3). Remember, with tidy data, **each _column_ is a _variable_ and each _row_ is an _observation_.** Here, we have not one observation per row, but 36 (6 nutrients $\times$ 6 rates)! There's no way we could filter this data by a certain nutrient, or try to calculate statistics between rate and expression.
1. **Expression values are scattered throughout the table.** Related to the problem above, and just like our heart rate example, `expression` isn't a single-column variable as in the cleaned tidy data, but it's scattered around these 36 columns.
1. **Other important information is in a separate table.** We're missing all the gene ontology information we had in the tidy data (no information about biological process (`bp`) or molecular function (`mf`)).
Let's tackle these issues one at a time, all on a `%>%` pipeline.
### `separate()` the `NAME`
Let's `separate()` the `NAME` column `into` multiple different variables. The first row looks like this:
> `SFB2::YNL049C::1082129`
That is, it looks like we've got the gene symbol, the systematic name, and some other number (that isn't discussed in the paper). Let's `separate()`!
```{r}
yorig %>%
separate(NAME, into=c("symbol", "systematic_name", "somenumber"), sep="::")
```
Now, let's `select()` out the stuff we don't want.
```{r}
yorig %>%
separate(NAME, into=c("symbol", "systematic_name", "somenumber"), sep="::") %>%
select(-GID, -YORF, -somenumber, -GWEIGHT)
```
### `gather()` the data
Let's gather the data from wide to long format so we get nutrient/rate (key) and expression (value) in their own columns.
```{r}
yorig %>%
separate(NAME, into=c("symbol", "systematic_name", "somenumber"), sep="::") %>%
select(-GID, -YORF, -somenumber, -GWEIGHT) %>%
gather(key=nutrientrate, value=expression, G0.05:U0.3)
```
And while we're at it, let's `separate()` that newly created key column. Take a look at the help for `?separate` again. The `sep` argument could be a delimiter or a number position to split at. Let's split after the first character. While we're at it, let's hold onto this intermediate data frame before we add gene ontology information. Call it `ynogo`.
```{r}
ynogo <- yorig %>%
separate(NAME, into=c("symbol", "systematic_name", "somenumber"), sep="::") %>%
select(-GID, -YORF, -somenumber, -GWEIGHT) %>%
gather(key=nutrientrate, value=expression, G0.05:U0.3) %>%
separate(nutrientrate, into=c("nutrient", "rate"), sep=1)
```
### `inner_join()` to GO
It's rare that a data analysis involves only a single table of data. You normally have many tables that contribute to an analysis, and you need flexible tools to combine them. The **dplyr** package has several tools that let you work with multiple tables at once. Do a [Google image search for "SQL Joins"](https://www.google.com/search?q=SQL+join&tbm=isch), and look at [RStudio's Data Wrangling Cheat Sheet](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) to learn more.
First, let's import the dataset that links the systematic name to gene ontology information. It's the [brauer2007_sysname2go.csv](data/brauer2007_sysname2go.csv) file available at [the data downloads page](data.html). Let's call the imported data frame `sn2go`.
```{r}
# Import the data
sn2go <- read_csv("data/brauer2007_sysname2go.csv")
# Take a look
# View(sn2go)
head(sn2go)
```
Now, look up some help for `?inner_join`. Inner join will return a table with all rows from the first table where there are matching rows in the second table, and returns all columns from both tables. Let's give this a try.
```{r innerjoin}
yjoined <- inner_join(ynogo, sn2go, by="systematic_name")
# View(yjoined)
yjoined
# The glimpse function makes it possible to see a little bit of everything in your data.
glimpse(yjoined)
```
There are many different kinds of two-table verbs/joins in dplyr. In this example, every systematic name in `ynogo` had a corresponding entry in `sn2go`, but if this weren't the case, those un-annotated genes would have been removed entirely by the `inner_join`. A `left_join` would have returned all the rows in `ynogo`, but would have filled in `bp` and `mf` with missing values (`NA`) when there was no corresponding entry. See also: `right_join`, `semi_join`, and `anti_join`.
### Finishing touches
We're almost there but we have an obvious discrepancy in the number of rows between `yjoined` and `ydat`.
```{r rowcheck}
nrow(yjoined)
nrow(ydat)
```
Before we can figure out what rows are different, we need to make sure all of the columns are the same class and do something more miscellaneous cleanup.
In particular:
1. Convert rate to a numeric column
2. Make sure `NA` values are coded properly
3. Create (and merge) values to convert "G" to "Glucose", "L" to "Leucine", etc.
4. Rename and reorder columns
The code below implements those operations on `yjoined`.
```{r misctidy}
nutrientlookup <-
tibble(nutrient = c("G", "L", "N", "P", "S", "U"), nutrientname = c("Glucose", "Leucine", "Ammonia","Phosphate", "Sulfate","Uracil"))
yjoined <-
yjoined %>%
mutate(rate = as.numeric(rate)) %>%
mutate(symbol = ifelse(symbol == "NA", NA, symbol)) %>%
left_join(nutrientlookup) %>%
select(-nutrient) %>%
select(symbol:systematic_name, nutrient = nutrientname, rate:mf)
```
Now we can determine what rows are different between `yjoined` and `ydat` using `anti_join`, which will return all of the rows that *do not* match.
```{r antijoin}
anti_join(yjoined, ydat)
```
Hmmmm ... so `yjoined` has some rows that have missing (`NA`) expression values. Let's try removing those and then comparing the data frame contents one more time.
```{r rowcheck2}
yjoined <-
yjoined %>%
filter(!is.na(expression))
nrow(yjoined)
nrow(ydat)
all.equal(ydat, yjoined)
```
Looks like that did it!