-
Notifications
You must be signed in to change notification settings - Fork 494
/
93-appendixC.Rmd
executable file
·373 lines (263 loc) · 13.2 KB
/
93-appendixC.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
# Tips and Tricks {#appendixC}
### Needed packages {-}
Let's load all the packages needed for this chapter (this assumes you've already installed them). Recall from our discussion in Section \@ref(tidyverse-package) that loading the `tidyverse` package by running `library(tidyverse)` loads the following commonly used data science packages all at once:
* `ggplot2` for data visualization
* `dplyr` for data wrangling
* `tidyr` for converting data to "tidy" format
* `readr` for importing spreadsheet data into R
* As well as the more advanced `purrr`, `tibble`, `stringr`, and `forcats` packages.
If needed, read Section \@ref(packages) for information on how to install and load R packages.
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(scales)
library(janitor)
library(dygraphs)
library(nycflights13)
```
## Data wrangling
```{r setup, include=FALSE}
opts_chunk$set(
echo = TRUE, message = FALSE, warning = FALSE, eval = TRUE,
cache = TRUE,
fig.width = 16 / 2, fig.height = 9 / 2
)
# Why 76? https://www.youtube.com/watch?v=xjJ7FheCkCU
set.seed(76)
```
In this Section, we address some of the most common data wrangling questions we've encountered in student projects (shout out to [Dr. Jenny Smetzer](https://www.scsparkscience.org/fellow/jennifer-smetzer/) for her work setting this up!):
* \@ref(appendix-missing-values): Dealing with missing values
* \@ref(appendix-reordering-bars): Reordering bars in a barplot
* \@ref(appendix-money-on-axis): Showing money on an axis
* \@ref(appendix-changing-values): Changing values inside cells
* \@ref(appendix-convert-numerical-categorical): Converting a numerical variable to a categorical one
* \@ref(appendix-prop): Computing proportions
* \@ref(appendix-commas): Dealing with %, commas, and \$
![](images/data_ninja1.png){width=30%}
Let's load an example movies dataset, pare down the rows and columns a bit, and then show the first 10 rows using `slice()`.
```{r, eval=FALSE}
movies_ex <- read_csv("https://moderndive.com/data/movies.csv") %>%
filter(type %in% c("action", "comedy", "drama", "animated", "fantasy", "rom comedy")) %>%
select(-over200)
movies_ex %>%
slice(1:10)
```
```{r, echo=FALSE, purl=FALSE}
movies_ex <- read_csv("data/movies.csv") %>%
filter(type %in% c("action", "comedy", "drama", "animated", "fantasy", "rom comedy")) %>%
select(-over200)
movies_ex %>%
slice(1:10)
```
### Dealing with missing values {#appendix-missing-values}
You see the revenue in `million`s value for the movie "2 Fast 2 Furious" is `NA` (missing). So the following occurs when computing the median revenue:
```{r}
movies_ex %>%
summarize(mean_profit = median(millions))
```
You should always think about why a data value might be missing and what that missingness may mean. For example, imagine you are conducting a study on the effects of smoking on lung cancer and a lot of your patients' data is missing because they died of lung cancer. If you just "sweep these patients under the rug" and ignore them, you are clearly biasing the results.
While there are statistical methods to deal with missing data they are beyond the reach of this class. The easiest thing to do is to remove all missing cases, but **you should always at the very least report to the reader if you do so, as by removing the missing values you may be biasing your results.**
You can do this with a `na.rm = TRUE` argument like so:
```{r}
movies_ex %>%
summarize(mean_profit = median(millions, na.rm = TRUE))
```
If you decide you want to remove the row with the missing data, you can use the filter function like so:
```{r}
movies_no_missing <- movies_ex %>%
filter(!is.na(millions))
movies_no_missing %>%
slice(1:10)
```
We see "2 Fast 2 Furious" is now gone.
### Reordering bars in a barplot {#appendix-reordering-bars}
Let's compute the total revenue for each movie type and plot a barplot.
```{r}
revenue_by_type <- movies_ex %>%
group_by(type) %>%
summarize(total_revenue = sum(millions))
revenue_by_type
ggplot(revenue_by_type, aes(x = type, y = total_revenue)) +
geom_col() +
labs(x = "Movie genre", y = "Total box office revenue (in millions of $)")
```
Say we want to reorder the categorical variable `type` so that the bars show in a different order. We can reorder the bars by manually defining the order of the `levels` in the `factor()` command:
```{r}
type_levels <- c("rom comedy", "action", "drama", "animated", "comedy", "fantasy")
revenue_by_type <- revenue_by_type %>%
mutate(type = factor(type, levels = type_levels))
ggplot(revenue_by_type, aes(x = type, y = total_revenue)) +
geom_col() +
labs(x = "Movie genre", y = "Total boxoffice revenue (in millions of $)")
```
Or if you want to reorder `type` in ascending order of `total_revenue`, we use `reorder()`:
```{r}
revenue_by_type <- revenue_by_type %>%
mutate(type = reorder(type, total_revenue))
ggplot(revenue_by_type, aes(x = type, y = total_revenue)) +
geom_col() +
labs(
x = "Movie genre", y = "Total boxoffice revenue (in millions of $)"
)
```
Or if you want to reorder `type` in descending order of `total_revenue`, just put
a `-` sign in front of `-total_revenue` in `reorder()`:
```{r}
revenue_by_type <- revenue_by_type %>%
mutate(type = reorder(type, -total_revenue))
ggplot(revenue_by_type, aes(x = type, y = total_revenue)) +
geom_col() +
labs(
x = "Movie genre", y = "Total boxoffice revenue (in millions of $)"
)
```
For more advanced categorical variable (i.e. factor) manipulations, check out
the [`forcats` package](https://forcats.tidyverse.org/){target="_blank"}. Note: `forcats` is an anagram of `factors`.
![](https://github.com/tidyverse/forcats/blob/main/man/figures/logo.png?raw=true){width=20%}
### Showing money on an axis {#appendix-money-on-axis}
```{r fig.height=3, fig.width=5}
movies_ex <- movies_ex %>%
mutate(revenue = millions * 10^6)
ggplot(data = movies_ex, aes(x = rating, y = revenue)) +
geom_boxplot() +
labs(x = "rating", y = "Revenue in $", title = "Profits for different movie ratings")
```
Google "ggplot2 axis scale dollars" and click on the [first link](http://www.sthda.com/english/wiki/ggplot2-axis-scales-and-transformations) and search for the word "dollars". You'll find:
```{r fig.height=3, fig.width=5}
# Don't forget to load the scales package first!
library(scales)
ggplot(data = movies_ex, aes(x = rating, y = revenue)) +
geom_boxplot() +
labs(x = "rating", y = "Revenue in $", title = "Profits for different movie ratings") +
scale_y_continuous(labels = dollar)
```
### Changing values inside cells {#appendix-changing-values}
The `rename()` function in the `dplyr` package renames column/variable names. To "rename" values inside cells of a particular column, you need to `mutate()` the column using one of the three functions below. There might be other ones too, but these are the three we've seen the most. In these examples, we'll change values in the variable `type`.
1. `if_else()`
1. `recode()`
1. `case_when()`
#### `if_else()` {-}
Switch all instances of `rom comedy` with `romantic comedy` using `if_else()` from the `dplyr` package. If a particular row has `type == "rom comedy"`, then return `"romantic comedy"`, else return whatever was originally in `type`. Save everything in a new variable `type_new`:
```{r}
movies_ex %>%
mutate(type_new = if_else(type == "rom comedy", "romantic comedy", type)) %>%
slice(1:10)
```
Do the same here, but return `"not romantic comedy"` if `type` is not `"rom comedy"` and this time overwrite the original `type` variable:
```{r}
movies_ex %>%
mutate(type = if_else(type == "rom comedy", "romantic comedy", "not romantic comedy")) %>%
slice(1:10)
```
#### `recode()` {-}
`if_else()` is rather limited however. What if we want to "rename" all `type` so that they start with uppercase? Use `recode()`:
```{r}
movies_ex %>%
mutate(type_new = recode(type,
"action" = "Action",
"animated" = "Animated",
"comedy" = "Comedy",
"drama" = "Drama",
"fantasy" = "Fantasy",
"rom comedy" = "Romantic Comedy"
)) %>%
slice(1:10)
```
#### `case_when()` {-}
`case_when()` is a little trickier, but allows you to evaluate boolean operations using `==`, `>`, `>=`, `&`, `|`, etc:
```{r}
movies_ex %>%
mutate(
type_new =
case_when(
type == "action" & millions > 40 ~ "Big budget action",
type == "rom comedy" & millions < 40 ~ "Small budget romcom",
# Need this for everything else that aren't the two cases above:
TRUE ~ "Rest"
)
)
```
### Converting a numerical variable to a categorical one {#appendix-convert-numerical-categorical}
Sometimes we want to turn a numerical, continuous variable into a categorical variable. For instance, what if we wanted to have a variable that tells us if a movie made one hundred million dollars or more. That is to say, we can create a binary variable, which is the same thing as a categorical variable with 2 levels. We can again use the `mutate()` function:
```{r}
movies_ex %>%
mutate(big_budget = millions > 100) %>%
slice(1:10)
```
What if you want to convert a numerical variable into a categorical variable with more than 2 levels? One way is to use the `cut()` command. For instance, below, we `cut()` the `score` variable, to recode it into 4 categories:
1. 0 - 40 = bad
2. 40.1 - 60 = so-so
3. 60.1 - 80 = good
4. 80.1+ = great
We set the breaking points for cutting the numerical variable with the `c(0, 40, 60, 80, 100)` part, and set the labels for each of these bins with the `labels = c("bad", "so-so", "good", "great")` part. All this action happens inside the `mutate()` command, so the new categorical variable `score_categ` is added to the data frame.
```{r}
movies_ex %>%
mutate(score_categ = cut(score,
breaks = c(0, 40, 60, 80, 100),
labels = c("bad", "so-so", "good", "great")
)) %>%
slice(1:10)
```
Other options with the `cut` function:
* By default, if the value is exactly the upper bound of an interval, it's
included in the lessor category (e.g. 60.0 is 'so-so' not 'good'), to
flip this, include the argument `right = FALSE`.
* You could also have R equally divide the variable into a balanced
number of groups. For example, specifying `breaks = 3` would create 3 groups with
approximately the same number of values in each group.
### Computing proportions {#appendix-prop}
By using a `group_by()` followed not by a `summarize()` as is often the case, but rather a `mutate()`. So say we compute the total revenue millions for each movie rating and type:
```{r}
rating_by_type_millions <- movies_ex %>%
group_by(rating, type) %>%
summarize(millions = sum(millions)) %>%
arrange(rating, type)
rating_by_type_millions
```
Say within each movie rating (G, PG, PG-13, R), we want to know the proportion of `total_millions` that made by each movie type (animated, action, comedy, etc). We can:
```{r}
rating_by_type_millions %>%
group_by(rating) %>%
mutate(
# Compute a new column of the sum of millions split by rating:
total_millions = sum(millions),
# Compute the proportion within each rating:
prop = millions / total_millions
)
```
So for example, the 4 proportions corresponding to R rated movies are 0.596 + 0.142 + 0.213 + 0.0491 = 1.
### Dealing with %, commas, and \$ {#appendix-commas}
Say you have numerical data that are recorded as percentages, have commas, or are in dollar form and hence are character strings. How do you convert these to numerical values? Using the `parse_number()` function from the `readr` package inside a `mutate()`! Shout out to [Stack Overflow](https://stackoverflow.com/a/52965808/3674500){target="_blank"}.
```{r}
library(readr)
parse_number("10.5%")
parse_number("145,897")
parse_number("$1,234.5")
```
What about the other way around? Use the `scales` package!
```{r}
library(scales)
percent(0.105)
comma(145897)
dollar(1234.5)
```
***
**Congratulations. You are now an R Ninja!**
![](images/Rninja.png){width=40%}
***
## Interactive graphics
### Interactive linegraphs
Another useful tool for viewing linegraphs such as this is the `dygraph` function in the `dygraphs` package in combination with the `dyRangeSelector` function. This allows us to zoom in on a selected range and get an interactive plot for us to work with:
```{r warning=FALSE, out.width="100%"}
library(dygraphs)
library(nycflights13)
flights_day <- mutate(flights, date = as.Date(time_hour))
flights_summarized <- flights_day %>%
group_by(date) %>%
summarize(median_arr_delay = median(arr_delay, na.rm = TRUE)) %>%
as.data.frame()
rownames(flights_summarized) <- flights_summarized$date
flights_summarized <- select(flights_summarized, -date)
dyRangeSelector(dygraph(flights_summarized))
```
<br>
The syntax here is a little different than what we have covered so far. The `dygraph` function is expecting for the dates to be given as the `rownames` of the object. We convert our data to a data frame (since tibbles don't allow for easy row name manipulation), and then remove the `date` variable from the `flights_summarized` data frame since it is accounted for in the `rownames`. Lastly, we run the `dygraph` function on the new data frame that only contains the median arrival delay as a column and then provide the ability to have a selector to zoom in on the interactive plot via `dyRangeSelector`. (Note that this plot will only be interactive in the HTML version of this book.)