-
Notifications
You must be signed in to change notification settings - Fork 2
/
data-visualization.qmd
413 lines (296 loc) · 15.8 KB
/
data-visualization.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
# Data Visualization {#sec-data-visualization}
```{r}
#| echo: false
#| message: false
source("_common.R")
```
> ### Learning Objectives {.unnumbered}
>
> * Create simple scatterplots and histograms with Base R graphics.
> * Learn the basic plotting features of the ggplot2 package.
> * Customize the aesthetics of an existing ggplot figure.
> * Create plots from data in a data frame.
> * Export plots from RStudio to standard graphical file formats.
>
> ### Suggested Readings {.unnumbered}
>
> * [Chapter 3](https://r4ds.had.co.nz/data-visualisation.html) of "R for Data Science", by Garrett Grolemund and Hadley Wickham
> * [Data Visualization: A practical introduction](http://socviz.co/), by Kieran Healy
> "The purpose of computing is insight, not numbers"
>
> \- [Richard Hamming](https://en.wikipedia.org/wiki/Richard_Hamming)
...and one of the best ways to develop insights from data is to _visualize_ the data. If you're completely new to data visualization, I recommend watching [this 40-minute video](https://www.youtube.com/watch?v=fSgEeI2Xpdc) on how humans see data, by John Rauser. This is one of the best overviews I've ever seen of how we can exploit our understanding of human psychology to design effective charts:
<center>
<iframe width="560" height="315" src="https://www.youtube.com/embed/fSgEeI2Xpdc" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</center>
## R Setup
Before we get started, let's set up our analysis environment like before:
1) Open up your "data_analysis_tutorial" R Project that you created in the first data analysis lesson - if you didn't do this, [go back and do it now](data-frames.html#r-setup).
2) Create a new `.R` file (File > New File > R Script), and save it as "`data_viz.R`" inside your "data_analysis_tutorial" R Project folder.
3) This time, instead of downloading the data file and saving it in our `data` folder, let's just read it in directly from the web!
```{r}
#| message: false
library(readr)
library(dplyr)
df <- read_csv("https://raw.githubusercontent.com/jhelvy/p4a/main/data/north_america_bear_killings.csv")
```
For this lesson, we are going to use the [North American Bear Killings](https://data.world/ajsanne/north-america-bear-killings) dataset, which was compiled by [Ali Sanne](https://data.world/ajsanne) from the [Wikipedia page](https://en.wikipedia.org/wiki/List_of_fatal_bear_attacks_in_North_America) on fatal bear attacks in North America. The dataset contains recorded killings by black, brown, or polar bears from 1900 to 2019 in North America. Each row in the dataset holds information for a single incident with the following columns:
Variable |Class |Description
--------------|-----------|----------------------------------------
name | character | Name of victim.
age | double | Age of victim.
gender | character | Gender of victim.
date | character | Date of incident.
month | double | Month of incident.
year | double | Year of incident.
wildOrCaptive | character | "Wild" or "Captive" bear.
location | character | Location of incident.
description | character | Short description of incident.
bearType | character | "Black", "Brown", or "Polar"
hunter | double | 1 if victim was a hunter, 0 otherwise.
grizzly | double | 1 if bear is a Grizzly, 0 otherwise.
hiker | double | 1 if victim was a hiker, 0 otherwise.
onlyOneKilled | double | 1 if only one victim was killed, 0 otherwise.
**Side node**: One thing I learned looking at this data is that all grizzly bears are brown bears, but [not all brown bears are grizzly bears](https://www.nps.gov/katm/learn/photosmultimedia/brown-bear-frequently-asked-questions.htm#2) (kind of like the squares and rectangles thing).
To confirm that we've correctly loaded the data frame, let's preview the data:
```{r}
glimpse(df)
```
Look's good - let's start making some plots!
## Basic plots in R
R has a number of built-in tools for basic graph types. We will only cover two here because they are so common and convenient: scatterplots and histograms.
### Scatterplots with `plot()`
A scatter plot provides a graphical view of the relationship between two variables. Typically these are used for "continuous" variables, like _time_, _age_, _money_, etc...things that are not categorical in nature (as opposed to "discrete" variables, like _nationality_). Here's a scatterplot of the age of the bear killing victims over time:
```{r}
#| label: 'scatter-basic'
#| fig.height: 5
#| fig.width: 6
plot(x = df$year, y = df$age)
```
The basic inputs to the `plot()` function are `x` and `y`, which must be vectors of the same length. You can customize many features (fonts, colors, axes, shape, titles, etc.) through [graphic options](http://www.statmethods.net/advgraphs/parameters.html). Here's the same plot with a few customizations:
```{r}
#| label: 'scatter-basic-pretty'
#| fig.height: 5
#| fig.width: 6
plot(x = df$year,
y = df$age,
col = 'darkblue', # "col" changes the point color
pch = 19, # "pch" changes the point shape
main = "Age of bear killing victims over time",
xlab = "Year",
ylab = "Age")
```
Looks like bear killings are becoming more frequent over time (hmm, why might that be?), though pretty evenly-distributed across age (I guess bears will kill you regardless of your age).
### Histograms with `hist()`
The [histogram](https://en.wikipedia.org/wiki/Histogram) is one of the most common ways to visualize the _distribution_ of a variable. The `hist()` function takes just one variable: `x`. Here's a histogram of the `month` variable:
```{r}
#| label: 'hist-basic'
#| fig.height: 5
#| fig.width: 6
hist(x = df$month)
```
As you might expect, most bear attacks occur during the summer months, when parks get more visitors. As with the `plot()` function, you can customize a lot of the histogram features. One common customization is to modify the number of "bins" in the histogram by changing the `breaks` argument. Here we'll fix the number of bins to `12` - one for each month:
```{r}
#| label: 'hist-basic-pretty'
#| fig.height: 5
#| fig.width: 6
hist(x = df$month,
breaks = 12,
col = 'darkred',
main = "Distribution of bear killings by month",
xlab = "Month",
ylab = "Count")
```
## Advanced figures with `ggplot2`
![](images/horst_monsters_ggplot2.png){ width=500 }
[Art by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)]{.aside}
While Base R plot functions are useful for making simple, quick plots, many R users have adopted the `ggplot2` package as their primary tool for visualizing data.
### The Grammar of Graphics
The `ggplot2` library is built on the ["Grammar of Graphics"](https://www.springer.com/in/book/9780387245447) concept developed by [Leland Wilkinson](https://en.wikipedia.org/wiki/Leland_Wilkinson). A "grammar of graphics" (that's what the "gg" in "ggplot2" stands for) is a framework that uses _layers_ to describe and construct visualizations or graphics in a structured manner. Here's a visual summary of the concept:
<center>
![](images/making_a_ggplot.jpeg){ width=700 }
</center>
We will start using `ggplot2` by re-creating some of the above plots, but using ggplot functions to get a feel for the syntax. But first, install and load the library:
```{r}
#| message: false
library(ggplot2)
```
### A blank slate
The `ggplot()` function is used to initialize the basic graph structure, and then we add layers to it. The basic idea is that you specify different parts of the plot, and add them together using the `+` operator. We will start with a blank plot and will add layers as we go along:
```{r}
#| label: 'ggplot-blank'
#| fig.height: 4
#| fig.width: 6
ggplot(data = df)
```
### Geoms and aesthetics
Geometric objects (called "geoms") are the shapes we put on a plot (e.g. points, bars, etc.). You can have an unlimited number of layers, but at a minimum a plot **must have at least one geom**. Examples include:
- `geom_point()` makes a scatter plot by adding a layer of points.
- `geom_line()` adds a layer of lines connecting data points.
- `geom_col()` adds bars for bar charts.
- `geom_histogram()` makes a histogram.
- `geom_boxplot()` adds boxes for boxplots.
Each type of geom usually has a **required set of aesthetics** to be set, and usually accepts only a subset of all aesthetics. Aesthetic mappings are set with the `aes()` function. Examples include:
- `x` and `y` (the position on the x and y axes)
- `color` ("outside" color, like the line around a bar)
- `fill` ("inside" color, like the color of the bar itself)
- `shape` (the type of point, like a dot, square, triangle, etc.)
- `linetype` (solid, dashed, dotted etc.)
- `size` (of geoms)
### Scatterplots with `geom_point()`
Now that we know what geoms and aesthetics are, let's put them to practice by making a scatterplot. To start, we will add the `geom_point()` geom and we'll set the position for the x- and y-axis inside the `aes()` function:
```{r}
#| label: 'ggplot-scatter'
#| fig.height: 4
#| fig.width: 6
ggplot(data = df) +
geom_point(aes(x = year, y = age))
```
Notice how we've "added" the `geom_point()` layer to the previous blank slate. Also note that the names we used to define the `x` and `y` axes are column names in the data frame, `df`. These must be placed _inside_ the `aes()` function, which tells ggplot to look in `df` for those columns.
If I wanted to change the point color, I could add that inside the `geom_point()` layer:
```{r}
#| label: 'ggplot-scatter-blue'
#| fig.height: 4
#| fig.width: 6
ggplot(data = df) +
geom_point(aes(x = year, y = age), color = "blue")
```
But I could also _map_ one of my variables to the point color by placing the `color` variable inside the `aes()` function:
```{r}
#| label: 'ggplot-scatter-color'
#| fig.height: 4
#| fig.width: 6
ggplot(data = df) +
geom_point(aes(x = year, y = age, color = gender))
```
### Bar charts with `geom_col()`
I recommend using the `geom_col()` layer to create bar charts, which are great for comparing different numerical values across a categorical variable. One of the simplest things to show with bars is the _count_ of how many observations you have. You can compute this by using the `count()` function, and then use the resulting data frame to create bars of those counts:
```{r}
#| label: 'ggplot-bars'
#| fig.height: 4
#| fig.width: 6
# Compute the counts
monthCounts <- df %>%
count(month)
# Create the bar chart
ggplot(data = monthCounts) +
geom_col(aes(x = month, y = n))
```
Alternatively, you could use the `%>%` operator to pipe the results of a summary data frame directly into ggplot:
```{r}
#| label: 'ggplot-bars2'
#| fig.height: 4
#| fig.width: 6
df %>%
count(month) %>% # Compute the counts
ggplot() +
geom_col(aes(x = month, y = n)) # Create the bar chart
```
Just like how we mapped the point color to a variable in scatter plots, you can map the bar color to a variable with bar charts using the `fill` argument in the `aes()` call. For example, here's the same bar chart of the count of observations with the bar colors representing the type of bear.
```{r}
#| label: 'ggplot-bars-fill'
#| fig.height: 4
#| fig.width: 6
df %>%
count(month, bearType) %>% # Compute the counts for month and bear type
ggplot() +
geom_col(aes(x = month, y = n, fill = bearType)) # Change the bar color based on bear type
```
Hmm, looks like brown bears are the most frequent killers, though black bears are a close second.
You can plot variables other than the count. For example, here is a plot of the mean age of the victim in each year:
```{r}
#| label: 'ggplot-bars-summary'
#| fig.height: 4
#| fig.width: 7
df %>%
filter(!is.na(age)) %>%
group_by(year) %>%
summarise(meanAge = mean(age)) %>% # Compute the mean age in each year
ggplot() +
geom_col(aes(x = year, y = meanAge))
```
## Customizing your ggplot
There are lots of ways to tweak your ggplot to make it more aesthetically pleasing and easier for others to understand. We'll cover just two here: labels and themes.
### Labels
You can change the labels of your plot by adding the `labs()` layer:
```{r}
#| label: 'ggplot-scatter-labs'
#| fig.height: 4
#| fig.width: 6
ggplot(data = df) +
geom_point(aes(x = year, y = age, color = gender)) +
labs(x = "Year",
y = "Age",
color = "Gender",
title = "Age of bear killing victims over time",
subtitle = "A subtitle",
caption = "Data source: Wikipedia")
```
The `labs()` layer enables you to modify the labels of any of the variables that you have mapped in your `aes()` call, as well as some other labels like the title, subtitle, and caption.
### Themes
Adding theme layers can change some global aspects of the plot, such as the background color, grid lines, legend appearance, etc. There are [many themes to choose from](https://ggplot2.tidyverse.org/reference/ggtheme.html), but using simple themes like `theme_bw()` or `theme_minimal()` often improves the plot from the default theme settings:
```{r}
#| label: 'ggplot-scatter-minimal'
#| fig.height: 4
#| fig.width: 6
ggplot(data = df) +
geom_point(aes(x = year, y = age)) +
theme_minimal()
```
There are LOTS of other themes from external packages as well. Some of my favorites are `theme_ipsum()` and `theme_ft_rc()` from the [**hrbrthemes**](https://github.com/hrbrmstr/hrbrthemes) package:
```{r}
#| eval: false
library(hrbrthemes)
ggplot(data = df) +
geom_point(aes(x = year, y = age)) +
theme_ipsum()
```
<center>
<img src="images/hrbrthemes_theme_ipsum.png" width=600>
</center>
```{r}
#| eval: false
library(hrbrthemes)
ggplot(data = df) +
geom_point(aes(x = year, y = age)) +
theme_ft_rc()
```
<center>
<img src="images/hrbrthemes_theme_ft_rc.png" width=600>
</center>
Want to make a plot look fancy like those in the Economist magazine? Try `theme_economist()` from the [ggthemes](https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/) library:
```{r}
#| label: 'ggplot-scatter-economist'
#| warning: false
#| fig.height: 4
#| fig.width: 6
library(ggthemes)
ggplot(data = df) +
geom_point(aes(x = year, y = age)) +
theme_economist()
```
## Saving figures
The first (and easiest) is to export directly from the RStudio 'Plots' panel, by clicking on `Export` when the image is plotted. This will give you the option of `.png` or `.pdf` and selecting the directory to which you wish to save it to. I strongly recommend you save images as `.pdf` types as these won't pixelate when you change the image size.
Another easy way to save a ggplot figure is to use the `ggsave()` function. First, create your plot and save it as an object:
```{r}
#| eval: false
scatterPlot <- ggplot(data = df) +
geom_point(aes(x = year, y = age))
```
Then save the plot using `ggsave()` (make sure you create a folder called "plots" in which to save your plot):
```{r}
#| eval: false
ggsave(filename = here('data', 'scatterPlot.pdf'),
plot = scatterPlot,
width = 6,
height = 4)
```
## Other resources
While the `ggplot2` library offers a wide variety of options for customizing your plots, remembering exactly how to do specific tasks (like changing the color of a line, or changing the position of a legend) can be difficult. Fortunately, there are wonderful resources for looking up all the tricks to make the perfect ggplot. Here are a few:
- [RStudio `ggplot2` Cheatsheet](https://posit.co/resources/cheatsheets/)
- [Tidyverse `ggplot2` reference guide](https://ggplot2.tidyverse.org/reference/)
- [R Cookbook for `ggplot2`](http://www.cookbook-r.com/Graphs/)
- [Top 50 `ggplot2` visualizations](http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html)
## Page sources {.unnumbered}
Some content on this page has been modified from other courses, including:
- [Data Analysis and Visualization in R _alpha_](https://datacarpentry.org/R-genomics/index.html), by Data Carpentry contributors.