-
Notifications
You must be signed in to change notification settings - Fork 1
/
L10-da1-data-frames.Rmd
432 lines (313 loc) · 17.7 KB
/
L10-da1-data-frames.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
---
title: "Data Analysis 1 - Data Frames"
output: html_document
---
```{r setup, echo=FALSE, message=FALSE, warning=FALSE}
rm(list=objects()) # start with a clean workspace
source("knitr_setup.R")
library(dplyr)
```
This is the first lesson in a sequence on data analysis in R. Before reading this lesson, check out the ["Prelude to Data Analysis"](dataAnalysisPrelude.html).
> ### Learning Objectives
>
> * Describe what a data frame is.
> * Create data frames.
> * Use indexing to subset and modify specific portions of data frames.
> * Load external data from a .csv file into a data frame.
> * Summarize the contents of a data frame.
>
> ### Suggested Readings
>
> * ["Introduction to Data frames in R"](https://www.datacamp.com/community/tutorials/intro-data-frame-r) Data Camp article, by Ryan Sheehy
> * [Chapter 10](https://r4ds.had.co.nz/tibbles.html) of "R for Data Science", by Garrett Grolemund and Hadley Wickham
> * [Chapters 5.8 - 5.11](https://rstudio-education.github.io/hopr/r-objects.html#data-frames) of "Hands-On Programming with R", by Garrett Grolemund
---
# The data frame
## What are data frames?
Data frames are the _de facto_ data structure for most tabular data in R. A data frame can be created by hand, but most commonly they are generated by reading in a data file (typically a `.csv` file).
A data frame is the representation of data in the format of a table where the columns are **vectors** of the same length. Because columns are vectors, each column must contain a single type of data (e.g., numeric, character, integer, logical). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector:
![](./images/data-frame.png)
## The `data.frame()` function
You can create a data frame using the `data.frame()` function. Here is an example using of members of the Beatles band:
```{r}
beatles <- data.frame(
firstName = c("John", "Paul", "Ringo", "George"),
lastName = c("Lennon", "McCartney", "Starr", "Harrison"),
instrument = c("guitar", "bass", "drums", "guitar"),
yearOfBirth = c(1940, 1942, 1940, 1943),
deceased = c(TRUE, FALSE, FALSE, TRUE)
)
beatles
```
Notice how the data frame is created - you just hand the `data.frame()` function a bunch of vectors! This should hopefully help make it clear that a data frame is indeed a series of same-length vectors structured side-by-side.
## The `tibble()` function
The [tibble](https://r4ds.had.co.nz/tibbles.html) is an improved version of the Base R data frame, and it comes from the **dplyr** library (which we'll get into [next lesson](L11-da2-data-wrangling.html)). If you haven't already, go ahead and install and load the **dplyr** library now:
```{r, eval=FALSE}
install.packages('dplyr')
library(dplyr)
```
A tibble works just like a data frame, but it has a few small features that make it a bit more useful - to the extent that from here on, we will be using tibbles as our default data frame structure. With this in mind, I'll often use the term "data frame" to refer to both tibbles and data frames, since they serve the same purpose as a data structure.
Just like with data frames, you can create a tibble using the `tibble()` function. Here's the same example as before with the Beatles band:
```{r}
beatles <- tibble(
firstName = c("John", "Paul", "Ringo", "George"),
lastName = c("Lennon", "McCartney", "Starr", "Harrison"),
instrument = c("guitar", "bass", "drums", "guitar"),
yearOfBirth = c(1940, 1942, 1940, 1943),
deceased = c(TRUE, FALSE, FALSE, TRUE)
)
beatles
```
Here we can see a couple of the differences that make tibbles a bit more intuitive to use:
1. It's easier to see what type of data each column is because tibbles display this in between the `<>` symbols under each column name.
2. A tibble will only print the first few rows of data when you enter the object name. In contrast, data frames will try to print _the entire_ data frame (which is super annoying when you have a data frame with millions of rows of data). Here, we only have 4 rows, so this difference is not apparent.
3. Columns of class `character` are never converted into factors (don't worry about this for now...just know that keeping strings as a `character` class generally makes life easier in R).
Now that we have a data frame (tibble) defined, let's see what we can do with it!
## Dimensions
You can get the dimensions of a data frame using the `ncol()`, `nrow()`, and `dim()` functions:
```{r}
nrow(beatles) # Returns the number of rows
ncol(beatles) # Returns the number of columns
dim(beatles) # Returns a vector of the number rows and columns
```
## Row and column names
Data frames **must** have column names, but row names are optional (by default, row names are just a sequence of numbers). The `names()` function returns the column names, or you can also be more specific and use the `colnames()` and `rownames()` functions:
```{r}
names(beatles) # Returns a vector of the column names
colnames(beatles) # Also returns a vector of the column names
rownames(beatles) # Returns a vector of the row names
```
## Combining data frames
You can combine data frames using the `bind_cols()` and `bind_rows()` functions:
```{r}
# Combine columns
names <- tibble(
firstName = c("John", "Paul", "Ringo", "George"),
lastName = c("Lennon", "McCartney", "Starr", "Harrison")
)
instruments <- tibble(
instrument = c("guitar", "bass", "drums", "guitar")
)
bind_cols(names, instruments)
```
```{r}
# Combine rows
members1 <- tibble(
firstName = c("John", "Paul"),
lastName = c("Lennon", "McCartney")
)
members2 <- tibble(
firstName = c("Ringo", "George"),
lastName = c("Starr", "Harrison")
)
bind_rows(members1, members2)
```
Note that to combine rows, the column names must be the same. For example, if we change the second column name in `members2` to `"LASTNAME"`, you'll get a data frame with three columns, two of which will have missing values:
```{r}
colnames(members2) <- c("firstName", "LASTNAME")
bind_rows(members1, members2)
```
# Accessing elements
## Using the `$` operator
You can extract columns from a data frame by name by using the `$` operator plus the column name. For example, the `instrument` column can be accessed using `beatles$instrument`:
```{r}
beatles$instrument
```
## Using integer indices
You can access elements in a data frame using brackets `[]` and indices inside the brackets. The general form is:
```
DF[ROWS, COLUMNS]
```
To index with integers, specify the row numbers and column numbers as vectors.
```{r}
beatles[1, 2] # Select the element in row 1, column 2
beatles[c(1, 2), c(2, 3)] # Select the elements in rows 1 & 2 and columns 2 & 3
beatles[1:2, 2:3] # Same thing, but using the ":" operator
```
If you leave either the row or column index blank, it means "selects all":
```{r}
beatles[c(1, 2),] # Leaving the column index blank will select all columns
beatles[,c(1, 2)] # Leaving the row index blank will select all rows
```
You can also use negative integers to specify rows or columns to be excluded:
```{r}
beatles[-1, ] # Select all rows and except the first
```
## Using character indices
You can use the column names to select elements in a data frame. If you do not include a `,` to designate which rows to select, R will return all the rows for the selected columns:
```{r}
beatles[c('firstName', 'lastName')] # Select all rows for the "firstName" and "lastName" columns
beatles[1:2, c('firstName', 'lastName')] # Select just the first two rows for the "firstName" and "lastName" columns
```
## Using logical indices
When using a logical vector for indexing, the position where the logical vector is `TRUE` is returned. This is helpful for filtering data frame rows based on conditions. For example, if you wanted to filter out the rows for which Beatles members were still alive, you could first create a logical vector using the `deceased` column:
```{r}
beatles$deceased == FALSE
```
Then, you could insert this logical vector in the row position of the `[]` brackets to filter only the rows that are `TRUE`:
```{r}
beatles[beatles$deceased == FALSE,]
```
## Modifying data frames
You can use any of the above methods for accessing elements in a data frame to also modify those elements using the assignment operator (`<-`). In addition to using brackets to modify specific elements, you can use the `$` operator to create new columns in a data frame.
For example, let's create the variable `age` by subtracting the `yearOfBirth` variable from the current year:
```{r}
beatles$age <- 2019 - beatles$yearOfBirth
beatles
```
You can also make a new column of all the same value by just providing one value:
```{r}
beatles$hometown <- 'Liverpool'
beatles
```
# Dealing with actual data
Now that we know what a data frame is, let's start working with actual data! We are going to use the `msleep` dataset, which contains data on sleep times and weights of different mammals. The data are taken from [V. M. Savage and G. B. West. "A quantitative, theoretical framework for understanding mammalian sleep." _Proceedings of the National Academy of Sciences_, 104 (3):1051-1056, 2007.](https://www.pnas.org/content/104/3/1051.long).
The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
| Column Name | Description |
|------------------|------------------------------------|
name | Common name
genus | The taxonomic genus of animal
vore | Carnivore, omnivore or herbivore?
order | The taxonomic order of animal
conservation | The conservation status of the animal
sleep_total | Total amount of sleep, in hours
sleep_rem | REM sleep, in hours
sleep_cycle | Length of sleep cycle, in hours
awake | Amount of time spent awake, in hours
brainwt | Brain weight in kilograms
bodywt | Body weight in kilograms
## R Setup
Before we dig into the data, let's prepare our analysis environment by following these steps:
1) Create a new R Project called "data-analysis-tutorial" and save the folder somewhere on your computer (see the ["RStudio projects"](L1.2-getting-started.html#33_rstudio_projects.html) section from waaaaay back on week 1).
2) Create a new `.R` file (File > New File > R Script), and save it as "`data_frames.R`" inside your "data-analysis-tutorial" R Project folder. From here on, we'll type all code for this lesson inside this `data_frames.R` file.
3) Create another folder in your R Project folder called "data" - we'll put data in this folder real soon.
## Getting the data
### Method 1: Loading data from a package
Many R packages come with pre-loaded datasets. For example, the **ggplot2** library (which we'll [use soon](L12-da3-data-visualization.html) to make plots in R) comes with the `msleep` dataset already loaded. To see this, install **ggplot2** and load the library:
```{r, eval=FALSE, message=FALSE}
install.packages("ggplot2")
library(ggplot2)
head(msleep) # Preview just the first 6 rows of the data frame
```
```{r, echo=FALSE, message=FALSE}
library(ggplot2)
head(msleep)
```
If you want to see all of the different datasets that any particular package contains, you can call the `data()` function after loading a library. For example, here are all the dataset that are contained in the **ggplot2** library:
```{r, eval=FALSE}
data(package = "ggplot2")
```
```
Data sets in package 'ggplot2':
diamonds Prices of 50,000 round cut diamonds
economics US economic time series
economics_long US economic time series
faithfuld 2d density estimate of Old Faithful data
luv_colours 'colors()' in Luv space
midwest Midwest demographics
mpg Fuel economy data from 1999 and 2008 for 38
popular models of car
msleep An updated and expanded version of the mammals
sleep dataset
presidential Terms of 11 presidents from Eisenhower to Obama
seals Vector field of seal movements
txhousing Housing sales in TX
```
### Method 2: Importing data
What do you do when a dataset isn't available from a package? Well, you can "read" the data into R from an external file. One of the most common format for storing tabular data (i.e. data that is stored as rows and columns) is the comma separated value (CSV) file.
To load the same `msleep` data from an external csv file, first use the `download.file()` function to download the file. The first argument in this function is a character string with the source URL to the data file ("https://github.com/emse6574-gwu/2019-Fall/raw/gh-pages/data/msleep.csv"). The second argument is the destination where you want to locally save the file on your computer.
```{r, eval=FALSE}
download.file(
url = "https://github.com/emse6574-gwu/2019-Fall/raw/gh-pages/data/msleep.csv",
destfile = file.path('data', 'msleep.csv')
)
```
> **Note on making file paths**: Notice the use of the `file.path()` function to generate the path to the "data" folder on your computer. This function will automatically use the correct "`/`" symbols to create the file path. This is important because the specific file path syntax is different depending on your computer operating system (e.g. mac is "`/`" and windows is "`\`"). In the above example, the destination file path used was:
```{r}
file.path('data', 'msleep.csv')
```
Now you are now ready to load the downloaded data! The Base R function for reading in a csv file is called `read.csv()`, but it has some quirky aspects in how it formats the data (in particular, character variables). So instead we are going to use an improved function, `read_csv()`, from the `readr` package.
First, install the readr package if you haven't already:
```{r, eval=FALSE}
install.packages("readr")
```
Now load the data:
```{r}
library(readr)
msleep <- read_csv(file.path('data', 'msleep.csv'))
```
R tells us that we've successfully read in some data and a quick summary of the data type for each column in the dataset.
## Previewing the data
You can view the entire dataset in a tabular format (similar to how Excel looks) by using the `View()` function, which opens up another tab to view the data. Note that you cannot modify the data this way - you can just look at it:
```{r, eval = FALSE}
View(msleep)
```
In addition to viewing the whole dataset with `View()`, you can quickly view summaries of the data frame with a few convenient functions. For example, you can look at the first 6 rows by using the `head()` function:
```{r}
head(msleep)
```
Similarly, you can view the _last_ 6 rows by using the `tail()` function:
```{r}
tail(msleep)
```
You can also view an overview summary of each column and it's data types by using the `str()` or `glimpse()` functions (these both do the same thing, but I prefer the output of `glimpse()`):
```{r}
glimpse(msleep)
```
Finally, you can view summary statistics for each column using the `summary()` function:
```{r}
summary(msleep)
```
In summary, here is a non-exhaustive list of functions to get a sense of the content/structure of a data frame:
* Size:
* `dim(df)` - returns a vector with the number of rows in the first element, and the number of columns as the second element (the **dim**ensions of the object).
* `nrow(df)` - returns the number of rows.
* `ncol(df)` - returns the number of columns.
* Content:
* `head(df)` - shows the first 6 rows.
* `tail(df)` - shows the last 6 rows.
* Names:
* `names(df)` - returns the column names (synonym of `colnames()` for `data.frame` objects).
* `rownames(df)` - returns the row names.
* Summary:
* `glimpse(df)` or `str(df)` - structure of the object and information about the class, length and content of each column.
* `summary(df)` - summary statistics for each column.
Note: most of these functions are "generic", they can be used on other types of
objects besides `data.frame`.
---
# Now what?
Now that you've got some data into R and are up to speed with what a data frame / tibble is, you may be asking, "so what now?" Well, over the next two lessons we will learn more about how to manipulate data frames and explore the underlying information with visualizations.
But just to give you an idea of where we're going, here are a few pieces of information from the `msleep` dataset:
1) It appears that mammalian brain and body weight are logarithmically correlated - cool!
```{r fig.height=4, fig.width=6, message=FALSE, warning=FALSE}
library(ggplot2)
ggplot(msleep, aes(x=brainwt, y=bodywt)) +
geom_point(alpha=0.6) +
stat_smooth(method='lm', col='red', se=F, size=0.7) +
scale_x_log10() +
scale_y_log10() +
labs(x='log(brain weight) in g', y='log(body weight) in kg') +
theme_minimal()
```
2) It appears there may also be a negative, logarithmic relationship (albeit weaker) between the size of mammalian brains and how much they sleep - cool!
```{r fig.height=4, fig.width=6, message=FALSE, warning=FALSE}
ggplot(msleep, aes(x=brainwt, y=sleep_total)) +
geom_point(alpha=0.6) +
scale_x_log10() +
scale_y_log10() +
stat_smooth(method='lm', col='red', se=F, size=0.7) +
labs(x='log(brain weight) in g', y='log(total sleep time) in hours') +
theme_minimal()
```
3) Wow, there's a lot of variation in how much different mammals sleep - cool!
```{r fig.height=4, fig.width=6, message=FALSE, warning=FALSE}
ggplot(msleep, aes(x=sleep_total)) +
geom_histogram() +
labs(x = 'Total sleep time in hours',
title = 'Histogram of total sleep time') +
theme_minimal()
```
---
**Page sources**:
Some content on this page has been modified from other courses, including:
- [Data Analysis and Visualization in R for Ecologists](https://datacarpentry.org/R-ecology-lesson/), by François Michonneau & Auriel Fournier. Zenodo: http://doi.org/10.5281/zenodo.3264888