-
Notifications
You must be signed in to change notification settings - Fork 0
/
wrangle_join.qmd
318 lines (213 loc) · 11.6 KB
/
wrangle_join.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
---
title: "Joining Data"
---
## Library Loading
Begin by loading any needed libraries.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Load the `tidyverse` meta-package.
```{r r-start, warning = F, message = F}
# Load needed library
library(tidyverse)
```
## [{{< fa brands python >}} Python]{.py}
Load the `pandas` and `os` libraries.
```{python py-libs}
# Load needed libraries
import os
import pandas as pd
```
:::
## Combining DataFrames
Sometimes we collected related data and store them in separate files. This necessitates integrating the two datasets later on for statistics and/or visualization. If the two datasets that are sampled at very different frequencies (e.g., annual temperature values and daily insect counts), trying to include both in a single file results in duplicating the less granular data _many_ times. This is not ideal. Fortunately, scripted languages provide several methods for combining data easily and appropriately so that they can be used together despite being stored separately.
To illustrate some of these methods we'll load some new simulated data on lichen coverage to use instead of the vertebrate data we've used in past modules.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Load the vertebrate data first.
```{r r-data1}
# Load data
vert_r <- read.csv(file = file.path("data", "verts.csv"))
# Check out first few rows
head(vert_r, n = 2)
```
Then load the dataset with lichen community composition on trees.
```{r r-data2}
# Load data
lich <- read.csv(file = file.path("data", "tree_lichen.csv"))
# Check out rows rows
head(lich, n = 2)
```
And finally load the data that includes distance from the nearest road for some of the same trees for which we have lichen data.
```{r r-data3}
# Load data
road <- read.csv(file = file.path("data", "tree_road.csv"))
# Check out rows
head(road, n = 2)
```
## [{{< fa brands python >}} Python]{.py}
Load the vertebrate data first.
```{python py-data1}
# Load data
vert_py = pd.read_csv(os.path.join("data", "verts.csv"))
# Check out first few rows
vert_py.head(2)
```
Then load the dataset with lichen community composition on trees.
```{python py-data2}
# Load data
lich = pd.read_csv(os.path.join("data", "tree_lichen.csv"))
# Check out rows
lich.head(2)
```
And finally load the data that includes distance from the nearest road for some of the same trees for which we have lichen data.
```{python py-data3}
# Load data
road = pd.read_csv(os.path.join("data", "tree_road.csv"))
# Check out rows
road.head(2)
```
:::
## Concatenating Data
The simplest way of combining data in either [{{< fa brands python >}} Python]{.py} or [{{< fa brands r-project >}} R]{.r} is called "concatenation". This involves--essentially--pasting rows or columns of separate data [variables]{.py}/[objects]{.r} together.
We'll need to modify our vertebrate data somewhat in order to demonstrate the two modes (horizontal or vertical) of concatenating [DataFrames]{.py}/[data.frames]{.r}.
Concatenating data (either horizontally or vertically) in this way is _deeply_ risky in that it can easily result in creating rows or columns that appear to relate to one another but in actuality do not. However, it is still worthwhile to cover how this can be done.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Split the first and last two rows of the vertebrate data into separate objects. Note that in each we'll want to retain all columns.
```{r r-concat-prep}
# Split the two datasets
vert_r_top <- vert_r[1:2, ]
vert_r_bottom <- vert_r[(nrow(vert_r) - 1):nrow(vert_r), ]
# Look at one
vert_r_bottom
```
## [{{< fa brands python >}} Python]{.py}
Split the first and last two rows of the vertebrate data into separate variables. Note that in each we'll want to retain all columns.
```{python py-concat-prep}
# Split the two datasets
vert_py_top = vert_py.iloc[[0, 1], :]
vert_py_bottom = vert_py.iloc[[-1, -2], :]
# Look at one
vert_py_bottom
```
:::
### Vertical Concatenation
Vertical concatenation (i.e., concatenating by stacking rows on top of each other) is one option for concatenation. This is much more common than horizontal concatenation.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
[{{< fa brands r-project >}} R]{.r} can perform this operation with the `rbind` function (from base [{{< fa brands r-project >}} R]{.r}) but the `bind_rows` function (from the `dplyr` package) is preferable because it checks for matching column names and--if necessary--reorders columns to match between the two data objects.
```{r r-rowbind}
# Combine vertically
vert_r_vertical <- dplyr::bind_rows(vert_r_top, vert_r_bottom)
# Check shape before and after to demonstrate it worked
message("There were two rows each before concatenation and ", nrow(vert_r_vertical), " after")
```
## [{{< fa brands python >}} Python]{.py}
Python does horizontal concatenation with the `concat` function from the `pandas` library. This function does horizontal and vertical concatenation and uses the `axis` argument to determine which is done. Setting `axis` to `0` performs vertical concatenation.
```{python py-rowbind}
# Combine vertically
vert_py_vertical = pd.concat([vert_py_top, vert_py_bottom], axis = 0)
# Check shape before and after to demonstrate it worked
print("There were two rows each before concatenation and", len(vert_py_vertical), "after")
```
:::
### Horizontal Concatenation
Horizontal concatenation (i.e., concatenating by putting columns next to one another) is the other option for concatenation. Note that it assumes row orders are consistent and _won't perform any check for improper row combinations_. Also, both languages will create duplicate column [labels]{.py}/[names]{.r} in our example because both data [variables]{.py}/[objects]{.r} have the same column [labels]{.py}/[names]{.r}.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
[{{< fa brands r-project >}} R]{.r} does horizontal concatenation with the `cbind` function from base [{{< fa brands r-project >}} R]{.r}.
```{r r-colbind}
# Combine horizontally
vert_r_horiz <- cbind(vert_r_top, vert_r_bottom)
# Check columns to show they were added
names(vert_r_horiz)
```
## [{{< fa brands python >}} Python]{.py}
Setting the `axis` argument to `1` is how the `concat` function is switched to horizontal concatenation.
```{python py-colbind}
# Combine horizontally
vert_py_horiz = pd.concat([vert_py_top, vert_py_bottom], axis = 1)
# Check columns to show they were added
vert_py_horiz.columns
```
:::
## Joins
While concatentation is simple and effective in some cases, a less risky mode of combining separate data objects is to use "joins". Joins allow two data [variables]{.py}/[objects]{.r} to be combined in an appropriate order based on a column found in both data [variables]{.py}/[objects]{.r}. This shared column (or column_s_) are known as "join keys".
This module makes use of simulated data on lichen coverage on various trees and the distance of those trees to the nearest road. Lichen wasn't surveyed on all of the trees and not all trees that _did_ have lichen measured had their distance to the nearest road recorded. Vertical concatenation is inappropriate because the column [labels]{.py}/[names]{.r} and horizontal concatenation is incorrect because we want to make sure we only combine data from the same tree across the two datasets.
There are several variants of joins that each function slightly differently so we'll discuss each below. Note that all <u>joins only combine two data [variables]{.py}/[objects]{.r} at a time</u> and they are always referred to as the "left" (first) and "right" (second).
### Left / Right Joins
Left joins combine two data [variables]{.py}/[objects]{.r} by their join key(s) but only keep rows that are found in the "left". A right join performs the same operation but prioritizes rows in the second data [variable]{.py}/[object]{.r}. Note that functionally you can make a left join into a right join by switching which data [variable]{.py}/[object]{.r} you assign to "left" versus "right".
<p align="center">
<img src="images/joins/left-join_table.png" width="75%"/>
</p>
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
The `left_join` function (from the `dplyr` package) performs--as its name suggests--left joins. The `right_join` function does the same for right joins. Both have an `x` and a `y` argument for the left and right data objects respectively.
The `by` argument expects a vector of join keys to by which to join. Note that if the join key column names differ--as is the case with our example data--you can specify that by using the syntax `"left key" = "right key"`.
```{r r-left}
# Do a left join
left_r <- dplyr::left_join(x = lich, y = road, by = c("tree" = "tree_name"))
# Check that out
head(left_r, n = 5)
```
See how "tree_B" has no distance to road listed? That is because it lacks a row in the right data object!
## [{{< fa brands python >}} Python]{.py}
All [{{< fa brands python >}} Python]{.py} joins are performed with the `merge` function (from the `pandas` library) and the type of join is specified by the `how` argument. Which data variable is left and right is specified by arguments of the same name.
The `on` argument can be used when the join key(s) labels match but otherwise a `left_on` and `right_on` argument are provided to specify the join key's label in the left and right data variable respectively.
```{python py-left}
# Do a left join
left_py = pd.merge(left = lich, right = road, how = "left",
left_on = "tree", right_on = "tree_name")
# Check that out
left_py.head(5)
```
See how "tree_B" has no distance to road listed? That is because it lacks a row in the right data variable!
:::
### Inner Joins
<p align="center">
<img src="images/joins/inner-join_table.png" width="75%"/>
</p>
An inner join keeps _only_ the rows that can be found in <u>both</u> join keys. This is very useful in situations where only "complete" data (i.e., no missing values) is required.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
In [{{< fa brands r-project >}} R]{.r}, we can do an inner join with `dplyr`'s `inner_join` function. It has the same arguments as the `left_join` function.
```{r r-inner}
# Inner join the two datasets
in_r <- dplyr::inner_join(x = lich, y = road, by = c("tree" = "tree_name"))
# Check that out
head(in_r, n = 5)
```
## [{{< fa brands python >}} Python]{.py}
In [{{< fa brands python >}} Python]{.py}, an inner join is actually the default behavior of the `merge` function. Therefore we can exclude the `how` argument entirely!
```{python py-inner}
# Inner join the two datasets
in_py = pd.merge(left = lich, right = road,
left_on = "tree", right_on = "tree_name")
# Check that out
in_py.head(5)
```
:::
### Full Joins
A full join (sometimes known as an "outer join" or a "full outer join") combines all rows in both left and right data [variables]{.py}/[objects]{.r} regardless of whether the join key entries are found in both.
<p align="center">
<img src="images/joins/outer-join_table.png" width="75%"/>
</p>
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
[{{< fa brands r-project >}} R]{.r} uses the `full_join` function (from `dplyr`) and the same arguments as the other types of join that we've already discussed.
```{r r-outer}
# Do a full join
full_r <- dplyr::full_join(x = lich, y = road, by = c("tree" = "tree_name"))
# Check that out
head(full_r, n = 5)
```
## [{{< fa brands python >}} Python]{.py}
[{{< fa brands python >}} Python]{.py} refers to full joins as "outer" joins so the `how` argument of the `merge` function needs to be set to "outer".
```{python py-outer}
# Do a full join
full_py = pd.merge(left = lich, right = road, how = "outer",
left_on = "tree", right_on = "tree_name")
# Check that out
full_py.head(5)
```
:::