wrangle_join.qmd

---
title: "Joining Data"
---

## Library Loading

Begin by loading any needed libraries.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

Load the `tidyverse` meta-package.

```{r r-start, warning = F, message = F}
# Load needed library
library(tidyverse)
```

## [{{< fa brands python >}} Python]{.py}

Load the `pandas` and `os` libraries.

```{python py-libs}
# Load needed libraries
import os
import pandas as pd
```
:::

## Combining DataFrames

Sometimes we collected related data and store them in separate files. This necessitates integrating the two datasets later on for statistics and/or visualization. If the two datasets that are sampled at very different frequencies (e.g., annual temperature values and daily insect counts), trying to include both in a single file results in duplicating the less granular data _many_ times. This is not ideal. Fortunately, scripted languages provide several methods for combining data easily and appropriately so that they can be used together despite being stored separately.

To illustrate some of these methods we'll load some new simulated data on lichen coverage to use instead of the vertebrate data we've used in past modules.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

Load the vertebrate data first.

```{r r-data1}
# Load data
vert_r <- read.csv(file = file.path("data", "verts.csv"))

# Check out first few rows
head(vert_r, n = 2)
```

Then load the dataset with lichen community composition on trees.

```{r r-data2}
# Load data
lich <- read.csv(file = file.path("data", "tree_lichen.csv"))

# Check out rows rows
head(lich, n = 2)
```

And finally load the data that includes distance from the nearest road for some of the same trees for which we have lichen data.

```{r r-data3}
# Load data
road <- read.csv(file = file.path("data", "tree_road.csv"))

# Check out rows
head(road, n = 2)
```

## [{{< fa brands python >}} Python]{.py}

Load the vertebrate data first.

```{python py-data1}
# Load data
vert_py = pd.read_csv(os.path.join("data", "verts.csv"))

# Check out first few rows
vert_py.head(2)
```

Then load the dataset with lichen community composition on trees.

```{python py-data2}
# Load data
lich = pd.read_csv(os.path.join("data", "tree_lichen.csv"))

# Check out rows
lich.head(2)
```

And finally load the data that includes distance from the nearest road for some of the same trees for which we have lichen data.

```{python py-data3}
# Load data
road = pd.read_csv(os.path.join("data", "tree_road.csv"))

# Check out rows
road.head(2)
```

:::

## Concatenating Data

The simplest way of combining data in either [{{< fa brands python >}} Python]{.py} or [{{< fa brands r-project >}} R]{.r} is called "concatenation". This involves--essentially--pasting rows or columns of separate data [variables]{.py}/[objects]{.r} together.

We'll need to modify our vertebrate data somewhat in order to demonstrate the two modes (horizontal or vertical) of concatenating [DataFrames]{.py}/[data.frames]{.r}.

Concatenating data (either horizontally or vertically) in this way is _deeply_ risky in that it can easily result in creating rows or columns that appear to relate to one another but in actuality do not. However, it is still worthwhile to cover how this can be done.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

Split the first and last two rows of the vertebrate data into separate objects. Note that in each we'll want to retain all columns.

```{r r-concat-prep}
# Split the two datasets
vert_r_top <- vert_r[1:2, ]
vert_r_bottom <- vert_r[(nrow(vert_r) - 1):nrow(vert_r), ]

# Look at one
vert_r_bottom
```

## [{{< fa brands python >}} Python]{.py}

Split the first and last two rows of the vertebrate data into separate variables. Note that in each we'll want to retain all columns.

```{python py-concat-prep}
# Split the two datasets
vert_py_top = vert_py.iloc[[0, 1], :]
vert_py_bottom = vert_py.iloc[[-1, -2], :]

# Look at one
vert_py_bottom
```
:::

### Vertical Concatenation

Vertical concatenation (i.e., concatenating by stacking rows on top of each other) is one option for concatenation. This is much more common than horizontal concatenation.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

[{{< fa brands r-project >}} R]{.r} can perform this operation with the `rbind` function (from base [{{< fa brands r-project >}} R]{.r}) but the `bind_rows` function (from the `dplyr` package) is preferable because it checks for matching column names and--if necessary--reorders columns to match between the two data objects.

```{r r-rowbind}
# Combine vertically
vert_r_vertical <- dplyr::bind_rows(vert_r_top, vert_r_bottom)

# Check shape before and after to demonstrate it worked
message("There were two rows each before concatenation and ", nrow(vert_r_vertical), " after")
```

## [{{< fa brands python >}} Python]{.py}

Python does horizontal concatenation with the `concat` function from the `pandas` library. This function does horizontal and vertical concatenation and uses the `axis` argument to determine which is done. Setting `axis` to `0` performs vertical concatenation.

```{python py-rowbind}
# Combine vertically
vert_py_vertical = pd.concat([vert_py_top, vert_py_bottom], axis = 0)

# Check shape before and after to demonstrate it worked
print("There were two rows each before concatenation and", len(vert_py_vertical), "after")
```
:::

### Horizontal Concatenation

Horizontal concatenation (i.e., concatenating by putting columns next to one another) is the other option for concatenation. Note that it assumes row orders are consistent and _won't perform any check for improper row combinations_. Also, both languages will create duplicate column [labels]{.py}/[names]{.r} in our example because both data [variables]{.py}/[objects]{.r} have the same column [labels]{.py}/[names]{.r}.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

[{{< fa brands r-project >}} R]{.r} does horizontal concatenation with the `cbind` function from base [{{< fa brands r-project >}} R]{.r}. 

```{r r-colbind}
# Combine horizontally
vert_r_horiz <- cbind(vert_r_top, vert_r_bottom)

# Check columns to show they were added
names(vert_r_horiz)
```

## [{{< fa brands python >}} Python]{.py}

Setting the `axis` argument to `1` is how the `concat` function is switched to horizontal concatenation.

```{python py-colbind}
# Combine horizontally
vert_py_horiz = pd.concat([vert_py_top, vert_py_bottom], axis = 1)

# Check columns to show they were added
vert_py_horiz.columns
```
:::

## Joins

While concatentation is simple and effective in some cases, a less risky mode of combining separate data objects is to use "joins". Joins allow two data [variables]{.py}/[objects]{.r} to be combined in an appropriate order based on a column found in both data [variables]{.py}/[objects]{.r}. This shared column (or column_s_) are known as "join keys".

This module makes use of simulated data on lichen coverage on various trees and the distance of those trees to the nearest road. Lichen wasn't surveyed on all of the trees and not all trees that _did_ have lichen measured had their distance to the nearest road recorded. Vertical concatenation is inappropriate because the column [labels]{.py}/[names]{.r} and horizontal concatenation is incorrect because we want to make sure we only combine data from the same tree across the two datasets.

There are several variants of joins that each function slightly differently so we'll discuss each below. Note that all <u>joins only combine two data [variables]{.py}/[objects]{.r} at a time</u> and they are always referred to as the "left" (first) and "right" (second).

### Left / Right Joins

Left joins combine two data [variables]{.py}/[objects]{.r} by their join key(s) but only keep rows that are found in the "left". A right join performs the same operation but prioritizes rows in the second data [variable]{.py}/[object]{.r}. Note that functionally you can make a left join into a right join by switching which data [variable]{.py}/[object]{.r} you assign to "left" versus "right".

<p align="center">
<img src="images/joins/left-join_table.png" width="75%"/>
</p>

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

The `left_join` function (from the `dplyr` package) performs--as its name suggests--left joins. The `right_join` function does the same for right joins. Both have an `x` and a `y` argument for the left and right data objects respectively.

The `by` argument expects a vector of join keys to by which to join. Note that if the join key column names differ--as is the case with our example data--you can specify that by using the syntax `"left key" = "right key"`.

```{r r-left}
# Do a left join
left_r <- dplyr::left_join(x = lich, y = road, by = c("tree" = "tree_name"))

# Check that out
head(left_r, n = 5)
```

See how "tree_B" has no distance to road listed? That is because it lacks a row in the right data object!

## [{{< fa brands python >}} Python]{.py}

All [{{< fa brands python >}} Python]{.py} joins are performed with the `merge` function (from the `pandas` library) and the type of join is specified by the `how` argument. Which data variable is left and right is specified by arguments of the same name.

The `on` argument can be used when the join key(s) labels match but otherwise a `left_on` and `right_on` argument are provided to specify the join key's label in the left and right data variable respectively.

```{python py-left}
# Do a left join
left_py = pd.merge(left = lich, right = road, how = "left", 
                   left_on = "tree", right_on = "tree_name")

# Check that out
left_py.head(5)
```

See how "tree_B" has no distance to road listed? That is because it lacks a row in the right data variable!
:::

### Inner Joins

<p align="center">
<img src="images/joins/inner-join_table.png" width="75%"/>
</p>

An inner join keeps _only_ the rows that can be found in <u>both</u> join keys. This is very useful in situations where only "complete" data (i.e., no missing values) is required.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

In [{{< fa brands r-project >}} R]{.r}, we can do an inner join with `dplyr`'s `inner_join` function. It has the same arguments as the `left_join` function.

```{r r-inner}
# Inner join the two datasets
in_r <- dplyr::inner_join(x = lich, y = road, by = c("tree" = "tree_name"))

# Check that out
head(in_r, n = 5)
```

## [{{< fa brands python >}} Python]{.py}

In [{{< fa brands python >}} Python]{.py}, an inner join is actually the default behavior of the `merge` function. Therefore we can exclude the `how` argument entirely!

```{python py-inner}
# Inner join the two datasets
in_py = pd.merge(left = lich, right = road,
                 left_on = "tree", right_on = "tree_name")

# Check that out
in_py.head(5)
```
:::

### Full Joins

A full join (sometimes known as an "outer join" or a "full outer join") combines all rows in both left and right data [variables]{.py}/[objects]{.r} regardless of whether the join key entries are found in both.

<p align="center">
<img src="images/joins/outer-join_table.png" width="75%"/>
</p>

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

[{{< fa brands r-project >}} R]{.r} uses the `full_join` function (from `dplyr`) and the same arguments as the other types of join that we've already discussed.

```{r r-outer}
# Do a full join
full_r <- dplyr::full_join(x = lich, y = road, by = c("tree" = "tree_name"))

# Check that out
head(full_r, n = 5)
```

## [{{< fa brands python >}} Python]{.py}

[{{< fa brands python >}} Python]{.py} refers to full joins as "outer" joins so the `how` argument of the `merge` function needs to be set to "outer".

```{python py-outer}
# Do a full join
full_py = pd.merge(left = lich, right = road, how = "outer",
                   left_on = "tree", right_on = "tree_name")

# Check that out
full_py.head(5)
```
:::