chapter_07.Rmd

---
title: "7.3.4 Exercises"
author: "Matt Wood"
date: '2022-04-13'
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(ggstance)
library(ggbeeswarm)

```

## 7.3.4 Exercises

Exercises from [Variation: Unusual Values](https://r4ds.had.co.nz/exploratory-data-analysis.html#exercises-15).

1. Explore the distribution of each of the `x`, `y`, and `z` variables in `diamonds`. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

ANSWER. It looks like the values in most of x & y are very similar, while the values of z are a bit smaller. Since most diamonds are cut to be round and relatively flat, I would assume `z` is the depth dimension when staring at the diamonds face, while `x` and `y` are the length and width, respectively.

```{r p_7.3.4.1}

df_x_z <- diamonds %>% 
  dplyr::select(x:z) %>% 
  tidyr::pivot_longer(cols = dplyr::everything(), names_to = "var", values_to = "value")

ggplot2::ggplot(df_x_z, ggplot2::aes(x = value)) +
  ggplot2::geom_histogram(binwidth = .5) +
  ggplot2::facet_wrap(facets = ggplot2::vars(var))
  
ggplot2::ggplot(df_x_z, ggplot2::aes(x = value, color = var)) +
  geom_freqpoly(binwidth = .5)


```


2. Explore the distribution of `price`. Do you discover anything unusual or surprising? (Hint: Carefully think about the `binwidth` and make sure you try a wide range of values.)

ANSWER. There is a gap in price right around $1500, and the distribution afterwards is much flatter than at lower values. Could have something to do with the markets that are being catered to with diamonds of each price, or something about how less expensive stones are used in jewelery (e.g., as accent pieces).

```{r p_7.3.4.2}

v_max_price <- dplyr::select(diamonds, price) %>% 
  dplyr::summarise(max = max(price, na.rm = TRUE)) %>% 
  purrr::as_vector()

dplyr::select(diamonds, price) %>% 
  ggplot2::ggplot(ggplot2::aes(x = price)) +
  ggplot2::geom_histogram(binwidth = 100) +
  ggplot2::scale_x_continuous(labels = seq(0, v_max_price, 1000), breaks = seq(0, v_max_price, 1000))

```


3. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

ANSWER. This one is not clear. Might be some very small differences in distribution of cut or clarity. Hard to tell since there's only 20-some cases of 0.99 carat.

```{r p_7.3.4.3}

diamonds %>% 
  dplyr::filter(dplyr::between(carat, 0.99, 1)) %>% 
  dplyr::count(carat)

# diamonds %>% 
#   dplyr::filter(dplyr::between(carat, 0.99, 1)) %>% 
#   tidyr::pivot_longer(depth:z, names_to = "var", values_to = "value") %>% 
#   ggplot2::ggplot(ggplot2::aes(x = value)) + 
#   ggplot2::geom_histogram() +
#   ggplot2::facet_grid(rows = ggplot2::vars(var), cols = ggplot2::vars(carat))

diamonds %>% 
  dplyr::filter(dplyr::between(carat, 0.99, 1)) %>% 
  ggplot2::ggplot(ggplot2::aes(x = carat, fill = cut)) +
  ggplot2::geom_bar(position = "dodge")

diamonds %>% 
    dplyr::filter(dplyr::between(carat, 0.99, 1)) %>% 
   ggplot2::ggplot(ggplot2::aes(x = carat, fill = color)) +
  ggplot2::geom_bar(position = "dodge")

diamonds %>% 
    dplyr::filter(dplyr::between(carat, 0.99, 1)) %>% 
   ggplot2::ggplot(ggplot2::aes(x = carat, fill = clarity)) +
  ggplot2::geom_bar(position = "dodge")

diamonds %>% 
    dplyr::filter(dplyr::between(carat, 0.99, 1)) %>% 
  ggplot2::ggplot(ggplot2::aes(x = depth)) + 
  ggplot2::geom_histogram(binwidth = .5) +
  ggplot2::facet_grid(rows = ggplot2::vars(carat))

diamonds %>% 
    dplyr::filter(dplyr::between(carat, 0.99, 1)) %>% 
  ggplot2::ggplot(ggplot2::aes(x = table)) + 
  ggplot2::geom_histogram(binwidth = 1) +
  ggplot2::facet_grid(rows = ggplot2::vars(carat))

diamonds %>% 
    dplyr::filter(dplyr::between(carat, 0.99, 1)) %>% 
  ggplot2::ggplot(ggplot2::aes(x = price)) + 
  ggplot2::geom_histogram(binwidth = 100) +
  ggplot2::facet_grid(rows = ggplot2::vars(carat))

diamonds %>% 
    dplyr::filter(dplyr::between(carat, 0.99, 1)) %>% 
  ggplot2::ggplot(ggplot2::aes(x = x)) + 
  ggplot2::geom_histogram(binwidth = .1) +
  ggplot2::facet_grid(rows = ggplot2::vars(carat))

```


4. Compare and contrast `coord_cartesian()` vs `xlim()` or `ylim()` when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

```{r p_7.3.4.4}

```


## 7.5.1.1 Exercises

Exercises from [Covariation: A categorical and continuous variable](https://r4ds.had.co.nz/exploratory-data-analysis.html#exercises-17).


1. Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.

```{r p_7.5.1.1.1}

```


2. What variable in the `diamonds` dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

```{r p_7.5.1.1.2}

```


3. Install the `ggstance` package, and create a horizontal boxplot. How does this compare to using `coord_flip()`?

```{r p_7.5.1.1.3}

```


4. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using `geom_lv()` to display the distribution of price vs cut. What do you learn? How do you interpret the plots?

```{r p_7.5.1.1.4}

```


5. Compare and contrast geom_violin() with a facetted `geom_histogram()`, or a coloured `geom_freqpoly()`. What are the pros and cons of each method?

```{r p_7.5.1.1.5}

```


6. If you have a small dataset, it’s sometimes useful to use `geom_jitter()` to see the relationship between a continuous and categorical variable. The `ggbeeswarm` package provides a number of methods similar to `geom_jitter()`. List them and briefly describe what each one does.

```{r p_7.5.1.1.6}

```


1. Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.

```{r p_7.5.1.1.1}

```


## 7.5.1.2 Exercises

Exercises from [Covariation: Two categorical values](https://r4ds.had.co.nz/exploratory-data-analysis.html#exercises-18).


1. How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?

```{r p_7.5.1.2.1}

```


2. Use `geom_tile()` together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?

```{r p_7.5.1.2.2}

```


3. Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.

```{r p_7.5.1.2.3}

```


## 7.5.1.3 Exercises

Exercises from [Covariation: Two continuous values](https://r4ds.had.co.nz/exploratory-data-analysis.html#exercises-19).


1. Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using `cut_width()` vs `cut_number()`? How does that impact a visualisation of the 2d distribution of `carat` and `price`?


```{r p_7.5.1.3.1}

```


2. Visualise the distribution of carat, partitioned by price.

```{r p_7.5.1.3.2}

```


3. How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?


```{r p_7.5.1.3.3}

```


4. Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.

```{r p_7.5.1.3.4}

```


5. Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of `x` and `y` values, which makes the points outliers even though their `x` and `y` values appear normal when examined separately.

```{r p_7.5.1.3.5}

```