diff --git a/conditionals.qmd b/conditionals.qmd index 265570d..823b0a7 100644 --- a/conditionals.qmd +++ b/conditionals.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` @@ -48,6 +49,7 @@ f <- function(x) { cat("D") } ``` + ```{r} f(0) f(1) @@ -65,6 +67,7 @@ absValue <- function(x) { return(x) } ``` + ```{r} absValue(7) # Returns 7 absValue(-7) # Also returns 7 @@ -112,6 +115,7 @@ f <- function(x) { cat("G") } ``` + ```{r} f(0) f(1) @@ -138,6 +142,7 @@ getLetterGrade <- function(score) { return(grade) } ``` + ```{r} cat("103 -->", getLetterGrade(103)) cat(" 88 -->", getLetterGrade(88)) diff --git a/creating-functions.qmd b/creating-functions.qmd index b233223..f1a69e8 100644 --- a/creating-functions.qmd +++ b/creating-functions.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` @@ -55,7 +56,10 @@ For example, here's the function `mySqrt(n)`, which returns the square root of ` |`mySqrt` | `<-` | `function` | `(n)` | `{ return(n^0.5) }` | And here's `mySqrt(n)` written in the typical format: -```{r, eval=FALSE} + +```{r} +#| eval: false + mySqrt <- function(n) { return(n^0.5) } @@ -71,6 +75,7 @@ square <- function(x) { return(y) } ``` + ```{r} square(2) square(8) @@ -84,6 +89,7 @@ sumTwoValues <- function(x, y) { return(value) } ``` + ```{r} sumTwoValues(2, 3) sumTwoValues(3, 4) @@ -96,6 +102,7 @@ doSomething <- function() { cat("Carpe diem!") # The cat() function prints whatever's inside it to the console } ``` + ```{r} doSomething() ``` @@ -109,6 +116,7 @@ f <- function(x, y=10) { return(x + y) } ``` + ```{r} f(5) # 15 f(5, 1) # 6 @@ -123,6 +131,7 @@ isPositive <- function(x) { return (x > 0) } ``` + ```{r} isPositive(5) # TRUE isPositive(-5) # FALSE @@ -138,6 +147,7 @@ isPositive <- function(x) { cat("Goodbye!") # Does not run ("dead code") } ``` + ```{r} x <- isPositive(5) # Prints Hello, then assigns TRUE to x x @@ -152,15 +162,18 @@ f <- function(x) { x + 42 } ``` + ```{r} f(5) ``` + ```{r} f <- function(x) { x + 42 x + 7 } ``` + ```{r} f(5) ``` @@ -174,6 +187,7 @@ printX <- function(x) { cat("The value of x provided is", x) } ``` + ```{r} printX(7) printX(42) @@ -186,6 +200,7 @@ cubed <- function(x) { cat(x^3) } ``` + ```{r} cubed(2) # Seems to work 2*cubed(2) # Expected 16...didn't work @@ -198,6 +213,7 @@ cubed <- function(x) { return(x^3) # That's better! } ``` + ```{r} cubed(2) # Works! 2*cubed(2) # Works! @@ -235,6 +251,7 @@ minSquared <- function(x, y) { return(smaller^2) } ``` + ```{r} minSquared(3, 4) minSquared(4, 3) @@ -242,7 +259,9 @@ minSquared(4, 3) If you try to call a local variable in the global environment, you'll get an error: -```{r error=TRUE} +```{r} +#| error: true + square <- function(x) { y <- x^2 return(y) @@ -252,15 +271,21 @@ y _"Global"_ variables are those in the global environment. These will show up in the "Environment" pane in RStudio. You can call these inside functions, but this is **BAD** practice. Here's an example (**Don't do this!**): -```{r include=FALSE} +```{r} +#| include: false + n <- NULL ``` -```{r error=TRUE} + +```{r} +#| error: true + printN <- function() { cat(n) # n is not local -- so it is global (bad idea!!!) } printN() # Nothing happens because n isn't defined ``` + ```{r} n = 5 # Define n in the global environment printN() diff --git a/data-analysis.qmd b/data-analysis.qmd index 001a1a6..ded1f48 100644 --- a/data-analysis.qmd +++ b/data-analysis.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` @@ -35,10 +36,10 @@ head(orings) We can see that the dataset contains observations about the temperatures of launches and O-ring damage, but we don't yet have _information_. One step forward towards _information_ is to simply plot the data to _see_ if there might be a relationship between temperature and O-ring damage: ```{r} +#| label: "challenger-temps" #| message: false #| fig.width: 8 #| fig.height: 3 -#| label: "challenger-temps" library(ggplot2) diff --git a/data-frames.qmd b/data-frames.qmd index 5238379..1fbd68e 100644 --- a/data-frames.qmd +++ b/data-frames.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` @@ -53,6 +54,7 @@ The [tibble](https://r4ds.had.co.nz/tibbles.html) is an improved version of the ```{r} #| message: false + library(dplyr) ``` @@ -192,6 +194,7 @@ beatles$deceased == FALSE ``` Then, you could insert this logical vector in the row position of the `[]` brackets to filter only the rows that are `TRUE`: + ```{r} beatles[beatles$deceased == FALSE,] ``` @@ -250,21 +253,33 @@ There are generally two ways to load external data. Many R packages come with pre-loaded datasets. For example, the **ggplot2** library (which we'll [use soon](data-visualization.html) to make plots in R) comes with the `msleep` dataset already loaded. To see this, install **ggplot2** and load the library: -```{r, eval=FALSE, message=FALSE} -install.packages("ggplot2") +```{r} +#| eval: false +#| message: false + +# install.packages("ggplot2") # Do this only once! library(ggplot2) + head(msleep) # Preview just the first 6 rows of the data frame ``` -```{r, echo=FALSE, message=FALSE} + +```{r} +#| echo: false +#| message: false + library(ggplot2) + head(msleep) ``` If you want to see all of the different datasets that any particular package contains, you can call the `data()` function after loading a library. For example, here are all the dataset that are contained in the **ggplot2** library: -```{r, eval=FALSE} +```{r} +#| eval: false + data(package = "ggplot2") ``` + ``` Data sets in package 'ggplot2': @@ -307,8 +322,11 @@ pathToData <- here('data', 'data.csv') 2. Import the data -```{r, eval=FALSE} +```{r} +#| eval: false + library(readr) + df <- read_csv(pathToData) ``` @@ -361,6 +379,7 @@ Now load the data: ```{r} library(readr) + msleep <- read_csv(here('data', 'msleep.csv')) ``` @@ -431,50 +450,62 @@ But just to give you an idea of where we're going, here are a few pieces of info 1) It appears that mammalian brain and body weight are logarithmically correlated - cool! ```{r} +#| label: 'msleep-scatter1' #| fig.height: 4 #| fig.width: 6 #| message: false #| warning: false library(ggplot2) -ggplot(msleep, aes(x=brainwt, y=bodywt)) + + +ggplot(msleep, aes(x = brainwt, y = bodywt)) + geom_point(alpha=0.6) + - stat_smooth(method='lm', col='red', se=F, size=0.7) + + stat_smooth(method = 'lm', col = 'red', se = FALSE, size = 0.7) + scale_x_log10() + scale_y_log10() + - labs(x='log(brain weight) in g', y='log(body weight) in kg') + + labs( + x = 'log(brain weight) in g', + y = 'log(body weight) in kg' + ) + theme_minimal() ``` 2) It appears there may also be a negative, logarithmic relationship (albeit weaker) between the size of mammalian brains and how much they sleep - cool! ```{r} +#| label: 'msleep-scatter2' #| fig.height: 4 #| fig.width: 6 #| message: false #| warning: false -ggplot(msleep, aes(x=brainwt, y=sleep_total)) + - geom_point(alpha=0.6) + +ggplot(msleep, aes(x = brainwt, y = sleep_total)) + + geom_point(alpha = 0.6) + scale_x_log10() + scale_y_log10() + - stat_smooth(method='lm', col='red', se=F, size=0.7) + - labs(x='log(brain weight) in g', y='log(total sleep time) in hours') + + stat_smooth(method = 'lm', col = 'red', se = FALSE, size = 0.7) + + labs( + x = 'log(brain weight) in g', + y = 'log(total sleep time) in hours' + ) + theme_minimal() ``` 3) Wow, there's a lot of variation in how much different mammals sleep - cool! ```{r} +#| label: 'msleep-bars' #| fig.height: 4 #| fig.width: 6 #| message: false #| warning: false -ggplot(msleep, aes(x=sleep_total)) + +ggplot(msleep, aes(x = sleep_total)) + geom_histogram() + - labs(x = 'Total sleep time in hours', - title = 'Histogram of total sleep time') + + labs( + x = 'Total sleep time in hours', + title = 'Histogram of total sleep time' + ) + theme_minimal() ``` diff --git a/data-visualization.qmd b/data-visualization.qmd index 6887ec8..c9a8cc7 100644 --- a/data-visualization.qmd +++ b/data-visualization.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` @@ -25,7 +26,9 @@ source("_common.R") ...and one of the best ways to develop insights from data is to _visualize_ the data. If you're completely new to data visualization, I recommend watching [this 40-minute video](https://www.youtube.com/watch?v=fSgEeI2Xpdc) on how humans see data, by John Rauser. This is one of the best overviews I've ever seen of how we can exploit our understanding of human psychology to design effective charts: - +
+ +
## R Setup @@ -35,9 +38,12 @@ Before we get started, let's set up our analysis environment like before: 2) Create a new `.R` file (File > New File > R Script), and save it as "`data_viz.R`" inside your "data_analysis_tutorial" R Project folder. 3) This time, instead of downloading the data file and saving it in our `data` folder, let's just read it in directly from the web! -```{r, message=FALSE} +```{r} +#| message: false + library(readr) library(dplyr) + df <- read_csv("https://raw.githubusercontent.com/jhelvy/p4a/main/data/north_america_bear_killings.csv") ``` @@ -79,6 +85,7 @@ R has a number of built-in tools for basic graph types. We will only cover two h A scatter plot provides a graphical view of the relationship between two variables. Typically these are used for "continuous" variables, like _time_, _age_, _money_, etc...things that are not categorical in nature (as opposed to "discrete" variables, like _nationality_). Here's a scatterplot of the age of the bear killing victims over time: ```{r} +#| label: 'scatter-basic' #| fig.height: 5 #| fig.width: 6 @@ -88,6 +95,7 @@ plot(x = df$year, y = df$age) The basic inputs to the `plot()` function are `x` and `y`, which must be vectors of the same length. You can customize many features (fonts, colors, axes, shape, titles, etc.) through [graphic options](http://www.statmethods.net/advgraphs/parameters.html). Here's the same plot with a few customizations: ```{r} +#| label: 'scatter-basic-pretty' #| fig.height: 5 #| fig.width: 6 @@ -107,6 +115,7 @@ Looks like bear killings are becoming more frequent over time (hmm, why might th The [histogram](https://en.wikipedia.org/wiki/Histogram) is one of the most common ways to visualize the _distribution_ of a variable. The `hist()` function takes just one variable: `x`. Here's a histogram of the `month` variable: ```{r} +#| label: 'hist-basic' #| fig.height: 5 #| fig.width: 6 @@ -116,6 +125,7 @@ hist(x = df$month) As you might expect, most bear attacks occur during the summer months, when parks get more visitors. As with the `plot()` function, you can customize a lot of the histogram features. One common customization is to modify the number of "bins" in the histogram by changing the `breaks` argument. Here we'll fix the number of bins to `12` - one for each month: ```{r} +#| label: 'hist-basic-pretty' #| fig.height: 5 #| fig.width: 6 @@ -155,6 +165,7 @@ library(ggplot2) The `ggplot()` function is used to initialize the basic graph structure, and then we add layers to it. The basic idea is that you specify different parts of the plot, and add them together using the `+` operator. We will start with a blank plot and will add layers as we go along: ```{r} +#| label: 'ggplot-blank' #| fig.height: 4 #| fig.width: 6 @@ -185,6 +196,7 @@ Each type of geom usually has a **required set of aesthetics** to be set, and us Now that we know what geoms and aesthetics are, let's put them to practice by making a scatterplot. To start, we will add the `geom_point()` geom and we'll set the position for the x- and y-axis inside the `aes()` function: ```{r} +#| label: 'ggplot-scatter' #| fig.height: 4 #| fig.width: 6 @@ -197,6 +209,7 @@ Notice how we've "added" the `geom_point()` layer to the previous blank slate. A If I wanted to change the point color, I could add that inside the `geom_point()` layer: ```{r} +#| label: 'ggplot-scatter-blue' #| fig.height: 4 #| fig.width: 6 @@ -207,6 +220,7 @@ ggplot(data = df) + But I could also _map_ one of my variables to the point color by placing the `color` variable inside the `aes()` function: ```{r} +#| label: 'ggplot-scatter-color' #| fig.height: 4 #| fig.width: 6 @@ -219,6 +233,7 @@ ggplot(data = df) + I recommend using the `geom_col()` layer to create bar charts, which are great for comparing different numerical values across a categorical variable. One of the simplest things to show with bars is the _count_ of how many observations you have. You can compute this by using the `count()` function, and then use the resulting data frame to create bars of those counts: ```{r} +#| label: 'ggplot-bars' #| fig.height: 4 #| fig.width: 6 @@ -234,6 +249,7 @@ ggplot(data = monthCounts) + Alternatively, you could use the `%>%` operator to pipe the results of a summary data frame directly into ggplot: ```{r} +#| label: 'ggplot-bars2' #| fig.height: 4 #| fig.width: 6 @@ -246,6 +262,7 @@ df %>% Just like how we mapped the point color to a variable in scatter plots, you can map the bar color to a variable with bar charts using the `fill` argument in the `aes()` call. For example, here's the same bar chart of the count of observations with the bar colors representing the type of bear. ```{r} +#| label: 'ggplot-bars-fill' #| fig.height: 4 #| fig.width: 6 @@ -260,6 +277,7 @@ Hmm, looks like brown bears are the most frequent killers, though black bears ar You can plot variables other than the count. For example, here is a plot of the mean age of the victim in each year: ```{r} +#| label: 'ggplot-bars-summary' #| fig.height: 4 #| fig.width: 7 @@ -280,6 +298,7 @@ There are lots of ways to tweak your ggplot to make it more aesthetically pleasi You can change the labels of your plot by adding the `labs()` layer: ```{r} +#| label: 'ggplot-scatter-labs' #| fig.height: 4 #| fig.width: 6 @@ -300,6 +319,7 @@ The `labs()` layer enables you to modify the labels of any of the variables that Adding theme layers can change some global aspects of the plot, such as the background color, grid lines, legend appearance, etc. There are [many themes to choose from](https://ggplot2.tidyverse.org/reference/ggtheme.html), but using simple themes like `theme_bw()` or `theme_minimal()` often improves the plot from the default theme settings: ```{r} +#| label: 'ggplot-scatter-minimal' #| fig.height: 4 #| fig.width: 6 @@ -341,6 +361,8 @@ ggplot(data = df) + Want to make a plot look fancy like those in the Economist magazine? Try `theme_economist()` from the [ggthemes](https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/) library: ```{r} +#| label: 'ggplot-scatter-economist' +#| warning: false #| fig.height: 4 #| fig.width: 6 diff --git a/data-wrangling.qmd b/data-wrangling.qmd index b91fb72..209e450 100644 --- a/data-wrangling.qmd +++ b/data-wrangling.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` @@ -31,7 +32,9 @@ Before we get started, let's set up our analysis environment: 2) Create a new `.R` file (File > New File > R Script), and save it as "`data_wrangling.R`" inside your "data-analysis-tutorial" R Project folder. 3) Use the `download.file()` function to download the `wildlife_impacts.csv` dataset, and save it in the `data` folder in your R Project: -```{r, eval=FALSE} +```{r} +#| eval: false + download.file( url = "https://github.com/jhelvy/p4a/raw/main/data/wildlife_impacts.csv", destfile = file.path('data', 'wildlife_impacts.csv') @@ -66,7 +69,9 @@ cost_repairs_infl_adj |double | Cost of repairs adjusted for inflation Let's load our libraries and read in the data: -```{r, message=FALSE} +```{r} +#| message: false + library(readr) library(dplyr) df <- read_csv(file.path('data', 'wildlife_impacts.csv')) @@ -106,7 +111,9 @@ As we saw in the last section, we can use brackets (`[]`) to access elements of The `dplyr` package was designed to make tabular data wrangling easier to perform and read. It pairs nicely with other libraries, such as **`ggplot2`** for visualizing data (which we'll cover next week). Together, `dplyr`, `ggplot2`, and a handful of other packages make up what is known as the ["Tidyverse"](https://www.tidyverse.org/) - an opinionated collection of R packages designed for data science. You can load all of the tidyverse packages at once using the `library(tidyverse)` command, but for now we're just going to install and use each package one at a time - starting with `dplyr`: -```{r, eval=FALSE} +```{r} +#| eval: false + install.packages("dplyr") ``` @@ -199,12 +206,18 @@ Consider reading the `%>%` operator as the words "...and then...". For instance, Here's another analogy: **Without Pipes**: -```{r, eval=FALSE} + +```{r} +#| eval: false + leave_house(get_dressed(get_out_of_bed(wake_up(me)))) ``` **With Pipes**: -```{r, eval=FALSE} + +```{r} +#| eval: false + me %>% wake_up %>% get_out_of_bed %>% @@ -395,7 +408,9 @@ heightSummary <- df %>% Save the the new `heightSummary` data frame as a CSV file in your "data_output" folder: -```{r, eval=FALSE} +```{r} +#| eval: false + write_csv(heightSummary, path = file.path('data_output', 'heightSummary.csv') ``` @@ -403,7 +418,9 @@ write_csv(heightSummary, path = file.path('data_output', 'heightSummary.csv') You will often need to create new variables based on a condition. To do this, you can use the `if_else()` function. Here's the general syntax: -```{r, eval=FALSE} +```{r} +#| eval: false + if_else(, , ) ``` @@ -411,7 +428,9 @@ The first argument is a condition. If the condition is TRUE, then the value give Here's an example of creating a variable to determine which months in the wildlife impacts data are in the summer: -```{r, eval=FALSE} +```{r} +#| eval: false + df %>% mutate( summer_month = if_else(incident_month %in% c(6, 7, 8), TRUE, FALSE)) @@ -419,15 +438,15 @@ df %>% Of course, in this particular case the `if_else()` function isn't even needed because the condition returns `TRUE` and `FALSE` values. However, if you wanted to extend this example to determine all four seasons, you could use a series of nested `if_else()` functions: -```{r, eval=FALSE} +```{r} +#| eval: false + df %>% - mutate(season = - if_else( - incident_month %in% c(3, 4, 5), 'Spring', - if_else( - incident_month %in% c(6, 7, 8), 'Summer', - if_else( - incident_month %in% c(9, 10, 11), 'Fall', 'Winter')))) + mutate(season = if_else( + incident_month %in% c(3, 4, 5), 'Spring', if_else( + incident_month %in% c(6, 7, 8), 'Summer', if_else( + incident_month %in% c(9, 10, 11), 'Fall', 'Winter') + ))) ``` Note: The Base R version of this function is `ifelse()`, but I recommend using the `dplyr` version, `if_else()`, as it is a [stricter function](https://stackoverflow.com/questions/50646133/dplyr-if-else-vs-base-r-ifelse). diff --git a/extensions.qmd b/extensions.qmd index 75610ea..f6ed733 100644 --- a/extensions.qmd +++ b/extensions.qmd @@ -1,3 +1,6 @@ # Extensions {#sec-extensions .unnumbered} -Coming soon! +In this section I'll show how to do a few specific things in R that build on all the tools we've seen so far. This section is a constant work in progress, so come back later for more concepts. For now, it includes two sub-sections: + +- [Monte Carlo Methods](monte-carlo-methods.qmd) +- [Using Python in R](python-in-r.qmd) diff --git a/figs/ggplot-bars-1.png b/figs/ggplot-bars-1.png new file mode 100644 index 0000000..f1fda7f Binary files /dev/null and b/figs/ggplot-bars-1.png differ diff --git a/figs/ggplot-bars-fill-1.png b/figs/ggplot-bars-fill-1.png new file mode 100644 index 0000000..2839e76 Binary files /dev/null and b/figs/ggplot-bars-fill-1.png differ diff --git a/figs/ggplot-bars-summary-1.png b/figs/ggplot-bars-summary-1.png new file mode 100644 index 0000000..bf33b12 Binary files /dev/null and b/figs/ggplot-bars-summary-1.png differ diff --git a/figs/ggplot-bars2-1.png b/figs/ggplot-bars2-1.png new file mode 100644 index 0000000..f1fda7f Binary files /dev/null and b/figs/ggplot-bars2-1.png differ diff --git a/figs/ggplot-blank-1.png b/figs/ggplot-blank-1.png new file mode 100644 index 0000000..3863f84 Binary files /dev/null and b/figs/ggplot-blank-1.png differ diff --git a/figs/ggplot-scatter-1.png b/figs/ggplot-scatter-1.png new file mode 100644 index 0000000..90ec0d9 Binary files /dev/null and b/figs/ggplot-scatter-1.png differ diff --git a/figs/ggplot-scatter-blue-1.png b/figs/ggplot-scatter-blue-1.png new file mode 100644 index 0000000..4609dd1 Binary files /dev/null and b/figs/ggplot-scatter-blue-1.png differ diff --git a/figs/ggplot-scatter-color-1.png b/figs/ggplot-scatter-color-1.png new file mode 100644 index 0000000..5bbd52f Binary files /dev/null and b/figs/ggplot-scatter-color-1.png differ diff --git a/figs/ggplot-scatter-economist-1.png b/figs/ggplot-scatter-economist-1.png new file mode 100644 index 0000000..c7dd8ec Binary files /dev/null and b/figs/ggplot-scatter-economist-1.png differ diff --git a/figs/ggplot-scatter-labs-1.png b/figs/ggplot-scatter-labs-1.png new file mode 100644 index 0000000..9886ea4 Binary files /dev/null and b/figs/ggplot-scatter-labs-1.png differ diff --git a/figs/ggplot-scatter-minimal-1.png b/figs/ggplot-scatter-minimal-1.png new file mode 100644 index 0000000..c40a586 Binary files /dev/null and b/figs/ggplot-scatter-minimal-1.png differ diff --git a/figs/hist-basic-1.png b/figs/hist-basic-1.png new file mode 100644 index 0000000..7b969a6 Binary files /dev/null and b/figs/hist-basic-1.png differ diff --git a/figs/hist-basic-pretty-1.png b/figs/hist-basic-pretty-1.png new file mode 100644 index 0000000..43a7aa3 Binary files /dev/null and b/figs/hist-basic-pretty-1.png differ diff --git a/figs/integration-1.png b/figs/integration-1.png new file mode 100644 index 0000000..c085e36 Binary files /dev/null and b/figs/integration-1.png differ diff --git a/figs/integration-box-1.png b/figs/integration-box-1.png new file mode 100644 index 0000000..65de0d3 Binary files /dev/null and b/figs/integration-box-1.png differ diff --git a/figs/monte-carlo-pie-1.png b/figs/monte-carlo-pie-1.png new file mode 100644 index 0000000..aeed0ab Binary files /dev/null and b/figs/monte-carlo-pie-1.png differ diff --git a/figs/msleep-bars-1.png b/figs/msleep-bars-1.png new file mode 100644 index 0000000..20be329 Binary files /dev/null and b/figs/msleep-bars-1.png differ diff --git a/figs/msleep-scatter-1.png b/figs/msleep-scatter-1.png new file mode 100644 index 0000000..59aaf9e Binary files /dev/null and b/figs/msleep-scatter-1.png differ diff --git a/figs/msleep-scatter1-1.png b/figs/msleep-scatter1-1.png new file mode 100644 index 0000000..59aaf9e Binary files /dev/null and b/figs/msleep-scatter1-1.png differ diff --git a/figs/msleep-scatter2-1.png b/figs/msleep-scatter2-1.png new file mode 100644 index 0000000..f9e00f6 Binary files /dev/null and b/figs/msleep-scatter2-1.png differ diff --git a/figs/scatter-basic-1.png b/figs/scatter-basic-1.png new file mode 100644 index 0000000..b55e2bc Binary files /dev/null and b/figs/scatter-basic-1.png differ diff --git a/figs/scatter-basic-pretty-1.png b/figs/scatter-basic-pretty-1.png new file mode 100644 index 0000000..77f8654 Binary files /dev/null and b/figs/scatter-basic-pretty-1.png differ diff --git a/figs/unnamed-chunk-9-1.png b/figs/unnamed-chunk-9-1.png index 3863f84..2f21dc1 100644 Binary files a/figs/unnamed-chunk-9-1.png and b/figs/unnamed-chunk-9-1.png differ diff --git a/functions-packages.qmd b/functions-packages.qmd index 36fa4fe..cae9b09 100644 --- a/functions-packages.qmd +++ b/functions-packages.qmd @@ -276,14 +276,18 @@ library(TurtleGraphics) Here's the idea. You have a turtle, and she lives in a nice warm terrarium. The terrarium is 100 x 100 units in size, where the lower-left corner is at the `(x, y)` position of `(0, 0)`. When you call `turtle_init()`, the turtle is initially positioned in the center of the terrarium at `(50, 50)`: -```{r, eval=FALSE} +```{r} +#| eval: false + turtle_init() ``` ![](images/turtle_init.png){ width=456 } You can move the turtle using a variety of movement functions (see `?turtle_move()`), and she will leave a trail where ever she goes. For example, you can move her 10 units forward from her starting position: -```{r eval=FALSE} +```{r} +#| eval: false + turtle_init() turtle_forward(distance = 10) ``` @@ -291,32 +295,41 @@ turtle_forward(distance = 10) You can also make the turtle jump to a new position (without drawing a line) by using the `turtle_setpos(x, y)`, where `(x, y)` is a coordinate within the 100 x 100 terrarium: -```{r eval=FALSE} +```{r} +#| eval: false + turtle_init() turtle_setpos(x=10, y=10) ``` + ![](images/turtle_setpos.png){ width=456 } ### Turtle loops -Simple enough, right? But what if I want my turtle to draw a more complicated shape? Let's say I want her to draw a hexagon. There are six sides to the hexagon, so the most natural way to write code for this is to write a `for` loop that loops over the sides (don't worry if this doesn't make sense yet - we'll get to [loops in week 5](L5-loops.html)!). At each iteration within the loop, I'll have the turtle walk forwards, and then turn 60 degrees to the left. Here's what happens: +Simple enough, right? But what if I want my turtle to draw a more complicated shape? Let's say I want her to draw a hexagon. There are six sides to the hexagon, so the most natural way to write code for this is to write a `for` loop that loops over the sides (if this doesn't make sense yet, go read ahead to the chapter on [iteration](iteration.qmd)!). At each iteration within the loop, I'll have the turtle walk forwards, and then turn 60 degrees to the left. Here's what happens: -```{r eval=FALSE} + +```{r} +#| eval: false + turtle_init() for (side in 1:6) { turtle_forward(distance = 10) turtle_left(angle = 60) } ``` + Cool! As you draw more complex shapes, you can speed up the process by wrapping your turtle commands inside the `turtle_do({})` function. This will skip the animations of the turtle moving and will jump straight to the final position. For example, here's the hexagon again without animations: -```{r eval=FALSE} +```{r} +#| eval: false + turtle_init() turtle_do({ for (side in 1:6) { @@ -325,6 +338,7 @@ turtle_do({ } }) ``` + ![](images/turtle_hexagon.png){ width=456 } ## Page sources {.unnumbered} diff --git a/getting-started.qmd b/getting-started.qmd index a41289e..ab92969 100644 --- a/getting-started.qmd +++ b/getting-started.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` diff --git a/intro.qmd b/intro.qmd index 5592f51..26e4753 100644 --- a/intro.qmd +++ b/intro.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` diff --git a/iteration.qmd b/iteration.qmd index 9d3b8b0..74ca3a1 100644 --- a/iteration.qmd +++ b/iteration.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` @@ -133,6 +134,7 @@ powersOfTwo <- function(upperLimit) { } } ``` + ```{r} powersOfTwo(5) powersOfTwo(100) diff --git a/monte-carlo-methods.qmd b/monte-carlo-methods.qmd index ba864c2..3cada0e 100644 --- a/monte-carlo-methods.qmd +++ b/monte-carlo-methods.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` @@ -70,6 +71,7 @@ That is, we want to compute the area under the curve of $x^2$ between $3 < x < 7 ```{r} +#| label: 'integration' #| echo: false #| message: false #| fig.width: 6 @@ -90,6 +92,7 @@ p One way to estimate the shaded area is to draw a bunch of random points inside a rectangle in the x-y plane that contains the shaded area and then count how many fall below the function line. So we're going to randomly draw points inside this box: ```{r} +#| label: 'integration-box' #| echo: false #| message: false #| fig.width: 6 @@ -185,7 +188,7 @@ So to compute $\pi$, all we need to do is multiply 4 times the ratio of the area First, generate lots of random points in a square. For this example, we'll use a square with side length of 1 centered at `(x, y) = (0, 0)`, so we need to draw random points between `x = (-0.5, 0.5)` and `y = (-0.5, 0.5)`: ```{r} -numTrials <- 1000 +numTrials <- 10000 points <- data.frame( x = runif(numTrials, -0.5, 0.5), y = runif(numTrials, -0.5, 0.5) @@ -210,7 +213,7 @@ points <- points %>% Just to make sure we correctly labeled the points, let's plot them, coloring them based on the `pointInCircle` variable we just created: ```{r} -#| echo: false +#| label: 'monte-carlo-pie' #| message: false #| fig.width: 6 #| fig.height: 4.5 @@ -218,7 +221,10 @@ Just to make sure we correctly labeled the points, let's plot them, coloring the library(ggplot2) ggplot(points) + - geom_point(aes(x = x, y = y, color = pointInCircle), size = 0.7) + + geom_point( + aes(x = x, y = y, color = pointInCircle), + size = 0.5 + ) + theme_minimal() ``` diff --git a/operators-data-types.qmd b/operators-data-types.qmd index 37a123d..fc5cfea 100644 --- a/operators-data-types.qmd +++ b/operators-data-types.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` @@ -26,7 +27,10 @@ R handles simple arithmetic using the following **arithmetic** operators:
-```{r arithmetic, echo=FALSE} +```{r} +#| label: 'arithmetic' +#| echo: false + knitr::kable(rbind( c("addition", "`+`", "`10 + 2`", "`12`"), c("subtraction", "`-`", "`9 - 3`", "`6`"), @@ -65,7 +69,10 @@ There are two other operators that are not typically as well-known as the first
-```{r mod, echo=FALSE} +```{r} +#| label: 'mod' +#| echo: false + knitr::kable( rbind( c("integer division", "`%/%`", "`4 %/% 3`", "`1`"), @@ -273,6 +280,7 @@ A logical expression `x & y` is `TRUE` only if *both* `x` and `y` are `TRUE`. ```{r} (2 == 2) & (2 == 3) # FALSE because the second comparison if not TRUE ``` + ```{r} (2 == 2) & (3 == 3) # TRUE because both comparisons are TRUE ``` @@ -410,7 +418,9 @@ typeof('3') Notice that even though the value _looks_ like a number, because it is inside quotes R interprets it as a character. If you mistakenly thought it was a a number, R will gladly return an error when you try to do a numerical operation with it: -```{r error=TRUE} +```{r} +#| error: true + '3' + 7 ``` diff --git a/python-in-r.qmd b/python-in-r.qmd index 02a6292..173f9cc 100644 --- a/python-in-r.qmd +++ b/python-in-r.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` @@ -31,19 +32,25 @@ While you can work with Python in a number of ways, we will use the [**reticulat To get started, install the package (remember, you only need to do this once on your computer): -```{r, eval=FALSE} +```{r} +#| eval: false + install.package('reticulate') ``` Once installed, load the package: -```{r, eval=FALSE} +```{r} +#| eval: false + library(reticulate) ``` If you already have Python installed on your computer, you should be okay, but you may see the following message pop up in the console: -```{r, eval=FALSE} +```{r} +#| eval: false + Would you like to install Miniconda? [Y/n]: ``` @@ -53,7 +60,9 @@ If so, I recommend you go ahead and install Miniconda by typing `y` and pressing Once you've loaded the **reticulate** library, use the following command to open up a Python REPL (which stands for "**R**ead–**E**val–**P**rint-**L**oop"): -```{r, eval=FALSE} +```{r} +#| eval: false + repl_python() ``` @@ -67,7 +76,9 @@ Above the `>>>` symbols, you should see a message indicating which version of Py If you want to get back to good 'ol R, just type the command `exit` into the Python console: -```{r, eval=FALSE} +```{r} +#| eval: false + exit ``` @@ -439,6 +450,7 @@ String manipulation is one area where more substantial differences emerge betwee ```{r} #| message: false + library(stringr) ``` @@ -460,7 +472,9 @@ In Python, many of the basic string manipulations are actually done with basic a **String concatenation**:
+ In R, we use the function `paste()` to combine strings: + ```{r} paste("foo", "bar", sep = "") ``` @@ -726,6 +740,7 @@ s[0:3] Note that we had to use a different starting index here to get the same sub-string in each language. That's because **indexing starts at 0 in Python**. If this seems strange, just imagine "fence posts". In Python, the elements in a sequence are like items sitting _between_ fence posts. So the index of each character in the string `"Apple"` look like this: + ```{r, eval=FALSE} index: 0 1 2 3 4 5 | | | | | | @@ -740,19 +755,25 @@ Negative indices are also handled differently.
**R**: Negative indices start from the end of the string _inclusively_: + ```{r} str_sub(s, -1) str_sub(s, -3) ``` +
+ **Python**: Negative indices start from the end of the string, but only return the _character at that index_: + ```{python, eval=FALSE} s[-1] ``` + ``` ## 'e' ``` + ```{python, eval=FALSE} s[-3] ``` @@ -780,12 +801,16 @@ You can get the index of a character or sub-string in Python using the `.index()
**R**: Returns the starting and ending indices of the sub-string + ```{r} str_locate(s, "pp") ``` +
+ **Python**: Returns only the starting index of the sub-string + ```{python, eval=FALSE} s.index("pp") ``` @@ -802,35 +827,44 @@ Like in R, splitting a string returns a list of strings. Python lists are simila
**R**: + ```{r} s <- "Apple" str_split(s, "pp") ``` +
+ **Python**: + ```{python, eval=FALSE} s = "Apple" s.split("pp") ``` + ``` + ## ['A', 'le'] ```
In both languages, the returned list contains the remaining characters after splitting the string (in this case, `"A"` and `"le"`). One main difference though is that R returns a list of _vectors_, so to access the returned vector containing `"A"` and `"le"` you have to access the first element in the list, like this: + ```{r} str_split(s, "pp")[[1]] ``` This is because in R the `str_split()` function is _vectorized_, meaning that the function can also be performed on a _vector_ of strings, like this: + ```{r} s <- c("Apple", "Snapple") str_split(s, "pp") ``` In this example, it's easier to see that R is returning a list of vectors. In contrast, Python cannot perform a split on multiple strings: + ```{python, eval=FALSE} s = ["Apple", "Snapple"] s.split("pp") @@ -868,26 +902,39 @@ n2 = isOdd(3) ``` Now that you have this code stored in your `foo.py` file, you can source the file from inside R, like this: + ```{r, eval=FALSE} reticulate::source_python('foo.py') ``` Magically, the function `isOdd()` and the objects we created (`n1` and `n2`) are now accessible from R! -```{r, eval=FALSE} + +```{r} +#| eval: false + isOdd(7) ``` + ``` ## [1] TRUE ``` -```{r, eval=FALSE} + +```{r} +#| eval: false + n1 ``` + ``` ## [1] FALSE ``` -```{r, eval=FALSE} + +```{r} +#| eval: false + n2 ``` + ``` ## [1] TRUE ``` diff --git a/strings.qmd b/strings.qmd index d5cddf9..0dc3b78 100644 --- a/strings.qmd +++ b/strings.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` @@ -23,7 +24,9 @@ A "string" is the generic word for character type variables. Base R has many bui Before going any further, make sure you install the `stringr` package and load it before trying to use any of the functions in this lesson: -```{r eval=FALSE} +```{r} +#| eval: false + install.packages("stringr") library(stringr) ``` @@ -39,12 +42,14 @@ cat('This is a string') If you have a string that contains a `'` symbol, use double quotes: Use them where it makes sense, e.g.: + ```{r} cat("It's a boy!") ``` Likewise, if you have a string that contains a `"` symbol, use single quotes: Use them where it makes sense, e.g.: + ```{r} cat('I said, "Hi!"') ``` @@ -52,12 +57,14 @@ cat('I said, "Hi!"') But what if you have a string that has both single and double quotes, like this: `It's nice to say, "Hi!"` In this case, you have to "escape" the quotes by using the `\` symbol: + ```{r} cat("It's nice to say, \"Hi!\"") # Double quotes escaped cat('It\'s nice to say, "Hi!"') # Single quote escaped ``` Escaping can be used for a lot of different string literals, such as starting a new line, adding a tab space, and even entering the `\` symbol itself: + ```{r} cat('New line:', 'This\nthat') cat('Tab space:', 'This\tthat') @@ -90,6 +97,7 @@ In addition to the Base R constants, the `stringr` library also comes with three ```{r} #| message: false + library(stringr) head(words) @@ -130,6 +138,7 @@ You can convert whole strings to lower-case, upper-case, and title-case using so ```{r} x <- "Want to hear a joke about paper? Never mind, it's tearable." ``` + ```{r} str_to_lower(x) str_to_upper(x) @@ -146,21 +155,25 @@ toTitleCase(x) ### Get the number of characters in a string If you want to find how long a string is (i.e. how many characters it contains), the `length()` function won't work: + ```{r} length("hello world") ``` That's be `length()` returns how many elements are in a _vector_ (in the above case, there's just one element). Instead, you should use `str_length()`: + ```{r} str_length("hello world") ``` Note that the space character has a length: + ```{r} str_length(" ") ``` Also note that the "empty" string (`""`) has no length: + ```{r} str_length("") ``` @@ -195,17 +208,20 @@ x If you want to know the start and end indices of a particular substring, use `str_locate()`. This is a helpful function to use in combination with `str_sub()` so you don't have to count the characters to find a substring. For example, let's say I want to extract the substring `"Good"` from the following string: + ```{r} x <- 'thisIsGoodPractice' ``` I could first use `str_locate()` to get the start and end indices: + ```{r} indices <- str_locate(x, 'Good') indices ``` Now that I have the start and end locations, I can use them within `str_sub()`: + ```{r} str_sub(x, indices[1], indices[2]) ``` @@ -213,11 +229,13 @@ str_sub(x, indices[1], indices[2]) ### Repeat a string To duplicate strings, use `str_dup()`: + ```{r} str_dup("hola", 3) ``` Note the difference with `rep()` (which returns a vector): + ```{r} rep("hola", 3) ``` @@ -232,6 +250,7 @@ str_trim(x) ``` By default, `str_trim()` removes whitespace on both sides, but you can specify a single side: + ```{r} str_trim(x, side = "left") # Only trim left side str_trim(x, side = "right") # Only trim right side @@ -241,6 +260,7 @@ str_trim(x, side = "right") # Only trim right side `str_pad()` pads a string to a fixed length by adding extra whitespace on the left, right, or both sides. Note that the `width` argument is the length of the _final_ string (not the length of the added padding): + ```{r} x <- "hello" x @@ -249,11 +269,13 @@ str_pad(x, width = 10, side = "both") # Pad both sides ``` You can pad with other characters by using the `pad` argument: + ```{r} str_pad(x, 10, side="both", pad='-') ``` Also, `str_pad()` will never make a string shorter: + ```{r} str_pad(x, 4) ``` @@ -295,6 +317,7 @@ printGreeting <- function(name, timeOfDay, isBirthday) { cat(greeting) } ``` + ```{r} printGreeting('John', 'morning', isBirthday = FALSE) printGreeting('John', 'morning', isBirthday = TRUE) @@ -307,26 +330,31 @@ Use `str_split()` to split a string up into pieces along a particular delimiter. ```{r} string <- 'This string has spaces-and-dashes' ``` + ```{r} str_split(string, " ") # Split on the spaces ``` + ```{r} str_split(string, "-") # Split on the dashes ``` By default, `str_split()` returns a `list` (another R data structure) of vectors. Each item in the list is a vector of strings. In the above cases, we gave `str_split()` a single string, so there is only one item in the returned list. In these cases, the easiest way to access the resulting vector of split strings is to use the double bracket `[[]]` operator to access the first list item: + ```{r} str_split(string, " ") # Returns a list of vectors str_split(string, " ")[[1]] # Returns the first vector in the list ``` If you give `str_split()` a vector of strings, it will return a list of length equal to the number of elements in the vector: + ```{r} x <- c('babble', 'scrabblebabble') str_split(x, 'bb') # Returns a list with two elements (each a vector) ``` A particularly useful string split is to split on the empty string (`""`), which breaks a string up into its individual characters: + ```{r} str_split(string, "")[[1]] ``` @@ -338,6 +366,7 @@ The `word()` function that another way to split up a longer string. It is design ```{r} sentence <- c("Be the change you want to be") ``` + ```{r} # Extract first word word(sentence, 1) @@ -359,6 +388,7 @@ You can sort a vector of strings alphabetically using `str_sort()` and `str_orde ```{r} x <- c('Y', 'M', 'C', 'A') ``` + ```{r} str_sort(x) str_sort(x, decreasing = TRUE) @@ -416,6 +446,7 @@ To force a match to a complete string, anchor it with both `^` and `$`: ```{r} x <- c("apple pie", "apple", "apple cake") ``` + ```{r} str_detect(x, "apple") str_detect(x, "^apple$") @@ -430,6 +461,7 @@ In the second example above, 1 & 3 are `FALSE` because there's a space after `ap ```{r} x <- c("apple", "pear", "banana") ``` + ```{r} str_replace(x, "a", "-") str_replace_all(x, "a", "-") @@ -438,27 +470,32 @@ str_replace_all(x, "a", "-") ## `stringr` functions work on vectors In many of the above examples, we used a single string, but most `stringr` functions are designed to work on vectors of strings. For example, consider a vector of two "fruit": + ```{r} x <- c("apples", "oranges") x ``` Get the first 3 letters in each string in `x`: + ```{r} str_sub(x, 1, 3) ``` Duplicate each string in `x` twice: + ```{r} str_dup(x, 2) ``` Convert all strings in `x` to upper case: + ```{r} str_to_upper(x) ``` Replace all `"a"` characters with a `"-"` character: + ```{r} str_replace_all(x, "a", "-") ``` @@ -468,6 +505,7 @@ str_replace_all(x, "a", "-") ### Breaking a string into characters Often times you'll want to break a string into it's individual character components. To do that, use `str_split()` with the empty string `""` as the delimiter: + ```{r} chars <- str_split("apples", "")[[1]] chars @@ -476,6 +514,7 @@ chars ### Breaking a sentence into words Similarly, if you have a single string that contains words separated by spaces, splitting on `" "` will break it into words: + ```{r} x <- "If you want to view paradise, simply look around and view it" str_split(x, " ")[[1]] @@ -484,6 +523,7 @@ str_split(x, " ")[[1]] ### Comparing strings If you want to compare whether two strings are the same, you must also consider their cases. For example: + ```{r} a <- "Apples" b <- "apples" @@ -491,6 +531,7 @@ a == b ``` The above returns `FALSE` because the cases are different on the `"a"` characters. If you want to ignore case, then a common strategy is to first convert the strings to a common case before comparing. For example: + ```{r} str_to_lower(a) == str_to_lower(b) ``` diff --git a/testing-debugging.qmd b/testing-debugging.qmd index 985402c..2994c12 100644 --- a/testing-debugging.qmd +++ b/testing-debugging.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` @@ -80,7 +81,9 @@ The two test cases we used for `isEvenNumber()` are "normal" cases because they One particular common error is when a user inputs the wrong data type to a function: -```{r error=TRUE} +```{r} +#| error: true + isEvenNumber('42') ``` @@ -111,7 +114,9 @@ testIsEvenNumber() Another approach to checking input types is to explicitly provide a better error message so the user can know what went wrong. For example, rather than return `FALSE` when we input a string to `isEvenNumber()`, we can use `stop()` to halt the function and send an error message: -```{r error=TRUE} +```{r} +#| error: true + isEvenNumber <- function(n) { if (! is.numeric(n)) { stop('Oops! This function requires numeric inputs!') @@ -158,7 +163,9 @@ immediately after the error has occurred. -```{r, eval = FALSE} +```{r} +#| eval: false + f <- function(x) { return(x + 1) } @@ -167,12 +174,15 @@ g <- function(x) { } g("a") ``` + ``` #> Error in x + 1 : non-numeric argument to binary operator ``` + ```r traceback() ``` + ``` #> 2: f(x) at #1 #> 1: g("a") @@ -181,7 +191,9 @@ traceback() Or by using `traceback()` as an error handler, which will call it immediately on any error. (You could even put this in your [`.Rprofile`](https://stackoverflow.com/questions/46819684/how-to-access-and-edit-rprofile)) -```{r, eval = FALSE} +```{r} +#| eval: false + options(error = traceback) g("a") ``` diff --git a/vectors.qmd b/vectors.qmd index 4c35705..b8ab8a2 100644 --- a/vectors.qmd +++ b/vectors.qmd @@ -3,6 +3,7 @@ ```{r} #| echo: false #| message: false + source("_common.R") ``` @@ -54,6 +55,7 @@ You can also create a vector by using the `rep()` function, which replicates the y <- rep(5, 10) # The number 5 ten times z <- rep(10, 5) # The number 10 five times ``` + ```{r} y z @@ -257,13 +259,17 @@ x[-c(2, 7)] # Returns everything except the 2nd and 7th elements But you cannot mix positive and negative integers while indexing: -```{r error=TRUE} +```{r} +#| error: true + x[c(-2, 7)] ``` If you try to use a float as an index, it gets rounded **down** to the nearest integer: -```{r error=TRUE} +```{r} +#| error: true + x[3.1415] # Returns the 3rd element x[3.9999] # Still returns the 3rd element ``` @@ -272,7 +278,9 @@ x[3.9999] # Still returns the 3rd element You can name the elements in a vector and then use those names to access elements. To create a named vector, use the `names()` function: -```{r error=TRUE} +```{r} +#| error: true + x <- seq(5) names(x) <- c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j') x @@ -280,14 +288,18 @@ x You can also create a named vector by putting the names directly in the `c()` function: -```{r error=TRUE} +```{r} +#| error: true + x <- c('a' = 1, 'b' = 2, 'c' = 3, 'd' = 4, 'e' = 5) x ``` Once your vector has names, you can then use those names as indices: -```{r error=TRUE} +```{r} +#| error: true + x['a'] # Returns the first element x[c('a', 'c')] # Returns the 1st and 3rd elements ``` @@ -321,6 +333,7 @@ When you perform arithmetic operations on vectors, they are executed on an eleme x1 <- c(1, 2, 3) x2 <- c(4, 5, 6) ``` + ```{r} # Addition x1 + x2 # Returns (1+4, 2+5, 3+6) @@ -337,7 +350,9 @@ x1 / x2 # Returns (1/4, 2/5, 3/6) When performing vectorized operations, the vectors need to have the same dimensions, or one of the vectors needs to be a single-value vector: -```{r error=TRUE} +```{r} +#| error: true + # Careful! Mis-matched dimensions will only give you a warning, but will still return a value: x1 <- c(1, 2, 3) x2 <- c(4, 5) @@ -347,6 +362,7 @@ x1 + x2 What R does in these cases is _repeat_ the shorter vector, so in the above case the last value is `3 + 4`. If you have a single value vector, R will add it element-wise: + ```{r} x1 <- c(1, 2, 3) x2 <- c(4)