-
Notifications
You must be signed in to change notification settings - Fork 7
/
fallWorkshop.Rmd
545 lines (368 loc) · 23.5 KB
/
fallWorkshop.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
---
title: "Fall 2021 Workshop"
author: "Meredith Brown"
output:
html_document:
theme: cerulean
toc: true
toc_depth: 3
toc_float:
collapsed: true
---
```{r setup, include=FALSE}
library(tidyverse)
```
## Introduction
This workshop is for students new to R and RStudio.
Your introductory StatSci course this semester will march you through a thorough introduction to using R for statistical science and data science purposes, as well as for many other practical applications.
You should already be familiar with the environment you are going to use for your course.
This most likely involves accessing RStudio in your browser, via a Docker container provided through your course.
Your instructor should have given you that information and you should be ready to use it.
The materials here are a compilation of advice and instructions from your professors, TAs, and many helpful resources.
Here are some key ideas:
- Everyone can learn R!
It is very user-friendly, so do not be intimidated!
- R is an open-source programming language for statistical computing (<http://cran-r-project.org>).
- RStudio is the IDE (Interactive Development Environment) we use to work with R.
## Getting started with Github
### Creating a GitHub account
If you're new to GitHub, you'll first need to go in and create an account.
You will use GitHub to obtain the R Markdown (`.Rmd`) file for this workshop.
If you already have a GitHub account, great!
If not, take a minute to create one.
But first, some username advice:
- Try to incorporate your real name!
- Remember that horrible Snapchat username you made in middle school?
Don't do that again.
Make your username something you will feel comfortable sharing with future employers!
- Make it short and sweet!
After creating your account, **write down your username and password**.
Or better yet, get in the habit of using an online password manager (like [LastPass](https://oit.duke.edu/what-we-do/applications/lastpass)) to save it.
### Authenticating with SSH
Git is used to integrate GitHub and version control into your RStudio IDE.
To connect to Github, do the following:
1. In the console of the docker container, type `credentials::ssh_setup_github()`.
R will ask "No SSH key found. Generate one now?" You should **click 1 for yes**.
2. You will generate a key.
It will begin with "ssh-rsa..." R will then ask "Would you like to open a browser now?" You should **click 1 for yes**.
3. In Github in the browser, give your key a name and **paste the generated key**.
Hit save and refresh Github.
```{r echo = FALSE, out.width = "50%", fig.alt = "Screenshot of the GitHub page for saving a a new SSH key. On top is the Title box and at the bottom is the Key box. At the very bottom of the document there's a button that says 'Add SSH key'."}
knitr::include_graphics("images/SSH_key.png")
```
This process will only need to be done one time for your container.
### Introduce yourself to Git
You will use the `use_git_config()` function from the **usethis** package.
Type the following lines of code in the console in RStudio filling in your name and email address.
The email address should be the one tied to your GitHub account so that your commits can be tied to you on GitHub.
```{r get your github congfigured, echo=TRUE, eval=FALSE}
library(usethis)
use_git_config(user.name = "GitHub username", user.email = "your email")
```
For example, Meredith's would be:
```{r github-usethis-ex, echo=TRUE, eval = FALSE}
library(usethis)
use_git_config(user.name = "meredithb3", user.email = "[email protected]")
```
Take a minute to introduce yourself to Git by placing your own information in the `use_git_config()` function above.
### Fork and clone "Fall 2021 R Workshop"
Go to the following URL: <https://github.com/DukeStatSci/Fall_2021_R_Workshop>.
First, **Fork** the repository by clicking Fork in the upper right-hand portion of the screen.
Forking allows you to create a copy of the repository that you own, and hence can make changes to.
```{r echo = FALSE, out.width = "90%", fig.alt = "GitHub repo with the Fork button highlighted."}
knitr::include_graphics("images/new_fork.png")
```
Click on the green **Code** button and then click on the clipboard icon next to the URL to copy the repository link.
Make sure to copy the URL corresponding to the **SSH** tab.
```{r echo = FALSE, out.width = "90%", fig.alt = "GitHub repo with the Code button and the URL of the repo highlighted."}
knitr::include_graphics("images/new_clone-repo.png")
```
Now open your RStudio window.
Click **File** $\rightarrow$ **New Project** $\rightarrow$ **Version Control** $\rightarrow$ **Git** and hit **paste** in the **Repository URL** field.
- You will need to select a location to store this project. Good coding practice suggests creating a folder for your class, and then sub-folders for labs, homeworks, etc.
Now click **Create Project**.
To view the output that I am currently sharing with you, click on the file `fallWorkshop.Rmd` and click **Knit**.
## R overview
### RStudio
Your RStudio window generally has four panels.
This is customizable so you might have changed it already.
You can also drag the borders to resize each portion.
- *Source*: Where R Markdown files are edited and saved
- *Console*: Where R code is run interactive; cannot be saved easily
- *Environment*: Where R objects live and are stored
- *Utility*: Where a files, plots, list of packages, and help documentation can be viewed
```{r echo = FALSE, out.width = "90%", fig.alt = "RStudio window with four panes. Top left: Source. Top right: Environment. Bottom right: Utility. Bottom left: Console."}
knitr::include_graphics("images/rstudio.jpeg")
```
Your R Markdown file is in the upper left panel.
If you want to view the the output of this `Rmd` file in a nicer format, **click the Knit button** above.
This will process your `.Rmd` file into a HTML (or PDF or a Word) file.
The `.Rmd` and `.html` files are separate but directly related; the `.Rmd` file contains the source code, the `.html` file is the output.
When you make any changes in the `.Rmd` file, click Knit to update the `.html` file.
The panel in the upper right displays information on any objects available to you to work with (i.e., any objects available in your environment).
This can include any data you have loaded.
On a separate tab in this panel you can also view the history of the commands you've previously ran in the Console (more on that below).
And in a different tab you can use the Git interface to version control the work you're doing in RStudio.
Any plots that you generate will show up in the panel in the lower right corner.
This is also where you can browse your files, access help documentation, manage packages you have installed and loaded, etc.
Your console, in the bottom left panel, can be used to run code.
We generally recommend not relying on code ran in the console for any serious work (read: any work you want to save, like your homework!).
Instead, we'll introduce you to how to write your code (and save it) in an R Markdown file.
### R Markdown
The console is a great place for playing around with code, however it is not a good place for documenting your work.
Working in the console exclusively makes it difficult to document your work as you go, and reproduce it later.
R Markdown is a great solution for this problem.
And, you already have worked with an R Markdown document -- this workshop!
Type code in the code chunks provided in the R Markdown `(.Rmd)` document, and **Knit** the document to see the results.
This should produce an HTML file for you that lets you view the output of your code.
### R packages
R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free.
For this workshop, and many others in the future, we will use the following R packages:
- **tidyverse**: A group of packages for data manipulation, exploration and visualization that share a common coding syntax and style.
Some of the packages in the tidyverse include:
- **dplyr**: A package in the tidyverse for data wrangling
- **ggplot2**: A package in the tidyverse for data visualization
- **usethis**: A package for interfacing between RStudio and Github
These packages and most packages you will use for your courses have been pre-installed for you in the containers you're using to access RStudio.
In the future, if you download RStudio locally or wish to install new packages for your personal use, you can install packages using commands like `install.packages()` and `install_github()`.
Find out more about the tidyverse [here](https://www.tidyverse.org/).
Next, you need to load the packages in your working environment.
We do this with the `library()` function.
Note that you only need to *install* packages once, but you need to *load* them each time you start a new R session.
```{r load-packages}
library(tidyverse)
library(usethis)
```
To do so, you can use any **one** of the following options:
- Click on the green arrow at the top of the code chunk in the R Markdown (`.Rmd`) file
- Highlight these lines, and hit the **Run** button on the upper right corner of the pane
- Hit `CTRL` + `ENTER` to run the current line of code
- Hit `CTRL` + `SHIFT` + `ENTER` to run the current chunk of code
- Type the code in the console.
- Note: Running code in the console is useful for testing code and seeing the current status of variables, but any code that you want in your final `.Rmd` file should be put in a code chunk in the `.Rmd`.
Going forward you will be asked to load any relevant packages at the beginning of each lab or R exercise you do in class.
## Diving into the data
We are going to use a dataset called `starwars` from the **dplyr** package (which is part of the tidyverse).
Simply run `library(tidyverse)` if you haven't already and then run the following commands to inspect the first ten rows of the file.
```{r display-starwars-data, echo=TRUE}
# load starwars data
data(starwars)
# display starwars data
starwars
```
To do so, once again, you can:
- Click on the green arrow at the top of the code chunk in the R Markdown (`.Rmd`) file, or
- Put your cursor on the line with the code, and hit the **Run** button on the upper right corner of the pane, or
- Type the code in the Console.
This command instructs R to show you the data, which is stored as a `tibble`.
By default it only prints out the top 10 rows, but it also tells you that it's a `87 x 14` table, meaning 87 rows (observations) and 14 columns (variables).
As you interact with R, you will create a series of objects.
Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed.
One advantage of RStudio is that it comes with a built-in data viewer.
Click on the name `starwars` in the *Environment* pane (upper right window) that lists the objects in your workspace.
This will bring up an alternative display of the dataset in the *Data Viewer* (upper left window).
You can close the data viewer by clicking on the *x* in the upper left hand corner.
What you should see are 87 rows and 14 columns of data.
Each row represents a different character from Star Wars: the first entry in each row is the name of the character, the second is height in centimeters, and the third is the mass in kilograms.
Use the scroll bar on the right side of the *Data Viewer* window to examine the complete dataset.
Note that the row numbers in the first column are not part of the `starwars` data.
R adds them as part of its printout to help you identify the rows.
You can think of them as the index that you see on the left side of a spreadsheet.
In fact, the comparison to a spreadsheet will generally be helpful.
R has stored the `starwars` data in a kind of spreadsheet or table called a `data frame`.
You can see also the dimensions -- the number of observations and variables -- of this data frame by typing:
```{r dim-data, echo = TRUE}
dim(starwars)
```
This command should output `[1] 87 14`, indicating that there are 87 rows (observations) and 14 columns (variables).
You can see the names of these columns (variables) by typing:
```{r names-data, echo=TRUE}
names(starwars)
```
**Exercise:** What are some of the values for `homeworld` included in this dataset?
**Hint:** Take a look at the `homeworld` variable in the *Data Viewer* to answer this question.
Let's start to examine the data a little more closely.
To find out more about this dataset, you could use some help.
Remove the `#` from the line of code below and then run the code chunk.
```{r help-starwars}
#?starwars
```
You should see the documentation on the dataset in the *Help* window.
Any time you need more information about a dataset, function, or package of interest, typing `?<function_name>` in the console will bring up its help page!
Try out the other tabs there briefly.
## Data exploration
At this point, you might notice that many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of arguments.
The `dim()` and `names()` commands, for example, each took a single argument, the name of a data frame.
The **dplyr** package aims to provide a function for each basic verb of data manipulation.
These verbs can be organised into three categories based on the component of the dataset that they work with:
- Rows:
- `filter()` chooses rows based on column values
- `slice()` chooses rows based on location
- `arrange()` changes the order of the rows
- Columns:
- `select()` changes whether or not a column is included
- `rename()` changes the name of columns
- `mutate()` changes the values of columns and/or creates new columns
- `relocate()` changes the order of the columns
- Groups of rows:
- `summarise()` collapses the data (or groups in the data) into a single row containing a summary statistic
The **dplyr** package also uses the `%>%` operator, called the *pipe operator*.
Basically, it takes the output of the current line and tells R to use that output as the first argument in the function in the next line.
### Select and slice
Putting the functions with the pipe operator, we can do something like this:
```{r species-top-15}
starwars %>%
select(species) %>%
slice(1:15)
```
Reading the code in plain English: From the `starwars` data frame, we select the column `species`.
We want the first 15 rows of the column `species`.
**Exercise:** Try selecting a different column to observe and only view the first 5 rows.
```{r new-col-top-5}
# Write your code here
```
### Mutate
We can create new columns in our data frame through the `mutate()` function.
```{r new-columns}
starwars <- starwars %>%
mutate(is_human = if_else(species == "Human", TRUE, FALSE))
starwars %>%
select(species, is_human)
```
See how we used the `select()` function to compare these two columns side-by-side!
The `mutate()` function added a new variable to the `starwars` data frame containing the values of either `TRUE` if the species was Human, or `FALSE` if it was not.
Note that we put `"Human"` in quotes since it's a character string.
Here, we've asked R to create *logical* data, data where the values are either `TRUE` or `FALSE`.
Other data types we've used so far are *numeric* and *character*.
### Filtering
We can reduce the dataset to a subset of rows by filtering.
Below, we use the `filter()` function on a particular value of `skin_color`.
```{r filter-skin-color}
starwars %>%
filter(skin_color == "fair")
```
We can also filter with multiple conditions.
Below, we filter for character with fair `skin_color` *and* blue `eye_color`.
```{r filter-skin-eye-color}
starwars %>%
filter(skin_color == "fair", eye_color =="blue")
```
**Exercise**: Try filtering for characters with brown hair color.
Don't forget to spell the name of the variable as it appears in the data frame and put brown in quotes since it's a character string.
```{r new-filter}
# Write your code here
```
### Booleans
In addition to simple mathematical operators like subtraction and division, you can ask R to make logical comparisons like greater than, `>`, less than, `<`, and equality, `==`.
For example, above we asked R to compare the species variable and the value `"Human"` using the equality boolean operator, `==`.
We can ask if the mass of the character is greater than 50 kg:
```{r x-greater-than-y}
starwars %>%
filter(mass > 50)
```
You may have noticed that the syntax and function of all these verbs are very similar:
- The first argument is a data frame
- The subsequent arguments describe what to do with the data frame
- The result is a new data frame
Together these properties make it easy to do complex things with a few lines of R code!
## Data visualization
R has some powerful functions for making graphics.
We will utilize the **ggplot2** package for these visualizations (which is part of the tidyverse).
To build a ggplot, we will use the following basic template that can be used for different types of plots:
```{r basic-template, eval = FALSE}
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
<GEOM_FUNCTION>() +
labs(title = "YOUR TITLE", x = "X-label", y = "Y-label")
```
In English, the above code uses the following:
- `ggplot()` is the function we are using
- `data` is the dataset we are plotting from
- `mapping` specifies the variables of interest
- `<GEOM_FUNCTION>()` will be the specific type of plot to make, i.e. `geom_point`, `geom_bar`, etc.
- `labs()` specifies a title and labels for the axes
Let's visualize the distribution of the character heights v. masses:
```{r plot-height-vs-mass, echo=TRUE, warning=FALSE}
ggplot(data = starwars, aes(x = height, y = mass)) +
geom_point() +
labs(
title = "Mass and height of StarWars characters",
x = "Height (cm)",
y = "Mass (kg)"
)
```
If you run the plotting code, you should see the plot appear under the *Plots* tab of the lower right panel of RStudio.
Notice that the command above again looks like a function, this time with arguments separated by commas.
If we want to graph a **categorical variable** such as `gender`, we can use boxplots.
```{r better-plots-2, echo=TRUE, warning=FALSE}
ggplot(data = starwars, aes(x = gender, y = height, fill = gender)) +
geom_boxplot() +
labs(
title = "Height and gender of StarWars characters",
x = "Gender",
y = "Height (cm)",
fill = "Gender"
)
```
Here, we used the `fill` option to color the plot by the different categories of our variable.
The **ggplot2** package has tons of other cool features!
To learn more about the `ggplot()` function, check out the help page by running the code chunk below:
```{r plot-help, echo=TRUE}
?ggplot
```
```{r echo = FALSE, out.width = "90%", fig.alt = "RStudio IDE with the Help pane highlighted. Help pane shows the documentation for the ggplot function."}
knitr::include_graphics("images/help.jpeg")
```
Notice that the help file replaces the plot in the lower right panel.
You can toggle between plots and help files using the tabs at the top of that panel.
**Exercise:** Play with the `ggplot()` function to create a plot of interest to you.
```{r plot-of-something-interesting, error = TRUE}
ggplot(data = "____", aes(x = "____", y = "____")) +
"____"
```
For help with plotting with the **ggplot2** package, check out [this cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-visualization.pdf)!
The best (and easiest) way to learn the syntax is to take a look at the sample plots provided on that page, and modify the code to fit your needs.
## Submitting your work
Now that you've gotten some practice coding - let's go over how to submit your work for classes.
Your professors will most likely ask for your final **PDF** file.
We will briefly go over how to create these outputs from an `.Rmd` file.
Turning your assignment can be accomplished from the following steps.
1. Once you are finished with the assignment, **Knit** the file.
2. Select all updated files in the **Git** tab of the *Environment* pane.
3. **Commit** the files with an informative title, and **push** the files to Github.
4. Go to your class's Gradescope page. **Submit the PDF** of your assignment.
5. **Select** the page each exercise is associated with.
6. Done!
```{r echo = FALSE, out.width = "50%", fig.alt = "Gradescope assignment submission screen."}
knitr::include_graphics("images/gradescope.png")
```
## Troubleshooting
Sometimes, your code will work in your R Markdown document but will not compile.
This is because when you **knit**, your machine is running the code line by line, from the top.
If one of your lines of code was entered in the console instead of in the file, uses a variable not created until later in your document, repeats a code chunk name, or possesses any other fault, you will run into an error.
R provides thorough error messages letting you know the issue at hand.
Very often, it is because the data or a package is not loaded into the file.
There are a lot of great resources online for troubleshooting error messages.
Sites like [**Stack Overflow**](https://stackoverflow.com/) and [**RStudio Community**](https://community.rstudio.com/) provide lots questions and answers.
Likewise, your TA and professors are here to help!
If you cannot find a solution online, post your code (privately) and error message on your course's **Ed Discussion board**, or visit your professor/TA's office hours.
## Resources for learning R
That was a short introduction to R and RStudio, but your instructors will provide you with more functions and a more complete sense of the language as your course progresses.
You might find the following tips and resources helpful.
- Many courses use the tidyverse while other courses will expect you to learn to work in base R, rather than in the RStudio IDE. In Tidyverse-flavored courses, you will use **dplyr** (for data wrangling) and **ggplot2** (for data visualization) extensively.
If you are googling for R code, make sure to also include these package names in your search query.
For example, instead of googling "scatterplot in R", google "scatterplot in R with ggplot2".
- Duke offers Coursera courses for free to Duke students through this link: <https://online.duke.edu/coursera-for-duke/>.
- R for Data Science is the definitive source for learning data science with R and tidyverse, and it's available for free online: <https://r4ds.had.co.nz/>.
- The following are great collections for free resources for learning R:
- <https://www.learnr4free.com/en/index.html>
- <https://www.bigbookofr.com/>
- The RStudio cheatsheets are great resources.
Note that some of the code on these cheatsheets may be too advanced for your needs now, however the majority of it will become useful as you progress through your statistical science courses.
Get all of them here: <https://www.rstudio.com/resources/cheatsheets/>.
- The R Weekly blog is a great for catching up with what's happening in the R space, on a weekly basis: <https://rweekly.org/>.
- More and more folks are streaming data science content on Twitch, many using R.
See a list of them here: <https://www.jessemaegan.com/blog/2021-05-28-data-science-twitch-streamers-round-up/>.
- There is a terrific resource on Medium called Towards Data Science, and you can find good information there, although you may have a limited number of free articles to read: <https://towardsdatascience.com/data-science/home>.
## Acknowledgements
This is derived from the `Learning R Tutorials` website.
Thanks to Joan Durso, Maria Tackett, George Lindner, Mine Çetinkaya-Rundel for all their help!