Skip to content

Commit 5f236dc

Browse files
committed
source commit: afa980f
0 parents  commit 5f236dc

File tree

88 files changed

+19088
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

88 files changed

+19088
-0
lines changed

01-rstudio-intro.md

Lines changed: 893 additions & 0 deletions
Large diffs are not rendered by default.

02-project-intro.md

Lines changed: 284 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,284 @@
1+
---
2+
title: Project Management With RStudio
3+
teaching: 20
4+
exercises: 10
5+
source: Rmd
6+
---
7+
8+
::::::::::::::::::::::::::::::::::::::: objectives
9+
10+
- Create self-contained projects in RStudio
11+
12+
::::::::::::::::::::::::::::::::::::::::::::::::::
13+
14+
:::::::::::::::::::::::::::::::::::::::: questions
15+
16+
- How can I manage my projects in R?
17+
18+
::::::::::::::::::::::::::::::::::::::::::::::::::
19+
20+
21+
22+
## Introduction
23+
24+
The scientific process is naturally incremental, and many projects
25+
start life as random notes, some code, then a manuscript, and
26+
eventually everything is a bit mixed together.
27+
28+
<blockquote class="twitter-tweet"><p>Managing your projects in a reproducible fashion doesn't just make your science reproducible, it makes your life easier.</p>— Vince Buffalo (@vsbuffalo) <a href="https://twitter.com/vsbuffalo/status/323638476153167872">April 15, 2013</a></blockquote>
29+
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
30+
31+
Most people tend to organize their projects like this:
32+
33+
![](fig/bad_layout.png){alt='Screenshot of file manager demonstrating bad project organisation'}
34+
35+
There are many reasons why we should *ALWAYS* avoid this:
36+
37+
1. It is really hard to tell which version of your data is
38+
the original and which is the modified;
39+
2. It gets really messy because it mixes files with various
40+
extensions together;
41+
3. It probably takes you a lot of time to actually find
42+
things, and relate the correct figures to the exact code
43+
that has been used to generate it;
44+
45+
A good project layout will ultimately make your life easier:
46+
47+
- It will help ensure the integrity of your data;
48+
- It makes it simpler to share your code with someone else
49+
(a lab-mate, collaborator, or supervisor);
50+
- It allows you to easily upload your code with your manuscript submission;
51+
- It makes it easier to pick the project back up after a break.
52+
53+
## A possible solution
54+
55+
Fortunately, there are tools and packages which can help you manage your work effectively.
56+
57+
One of the most powerful and useful aspects of RStudio is its project management
58+
functionality. We'll be using this today to create a self-contained, reproducible
59+
project.
60+
61+
::::::::::::::::::::::::::::::::::::::: challenge
62+
63+
## Challenge 1: Creating a self-contained project
64+
65+
We're going to create a new project in RStudio:
66+
67+
1. Click the "File" menu button, then "New Project".
68+
2. Click "New Directory".
69+
3. Click "New Project".
70+
4. Type in the name of the directory to store your project, e.g. "my\_project".
71+
5. If available, select the checkbox for "Create a git repository."
72+
6. Click the "Create Project" button.
73+
74+
::::::::::::::::::::::::::::::::::::::::::::::::::
75+
76+
The simplest way to open an RStudio project once it has been created is to click
77+
through your file system to get to the directory where it was saved and double
78+
click on the `.Rproj` file. This will open RStudio and start your R session in the
79+
same directory as the `.Rproj` file. All your data, plots and scripts will now be
80+
relative to the project directory. RStudio projects have the added benefit of
81+
allowing you to open multiple projects at the same time each open to its own
82+
project directory. This allows you to keep multiple projects open without them
83+
interfering with each other.
84+
85+
::::::::::::::::::::::::::::::::::::::: challenge
86+
87+
## Challenge 2: Opening an RStudio project through the file system
88+
89+
1. Exit RStudio.
90+
2. Navigate to the directory where you created a project in Challenge 1.
91+
3. Double click on the `.Rproj` file in that directory.
92+
93+
::::::::::::::::::::::::::::::::::::::::::::::::::
94+
95+
## Best practices for project organization
96+
97+
Although there is no "best" way to lay out a project, there are some general
98+
principles to adhere to that will make project management easier:
99+
100+
### Treat data as read only
101+
102+
This is probably the most important goal of setting up a project. Data is
103+
typically time consuming and/or expensive to collect. Working with them
104+
interactively (e.g., in Excel) where they can be modified means you are never
105+
sure of where the data came from, or how it has been modified since collection.
106+
It is therefore a good idea to treat your data as "read-only".
107+
108+
### Data Cleaning
109+
110+
In many cases your data will be "dirty": it will need significant preprocessing
111+
to get into a format R (or any other programming language) will find useful.
112+
This task is sometimes called "data munging". Storing these scripts in a
113+
separate folder, and creating a second "read-only" data folder to hold the
114+
"cleaned" data sets can prevent confusion between the two sets.
115+
116+
### Treat generated output as disposable
117+
118+
Anything generated by your scripts should be treated as disposable: it should
119+
all be able to be regenerated from your scripts.
120+
121+
There are lots of different ways to manage this output. Having an output folder
122+
with different sub-directories for each separate analysis makes it easier later.
123+
Since many analyses are exploratory and don't end up being used in the final
124+
project, and some of the analyses get shared between projects.
125+
126+
::::::::::::::::::::::::::::::::::::::::: callout
127+
128+
## Tip: Good Enough Practices for Scientific Computing
129+
130+
[Good Enough Practices for Scientific Computing](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing/blob/gh-pages/good-enough-practices-for-scientific-computing.pdf) gives the following recommendations for project organization:
131+
132+
1. Put each project in its own directory, which is named after the project.
133+
2. Put text documents associated with the project in the `doc` directory.
134+
3. Put raw data and metadata in the `data` directory, and files generated during cleanup and analysis in a `results` directory.
135+
4. Put source for the project's scripts and programs in the `src` directory, and programs brought in from elsewhere or compiled locally in the `bin` directory.
136+
5. Name all files to reflect their content or function.
137+
138+
::::::::::::::::::::::::::::::::::::::::::::::::::
139+
140+
### Separate function definition and application
141+
142+
One of the more effective ways to work with R is to start by writing the code you want to run directly in a .R script, and then running the selected lines (either using the keyboard shortcuts in RStudio or clicking the "Run" button) in the interactive R console.
143+
144+
When your project is in its early stages, the initial .R script file usually contains many lines
145+
of directly executed code. As it matures, reusable chunks get pulled into their
146+
own functions. It's a good idea to separate these functions into two separate folders; one
147+
to store useful functions that you'll reuse across analyses and projects, and
148+
one to store the analysis scripts.
149+
150+
### Save the data in the data directory
151+
152+
Now we have a good directory structure we will now place/save the data file in the `data/` directory.
153+
154+
::::::::::::::::::::::::::::::::::::::: challenge
155+
156+
## Challenge 3
157+
158+
Download the gapminder data from [this link to a csv file](data/gapminder_data.csv).
159+
160+
1. Download the file (right mouse click on the link above -> "Save link as" / "Save file as", or click on the link and after the page loads, press <kbd>Ctrl</kbd>\+<kbd>S</kbd> or choose File -> "Save page as")
161+
2. Make sure it's saved under the name `gapminder_data.csv`
162+
3. Save the file in the `data/` folder within your project.
163+
164+
We will load and inspect these data later.
165+
166+
::::::::::::::::::::::::::::::::::::::::::::::::::
167+
168+
::::::::::::::::::::::::::::::::::::::: challenge
169+
170+
## Challenge 4
171+
172+
It is useful to get some general idea about the dataset, directly from the
173+
command line, before loading it into R. Understanding the dataset better
174+
will come in handy when making decisions on how to load it in R. Use the command-line
175+
shell to answer the following questions:
176+
177+
1. What is the size of the file?
178+
2. How many rows of data does it contain?
179+
3. What kinds of values are stored in this file?
180+
181+
::::::::::::::: solution
182+
183+
## Solution to Challenge 4
184+
185+
By running these commands in the shell:
186+
187+
188+
``` sh
189+
ls -lh data/gapminder_data.csv
190+
```
191+
192+
``` output
193+
-rw-r--r-- 1 runner docker 80K Aug 5 17:54 data/gapminder_data.csv
194+
```
195+
196+
The file size is 80K.
197+
198+
199+
``` sh
200+
wc -l data/gapminder_data.csv
201+
```
202+
203+
``` output
204+
1705 data/gapminder_data.csv
205+
```
206+
207+
There are 1705 lines. The data looks like:
208+
209+
210+
``` sh
211+
head data/gapminder_data.csv
212+
```
213+
214+
``` output
215+
country,year,pop,continent,lifeExp,gdpPercap
216+
Afghanistan,1952,8425333,Asia,28.801,779.4453145
217+
Afghanistan,1957,9240934,Asia,30.332,820.8530296
218+
Afghanistan,1962,10267083,Asia,31.997,853.10071
219+
Afghanistan,1967,11537966,Asia,34.02,836.1971382
220+
Afghanistan,1972,13079460,Asia,36.088,739.9811058
221+
Afghanistan,1977,14880372,Asia,38.438,786.11336
222+
Afghanistan,1982,12881816,Asia,39.854,978.0114388
223+
Afghanistan,1987,13867957,Asia,40.822,852.3959448
224+
Afghanistan,1992,16317921,Asia,41.674,649.3413952
225+
```
226+
227+
:::::::::::::::::::::::::
228+
229+
::::::::::::::::::::::::::::::::::::::::::::::::::
230+
231+
::::::::::::::::::::::::::::::::::::::::: callout
232+
233+
## Tip: command line in RStudio
234+
235+
The Terminal tab in the console pane provides a convenient place directly
236+
within RStudio to interact directly with the command line.
237+
238+
::::::::::::::::::::::::::::::::::::::::::::::::::
239+
240+
### Working directory
241+
242+
Knowing R's current working directory is important because when you need to access other files (for example, to import a data file), R will look for them relative to the current working directory.
243+
244+
Each time you create a new RStudio Project, it will create a new directory for that project. When you open an existing `.Rproj` file, it will open that project and set R's working directory to the folder that file is in.
245+
246+
::::::::::::::::::::::::::::::::::::::: challenge
247+
248+
## Challenge 5
249+
250+
You can check the current working directory with the `getwd()` command, or by using the menus in RStudio.
251+
252+
1. In the console, type `getwd()` ("wd" is short for "working directory") and hit Enter.
253+
2. In the Files pane, double click on the `data` folder to open it (or navigate to any other folder you wish). To get the Files pane back to the current working directory, click "More" and then select "Go To Working Directory".
254+
255+
You can change the working directory with `setwd()`, or by using RStudio menus.
256+
257+
1. In the console, type `setwd("data")` and hit Enter. Type `getwd()` and hit Enter to see the new working directory.
258+
2. In the menus at the top of the RStudio window, click the "Session" menu button, and then select "Set Working Directory" and then "Choose Directory". Next, in the windows navigator that opens, navigate back to the project directory, and click "Open". Note that a `setwd` command will automatically appear in the console.
259+
260+
::::::::::::::::::::::::::::::::::::::::::::::::::
261+
262+
::::::::::::::::::::::::::::::::::::::::: callout
263+
264+
## Tip: File does not exist errors
265+
266+
When you're attempting to reference a file in your R code and you're getting errors saying the file doesn't exist, it's a good idea to check your working directory.
267+
You need to either provide an absolute path to the file, or you need to make sure the file is saved in the working directory (or a subfolder of the working directory) and provide a relative path.
268+
269+
::::::::::::::::::::::::::::::::::::::::::::::::::
270+
271+
### Version Control
272+
273+
It is important to use version control with projects. Go [here for a good lesson which describes using Git with RStudio](https://swcarpentry.github.io/git-novice/14-supplemental-rstudio.html).
274+
275+
:::::::::::::::::::::::::::::::::::::::: keypoints
276+
277+
- Use RStudio to create and manage projects with consistent layout.
278+
- Treat raw data as read-only.
279+
- Treat generated output as disposable.
280+
- Separate function definition and application.
281+
282+
::::::::::::::::::::::::::::::::::::::::::::::::::
283+
284+

0 commit comments

Comments
 (0)