-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathterm_project.Rmd
260 lines (155 loc) · 12.1 KB
/
term_project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
---
title: "Term Project"
output:
html_document:
toc: yes
toc_depth: 2
toc_float:
collapsed: false
smooth_scroll: false
df_print: kable
number_sections: yes
---
<style>
h1{font-weight: 400;}
</style>
```{r, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE, message=FALSE, warning = FALSE, eval = TRUE, fig.align='center'
)
library(knitr)
# Don't repeat yourself!
hint_1 <- "This is the phase of the project that is the least straightforward. Thus, we recommend you start early and get help during office hours early and often."
hint_2 <- "This phase may require revisions to your original choice of data and visits to the Spinelli tutoring center for help with data wrangling."
only_group_leader <- "Only the group leader will make a single submission on behalf of the whole group. They will submit:"
submission_will_differ <- "(Note that your data proposal will likely differ slightly from this one)"
```
## Note to Instructors {-}
This is an example of instructions for the term project. You see the source R Markdown file for this page [here](https://github.com/moderndive/moderndive_labs/blob/master/term_project.Rmd){target="_blank"}.
***
# Instructions {-}
Everything in this course builds up to the term group project, where there is
only one learning goal: *Engage in the data/science research pipeline in as
faithful a manner as possible while maintaining a level suitable for novices.*
```{r, echo=FALSE, out.width="80%"}
include_graphics("static/images/data_science_pipeline.png")
```
In order to break down the task and minimize end-of-semester stress, you'll be working on the project in five phases:
1. **Project groups:** Form groups.
1. **Project data:** Propose a data set for your project. `r hint_1`
1. **Project proposal:** Ensure you can work with your data in R by performing an exploratory data analysis. `r hint_2`
1. **Project submission:** Make an initial submission of your project. You will skip some of the sections for now and only complete them after we have covered inference for regression in class. After you submit your work, you will get instructor feedback.
1. **Project resubmission:** Incorporate your instructor feedback from the project submission phase, complete the remaining sections, and resubmit your project. You will only be graded on your project resubmission.
We would like to thank Smith College students Alexis Cohen, Andrianne Dao, Isabel Gomez for allowing us to use their term project as an example.
***
# Project Groups {#groups}
1. Form groups of 2-3 students.
+ You must all be registered for the same lab section, because certain labs will be devoted to working on the project:
+ All groups members are expected to contribute and you will all be held accountable for your contributions in peer evaluations.
1. Choose a group name.
1. Assign a "team/group leader" who will have a few extra administrative duties at each phase. Ex: making submissions, filling out Google Forms, etc.
1. Only the team leader will then:
+ Create a Slack message that includes all (1) your group members, (2) your instructor, and (3) your lab instructor. To ensure everyone is on the same page, please ask all project related questions in this message.
+ Submit your group information using the "Project groups Google Form" on Moodle.
1. If you are looking for a group to join, please fill out the "Project groups Google Form" on Moodle.
***
# Project Data {#data}
**`r hint_1`**
* Get a sense for the requirements of this project phase by reading a possible [example data proposal](static/term_project/data_example.html){target="_blank"} `r submission_will_differ`:
* Download the following <a href="static/term_project/data.Rmd" download>`data.Rmd`</a> template R Markdown file and fill it in.
## Find data
### Specifications {-}
Find a data set that fits these specifications. Note your data may need a little wrangling from its original form.
1. (If available) An [identification variable](https://moderndive.com/1-getting-started.html#identification-vs-measurement-variables){target="_blank"} that uniquely identifies each observation in each row.
1. A numerical outcome variable $y$. Note: binary outcomes variables with 0/1 values are not technically numerical.
1. Two explanatory variables:
+ A numerical explanatory variable $x_1$. Note: this can be some notion of time.
+ A categorical explanatory variable $x_2$ that has between 3-5 levels. Note: If your data has more than 5 levels, they can be collapsed into 5 using data wrangling later.
1. At least 50 observations/rows.
### Possible sources {-}
Here are some possible data sources:
* Best option: data from your own research or other courses! The more connected you feel with your data, the more motivated you will be for this project.
* Next best options: Online data repositories such as (but not limited to):
+ [IPUMS survey & census data](https://ipums.org/){target="_blank"}
+ [data.world](https://data.world/){target="_blank"}
+ [Tableau](https://public.tableau.com/en-us/s/resources){target="_blank"}
+ [Google Dataset Search](https://toolbox.google.com/datasetsearch){target="_blank"}
+ [StatLib Datasets Archive](http://lib.stat.cmu.edu/datasets/)
<!--
* You may not use the following data:
+ Any datasets used in this class, either in ModernDive or in any of the examples.
+ Any data from the data journalism website FiveThirtyEight.com
-->
### Note on data confidentiallity {-}
If your data is not confidential or sensitive in nature, then publish your data as a CSV file on Google Sheets. That way your group can all access a single copy of your data on the web. If your data is confidential or sensitive in nature, **do not** publish it on the web, but rather submit the Excel or CSV file as well.
You can publish your data as a CSV file on Google Sheets by following the 6 steps in this Twitter thread:
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">2. Go to File -> Publish to the web ... <a href="https://t.co/CeM3IIemFU">pic.twitter.com/CeM3IIemFU</a></p>— Albert Y. Kim (@rudeboybert) <a href="https://twitter.com/rudeboybert/status/1055821837551235073?ref_src=twsrc%5Etfw">October 26, 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
## What to submit
`r only_group_leader`
1. The `data.Rmd` R Markdown file
1. The `data.html` HTML report file
1. Only if your data is confidential or sensitive in nature, submit your Excel or CSV file as well. Otherwise you should publish your data as a CSV file on Google Sheets as described above.
## Hints
* **Where is this heading?**: For the next project phase (project proposal), you will be making a visualization like this [one](static/term_project/proposal_example.html#colored-scatterplot){target="_blank"}. If you can make a visualization like this one, then your data is set for the rest of the project.
* When you have questions:
+ If possible, please ask questions that you think the entire class would like to know the answer to in the `#project` channel in Slack.
+ Ask more individual questions in the group DM that includes all group members + your instructor + your lab instructor.
* Only minimal data wrangling using the `dplyr` package is expected at this time; you will be doing more for the "project proposal" phase coming up. That being said, feel free to experiment!
<!--
## General feedback for all groups {#phase_I_feedback}
1. Disclaimer: Just because things seem good now doesn't mean your project data is set for semester. Unforeseen problems may crop up during the next phase on data wrangling, at which point you may need to revise your data. This is a reality of data collection in the real world!
1. Do not include any `View()` statements in your `.Rmd` files as this may cause an error.
1. Avoid "data dumps". For example, showing the contents of all 1000 rows in a data frame. This will make your output document really large and unreadable.
-->
***
# Project Proposal {#proposal}
**`r hint_2`**
* Get a sense for the requirements of this project phase by reading a possible [example project proposal](static/term_project/proposal_example.html){target="_blank"} `r submission_will_differ`:
* Download the following <a href="static/term_project/proposal.Rmd" download>`proposal.Rmd`</a> template R Markdown file and start filling it in.
## Work on your proposal
Your data may require some wrangling to get it in the appropriate format. Given that this is not a class where data wrangling is a central focus, we suggest you check out the following resources:
1. The Spinelli Center evening tutoring hours on Sun-Thurs 7-9pm. All student tutors there have taken SDS 192 Data Science and can thus help you get the data into the appropriate format.
1. Be sure to look at our [data wrangling "Tips & Tricks"](https://moderndive.com/C-appendixC.html#data-wrangling){target="_blank"} ModernDive page written by Jenny Smetzer. It's based on the seven most common data wrangling questions we've encountered from students while they were working on their term projects:
```{r, echo=FALSE, out.width="70%"}
include_graphics("static/images/data_wrangling.png")
```
## What to submit
`r only_group_leader`
1. The `proposal.Rmd` R Markdown file
1. The `proposal.html` HTML report file
1. Only if your data is confidential or sensitive in nature, submit your Excel or CSV file as well. Otherwise you should publish your data as a CSV file on Google Sheets as described above.
***
# Project Submission {#submission}
* Get a sense for the requirements of this project phase by reading only the following sections of this possible [project resubmission](static/term_project/resubmission_example.html){target="_blank"} `r submission_will_differ`:
+ Section 1: Introduction
+ Section 2: Exploratory data analysis
+ Section 3 subsections 3.1, 3.2, and 3.3: Multiple linear regression: Methods, Model Results, Interpreting the regression table.
* Download the following <a href="static/term_project/submission_example.Rmd" download>`project_submission_example.Rmd`</a> template R Markdown file and start filling it in.
## Complete your initial submission
* Complete the following sections of `project_submission.Rmd`:
+ Section 1: Introduction
+ Section 2: Exploratory data analysis
+ Section 3 subsections 3.1, 3.2, and 3.3: Multiple linear regression: Methods, Model Results, Interpreting the regression table.
* Do not complete the following sections (you'll be doing this at the resubmission phase):
+ Section 3 subsections 3.4, 3.5: Inference for multiple regression
+ Section 4: Discussion. You will write this conclusion based on the results of sections 3.4 and 3.5.
## What to submit
`r only_group_leader`
1. The `project_submission.Rmd` R Markdown file with sections 1, 2, 3.1, 3.2, and 3.3 completed
1. The `project_submission.html` HTML report file.
1. Only if your data is confidential or sensitive in nature, submit your Excel or CSV file as well. Otherwise you should publish your data as a CSV file on Google Sheets as described above.
***
# Project Resubmission {#resubmission}
Get a sense for the requirements of this project phase by re-reading all the sections of the possible [project resubmission](static/term_project/resubmission_example.html){target="_blank"} from the previous submission phase. In particular, read the following new sections:
* Sections 3.4, 3.5: Inference for multiple regression
* Section 4: Discussion.
## Revise your initial submission
Using the same `project_submission.Rmd` file you submitted for the project submission phase:
* Incorporate any feedback given to you from the project submission phase.
* Complete Sections 3.4 and 3.5: Inference for multiple regression
* Complete Section 4: Discussion. You will write this conclusion based on the results of sections 3.4 and 3.5.
## What to submit
`r only_group_leader`
1. The updated `project_submission.Rmd` R Markdown file.
1. The updated `project_submission.html` HTML report file.
1. Only if your data is confidential or sensitive in nature, submit your Excel or CSV file as well. Otherwise you should publish your data as a CSV file on Google Sheets as described above.