-
Notifications
You must be signed in to change notification settings - Fork 1
/
hw6-data-visualization.Rmd
111 lines (75 loc) · 6.81 KB
/
hw6-data-visualization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
title: "Homework 6 - Data Visualization & Communication"
output:
html_document:
number_sections: false
toc: no
---
**Due**: Sunday, 01-Dec. at 8pm
**Rules**: This entire assignment is **SOLO**. You may not work with other classmates, though you may consult instructors for help.
**Overview**: For this assignment, you will need to find a dataset of your choosing and create a short report that contains a description of the dataset as well as a few key insights about the data, including at least two summary visualizations. You will publish your report online at http://rpubs.com/. Before beginning the assignment, be sure to have reviewed the lessons on
[Data Frames](L10-da1-data-frames.html), [Data Wrangling](L11-da2-data-wrangling.html), [Data Visualization](L12-da3-data-visualization.Rmd) and [Communicating Information](L13-da4-communicating-information.Rmd).
---
# Mini data analysis project
### 1) Choose a dataset [5 pts]
You will need to select a dataset for your project. To keep things manageable, you must choose one of the following datasets from the following libraries. **Note that to load any of these data frames, all you need to do is load the library**:
**dplyr**:
- `storms`
- `starwars`
**ggplot2**:
- `diamonds`
- `economics`
- `midwest`
- `mpg`
- `msleep`
- `txhousing`
[**dslabs**](https://github.com/rafalab/dslabs):
- `brexit_polls`
- `gapminder`
- `greenhouse_gases`
- `historic_co2`
- `movielens`
- `murders`
- `nyc_regents_scores`
- `olive`
- `polls_us_election_2016`
- `reported_heights`
- `research_funding_rates`
- `results_us_election_2016`
- `stars`
- `temp_carbon`
- `trump_tweets`
- `us_contagious_diseases`
### 2) Setup your analysis environment [10 pts]
Once you've chosen a dataset, start setting up your analysis environment by following these steps:
1. Open RStudio and create a new project called "hw6-lastName", replacing "lastName" with your last name.
2. Download the [hw6.Rmd template script](https://github.com/emse6574-gwu/2019-Fall/raw/gh-pages/hw_templates/hw6.Rmd) and place it in RStudio project folder you just created.
3. In the "yaml header" at the top of this file, fill out your name, and give your project a title. Your title must include the dataset name and the package from which the dataset is sourced. For example, if you are using the `admissions` dataset in the `dslabs` package, an example title might be: "Summary of the `admissions` dataset from the `dslabs` package"
4. Create a `.R` file and save it in your project folder as `explore.R` - we'll use this file to explore your dataset later.
### 3) Inspect your data [10 pts]
Now that your environment is set up, open up your `explore.R` file and begin exploring your dataset. Be sure to load the library that contains the dataset.
Write some code to preview and summarize the dataset using some of the methods we've seen in class and in the lessons on [data frames](L10-da1-data-frames.html) and [data wrangling](L11-da2-data-wrangling.html). You should be able to quickly get an understanding of what variables are included in the data frame and their nature. Consider the following questions in your exploration:
- What type of data is each variable?
- What is the total size of the data frame?
- What are the boundaries of each period of observation (e.g. what time period do the observations in the data frame span)?
- Do any variables have missing values, and if so which ones have more NAs than others? Why might that be the case?
**Do not brush this step off** - the more thoroughly you inspect your dataset, the easier (and better) you data exploration will be. This will be important for step 4, and absolutely critical for step 5. Make sure you take the time to develop an understanding of the variables in your dataset as it is nearly impossible to imagine what summary plots might be worth creating otherwise.
### 4) Document and summarize your dataset [15 pts]
Open up your "hw6.Rmd" file and write a paragraph in the section labeled `# Data description` describing of your dataset. Include the following:
1. A brief overview of the subject matter of the dataset. What is this dataset about?
2. A description of the data _source_. Read the package help files to identify as much information as possible about how the data were collected (when, where, by whom, etc.) and whether the data have already been pre-processed or not. If the data come from a published paper, include a citation to that paper.
3. A summary table of each variable in the dataset. Each row should be a variable, and the columns should be the variable name and a short description of the variable (the "hw6.Rmd" template already has this table started for you).
### 5) Make plots [30 pts (15 per plot)]
Now that you have a basic understanding of the dataset, make some plots to explore the variables in the data and their potential relationships. You may use base R plotting functions or the **ggplot2** library to make your figures, but you must make at least two figures, including:
1. A scatterplot of involving at least two variables.
2. A histogram or barplot involving at least one variable.
You can choose to plot whichever variables you wish, but you must be able to interpret the results of your plot. I recommend that you first make your plots in your `explore.R` file to iterate on your code until the plots are in their final form you desire. Then copy the code for your final plots over to the appropriate code chunks in your "hw6.Rmd" file.
### 6) Interpret your plots [15 pts]
Below each figure, write a description and interpretation of your plot. Make sure you address at least the following questions:
1. Describe what variables you are plotting and why.
2. Describe the primary relationship / trend / information you hope the reader will gain from your visualization.
### 7) Publish your writeup [10 pts]
Once you have completed your analysis, compile the "hw6.Rmd" file by opening it in RStudio and clicking the "knit" button. In the upper-right of the window that opens showing your compiled hw6.html file, click on the "Publish" button and publish your file to http://rpubs.com/. You should create an account with rpubs if you have not already. Once published, you can update it by making edits to your "hw6.Rmd" file, clicking the "knit" button again, and then clicking the "Republish" button in the upper-right of the RStudio preview window.
For a visual overview of the publishing process, refer to the instructions in the [youtube video](https://emse6574-gwu.github.io/2019-Fall/L13-da4-communicating-information.html#34_publish_to_the_web) in the lesson on Communicating Information.
### 8) Submit your files on Blackboard [5 pts]
After publishing your "hw6.Rmd" file, create a zip file of all files in your R project folder for this assignment and submit the zip file on Blackboard by the due deadline. **Include a link to your published report on rpubs in your Blackboard submission**