forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
178 lines (121 loc) · 6.22 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
Activity Monitoring
========================================================
The data from a personal activity monitoring device is collected at 5 minute intervals through the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012, and include the number of steps taken in 5 minute intervals each day.
The data is stored in the file 'activity.csv', and includes the variables:
* **steps**: Number of steps taken in a 5-minute interval (missing values are coded as NA)
* **date**: Date on which the measurement was taken in YYYY-MM-DD format
* **interval**: Identifier for the 5 minute interval in which measurement was taken
### Loading and preprocessing the data
The data is read in to the dataframe activity as
```{r}
activity = read.csv('activity.csv')
```
The structure of the data shows that the dataset contains 17,568 observations.
```{r}
str(activity)
```
The date column, which is a factor variaable, is converted to the date class using the as.Date() function.
```{r}
activity$date = as.Date(activity$date, format = '%Y-%m-%d')
```
### The mean total number of steps taken per day
A new dataframe daily activity, containing the total number of steps taken each day is created using the plyr package.
```{r}
library(plyr)
daily_activity = ddply(activity, 'date', summarize, total.steps = sum(steps))
head(daily_activity)
```
A histogram of the total number of steps taken each day shows the following distribution.
```{r fig.width = 6, fig.height = 5}
library(lattice)
histogram(~daily_activity$total.steps,
main = 'Total number of steps taken each day',
xlab = 'Total daily steps', ylab = 'Frequency')
```
Ignoring the missing values, the mean and median total number of steps taken per day are
```{r}
mean(daily_activity$total.steps, na.rm = TRUE)
```
and
```{r}
median(daily_activity$total.steps, na.rm = TRUE)
```
respectively.
### The average daily activity pattern
A dataframe average_activity, containing the average number of steps taken, averaged across all days, for each 5 minute interval can be created, again using the plyr package. The missing values are ignored.
```{r}
average_activity = ddply(activity, 'interval', summarize,
average.steps = mean(steps, na.rm = TRUE))
```
A time series plot shows the the average number of steps taken for each 5 minute interval
```{r fig.width = 6, fig.height = 5}
xyplot(average_activity$average.steps ~ average_activity$interval, type = 'l',
main = 'Average number of steps taken for each 5 minute interval',
xlab = '5 minute interval', ylab = 'average number of steps')
```
On average across all the days in the dataset, the 5 minute interval that contains the maximum number of steps is
```{r}
average_activity$interval[which(average_activity$average.steps == max(average_activity$average.steps))]
```
and it contains
```{r}
max(average_activity$average.steps)
```
number of steps.
### Inputing missing values
There are a number of days/intervals where there are missing values denoted as NA. These missing values may introduce bias into some calculations or summaries of the data.
The total number of missing values in the dataset are
```{r}
length(which(is.na(activity$steps)))
```
The missing values can be filled in with the mean number of steps for that 5 minute interval. First, a new dataframe activity_f is created with a variable ave.steps that contains the average number of steps for each 5 minute interval.
```{r}
activity_f = ddply(activity, 'interval', transform, ave.steps = mean(steps, na.rm = TRUE))
```
The missing values in the steps column is then replaced with the value of the ave.steps column, the ave.steps column is removed, and the new dataframe is reordered according to the date as in the original activity dataframe.
```{r}
missing = which(is.na(activity_f$steps))
activity_f$steps[missing] = activity_f$ave.steps[missing]
activity_f = activity_f[1:dim(activity_f)[2]-1]
activity_f = ddply(activity_f, 'date')
str(activity_f)
head(activity_f)
```
The total number of steps taken each day now has the distribution shown in the figure.
```{r}
daily_activity_f = ddply(activity_f, 'date', summarize, total.steps = sum(steps))
```
```{r fig.width = 6, fig.height = 5}
histogram(~ daily_activity_f$total.steps,
main = 'Total number of steps taken each day with missing values filled in',
xlab = 'Total daily steps', ylab = 'Frequency')
```
The mean and median total number of steps taken per day are
```{r}
mean(daily_activity_f$total.steps)
```
and
```{r}
median(daily_activity_f$total.steps)
```
respectively.
Note that the mean and median total number of steps taken per day is virtually identical to those when the missing values were simply ignored. This is since the missing values were filled in with the average values over the dataset. Imputing missing data with the mean values over the dataset does not affect the mean.
### Difference in activity pattern bewteen weekdays and weekends
In order to compare the activity pattern between weekdays and weekends, a new factor variable is created in the dataset with two levels - 'weekday' and 'weekend' - indicating whether a given date is a weekday or weekend day.
```{r}
activity_f$day <- factor(weekdays(activity_f$date) %in% c('Saturday', 'Sunday'),
labels = c('weekday', 'weekend'))
head(activity_f)
```
A dataframe, average_activity_f, containing the average number of steps taken for each 5 minute interval across all weekdays or weekend days is created using the ddply() function. The dataset with the filled-in missing values are used.
```{r}
average_activity_f = ddply(activity_f, c('interval', 'day'), summarize,
average.steps = mean(steps, na.rm = TRUE))
head(average_activity_f)
```
A time series plot shows the the average number of steps taken for each 5 minute interval for weekdays and weekend days, showing that the distribution of activity is more even on weekend days than on weekdays.
```{r fig.width = 6, fig.height = 6}
xyplot(average_activity_f$average.steps ~ average_activity_f$interval | average_activity_f$day,
xlab = '5 minute interval', ylab = 'average number of steps', type = 'l',
layout = c(1, 2))
```