-
Notifications
You must be signed in to change notification settings - Fork 2
/
4.summarise-and-combine-solutions.Rmd
121 lines (85 loc) · 3.27 KB
/
4.summarise-and-combine-solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
title: "Summarising, Grouping and Joining Exercise"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output: html_document
---
### Summarising
The first part of the exercise uses the patients dataset we've been using in previous sections of the course. After reading this into R, answer the following questions using the `summarise`, `summarise_at`, `summarise_if` and `mutate_all` functions.
```{r message = FALSE}
library(dplyr)
patients <- read.delim("patient-data-cleaned.txt", stringsAsFactors = FALSE) %>% tbl_df
patients %>% head(5)
```
Compute the mean age, height and weight of patients in the patients dataset
- First compute the means using `summarise` and then try to do the same using `summarise_at`
```{r}
summarise(patients, mean(Age), mean(Height), mean(Weight))
```
```{r}
summarise_at(patients, vars(Age, Height, Weight), mean)
```
- Modify the output by adding a step to round to 1 decimal place
```{r}
patients %>%
summarize_at(vars(Age, Height, Weight), funs(mean)) %>%
mutate_all(funs(round(., digits = 1)))
```
Compute the means of all numeric columns
```{r}
summarise_if(patients, is.numeric, mean)
```
See what happens if you try to compute the mean of a logical (boolean) variable
- What proportion of our patient cohort has died?
```{r}
patients %>% summarize(mean(Died))
```
### Grouping
The following questions require grouping of patients based on one or more attributes using the `group_by` function.
Compare the average height of males and females in this patient cohort.
Are smokers heavier or lighter on average than non-smokers in this dataset?
```{r}
patients %>%
group_by(Sex) %>%
summarize(`Average height` = mean(Height))
patients %>%
group_by(Smokes) %>%
summarize(`Average weight` = mean(Weight))
patients %>%
group_by(Sex, Smokes) %>%
summarize(`Average weight` = mean(Weight))
```
### Joining
The patients are all part of a diabetes study and have had their blood glucose concentration and diastolic blood pressure measured on several dates.
This part of the exercise combines grouping, summarisation and joining operations to connect the diabetes study data to the patients table we've already been working with.
```{r}
diabetes <- read.delim("diabetes.txt", stringsAsFactors = FALSE)
diabetes %>% head(5)
```
The goal is to compare the blood pressure of smokers and non-smokers, similar to the comparison of the average weight we made in the previous part of the exercise.
First, calculate the average blood pressure for each individual in the `diabetes` data frame.
```{r}
bp <- diabetes %>%
group_by(ID) %>%
summarize(BP = mean(BP))
bp
```
Now use one of the join functions to combine these average blood pressure measurements with the `patients` data frame containing information on whether the patient is a smoker.
```{r}
combined <- left_join(bp, patients, by = "ID")
combined
```
Finally, calculate the average blood pressure for smokers and non-smokers on the resulting, combined data frame.
```{r}
combined %>%
group_by(Smokes) %>%
summarize(`Average blood pressure` = mean(BP))
```
Can you write these three steps as a single dplyr chain?
```{r}
diabetes %>%
group_by(ID) %>%
summarize(BP = mean(BP)) %>%
left_join(patients, by = "ID") %>%
group_by(Smokes) %>%
summarize(`Average blood pressure` = mean(BP))
```