-
Couldn't load subscription status.
- Fork 6
Visualising Data Practice
library(readr)
df <- read_csv(
'../code/dsbc_lectures/analysis/course_exemplar1.csv',
skip=1 # column names are on second line of table
)
# rename the columns using the clunky base R syntax.
# sadly can't rename all columns by name
# with plyr as there are duplicate column names
names(df)[1] <- "num"
names(df)[2] <- "age"
names(df)[3] <- "sex"
names(df)[4] <- "group"
names(df)[5] <- "pain_pre"
names(df)[6] <- "pain_3"
names(df)[7] <- "pain_12"
names(df)[8] <- "pain_24"
names(df)[9] <- "pain_2wk"
names(df)[10] <- "mov_12"
names(df)[11] <- "mov_24"
names(df)[12] <- "mov_2wk"
names(df)[13] <- "par"
names(df)[14] <- "cod"
names(df)[15] <- "mor"
names(df)[16] <- "other"
names(df)[17] <- "days"
names(df)[18] <- "rat"
names(df)[19] <- "satisfaction"
# all text to lower case
# then factor columns where appropriate
library(dplyr)
text_cols <- c(
"sex",
"group",
"par",
"cod",
"mor",
"other",
"rat",
"satisfaction"
)
fac_cols <- c(
"sex",
"group"
)
df <- df %>%
mutate_each_(funs(tolower), text_cols) %>%
mutate_each_(funs(factor), fac_cols)
# define satisfaction factor levels
df$satisfaction <- factor(
df$satisfaction,
levels = c(
'poor',
'satisfactory',
'good',
'excellent'
),
ordered=TRUE
)
# get rid of units in the drug columns and
# convert data type to numeric
df <- df %>%
mutate(par = as.numeric(strsplit(as.character(par), "g"))) %>%
mutate(cod = as.numeric(strsplit(as.character(cod), "mg"))) %>%
mutate(mor = as.numeric(strsplit(as.character(mor), "mg")))
I can't be bothered to tackle the 'other' column at the moment but using strsplit() within mutate(), then chaining to unnest() is probably the way to start? See here and here.
We can make plots just using 'base' R (i.e. without using any extra libraries), but it's easier to make prettier, more-complex plots using the ggplot2 package. If you haven't installed ggplot2 already, let's do it now:
install.packages("ggplot2")
We'll also install ggthemes, which makes it easy to customise the look of our visualisations.
install.packages("ggthemes")
As with any R packages, ggplot2 and ggthemes need to be loaded before we can start using them, like so:
library(ggplot2)
library(ggthemes)
Note that, unlike when using install.packages(), you don't need to use quotation marks around the package name when using library().
There are two ways to make plots with ggplot2: using the qplot() function (which is simpler and quicker), and using the ggplot() function (which is more flexible and fully-featured). Let's explore some associations between variables by making some simple scatterplots with qplot().
I wonder if there's an association between patient age and days spent in hospital? We might expect that older people would, on average, need to stay in a bit longer.
This code simply says: Plot data from the column called 'days' on the x axis, and data from the column called 'age' on the y axis. Look for those columns in the Data Frame called 'df'
qplot(days, age, data=df)
Surprisingly, there doesn't seem to be an obvious association there. Maybe demonstrate lack of linear association using stat_smooth(method="lm") here? Though perhaps this opens up a can of worms that we want to avoid?
Let's try something else: the relationship between pain score at 24-28 hours post-surgery, and cumulative post-surgery paracetamol dose.
This time, we'll store our plot as a variable. You'll see why this is useful in a second.
p1 <- qplot(pain_24, par, data=df)
p1
OK, this makes sense. For people with no pain at 24 hours, there's lots of variability in the cumulative paracetamol dose administered (probably some of them didn't have much pain in the first place, and some of them had their pain well controlled with decent doses of analgesia). For people with lots of pain at 24 hours, there's far less variability (all of them would have required maximum doses of paracetamol as background analgesia).
There's a limitation of our scatterplot though - we can't see how many patients are represented by each dot. We can fix this by using the stat_sum() function, making the size of each dot proportional to the number of patients it represents.
Note how we can simply 'add on' stat_sum() to our existing plot here.
p2 <- p1 + stat_sum()
p2
Let's increase the maximum size of the dots, to make the differences in size easier to see.
p3 <- p2 + scale_size_area(max_size=15)
p3
You'll see that R automatically labels the axes using the names of the columns in the data frame, but we can customise those labels if we like.
p4 <- p3 +
xlab("Pain score at 24-28 hours post-surgery") +
ylab("Cumulative post-surgery paracetamol (g)")
p4
Let's remake our plot from scratch, but this time we'll get it to display patient satisfaction as well. We can encode this information using colour.
p5 <- qplot(pain_24, par, data=df, color=satisfaction) +
stat_sum() +
scale_size_area(max_size=15) +
xlab("Pain score at 24-28 hours post-surgery") +
ylab("Cumulative post-surgery paracetamol (g)")
p5
OK, but R has used a default colour scheme which isn't very intuitive here. It would probably be easier to understand the plot if we used a 'traffic light' diverging palette, where increasing satisfaction is indicated by red through to yellow then green.
We can use the Colour Brewer website to pick out a colour scheme that's just to our liking. Grab the code for your customised colour palette like so:

p6 <- ggplot(df, aes(x=pain_24, y=par, color=satisfaction)) +
stat_sum() +
scale_size_area(max_size=15) +
xlab("Pain score at 24-28 hours post-surgery") +
ylab("Cumulative post-surgery paracetamol (g)") +
scale_colour_brewer(palette="RdYlGn", na.value="grey70")
p6
That's quite visually busy now, so let's simplify things by applying a minimal theme to the plot.
p7 <- p6 + theme_few()
p7
And let's save a high-resolution, pdf version of our plot.
ggsave("plot.pdf", path="my_folder/")
Please contact Steve Harris if you have any questions.