Skip to content

Commit 0b42a70

Browse files
committed
start adding notes from consultations
1 parent 7b5d72a commit 0b42a70

12 files changed

+11078
-0
lines changed

complex_reshaping_example.html

Lines changed: 3412 additions & 0 deletions
Large diffs are not rendered by default.

complex_reshaping_example.qmd

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
---
2+
title: "Complicated reshaping example"
3+
author: "Clay Ford"
4+
date: "February 15, 2024"
5+
format:
6+
html:
7+
embed-resources: true
8+
---
9+
10+
## The task
11+
12+
The data set in question had six different sets of variables that needed to be reshaped into "long" format. Each set was collected over five waves.
13+
14+
```{r echo=FALSE}
15+
library(haven)
16+
d <- read_sav("SupportandERI.sav")
17+
# reorder columns
18+
vars <- c("MEI_affirm", "w2MEI_affirm", "w3MEI_affirm", "w4MEI_affirm",
19+
"w5MEI_affirm", "MEI_explor", "w2MEI_explor", "w3MEI_explor",
20+
"w4MEI_explor", "w5MEI_explor")
21+
d <- d[,c(names(d)[1:6], vars, names(d)[17:36])]
22+
names(d)[c(7, 12)] <- c("w1MEI_affirm", "w1MEI_explor")
23+
```
24+
25+
For example, here's one set:
26+
27+
```{r}
28+
d[1:2,7:11]
29+
```
30+
31+
There are five more sets like that in the data. For each I needed to reshape such that there was a column for time and a column identifying the type of measure. Something like this...
32+
33+
```
34+
time affirm
35+
1 3
36+
2 2
37+
3 2.33
38+
4 3
39+
5 3
40+
1 4.33
41+
2 4
42+
...
43+
```
44+
45+
## The original approach
46+
47+
My first attempt was pretty verbose but it worked. You can tell my mastery of `pivot_longer()` did not exceed much beyond the basics. Yes, I knew how to use the `names_transform` argument to extract the numbers from the column labels, but that was about it.
48+
49+
```{r}
50+
library(tidyr)
51+
52+
# reshape affirm
53+
d1 <- pivot_longer(d[,1:11], -c(1:6), names_to = "time",
54+
values_to = "affirm",
55+
names_transform = readr::parse_number)
56+
57+
# reshape explor
58+
d2 <- pivot_longer(d[,c(1, 12:16)], -1, names_to = "time",
59+
values_to = "explor",
60+
names_transform = readr::parse_number)
61+
62+
63+
# reshape appSUM
64+
d3 <- pivot_longer(d[,c(1, 17:21)], -1, names_to = "time",
65+
values_to = "appSUM",
66+
names_transform = readr::parse_number)
67+
68+
# reshape emoSUM
69+
d4 <- pivot_longer(d[,c(1, 22:26)], -1, names_to = "time",
70+
values_to = "emoSUM",
71+
names_transform = readr::parse_number)
72+
73+
# reshape infoSUM
74+
d5 <- pivot_longer(d[,c(1, 27:31)], -1, names_to = "time",
75+
values_to = "infoSUM",
76+
names_transform = readr::parse_number)
77+
78+
# reshape racematch
79+
d6 <- pivot_longer(d[,c(1, 32:36)], -1, names_to = "time",
80+
values_to = "racematch",
81+
names_transform = readr::parse_number)
82+
83+
# merge reshaped dataframes into one
84+
d_long1 <- merge(d1, d2, by = c("P_ID", "time")) |>
85+
merge(d3, by = c("P_ID", "time")) |>
86+
merge(d4, by = c("P_ID", "time")) |>
87+
merge(d5, by = c("P_ID", "time")) |>
88+
merge(d6, by = c("P_ID", "time"))
89+
90+
head(d_long1, n = 5)
91+
```
92+
93+
This did the trick and I was able to move on, but it didn't sit right with me. I thought I remembered Jacob demonstrating in [his data wrangling workshop](https://virginia.box.com/s/u3ojzf0c0xuxe13yohk8noi4di8zb4ri) how to reshape datasets similar to this using just _one call_ to `pivot_longer()`. So today I decided to revisit this code and see what I could do with it.
94+
95+
## The elegant approach
96+
97+
After some fiddling around with the `pivot_longer()` function and reading the {tidyr} "Pivoting" vignette, I was finally able to implement this in a much more elegant fashion. (I should have referred to Jacob's workshop materials, but I didn't have them handy and decided to see if I could figure it out on my own. Of course, once I did it, I then had to go download his materials and check if this approach is what he demonstrated, and it is!)
98+
99+
```{r}
100+
d_long2 <- pivot_longer(d, cols = !c(1:6),
101+
names_to = c("time", ".value"),
102+
names_pattern = "(w[1-5])(.+)",
103+
names_transform = list(time = readr::parse_number))
104+
head(d_long2, n = 5)
105+
```
106+
107+
This is for my benefit so I remember how this works:
108+
109+
- `cols = !c(1:6)` says to reshape all columns but the first six
110+
- `names_to = c("time", ".value")` uses the keyword `.value` which "indicates that the corresponding component of the column name defines the name of the output column containing the cell values" (from the help page).
111+
- `names_pattern = "(w[1-5])(.+)"` defines the pattern of the column names: "the letter w followed by a number 1-5, `(w[1-5])`, and then everything else, `(.+)`.
112+
- `names_transform = list(time = readr::parse_number)` says to transform the time column to a number using the `parse_number` function.
113+
114+
In the grand scheme, I'm not sure this makes a difference. Most people don't care about data wrangling but rather the analysis. Although my first attempt was inefficient, it was easy for me to verify that each set of columns was successfully reshaped. I don't know if I'll ever get to a place where it becomes routine for me to reshape data like this casually using one call to `pivot_longer()`. But I thought it might be of interest to see what `pivot_longer()` is capable of.

0 commit comments

Comments
 (0)