|
| 1 | +--- |
| 2 | +layout: default |
| 3 | +title: "Reshaping Data from Long to Wide (or Wide to Long)" |
| 4 | +nav_order: 6 |
| 5 | +parent: MCS |
| 6 | +format: docusaurus-md |
| 7 | +--- |
| 8 | + |
| 9 | +# Introduction |
| 10 | + |
| 11 | +In this section, we show how to reshape data from long to wide (and vice versa). We do this for both raw and cleaned data. To demonstrate, we use data on cohort member's height and weight collected in Sweeps 3-7. |
| 12 | + |
| 13 | +The packages we use are: |
| 14 | + |
| 15 | +```{r} |
| 16 | +#| warning: false |
| 17 | +# Load Packages |
| 18 | +library(tidyverse) # For data manipulation |
| 19 | +library(haven) # For importing .dta files |
| 20 | +library(glue) # For creating strings |
| 21 | +``` |
| 22 | + |
| 23 | +```{r} |
| 24 | +#| include: false |
| 25 | +# setwd(Sys.getenv("mcs_fld")) |
| 26 | +``` |
| 27 | + |
| 28 | +# Reshaping Raw Data from Wide to Long |
| 29 | + |
| 30 | +We begin by loading the data from each sweep and merging these together into a single wide format data frame; see [Combining Data Across Sweeps](https://cls-data.github.io/docs/mcs-merging_across_sweeps.html) for further explanation on how this is achieved. Note, the names of the height and weight variables in Sweep 5 (`ECHTCMA0` and `ECWTCMAO`) diverge slightly from the convention used for other sweeps (`[C-G]CHTCM00` and `[C-G]CWTCM00` where `[C-G]` denotes sweep), hence the need for the complex regular expression in `read_dta(col_select = ...)` function call.[^1] To make the names of the columns in the wide dataset consistent (useful preparation for reshaping data), we rename the Sweep 5 variables so they follow the convention for the other sweeps. |
| 31 | + |
| 32 | +[^1]: Regular expressions are extremely useful and compact ways of working with text. See [Chapter 15 in the R for Data Science textbook](https://r4ds.hadley.nz/regexps.html) for more information. ChatGPT and similar services are very useful for writing and interpreting regular expressions. |
| 33 | + |
| 34 | +```{r} |
| 35 | +fups <- c(0, 3, 5, 7, 11, 14, 17) |
| 36 | +
|
| 37 | +load_height_wide <- function(sweep){ |
| 38 | + fup <- fups[sweep] |
| 39 | + prefix <- LETTERS[sweep] |
| 40 | + |
| 41 | + glue("{fup}y/mcs{sweep}_cm_interview.dta") %>% |
| 42 | + read_dta(col_select = c("MCSID", matches("^.(CNUM00|C(H|W)TCM(A|0)0)"))) %>% |
| 43 | + rename(CNUM00 = matches("CNUM00")) |
| 44 | +} |
| 45 | +
|
| 46 | +df_wide <- map(3:7, load_height_wide) %>% |
| 47 | + reduce(~ full_join(.x, .y, by = c("MCSID", "CNUM00"))) %>% |
| 48 | + rename(ECHTCM00 = ECHTCMA0, ECWTCMA00 = ECWTCMA0) |
| 49 | +
|
| 50 | +df_wide |
| 51 | +``` |
| 52 | + |
| 53 | +`df_wide` has 12 columns. Besides, the identifiers, `MCSID` and `cnum`, there are 10 columns for height and weight measurements at each sweep. Each of these 10 columns is prefixed by a single letter indicating the sweep. We can reshape the dataset into long format (one row per person x sweep combination) using the `pivot_longer()` function so that the resulting data frame has five columns: two person identifiers, a variable for sweep, and variables for height and weight. We specify the columns to be reshaped using the `cols` argument, provide the new variable names in the `names_to` argument, and the pattern the existing column names take using the `names_pattern` argument. For `names_pattern` we specify `"(.)(.*)"`, which breaks the column name into two pieces: the first character (`"(.)"`) and the rest of the name (`"(.*)"`). `names_pattern` uses regular expressions. `.` matches single characters, and `.*` modifies this to make zero or more characters. As noted, the first character holds information on sweep; in the reshaped data frame the character is stored as a value in a new column `sweep`. `.value` is a placeholder for the new columns in the reshaped data frame that store the values from the columns selected by `cols`; these new columns are named using the second piece from `names_pattern` - in this case `CHTCM00` (height) and `CWTCM00` (weight). |
| 54 | + |
| 55 | +```{r} |
| 56 | +#| warning: false |
| 57 | +df_long <- df_wide %>% |
| 58 | + pivot_longer(cols = matches("C(H|W)TCM00"), |
| 59 | + names_to = c("sweep", ".value"), |
| 60 | + names_pattern = "(.)(.*)") |
| 61 | +
|
| 62 | +df_long |
| 63 | +``` |
| 64 | + |
| 65 | +# Reshaping Raw Data from Long to Wide |
| 66 | + |
| 67 | +We can also reshape the data from long to wide format using the `pivot_wider()` function. In this case, we want to create two new columns for each sweep: one for height and one for weight. We specify the columns to be reshaped using the `values_from` argument, provide the old column names in the `names_from` argument, and use the `names_glue` argument to specify the convention to follow for the new column names. The `names_glue` argument uses curly braces (`{}`) to reference the values from the `names_from` and `.value` arguments. As we are specifying multiple columns in `values_from`, `.value` is a placeholder for the names of the variables selected in `values_from`. |
| 68 | + |
| 69 | +```{r} |
| 70 | +df_long %>% |
| 71 | + pivot_wider(names_from = sweep, |
| 72 | + values_from = matches("C(W|H)T"), |
| 73 | + names_glue = "{sweep}{.value}") |
| 74 | +``` |
| 75 | + |
| 76 | +# Reshaping Cleaned Data from Long to Wide |
| 77 | + |
| 78 | +It is likely that you will not just need to reshape raw data, but cleaned data too. In the next two sections we offer advice on naming variables so that they are easy to select and reshape in long or wide formats. First, we clean the long dataset by converting the `cnum` and `sweep` columns to integers, creating a new column for follow-up time, and creating new `height` and `weight` variables that replace negative values in the raw height and weight data with `NA` (as well as giving these variables more easy-to-understand names). |
| 79 | + |
| 80 | +```{r} |
| 81 | +df_long_clean <- df_long %>% |
| 82 | + mutate(cnum = as.integer(CNUM00), |
| 83 | + sweep = match(sweep, LETTERS), |
| 84 | + fup = fups[sweep], |
| 85 | + height = ifelse(CHTCM00 > 0, CHTCM00, NA), |
| 86 | + weight = ifelse(CWTCM00 > 0, CWTCM00, NA)) %>% |
| 87 | + select(MCSID, cnum, fup, height, weight) |
| 88 | +``` |
| 89 | + |
| 90 | +To reshape the clean data from long to wide format, we can use the `pivot_wider()` function as before. This time, we specify the columns to be reshaped using the `names_from` argument, provide the new column names in the `values_from` argument, and use the `names_glue` argument to specify the new column names. The `names_glue` argument uses curly braces (`{}`) to reference the values from the `names_from` and `.value` arguments. As we are specifying multiple columns in `values_from`, `.value` is a placeholder for the variable name. |
| 91 | + |
| 92 | +```{r} |
| 93 | +df_wide_clean <- df_long_clean %>% |
| 94 | + mutate(fup = ifelse(fup < 10, glue("0{fup}"), as.character(fup))) %>% |
| 95 | + pivot_wider(names_from = fup, |
| 96 | + values_from = c(height, weight), |
| 97 | + names_glue = "{.value}_{fup}") |
| 98 | +
|
| 99 | +df_wide_clean |
| 100 | +``` |
| 101 | + |
| 102 | +Notice that prior to reshaping, we convert the `fup` variable to a string and ensure it has two characters (`5` becomes `05`). The reason for including this step is to make the names of similar variables the same length. This consistency makes it simpler to subset variables either by name (e.g., `select(matches("^height_\d\d$"))`) or by numerical range (e.g., `select(matches("^(h|w)eight_1[1-4]$"))`).[^2] |
| 103 | + |
| 104 | +[^2]: In regular expressions, `^` and `$` are special characters that match the beginning and end of a string, respectively. `"^height_\\d\\d$"` matches any string that begins "height\_", immediately followed by two digits (0, 1, ..., 9) that end the string. `"^(h|w)eight_1[1-4]"` matches any string that begins height or weight, immediately followed by 11, 12, 13, or 14 (`[1-4]` is a compact way of matching the integer range 1, 2, 3 or 4). |
| 105 | + |
| 106 | +# Reshaping Cleaned Data from Wide to Long |
| 107 | + |
| 108 | +Finally, we can reshape the clean wide dataset back to long format using the `pivot_longer()` function. We specify the columns to be reshaped using the `cols` argument, provide the new variable names in the `names_to` argument, and the pattern the existing column names take using the `names_pattern` argument. For `names_pattern` we specify `"(.*)_(.*)y"`, which breaks the column name into two pieces: the variable name (`"(.*)"`), and the follow-up time (`"(.*)y"`). We also use the `names_transform` argument to convert the follow-up time to an integer. |
| 109 | + |
| 110 | +```{r} |
| 111 | +df_wide_clean %>% |
| 112 | + pivot_longer(cols = matches("_\\d\\d$"), |
| 113 | + names_to = c(".value", "fup"), |
| 114 | + names_pattern = "(.*)_(\\d\\d)$", |
| 115 | + names_transform = list(fup = as.integer)) |
| 116 | +``` |
| 117 | + |
| 118 | +# Footnotes |
0 commit comments