|  | 
|  | 1 | +--- | 
|  | 2 | +layout: default | 
|  | 3 | +title: Reshaping Data from Long to Wide (or Wide to Long) | 
|  | 4 | +nav_order: 4 | 
|  | 5 | +parent: NCDS | 
|  | 6 | +format: docusaurus-md | 
|  | 7 | +--- | 
|  | 8 | + | 
|  | 9 | + | 
|  | 10 | + | 
|  | 11 | + | 
|  | 12 | +# Introduction | 
|  | 13 | + | 
|  | 14 | +In this section, we show how to reshape data from long to wide (and vice | 
|  | 15 | +versa). To demonstrate, we use data from Sweeps 4 (23y) and 8 (50y) on | 
|  | 16 | +cohort member’s height and weight collected. | 
|  | 17 | + | 
|  | 18 | +The packages we use are: | 
|  | 19 | + | 
|  | 20 | +```r | 
|  | 21 | +# Load Packages | 
|  | 22 | +library(tidyverse) # For data manipulation | 
|  | 23 | +library(haven) # For importing .dta files | 
|  | 24 | +``` | 
|  | 25 | + | 
|  | 26 | +# Reshaping Raw Data from Wide to Long | 
|  | 27 | + | 
|  | 28 | +We begin by loading the data from each sweep and merging these together | 
|  | 29 | +into a single wide format data frame; see [Combining Data Across | 
|  | 30 | +Sweeps](https://cls-data.github.io/docs/ncds-merging_across_sweeps.html) | 
|  | 31 | +for further explanation on how this is achieved. Note, the names of the | 
|  | 32 | +height and weight variables in Sweep 4 and Sweep 8 follow a similar | 
|  | 33 | +convention, which is the exception rather than the rule in NCDS data. | 
|  | 34 | +Below, we convert the variable names in the Sweep 4 data frame to upper | 
|  | 35 | +case so that they closely match those in the Sweep 8 data frame. This | 
|  | 36 | +will make reshaping easier. | 
|  | 37 | + | 
|  | 38 | +```r | 
|  | 39 | +df_23y <- read_dta("23y/ncds4.dta", | 
|  | 40 | +                      col_select = c("ncdsid", "dvwt23", "dvht23")) %>% | 
|  | 41 | +rename_with(str_to_upper) | 
|  | 42 | + | 
|  | 43 | +df_50y <- read_dta("50y/ncds_2008_followup.dta", | 
|  | 44 | +                   col_select = c("NCDSID", "DVWT50", "DVHT50")) | 
|  | 45 | + | 
|  | 46 | +df_wide <- df_23y %>% | 
|  | 47 | +  full_join(df_50y, by = "NCDSID") | 
|  | 48 | +``` | 
|  | 49 | + | 
|  | 50 | +`df_wide` has 5 columns. Besides, the identifier, `NCDSID`, there are 4 | 
|  | 51 | +columns for height and weight measurements at each sweep. Each of these | 
|  | 52 | +4 columns is suffixed by two numbers indicating the age at assessment. | 
|  | 53 | +We can reshape the dataset into long format (one row per person x sweep | 
|  | 54 | +combination) using the `pivot_longer()` function so that the resulting | 
|  | 55 | +data frame has four columns: one person identifier, a variable for age | 
|  | 56 | +of assessment (`fup`), and variables for height and weight. We specify | 
|  | 57 | +the columns to be reshaped using the `cols` argument, provide the new | 
|  | 58 | +variable names in the `names_to` argument, and the pattern the existing | 
|  | 59 | +column names take using the `names_pattern` argument. For | 
|  | 60 | +`names_pattern` we specify `"^(.*)(\\d\\d)$"`, which breaks the column | 
|  | 61 | +name into two pieces: the first characters (`"(.*)"`) and two digits at | 
|  | 62 | +the end of the name (`"(\\d\\d)$"`). `names_pattern` uses regular | 
|  | 63 | +expressions. `.` matches single characters, and `.*` modifies this to | 
|  | 64 | +make zero or more characters. `\\d` is a special character denoting a | 
|  | 65 | +digit. As noted, the final two digits character hold information on age | 
|  | 66 | +of assessment; in the reshaped data frame the character is stored as a | 
|  | 67 | +value in a new column `fup`. `.value` is a placeholder for the new | 
|  | 68 | +columns in the reshaped data frame that store the values from the | 
|  | 69 | +columns selected by `cols`; these new columns are named using the first | 
|  | 70 | +piece from `names_pattern` - in this case `DVHT` (height) and `DVWT` | 
|  | 71 | +(weight). | 
|  | 72 | + | 
|  | 73 | +```r | 
|  | 74 | +df_long <- df_wide %>% | 
|  | 75 | +  pivot_longer(cols = matches("DV(HT|WT)\\d\\d"), | 
|  | 76 | +               names_to = c(".value", "fup"), | 
|  | 77 | +               names_pattern = "^(.*)(\\d\\d)$") | 
|  | 78 | + | 
|  | 79 | +df_long | 
|  | 80 | +``` | 
|  | 81 | + | 
|  | 82 | +``` text | 
|  | 83 | +# A tibble: 28,028 × 4 | 
|  | 84 | +   NCDSID  fup   DVHT      DVWT      | 
|  | 85 | +   <chr>   <chr> <dbl+lbl> <dbl+lbl> | 
|  | 86 | + 1 N10001N 23     1.63     59.4      | 
|  | 87 | + 2 N10001N 50    NA        66.7      | 
|  | 88 | + 3 N10002P 23     1.90     73.5      | 
|  | 89 | + 4 N10002P 50    NA        79.4      | 
|  | 90 | + 5 N10004R 23     1.65     76.2      | 
|  | 91 | + 6 N10004R 50    NA        NA        | 
|  | 92 | + 7 N10007U 23     1.63     52.2      | 
|  | 93 | + 8 N10007U 50    NA        72.1      | 
|  | 94 | + 9 N10009W 23     1.73     66.7      | 
|  | 95 | +10 N10009W 50     1.7      78        | 
|  | 96 | +# ℹ 28,018 more rows | 
|  | 97 | +``` | 
|  | 98 | + | 
|  | 99 | +# Reshaping Raw Data from Long to Wide | 
|  | 100 | + | 
|  | 101 | +We can also reshape the data from long to wide format using the | 
|  | 102 | +`pivot_wider()` function. In this case, we want to create two new | 
|  | 103 | +columns for each sweep: one for height and one for weight. We specify | 
|  | 104 | +the columns to be reshaped using the `values_from` argument, provide the | 
|  | 105 | +old column names in the `names_from` argument, and use the `names_glue` | 
|  | 106 | +argument to specify the convention to follow for the new column names. | 
|  | 107 | +The `names_glue` argument uses curly braces (`{}`) to reference the | 
|  | 108 | +values from the `names_from` and `.value` arguments. As we are | 
|  | 109 | +specifying multiple columns in `values_from`, `.value` is a placeholder | 
|  | 110 | +for the names of the variables selected in `values_from`. | 
|  | 111 | + | 
|  | 112 | +```r | 
|  | 113 | +df_long %>% | 
|  | 114 | +  pivot_wider(names_from = fup, | 
|  | 115 | +              values_from = matches("DV(HT|WT)"), | 
|  | 116 | +              names_glue = "{.value}{fup}") | 
|  | 117 | +``` | 
|  | 118 | + | 
|  | 119 | +``` text | 
|  | 120 | +# A tibble: 14,014 × 5 | 
|  | 121 | +   NCDSID  DVHT23    DVHT50    DVWT23    DVWT50    | 
|  | 122 | +   <chr>   <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> | 
|  | 123 | + 1 N10001N 1.63      NA         59.4      66.7     | 
|  | 124 | + 2 N10002P 1.90      NA         73.5      79.4     | 
|  | 125 | + 3 N10004R 1.65      NA         76.2      NA       | 
|  | 126 | + 4 N10007U 1.63      NA         52.2      72.1     | 
|  | 127 | + 5 N10009W 1.73       1.7       66.7      78       | 
|  | 128 | + 6 N10011Q 1.68       1.7       63.5      95       | 
|  | 129 | + 7 N10012R 1.96      NA        114.      133.      | 
|  | 130 | + 8 N10013S 1.78      NA         83.5      95.2     | 
|  | 131 | + 9 N10014T 1.55      NA         57.2      63.5     | 
|  | 132 | +10 N10015U 1.80      NA         73.0      78       | 
|  | 133 | +# ℹ 14,004 more rows | 
|  | 134 | +``` | 
|  | 135 | + | 
|  | 136 | +Note, in the original `df_wide` tibble, `DVHT23` and `DVWT23` were | 
|  | 137 | +labelled numeric vectors - this class allows users to add metadata to | 
|  | 138 | +variables (value labels, etc.). `DVHT50` and `DVWT50`, on the other | 
|  | 139 | +hand, were standard numeric vectors. When reshaping to long format, | 
|  | 140 | +multiple variables are effectively appended together. The final reshape | 
|  | 141 | +variables can only have one set of properties. `pivot_longer()` merges | 
|  | 142 | +variables together to preserve variables attributes, but in some cases | 
|  | 143 | +will throw an error (where variables are of inconsistent types) or print | 
|  | 144 | +a warning (where value labels are inconsistent). Note above, where we | 
|  | 145 | +reshape `df_long` back to wide format, all weight and height variables | 
|  | 146 | +now have labelled numeric type. | 
0 commit comments