Skip to content

Commit 4df5e78

Browse files
committed
add test stata page
1 parent 29e0b04 commit 4df5e78

File tree

3 files changed

+178
-0
lines changed

3 files changed

+178
-0
lines changed
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": null,
6+
"id": "fa5b3317-2e4d-40cb-a2df-5ee34de3cb04",
7+
"metadata": {},
8+
"outputs": [],
9+
"source": []
10+
}
11+
],
12+
"metadata": {
13+
"kernelspec": {
14+
"display_name": "Stata",
15+
"language": "stata",
16+
"name": "stata"
17+
},
18+
"language_info": {
19+
"codemirror_mode": "stata",
20+
"file_extension": ".do",
21+
"mimetype": "text/x-stata",
22+
"name": "stata",
23+
"version": "15.1"
24+
}
25+
},
26+
"nbformat": 4,
27+
"nbformat_minor": 5
28+
}

docs/Untitled.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
---
2+
layout: default
3+
title: "Test Stata Page"
4+
nav_order: 6
5+
parent: MCS
6+
format: docusaurus-md
7+
---
8+
9+
```stata
10+
sysuse auto, clear
11+
```
12+
13+
(1978 automobile data)
14+
15+
16+
17+
```stata
18+
sum mpg
19+
```
20+
21+
22+
Variable | Obs Mean Std. dev. Min Max
23+
-------------+---------------------------------------------------------
24+
mpg | 74 21.2973 5.785503 12 41
25+
26+
27+
Hopefully this works
28+
29+
30+
```stata
31+
32+
```
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
layout: default
3+
title: "Reshaping Data from Long to Wide (or Wide to Long)"
4+
nav_order: 6
5+
parent: MCS
6+
format: docusaurus-md
7+
---
8+
9+
# Introduction
10+
11+
In this section, we show how to reshape data from long to wide (and vice versa). We do this for both raw and cleaned data. To demonstrate, we use data on cohort member's height and weight collected in Sweeps 3-7.
12+
13+
The packages we use are:
14+
15+
```{r}
16+
#| warning: false
17+
# Load Packages
18+
library(tidyverse) # For data manipulation
19+
library(haven) # For importing .dta files
20+
library(glue) # For creating strings
21+
```
22+
23+
```{r}
24+
#| include: false
25+
# setwd(Sys.getenv("mcs_fld"))
26+
```
27+
28+
# Reshaping Raw Data from Wide to Long
29+
30+
We begin by loading the data from each sweep and merging these together into a single wide format data frame; see [Combining Data Across Sweeps](https://cls-data.github.io/docs/mcs-merging_across_sweeps.html) for further explanation on how this is achieved. Note, the names of the height and weight variables in Sweep 5 (`ECHTCMA0` and `ECWTCMAO`) diverge slightly from the convention used for other sweeps (`[C-G]CHTCM00` and `[C-G]CWTCM00` where `[C-G]` denotes sweep), hence the need for the complex regular expression in `read_dta(col_select = ...)` function call.[^1] To make the names of the columns in the wide dataset consistent (useful preparation for reshaping data), we rename the Sweep 5 variables so they follow the convention for the other sweeps.
31+
32+
[^1]: Regular expressions are extremely useful and compact ways of working with text. See [Chapter 15 in the R for Data Science textbook](https://r4ds.hadley.nz/regexps.html) for more information. ChatGPT and similar services are very useful for writing and interpreting regular expressions.
33+
34+
```{r}
35+
fups <- c(0, 3, 5, 7, 11, 14, 17)
36+
37+
load_height_wide <- function(sweep){
38+
fup <- fups[sweep]
39+
prefix <- LETTERS[sweep]
40+
41+
glue("{fup}y/mcs{sweep}_cm_interview.dta") %>%
42+
read_dta(col_select = c("MCSID", matches("^.(CNUM00|C(H|W)TCM(A|0)0)"))) %>%
43+
rename(CNUM00 = matches("CNUM00"))
44+
}
45+
46+
df_wide <- map(3:7, load_height_wide) %>%
47+
reduce(~ full_join(.x, .y, by = c("MCSID", "CNUM00"))) %>%
48+
rename(ECHTCM00 = ECHTCMA0, ECWTCMA00 = ECWTCMA0)
49+
50+
df_wide
51+
```
52+
53+
`df_wide` has 12 columns. Besides, the identifiers, `MCSID` and `cnum`, there are 10 columns for height and weight measurements at each sweep. Each of these 10 columns is prefixed by a single letter indicating the sweep. We can reshape the dataset into long format (one row per person x sweep combination) using the `pivot_longer()` function so that the resulting data frame has five columns: two person identifiers, a variable for sweep, and variables for height and weight. We specify the columns to be reshaped using the `cols` argument, provide the new variable names in the `names_to` argument, and the pattern the existing column names take using the `names_pattern` argument. For `names_pattern` we specify `"(.)(.*)"`, which breaks the column name into two pieces: the first character (`"(.)"`) and the rest of the name (`"(.*)"`). `names_pattern` uses regular expressions. `.` matches single characters, and `.*` modifies this to make zero or more characters. As noted, the first character holds information on sweep; in the reshaped data frame the character is stored as a value in a new column `sweep`. `.value` is a placeholder for the new columns in the reshaped data frame that store the values from the columns selected by `cols`; these new columns are named using the second piece from `names_pattern` - in this case `CHTCM00` (height) and `CWTCM00` (weight).
54+
55+
```{r}
56+
#| warning: false
57+
df_long <- df_wide %>%
58+
pivot_longer(cols = matches("C(H|W)TCM00"),
59+
names_to = c("sweep", ".value"),
60+
names_pattern = "(.)(.*)")
61+
62+
df_long
63+
```
64+
65+
# Reshaping Raw Data from Long to Wide
66+
67+
We can also reshape the data from long to wide format using the `pivot_wider()` function. In this case, we want to create two new columns for each sweep: one for height and one for weight. We specify the columns to be reshaped using the `values_from` argument, provide the old column names in the `names_from` argument, and use the `names_glue` argument to specify the convention to follow for the new column names. The `names_glue` argument uses curly braces (`{}`) to reference the values from the `names_from` and `.value` arguments. As we are specifying multiple columns in `values_from`, `.value` is a placeholder for the names of the variables selected in `values_from`.
68+
69+
```{r}
70+
df_long %>%
71+
pivot_wider(names_from = sweep,
72+
values_from = matches("C(W|H)T"),
73+
names_glue = "{sweep}{.value}")
74+
```
75+
76+
# Reshaping Cleaned Data from Long to Wide
77+
78+
It is likely that you will not just need to reshape raw data, but cleaned data too. In the next two sections we offer advice on naming variables so that they are easy to select and reshape in long or wide formats. First, we clean the long dataset by converting the `cnum` and `sweep` columns to integers, creating a new column for follow-up time, and creating new `height` and `weight` variables that replace negative values in the raw height and weight data with `NA` (as well as giving these variables more easy-to-understand names).
79+
80+
```{r}
81+
df_long_clean <- df_long %>%
82+
mutate(cnum = as.integer(CNUM00),
83+
sweep = match(sweep, LETTERS),
84+
fup = fups[sweep],
85+
height = ifelse(CHTCM00 > 0, CHTCM00, NA),
86+
weight = ifelse(CWTCM00 > 0, CWTCM00, NA)) %>%
87+
select(MCSID, cnum, fup, height, weight)
88+
```
89+
90+
To reshape the clean data from long to wide format, we can use the `pivot_wider()` function as before. This time, we specify the columns to be reshaped using the `names_from` argument, provide the new column names in the `values_from` argument, and use the `names_glue` argument to specify the new column names. The `names_glue` argument uses curly braces (`{}`) to reference the values from the `names_from` and `.value` arguments. As we are specifying multiple columns in `values_from`, `.value` is a placeholder for the variable name.
91+
92+
```{r}
93+
df_wide_clean <- df_long_clean %>%
94+
mutate(fup = ifelse(fup < 10, glue("0{fup}"), as.character(fup))) %>%
95+
pivot_wider(names_from = fup,
96+
values_from = c(height, weight),
97+
names_glue = "{.value}_{fup}")
98+
99+
df_wide_clean
100+
```
101+
102+
Notice that prior to reshaping, we convert the `fup` variable to a string and ensure it has two characters (`5` becomes `05`). The reason for including this step is to make the names of similar variables the same length. This consistency makes it simpler to subset variables either by name (e.g., `select(matches("^height_\d\d$"))`) or by numerical range (e.g., `select(matches("^(h|w)eight_1[1-4]$"))`).[^2]
103+
104+
[^2]: In regular expressions, `^` and `$` are special characters that match the beginning and end of a string, respectively. `"^height_\\d\\d$"` matches any string that begins "height\_", immediately followed by two digits (0, 1, ..., 9) that end the string. `"^(h|w)eight_1[1-4]"` matches any string that begins height or weight, immediately followed by 11, 12, 13, or 14 (`[1-4]` is a compact way of matching the integer range 1, 2, 3 or 4).
105+
106+
# Reshaping Cleaned Data from Wide to Long
107+
108+
Finally, we can reshape the clean wide dataset back to long format using the `pivot_longer()` function. We specify the columns to be reshaped using the `cols` argument, provide the new variable names in the `names_to` argument, and the pattern the existing column names take using the `names_pattern` argument. For `names_pattern` we specify `"(.*)_(.*)y"`, which breaks the column name into two pieces: the variable name (`"(.*)"`), and the follow-up time (`"(.*)y"`). We also use the `names_transform` argument to convert the follow-up time to an integer.
109+
110+
```{r}
111+
df_wide_clean %>%
112+
pivot_longer(cols = matches("_\\d\\d$"),
113+
names_to = c(".value", "fup"),
114+
names_pattern = "(.*)_(\\d\\d)$",
115+
names_transform = list(fup = as.integer))
116+
```
117+
118+
# Footnotes

0 commit comments

Comments
 (0)