|  | 
|  | 1 | +--- | 
|  | 2 | +layout: default | 
|  | 3 | +title: Data Discovery | 
|  | 4 | +nav_order: 1 | 
|  | 5 | +parent: NCDS | 
|  | 6 | +format: docusaurus-md | 
|  | 7 | +--- | 
|  | 8 | + | 
|  | 9 | + | 
|  | 10 | + | 
|  | 11 | + | 
|  | 12 | +# Introduction | 
|  | 13 | + | 
|  | 14 | +In this section, we show a few `R` functions for exploring NCDS data; as | 
|  | 15 | +noted, historical sweeps of the NCDS did not use modern metadata | 
|  | 16 | +standards, so finding a specific variable can be challenging. Variables | 
|  | 17 | +do not always have names that are descriptive or follow a consistent | 
|  | 18 | +naming convention across sweeps. (The variable for cohort member sex is | 
|  | 19 | +`N622`, for example.) In what follows, we will use the `R` functions to | 
|  | 20 | +find variables for cohort members’ height, which has been collected in | 
|  | 21 | +many of the sweeps. | 
|  | 22 | + | 
|  | 23 | +The packages we will use are: | 
|  | 24 | + | 
|  | 25 | +```r | 
|  | 26 | +# Load Packages | 
|  | 27 | +library(tidyverse) # For data manipulation | 
|  | 28 | +library(haven) # For importing .dta files | 
|  | 29 | +library(labelled) # For searching imported datasets | 
|  | 30 | +library(codebookr) # For creating .docx codebooks | 
|  | 31 | +``` | 
|  | 32 | + | 
|  | 33 | +# `labelled::lookfor()` | 
|  | 34 | + | 
|  | 35 | +The `labelled` package contains functionality for attaching and | 
|  | 36 | +examining metadata in dataframes (for instance, adding labels to | 
|  | 37 | +variables \[columns\]). Beyond this, it also contains the `lookfor()` | 
|  | 38 | +function, which replicates similar functionality in `Stata`. `lookfor()` | 
|  | 39 | +also one to search for variables in a dataframe by keyword (regular | 
|  | 40 | +expression); the function searches variable names as well as associated | 
|  | 41 | +metadata. It returns an object containing matching variables, their | 
|  | 42 | +labels, and their types, etc.. Below, we read in the NCDS 55-year sweep | 
|  | 43 | +dataset which contains derived variables (`55y/ncds_2013_derived.dta`) | 
|  | 44 | +and use `lookfor()` to search for variables related to `"height"`. | 
|  | 45 | + | 
|  | 46 | +```r | 
|  | 47 | +ncds_55y <- read_dta("55y/ncds_2013_derived.dta") | 
|  | 48 | + | 
|  | 49 | +lookfor(ncds_55y, "height") | 
|  | 50 | +``` | 
|  | 51 | + | 
|  | 52 | +``` text | 
|  | 53 | + pos variable label                      col_type missing values              | 
|  | 54 | + 46  ND9HGHTM (Derived) Height in metres dbl+lbl  0       [-8] No information | 
|  | 55 | +``` | 
|  | 56 | + | 
|  | 57 | +Users may consider it easier to create a tibble of the `lookfor()` | 
|  | 58 | +output, which can be searched and filtered using `dplyr` functions. | 
|  | 59 | +Below, we create a `tibble` (a type of `data.frame` with good printing | 
|  | 60 | +defaults) of the `lookfor()` output and use `filter()` to find variables | 
|  | 61 | +with `"height"` in their labels. Note, we convert both the variable | 
|  | 62 | +names and labels to lower case to make the search case insensitive. | 
|  | 63 | + | 
|  | 64 | +```r | 
|  | 65 | +ncds_55y_lookfor <- lookfor(ncds_55y) %>% | 
|  | 66 | +  as_tibble() %>% | 
|  | 67 | +  mutate(variable_low = str_to_lower(variable), | 
|  | 68 | +         label_low = str_to_lower(label)) | 
|  | 69 | + | 
|  | 70 | +ncds_55y_lookfor %>% | 
|  | 71 | +  filter(str_detect(label_low, "height")) | 
|  | 72 | +``` | 
|  | 73 | + | 
|  | 74 | +``` text | 
|  | 75 | +# A tibble: 1 × 9 | 
|  | 76 | +    pos variable label         col_type missing levels value_labels variable_low | 
|  | 77 | +  <int> <chr>    <chr>         <chr>      <int> <name> <named list> <chr>        | 
|  | 78 | +1    46 ND9HGHTM (Derived) He… dbl+lbl        0 <NULL> <dbl [1]>    nd9hghtm     | 
|  | 79 | +# ℹ 1 more variable: label_low <chr> | 
|  | 80 | +``` | 
|  | 81 | + | 
|  | 82 | +# `codebookr::codebook()` | 
|  | 83 | + | 
|  | 84 | +The NCDS datasets that are downloadable from the UK Data Service come | 
|  | 85 | +bundled with data dictionaries within the `mrdoc` subfolder. However, | 
|  | 86 | +these are limited in some ways. The `codebookr` package enables the | 
|  | 87 | +creation of data dictionaries that are more customisable, and in our | 
|  | 88 | +opinion, easy to read. Below we create a codebook for the NCDS 55-year | 
|  | 89 | +sweep derived variable dataset. These codebooks are intended to be saved | 
|  | 90 | +and viewed in Microsoft Word. | 
|  | 91 | + | 
|  | 92 | +```r | 
|  | 93 | +cdb <- codebook(ncds_55y) | 
|  | 94 | +print(cdb, "ncds_55y_codebook.docx") # Saves as .docx (Word) file | 
|  | 95 | +``` | 
|  | 96 | + | 
|  | 97 | +A screenshot of the codebook is shown below. | 
|  | 98 | + | 
|  | 99 | +<figure> | 
|  | 100 | +<img src="../images/ncds-data_discovery.png" | 
|  | 101 | +alt="Codebook created by codebookr::codebook()" /> | 
|  | 102 | +<figcaption aria-hidden="true">Codebook created by | 
|  | 103 | +codebookr::codebook()</figcaption> | 
|  | 104 | +</figure> | 
|  | 105 | + | 
|  | 106 | +# Create a Lookup Table Across All Datasets | 
|  | 107 | + | 
|  | 108 | +Creating the `lookfor()` and `codebook()` one dataset at a time does not | 
|  | 109 | +allow one to get a quick overview of the variables available in the | 
|  | 110 | +NCDS, including the sweeps repeatedly measured characteristics are | 
|  | 111 | +available in. Below we create a `tibble`, `df_lookfor`, that contains | 
|  | 112 | +`lookfor()` results for all the `.dta` files in the NCDS folder. | 
|  | 113 | + | 
|  | 114 | +To do this, we create a function, `create_lookfor()`, that takes a file | 
|  | 115 | +path to a `.dta` file, reads in the first row of the dataset (faster | 
|  | 116 | +than reading the full dataset), and applies `lookfor()` to it. We call | 
|  | 117 | +this function with a `mutate()` function call to create a set of lookups | 
|  | 118 | +for every `.dta` file we can find in the NCDS folder. `map()` loops over | 
|  | 119 | +every value in the `file_path` column, creating a corresponding lookup | 
|  | 120 | +table for that file, stored as a | 
|  | 121 | +[`list-column`](https://r4ds.hadley.nz/rectangling.html#list-columns). | 
|  | 122 | +`unnest()` expands the results out, so rather than have one row per | 
|  | 123 | +`file_path`, we have one row per variable. | 
|  | 124 | + | 
|  | 125 | +```r | 
|  | 126 | +create_lookfor <- function(file_path){ | 
|  | 127 | +  read_dta(file_path, n_max = 1) %>% | 
|  | 128 | +    lookfor() %>% | 
|  | 129 | +    as_tibble() | 
|  | 130 | +} | 
|  | 131 | + | 
|  | 132 | +df_lookfor <- tibble(file_path = list.files(pattern = "\\.dta$", recursive = TRUE)) %>% | 
|  | 133 | +  filter(!str_detect(file_path, "^UKDS")) %>% | 
|  | 134 | +  mutate(lookfor = map(file_path, create_lookfor)) %>% | 
|  | 135 | +  unnest(lookfor) %>% | 
|  | 136 | +  mutate(variable_low = str_to_lower(variable), | 
|  | 137 | +         label_low = str_to_lower(label)) %>% | 
|  | 138 | +  separate(file_path,  | 
|  | 139 | +           into = c("sweep", "file"),  | 
|  | 140 | +           sep = "/",  | 
|  | 141 | +           remove = FALSE) %>%  | 
|  | 142 | +  relocate(file_path, pos, .after = last_col()) | 
|  | 143 | +``` | 
|  | 144 | + | 
|  | 145 | +We can use the resulting object to search for variables with `"height"` | 
|  | 146 | +in their labels. | 
|  | 147 | + | 
|  | 148 | +```r | 
|  | 149 | +df_lookfor %>% | 
|  | 150 | +  filter(str_detect(label_low, "height")) %>% | 
|  | 151 | +  select(file, variable, label) | 
|  | 152 | +``` | 
|  | 153 | + | 
|  | 154 | +``` text | 
|  | 155 | +# A tibble: 77 × 3 | 
|  | 156 | +   file         variable label                                    | 
|  | 157 | +   <chr>        <chr>    <chr>                                    | 
|  | 158 | + 1 ncds0123.dta n510     0 Height of mum in inches at chlds brth  | 
|  | 159 | + 2 ncds0123.dta n332     1M Childs height, no shoes-nearest inch  | 
|  | 160 | + 3 ncds0123.dta n334     1M Childs height,no shoes-to centimeter  | 
|  | 161 | + 4 ncds0123.dta n1199    2P Father's height in inches             | 
|  | 162 | + 5 ncds0123.dta n1205    2P Mothers height in inches              | 
|  | 163 | + 6 ncds0123.dta n1510    2M Childs height no shoes,socks- inches  | 
|  | 164 | + 7 ncds0123.dta n1511    2M Fractions of an inch in childs height | 
|  | 165 | + 8 ncds0123.dta n1949    3M Child's height,in bare feet,in cms    | 
|  | 166 | + 9 ncds0123.dta dvht07   1D Height in metres at 7 years           | 
|  | 167 | +10 ncds0123.dta dvht11   2D Height in metres at 11 years          | 
|  | 168 | +# ℹ 67 more rows | 
|  | 169 | +``` | 
0 commit comments