-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
231 lines (177 loc) · 7.7 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
options(warn=-1)
options(readr.show_col_types = FALSE)
```
# interlacer <img src="man/figures/logo.svg" align="right" height="140" />
[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip)
[![R-CMD-check](https://github.com/khusmann/interlacer/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/khusmann/interlacer/actions/workflows/check-standard.yaml)
[![codecov](https://codecov.io/gh/khusmann/interlacer/graph/badge.svg?token=R4WNWH5NXU)](https://codecov.io/gh/khusmann/interlacer)
When a value is missing in your data, sometimes you want to know *why* it is
missing. Many textual tabular data sources will encode missing reasons as
special values *interlaced* with the regular values in a column (e.g. `N/A`,
`REFUSED`, `-99`, etc.). Unfortunately, the missing reasons are
lost when these values are all converted into a single `NA` type. Working
with missing reasons in R traditionally requires loading variables
as character vectors and doing a bunch of string
comparisons and type conversions to make sense of them.
interlacer provides functions that load variables from interlaced data sources
into a special `interlaced` column type that holds values and `NA`
reasons in separate *channels* of the same variable. In most contexts, you
can treat `interlaced` columns as if they were regular values: if you take
the `mean` of an interlaced column, for example,
you get the mean of its values, without its missing reasons interfering in
the computation.
Unlike a regular column, however, the missing reasons are still
available. This means you can still filter data frames on variables
by specific missing reasons, or generate
summary statistics with
breakdowns by missing reason. In other words, you no longer have to constantly
manually include / exclude missing reasons in computations by filtering them
with awkward string comparisons or type conversions... everything just works!
In addition to the introduction in `vignette("interlacer")`
be sure to also check out:
- `vignette("extended-column-types")` to see how to handle variable-level
missing reasons
- `vignette("coded-data")` for some recipies for working with coded data (e.g.
data produced by SPSS, SAS or Stata)
- `vignette("other-approaches")` for a deep dive into how interlacer's approach
compares to other approaches for representing and manipulating missing reasons
alongside data values
### ⚠️ ⚠️ ⚠️ WARNING ⚠️ ⚠️ ⚠️
This library is currently in its experimental stages, so be aware
that its
interface is quite likely to change in the future. In the meantime, please try
it out and
[let me know what you think](mailto:[email protected])!
## Installation
The easiest way to get interlacer is to install via devtools:
```{r, eval = FALSE}
install.packages("devtools") # If devtools is not already installed
devtools::install_github("khusmann/interlacer")
```
## Usage
To use interlacer, load it into your current R session:
```{r}
library(interlacer, warn.conflicts = FALSE)
```
interlacer supports the following file formats with these `read_interlaced_*()`
functions, which extend the `readr::read_*()` family of functions:
* `read_interlaced_csv()`
* `read_interlaced_tsv()`
* `read_interlaced_csv2()`
* `read_interlaced_delim()`
As a quick demo, consider the following example file bundled with interlacer:
```{r}
library(dplyr, warn.conflicts = FALSE)
library(readr)
read_file(interlacer_example("colors.csv")) |>
cat()
```
In this csv file, values are interlaced with three possible missing reasons:
`REFUSED`, `OMITTED`, and `N/A`.
With `readr`, loading these data would result in a data frame where all missing
reasons are replaced with `NA`:
```{r}
read_csv(
interlacer_example("colors.csv"),
na = c("REFUSED", "OMITTED", "N/A")
)
```
With interlacer, missing reasons are preserved:
```{r}
(ex <- read_interlaced_csv(
interlacer_example("colors.csv"),
na = c("REFUSED", "OMITTED", "N/A")
))
```
As you can see, in the printout above each column is defined by *two* types: a type
for values, and a type for missing reasons. The `age` column, for example, has type
`double` for its values, and type `factor` for its missing reasons:
```{r}
ex$age
```
Computations automatically operate on values:
```{r}
mean(ex$age, na.rm = TRUE)
```
But the missing reasons are still there! To indicate a value should be treated
as a missing reason instead of a regular value, you can use the `na()` function.
The following, for example,
will filter the data set for all individuals that `REFUSED` to give their
favorite color:
```{r}
ex |>
filter(favorite_color == na("REFUSED"))
```
And here's a pipeline that will compute a breakdown of the mean age of
respondents for each favorite color, with separate categories for each missing
reason:
```{r}
ex |>
summarize(
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color
) |>
arrange(favorite_color)
```
But this just scratches the surface of what can be done with interlacer...
check out `vignette("interlacer")` for a more complete overview!
## Known Issues
1. Some base functions, like `base::ifelse()`, drop the missing reason channel
on interlaced types, converting them into regular vectors
For example:
```{r}
ex |>
mutate(
favorite_color = ifelse(age < 18, na("REDACTED"), favorite_color)
)
```
This is due to a [limitation of R](https://vctrs.r-lib.org/#motivation).
If you run into this, use the tidyverse equivalent of the function. Tidyverse
functions are designed to more correctly handle type conversions.
In this example, we can use `dplyr::if_else()`:
```{r}
ex |>
mutate(
favorite_color = if_else(
age < 18,
na("REDACTED_UNDERAGE"),
favorite_color,
missing = na("REDACTED_MISSING_AGE")
)
)
```
2. Performance with large data sets
You may notice that on large datasets `interlacer` runs significantly slower
than `readr` / `vroom`. Although `interlacer` uses `vroom` under the hood to load
delimited data, it is not able to take advantage of many of its optimizations
because `vroom`
[does not currently support](https://github.com/tidyverse/vroom/issues/532)
column-level missing values. As soon as `vroom` supports column-level
missing values, I will be able to remedy this!
## Related work
interlacer was inspired by the [`haven`](https://haven.tidyverse.org/),
[`labelled`](https://larmarange.github.io/labelled/), and
[`declared`](https://dusadrian.github.io/declared/) packages. These packages
provide similar functionality to interlacer, but are more focused on
providing compatibility with missing reason data imported from SPSS, SAS, and
Stata. interlacer has slightly different aims:
1. Be fully generic: Add a missing value channel to *any* vector type.
2. Provide functions for reading / writing interlaced CSV files (not just SPSS
/ SAS / Stata files)
3. Provide a functional API that integrates well into tidy pipelines
Future versions of interlacer will provide functions to convert to and from
these other packages' types.
For a more detailed discussion, see `vignette("other-approaches")`.
## Acknowledgements
The development of this software was supported, in whole or in part, by the Institute of Education Sciences, U.S. Department of Education, through Grant R305A170047 to The Pennsylvania State University. The opinions expressed are those of the authors and do not represent the views of the Institute or the U.S. Department of Education.