You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In many datasets, missing values are interlaced with data as codes or strings. read_delim() et al., presently have an option to replace these values with NA, but do not have an easy way to do anything else with these values. For example:
To load these data we'd run read_csv(na_strings_csv, na=c("_DECLINED_ANSWER_", "_TECHNICAL_ERROR_"))
But if we want to load the missing reasons as its own dataframe, or load the missing reasons in separate columns with a suffix, it requires loading the entire dataframe as text, filtering for the missing reasons and joining to the original dataframe. Given the extensive number of datasets that interlace values and missing reasons, it would be really nice to have an extra option on read_* to make this common task more ergonomic. I propose having a channels arg having the following functionality:
channels="values" gives the default behavior, but channels="missing" loads the missing reasons.
Loading values and missing reasons in separate columns like this greatly facilitates manipulation with tidyverse aggregation & filtering functions. For example, to find the average age of the of individuals that had technical errors reporting their favorite color:
The main limitation of the package is around column-level missing values. It forces me to load files as character vectors and then use type_convert() which doesn't have complete parity with the vroom read_* functions.
So even though this package works as a sort of stop-gap, I still think this kind of functionality would still be better built-in to readr / vroom.
In many datasets, missing values are interlaced with data as codes or strings.
read_delim()
et al., presently have an option to replace these values withNA
, but do not have an easy way to do anything else with these values. For example:To load these data we'd run
read_csv(na_strings_csv, na=c("_DECLINED_ANSWER_", "_TECHNICAL_ERROR_"))
But if we want to load the missing reasons as its own dataframe, or load the missing reasons in separate columns with a suffix, it requires loading the entire dataframe as text, filtering for the missing reasons and joining to the original dataframe. Given the extensive number of datasets that interlace values and missing reasons, it would be really nice to have an extra option on
read_*
to make this common task more ergonomic. I propose having achannels
arg having the following functionality:channels="values"
gives the default behavior, butchannels="missing"
loads the missing reasons.channels=c("values", "missing")
channels
controls the suffixes of the columnsLoading values and missing reasons in separate columns like this greatly facilitates manipulation with tidyverse aggregation & filtering functions. For example, to find the average age of the of individuals that had technical errors reporting their favorite color:
I provide some more examples here here (I also include an example / naive implementation of the above api).
Please let me know if this idea is of interest, and I'd be happy to work on a PR. Cheers!
The text was updated successfully, but these errors were encountered: