Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing "id" option in read_table #1540

Open
janxkoci opened this issue May 9, 2024 · 1 comment
Open

missing "id" option in read_table #1540

janxkoci opened this issue May 9, 2024 · 1 comment

Comments

@janxkoci
Copy link

janxkoci commented May 9, 2024

The read_table function lacks some options the other read_delim-based functions have, most importantly for me the id option. I have many whitespace-delimited files (the kind read_table is meant to handle) and I want to load them all with a map call followed with bind_rows, but the lack of id option makes it hard to keep track of which data comes from which input file.

Moreover, it makes the read_table function inconsistent with other read functions, for no obvious reason.

@janxkoci
Copy link
Author

janxkoci commented May 29, 2024

workaround(s)

Recently, I came up with a decent workaround for getting the result I want, so I'm documenting it here, and contrasting it with other read_* functions.

csv, tsv, fwf, and other "nice" formats (expected behaviour)

What I want can be done directly with these formats:

my_files <- list.files("my_data", pattern = "*.tsv")

my_df <- my_files %>% 
  map(read_tsv, id = "filename", comment = "#") %>% 
  bind_rows()

format pretty-printed with multiple spaces between columns (actual hoops necessary to get there)

On the other hand, with read_table one needs to do something like this:

my_files <- list.files("my_data", pattern = "*.txt")

my_df <- my_files %>% 
  map(read_table, comment = "#") %>% 
  bind_rows(.id = "filename") %>% # this fills a column with numbers per file, but the type is char
  mutate(filename = my_files[as.numeric(filename)])

another option - miller

The same can be also achieved from command-line with miller. Miller is like a tidyverse for Unix terminal, giving a 21st-century facelift to those old rusty tools like cut, sort, or awk.

One of miller's strengths is format-awareness and seamless conversions. It can directly read the format discussed here with --ipprint and convert to e.g. csv with --ocsv, or, as I do below, with a keystroke-saver --p2c:

mlr --p2c --skip-comments --mfrom my_data/*.txt -- cat --filename > all_my_data.csv

The created csv can obviously be loaded into R directly and I can continue with my work.

Miller works fine, unless I need to rename many columns in the individual files before concatenating them into one table. Miller has similar facilities comparable to dplyr::rename, but anything more complex is better served with magrittr::set_colnames, where I can do things like:

map(my_files, read_tsv, comment = "#") %>% 
map(magrittr::set_colnames, c("model", "estimate", paste0("bootstrap", 1:50))) %>% 
bind_rows()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant