missing "id" option in read_table #1540

janxkoci · 2024-05-09T13:59:45Z

The read_table function lacks some options the other read_delim-based functions have, most importantly for me the id option. I have many whitespace-delimited files (the kind read_table is meant to handle) and I want to load them all with a map call followed with bind_rows, but the lack of id option makes it hard to keep track of which data comes from which input file.

Moreover, it makes the read_table function inconsistent with other read functions, for no obvious reason.

The text was updated successfully, but these errors were encountered:

janxkoci · 2024-05-29T07:31:39Z

workaround(s)

Recently, I came up with a decent workaround for getting the result I want, so I'm documenting it here, and contrasting it with other read_* functions.

csv, tsv, fwf, and other "nice" formats (expected behaviour)

What I want can be done directly with these formats:

my_files <- list.files("my_data", pattern = "*.tsv")

my_df <- my_files %>% 
  map(read_tsv, id = "filename", comment = "#") %>% 
  bind_rows()

format pretty-printed with multiple spaces between columns (actual hoops necessary to get there)

On the other hand, with read_table one needs to do something like this:

my_files <- list.files("my_data", pattern = "*.txt")

my_df <- my_files %>% 
  map(read_table, comment = "#") %>% 
  bind_rows(.id = "filename") %>% # this fills a column with numbers per file, but the type is char
  mutate(filename = my_files[as.numeric(filename)])

another option - miller

The same can be also achieved from command-line with miller. Miller is like a tidyverse for Unix terminal, giving a 21st-century facelift to those old rusty tools like cut, sort, or awk.

One of miller's strengths is format-awareness and seamless conversions. It can directly read the format discussed here with --ipprint and convert to e.g. csv with --ocsv, or, as I do below, with a keystroke-saver --p2c:

mlr --p2c --skip-comments --mfrom my_data/*.txt -- cat --filename > all_my_data.csv

The created csv can obviously be loaded into R directly and I can continue with my work.

Miller works fine, unless I need to rename many columns in the individual files before concatenating them into one table. Miller has similar facilities comparable to dplyr::rename, but anything more complex is better served with magrittr::set_colnames, where I can do things like:

map(my_files, read_tsv, comment = "#") %>% 
map(magrittr::set_colnames, c("model", "estimate", paste0("bootstrap", 1:50))) %>% 
bind_rows()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

missing "id" option in read_table #1540

missing "id" option in read_table #1540

janxkoci commented May 9, 2024

janxkoci commented May 29, 2024 •

edited

Loading

missing "id" option in read_table #1540

missing "id" option in read_table #1540

Comments

janxkoci commented May 9, 2024

janxkoci commented May 29, 2024 • edited Loading

workaround(s)

csv, tsv, fwf, and other "nice" formats (expected behaviour)

format pretty-printed with multiple spaces between columns (actual hoops necessary to get there)

another option - miller

janxkoci commented May 29, 2024 •

edited

Loading