read_delim skips specified columns if first line of input does not match #1496

Henning-Lenz · 2023-05-31T09:46:46Z

I have to parse a rather nasty format of two intertwined log formats in one file which results in different column numbers being present. They start with the same columns, but the later ones differ in data and number. Therefore I created a solution using read_delim based on the longer of the two formats (that I also want to use further) ignoring the warnings the function spits out about missing columns. Filtering for correctly read data is done on the always present columns later on. This seemed to work rather fine (except for the warnings).

At some point it started to get weird because some files were parsed correctly, others not. I finally found out, that even with a given number of col_names the number of "expected" columns changes when parsing is started. Lets have a look on the example:

library(readr)

readTestlog <- function(filename)
{
  readdata <- read_delim(
    filename,
    delim = " - ",
    col_names = c("datetime", "service", "loglevel", "remote", "target", "response", "requesttime", "content"),
    col_types = "cccccccc",
    na = c("", "NA"),
    trim_ws = TRUE,
    lazy = FALSE,
    show_col_types = TRUE,
    name_repair = "minimal"
  )
  problems(readdata)
}

tmpfilename <- tempfile()

testlogtext <- c(
  '2023-05-31 08:35:00,042 - Tool2 - DEBUG - Remote 127.0.0.1 - <Request "http://127.0.0.1:1234/another/url" [GET]> - <Response 3 bytes [200 OK]> - Request time 0.0 s - Content {}',
  '2023-05-31 08:35:00,026 - Tool1 - INFO - 1.2.3.4 - - [31/May/2023 08:35:00] "GET /this/is/an/URL HTTP/1.1" 200 -'
)

#Write and read in normal order
write_lines(testlogtext, file = tmpfilename)
readTestlog(tmpfilename)

#Write and read in reversed order
write_lines(rev(testlogtext), file = tmpfilename)
readTestlog(tmpfilename)

unlink(tmpfilename)

First one parses well, only complaining that in line 2 there are only 6 of 8 columns present:

Rows: 2 Columns: 8                                                                                                                                                                        
── Column specification ────
Delimiter: " - "
chr (8): datetime, service, loglevel, remote, target, response, requesttime, content

# A tibble: 1 × 5
    row   col expected  actual    file                                                          
  <int> <int> <chr>     <chr>     <chr>                                                         
1     2     6 8 columns 6 columns C:/Users/user/AppData/Local/Temp/Rtmp21wxrO/file76fc4672112f

If order is reversed (which is just happening by chance in the wild when Tool1 writes first), read_delim uses only 6 out of my 8 defined col_names:

Rows: 2 Columns: 6                                                                                                                                                                        
── Column specification ────
Delimiter: " - "
chr (6): datetime, service, loglevel, remote, target, response

# A tibble: 1 × 5
    row   col expected  actual    file                                                          
  <int> <int> <chr>     <chr>     <chr>                                                         
1     2     8 6 columns 8 columns C:/Users/user/AppData/Local/Temp/Rtmp21wxrO/file76fc4672112f

I could not convince readr to really use the number of defined columns by changing options, but it keeps "guessing" the number from the very first line of data which should be ignored IMHO if a given number of col_names (and col_types) is defined.

The text was updated successfully, but these errors were encountered:

hadley · 2023-07-31T21:39:36Z

I'd suggest parsing the file another way — readr will always guess the number of columns from the data, and while I can see that it would be useful to override if you specify both names and type, that code is already rather complicated and unfortunately I don't think the effort will be worth the payoff.

hadley closed this as completed Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_delim skips specified columns if first line of input does not match #1496

read_delim skips specified columns if first line of input does not match #1496

Henning-Lenz commented May 31, 2023

hadley commented Jul 31, 2023

read_delim skips specified columns if first line of input does not match #1496

read_delim skips specified columns if first line of input does not match #1496

Comments

Henning-Lenz commented May 31, 2023

hadley commented Jul 31, 2023