Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_delim skips specified columns if first line of input does not match #1496

Closed
Henning-Lenz opened this issue May 31, 2023 · 1 comment
Closed

Comments

@Henning-Lenz
Copy link

I have to parse a rather nasty format of two intertwined log formats in one file which results in different column numbers being present. They start with the same columns, but the later ones differ in data and number. Therefore I created a solution using read_delim based on the longer of the two formats (that I also want to use further) ignoring the warnings the function spits out about missing columns. Filtering for correctly read data is done on the always present columns later on. This seemed to work rather fine (except for the warnings).

At some point it started to get weird because some files were parsed correctly, others not. I finally found out, that even with a given number of col_names the number of "expected" columns changes when parsing is started. Lets have a look on the example:

library(readr)

readTestlog <- function(filename)
{
  readdata <- read_delim(
    filename,
    delim = " - ",
    col_names = c("datetime", "service", "loglevel", "remote", "target", "response", "requesttime", "content"),
    col_types = "cccccccc",
    na = c("", "NA"),
    trim_ws = TRUE,
    lazy = FALSE,
    show_col_types = TRUE,
    name_repair = "minimal"
  )
  problems(readdata)
}

tmpfilename <- tempfile()

testlogtext <- c(
  '2023-05-31 08:35:00,042 - Tool2 - DEBUG - Remote 127.0.0.1 - <Request "http://127.0.0.1:1234/another/url" [GET]> - <Response 3 bytes [200 OK]> - Request time 0.0 s - Content {}',
  '2023-05-31 08:35:00,026 - Tool1 - INFO - 1.2.3.4 - - [31/May/2023 08:35:00] "GET /this/is/an/URL HTTP/1.1" 200 -'
)

#Write and read in normal order
write_lines(testlogtext, file = tmpfilename)
readTestlog(tmpfilename)

#Write and read in reversed order
write_lines(rev(testlogtext), file = tmpfilename)
readTestlog(tmpfilename)

unlink(tmpfilename)

First one parses well, only complaining that in line 2 there are only 6 of 8 columns present:

Rows: 2 Columns: 8                                                                                                                                                                        
── Column specification ────
Delimiter: " - "
chr (8): datetime, service, loglevel, remote, target, response, requesttime, content

# A tibble: 1 × 5
    row   col expected  actual    file                                                          
  <int> <int> <chr>     <chr>     <chr>                                                         
1     2     6 8 columns 6 columns C:/Users/user/AppData/Local/Temp/Rtmp21wxrO/file76fc4672112f

If order is reversed (which is just happening by chance in the wild when Tool1 writes first), read_delim uses only 6 out of my 8 defined col_names:

Rows: 2 Columns: 6                                                                                                                                                                        
── Column specification ────
Delimiter: " - "
chr (6): datetime, service, loglevel, remote, target, response

# A tibble: 1 × 5
    row   col expected  actual    file                                                          
  <int> <int> <chr>     <chr>     <chr>                                                         
1     2     8 6 columns 8 columns C:/Users/user/AppData/Local/Temp/Rtmp21wxrO/file76fc4672112f

I could not convince readr to really use the number of defined columns by changing options, but it keeps "guessing" the number from the very first line of data which should be ignored IMHO if a given number of col_names (and col_types) is defined.

@hadley
Copy link
Member

hadley commented Jul 31, 2023

I'd suggest parsing the file another way — readr will always guess the number of columns from the data, and while I can see that it would be useful to override if you specify both names and type, that code is already rather complicated and unfortunately I don't think the effort will be worth the payoff.

@hadley hadley closed this as completed Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants