You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have to parse a rather nasty format of two intertwined log formats in one file which results in different column numbers being present. They start with the same columns, but the later ones differ in data and number. Therefore I created a solution using read_delim based on the longer of the two formats (that I also want to use further) ignoring the warnings the function spits out about missing columns. Filtering for correctly read data is done on the always present columns later on. This seemed to work rather fine (except for the warnings).
At some point it started to get weird because some files were parsed correctly, others not. I finally found out, that even with a given number of col_names the number of "expected" columns changes when parsing is started. Lets have a look on the example:
library(readr)
readTestlog<-function(filename)
{
readdata<- read_delim(
filename,
delim=" - ",
col_names= c("datetime", "service", "loglevel", "remote", "target", "response", "requesttime", "content"),
col_types="cccccccc",
na= c("", "NA"),
trim_ws=TRUE,
lazy=FALSE,
show_col_types=TRUE,
name_repair="minimal"
)
problems(readdata)
}
tmpfilename<- tempfile()
testlogtext<- c(
'2023-05-31 08:35:00,042 - Tool2 - DEBUG - Remote 127.0.0.1 - <Request "http://127.0.0.1:1234/another/url" [GET]> - <Response 3 bytes [200 OK]> - Request time 0.0 s - Content {}',
'2023-05-31 08:35:00,026 - Tool1 - INFO - 1.2.3.4 - - [31/May/2023 08:35:00] "GET /this/is/an/URL HTTP/1.1" 200 -'
)
#Write and read in normal order
write_lines(testlogtext, file=tmpfilename)
readTestlog(tmpfilename)
#Write and read in reversed order
write_lines(rev(testlogtext), file=tmpfilename)
readTestlog(tmpfilename)
unlink(tmpfilename)
First one parses well, only complaining that in line 2 there are only 6 of 8 columns present:
I could not convince readr to really use the number of defined columns by changing options, but it keeps "guessing" the number from the very first line of data which should be ignored IMHO if a given number of col_names (and col_types) is defined.
The text was updated successfully, but these errors were encountered:
I'd suggest parsing the file another way — readr will always guess the number of columns from the data, and while I can see that it would be useful to override if you specify both names and type, that code is already rather complicated and unfortunately I don't think the effort will be worth the payoff.
I have to parse a rather nasty format of two intertwined log formats in one file which results in different column numbers being present. They start with the same columns, but the later ones differ in data and number. Therefore I created a solution using
read_delim
based on the longer of the two formats (that I also want to use further) ignoring the warnings the function spits out about missing columns. Filtering for correctly read data is done on the always present columns later on. This seemed to work rather fine (except for the warnings).At some point it started to get weird because some files were parsed correctly, others not. I finally found out, that even with a given number of
col_names
the number of "expected" columns changes when parsing is started. Lets have a look on the example:First one parses well, only complaining that in line 2 there are only 6 of 8 columns present:
Rows: 2 Columns: 8 ── Column specification ──── Delimiter: " - " chr (8): datetime, service, loglevel, remote, target, response, requesttime, content # A tibble: 1 × 5 row col expected actual file <int> <int> <chr> <chr> <chr> 1 2 6 8 columns 6 columns C:/Users/user/AppData/Local/Temp/Rtmp21wxrO/file76fc4672112f
If order is reversed (which is just happening by chance in the wild when Tool1 writes first),
read_delim
uses only 6 out of my 8 definedcol_names
:Rows: 2 Columns: 6 ── Column specification ──── Delimiter: " - " chr (6): datetime, service, loglevel, remote, target, response # A tibble: 1 × 5 row col expected actual file <int> <int> <chr> <chr> <chr> 1 2 8 6 columns 8 columns C:/Users/user/AppData/Local/Temp/Rtmp21wxrO/file76fc4672112f
I could not convince readr to really use the number of defined columns by changing options, but it keeps "guessing" the number from the very first line of data which should be ignored IMHO if a given number of
col_names
(andcol_types
) is defined.The text was updated successfully, but these errors were encountered: