-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lines read and skip lines use different evaluation in read_lines #1500
Comments
Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session. |
Here's the reprex created with the reprex package. Hopefully this is more helpful... library(readr)
#> Warning: package 'readr' was built under R version 4.1.3
dummy_text <-
data.frame(
a = sprintf("%03i", 1:255), ## add as identifier at the beginning of the line
b = unlist(lapply(as.raw(1:255), rawToChar)), ## generate some nasty UTF8 characters
c = "\n" ## add line end
)
## paste all lines together:
dummy_text <- paste(apply(as.matrix(dummy_text), 1, paste, collapse = ""), collapse = "")
## save as a temp file
dummy_file <- tempfile(fileext = ".txt")
writeLines(dummy_text, dummy_file)
chunk_size <- 5
lines_read <- 0
result <- character(0)
repeat {
lines <- read_lines(dummy_file, skip = lines_read, n_max = chunk_size)
## print(problems(lines)) ## commented out for brevity
if (length(lines) == 0) break
lines_read <- lines_read + length(lines)
result <- c(result, lines)
}
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
## Next, try to extract the first three characters from each line, which I had added above as a unique identifier for each line
result_check <- unlist(lapply(result, function(x) tryCatch({substr(x, 1, 3)}, error = function(e) NULL)))
table(duplicated(result_check[result_check != ""]))
#>
#> FALSE TRUE
#> 127 1
## It turns out the the line starting with '014' is read twice. Created on 2023-08-01 with reprex v2.0.2 |
I can't replicate it: library(readr)
lines <- paste0(
sprintf("%03i", 1:255), ## add as identifier at the beginning of the line
unlist(lapply(as.raw(1:255), rawToChar)) ## generate some nasty UTF8 characters
)
path <- tempfile()
writeLines(lines, path)
chunk_size <- 5
skips <- c(0, seq_len(length(lines) %/% chunk_size) * chunk_size)
chunks <- lapply(skips, \(skip) read_lines(path, skip = skip, n_max = chunk_size))
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
id <- stringr::str_sub(unlist(chunks), 1, 3)
id[duplicated(id)]
#> character(0) Created on 2023-08-01 with reprex v2.0.2 But I'm suspicious that your example is just tripping up on "\013" which is the carriage return, and I see you are on windows. |
You are right, after your comment I tried running my reprex on a Linux machine:
There I don't get the issue described here. So it seems to be a Windows specific issue. For my specific case I have created a work-around. But I was hoping that |
We can look into it, but would you mind having a go at a simpler reprex? I'm pretty sure the problem is related to having on line that uses |
Sure, I will have a look whether I can pinpoint further what causes the issue. This might take some time... |
It was easier to simpify the issue than I thought. You are right that \r is triggering funny behaviour on Windows: library(readr)
#> Warning: package 'readr' was built under R version 4.1.3
text <- "001\n002\n\r003\n004"
read_lines(text, skip = 0)
#> [1] "001" "002" "\r003" "004"
read_lines(text, skip = 1)
#> [1] "002" "\r003" "004"
read_lines(text, skip = 2)
#> [1] ""
read_lines(text, skip = 3)
#> [1] "003" "004"
read_lines(text, skip = 4)
#> [1] "004" Created on 2023-08-01 with reprex v2.0.2 On Windows reading data from the text comes to a halt when using Preferably, the same text is interpreted the same on each platform, or the user should be able to indicate which characters should be interpreted as a line feed. |
I get the same behaviour on my mac, so it's great that we have a platform independent reprex (possibly because you're no longer saving the string to disk, which can do weird things to newlines). library(readr)
text <- "001\n002\n\r003\n004"
read_lines(text, skip = 0)
#> [1] "001" "002" "\r003" "004"
read_lines(text, skip = 1)
#> [1] "002" "\r003" "004"
read_lines(text, skip = 2)
#> [1] ""
read_lines(text, skip = 3)
#> [1] "003" "004"
read_lines(text, skip = 4)
#> [1] "004" Created on 2023-08-01 with reprex v2.0.2 |
Is there any progress to report on this bug? It seems to be still present in the latest release... |
@pepijn-devries if their was progress, you can assume it would be reported here... |
Thanks for your work on readr! It's most helpful, but I did came across the following problem.
I have a very large ASCII file which is too large to load entirely into memory. Therefore, I use
read_lines
to read it in chunks using theskip
andn_max
arguments, process the chunks and write the results to a file. It turned out that a specific line in the file was read twice. First I assumed that this was an error in the ASCII file, but after some testing it turned out thatread_lines
had read the same line twice.It turns out that the
skip
arguments uses a different way of evaluating the number of lines (to be skipped) than the actual reading algorithm. I've prepared the following reprex by simplifying my case:First prepare a text file with some nasty UTF8 characters:
Next, let's read from the file, 5 lines at a time:
Next, try to extract the first three characters from each line, which I had added above as a unique identifier for each line:
It turns out the the line starting with
014
is read twice. I suspect that"\U000d"
is treated as line feed while reading the file, but not when counting the number of lines to be skipped. This causes the same line to be read twice. Is this intended (then this should be documented), or not (can this be fixed)?This is my
sessionInfo()
The text was updated successfully, but these errors were encountered: