Lines read and skip lines use different evaluation in read_lines #1500

pepijn-devries · 2023-06-21T07:39:15Z

Thanks for your work on readr! It's most helpful, but I did came across the following problem.

I have a very large ASCII file which is too large to load entirely into memory. Therefore, I use read_lines to read it in chunks using the skip and n_max arguments, process the chunks and write the results to a file. It turned out that a specific line in the file was read twice. First I assumed that this was an error in the ASCII file, but after some testing it turned out that read_lines had read the same line twice.

It turns out that the skip arguments uses a different way of evaluating the number of lines (to be skipped) than the actual reading algorithm. I've prepared the following reprex by simplifying my case:

First prepare a text file with some nasty UTF8 characters:

library(readr)

dummy_text <-
  data.frame(
    a = sprintf("%03i", 1:255),                   ## add as identifier at the beginning of the line
    b = unlist(lapply(as.raw(1:255), rawToChar)), ## generate some nasty UTF8 characters
    c = "\n"                                      ## add line end
  )

## paste all lines together:
dummy_text <- paste(apply(as.matrix(dummy_text), 1, paste, collapse = ""), collapse = "")

## save as a temp file
dummy_file <- tempfile(fileext = ".txt")
writeLines(dummy_text, dummy_file)

Next, let's read from the file, 5 lines at a time:

chunk_size <- 5
lines_read <- 0
result <- character(0)

repeat {
  lines <- read_lines(dummy_file, skip = lines_read, n_max = chunk_size)
  print(problems(lines))
  if (length(lines) == 0) break
  lines_read <- lines_read + length(lines)
  result <- c(result, lines)
}

Next, try to extract the first three characters from each line, which I had added above as a unique identifier for each line:

result_check <- unlist(lapply(result, function(x) tryCatch({substr(x, 1, 3)}, error = function(e) NULL)))
duplicated(result_check[result_check != ""])

It turns out the the line starting with 014 is read twice. I suspect that "\U000d" is treated as line feed while reading the file, but not when counting the number of lines to be skipped. This causes the same line to be read twice. Is this intended (then this should be documented), or not (can this be fixed)?

This is my sessionInfo()

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252    LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C                       LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_2.1.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       pillar_1.9.0     dbplyr_2.3.2     cellranger_1.1.0 compiler_4.1.1   tools_4.1.1      digest_0.6.31   
 [8] bit_4.0.4        tibble_3.2.1     jsonlite_1.8.4   evaluate_0.21    RSQLite_2.2.8    memoise_2.0.1    lifecycle_1.0.3 
[15] lattice_0.20-44  pkgconfig_2.0.3  rlang_1.1.0      DBI_1.1.3        cli_3.4.1        rstudioapi_0.13  parallel_4.1.1  
[22] fastmap_1.1.0    withr_2.5.0      dplyr_1.1.2      httr_1.4.6       stringr_1.5.0    xml2_1.3.2       hms_1.1.2       
[29] generics_0.1.3   vctrs_0.6.2      rappdirs_0.3.3   tidyselect_1.2.0 bit64_4.0.5      grid_4.1.1       glue_1.6.2      
[36] R6_2.5.1         fansi_1.0.3      readxl_1.3.1     vroom_1.6.0      tzdb_0.1.2       blob_1.2.3       magrittr_2.0.3  
[43] ellipsis_0.3.2   leaps_3.1        rvest_1.0.1      utf8_1.2.2       stringi_1.7.6    cachem_1.0.6     crayon_1.5.2

The text was updated successfully, but these errors were encountered:

hadley · 2023-07-31T21:32:22Z

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

pepijn-devries · 2023-08-01T14:05:35Z

Here's the reprex created with the reprex package. Hopefully this is more helpful...

library(readr)
#> Warning: package 'readr' was built under R version 4.1.3

dummy_text <-
  data.frame(
    a = sprintf("%03i", 1:255),                   ## add as identifier at the beginning of the line
    b = unlist(lapply(as.raw(1:255), rawToChar)), ## generate some nasty UTF8 characters
    c = "\n"                                      ## add line end
  )

## paste all lines together:
dummy_text <- paste(apply(as.matrix(dummy_text), 1, paste, collapse = ""), collapse = "")

## save as a temp file
dummy_file <- tempfile(fileext = ".txt")
writeLines(dummy_text, dummy_file)

chunk_size <- 5
lines_read <- 0
result <- character(0)

repeat {
  lines <- read_lines(dummy_file, skip = lines_read, n_max = chunk_size)
  ## print(problems(lines)) ## commented out for brevity
  if (length(lines) == 0) break
  lines_read <- lines_read + length(lines)
  result <- c(result, lines)
}
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)

## Next, try to extract the first three characters from each line, which I had added above as a unique identifier for each line
result_check <- unlist(lapply(result, function(x) tryCatch({substr(x, 1, 3)}, error = function(e) NULL)))
table(duplicated(result_check[result_check != ""]))
#> 
#> FALSE  TRUE 
#>   127     1

## It turns out the the line starting with '014' is read twice.

^{Created on 2023-08-01 with reprex v2.0.2}

hadley · 2023-08-01T17:07:40Z

I can't replicate it:

library(readr)

lines <- paste0(
  sprintf("%03i", 1:255),                   ## add as identifier at the beginning of the line
  unlist(lapply(as.raw(1:255), rawToChar))  ## generate some nasty UTF8 characters
)
path <- tempfile()
writeLines(lines, path)

chunk_size <- 5
skips <- c(0, seq_len(length(lines) %/% chunk_size) * chunk_size)
chunks <- lapply(skips, \(skip) read_lines(path, skip = skip, n_max = chunk_size))
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)

id <- stringr::str_sub(unlist(chunks), 1, 3)
id[duplicated(id)]
#> character(0)

^{Created on 2023-08-01 with reprex v2.0.2}

But I'm suspicious that your example is just tripping up on "\013" which is the carriage return, and I see you are on windows.

pepijn-devries · 2023-08-01T20:09:31Z

You are right, after your comment I tried running my reprex on a Linux machine:

R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

There I don't get the issue described here. So it seems to be a Windows specific issue. For my specific case I have created a work-around. But I was hoping that readr would provide a platform-independent solution for reading files, i.e. produce the same result on each platform for the same file. If this is not possible than c'est la vie, but maybe this issue can be documented or produce a warning?

hadley · 2023-08-01T20:11:24Z

We can look into it, but would you mind having a go at a simpler reprex? I'm pretty sure the problem is related to having on line that uses \r\n where all the other lines use \n.

pepijn-devries · 2023-08-01T20:38:34Z

Sure, I will have a look whether I can pinpoint further what causes the issue. This might take some time...

pepijn-devries · 2023-08-01T21:04:30Z

It was easier to simpify the issue than I thought. You are right that \r is triggering funny behaviour on Windows:

library(readr)
#> Warning: package 'readr' was built under R version 4.1.3

text <- "001\n002\n\r003\n004"

read_lines(text, skip = 0)
#> [1] "001"   "002"   "\r003" "004"
read_lines(text, skip = 1)
#> [1] "002"   "\r003" "004"
read_lines(text, skip = 2)
#> [1] ""
read_lines(text, skip = 3)
#> [1] "003" "004"
read_lines(text, skip = 4)
#> [1] "004"

^{Created on 2023-08-01 with reprex v2.0.2}

On Windows reading data from the text comes to a halt when using skip=2 in the reprex, and just returns an empty string. On Linux the code above behaves as expected. On Linux \r is read as a separate line, whereas on Windows, it is considered the same line as where 003 is.

Preferably, the same text is interpreted the same on each platform, or the user should be able to indicate which characters should be interpreted as a line feed.

hadley · 2023-08-01T21:46:41Z

I get the same behaviour on my mac, so it's great that we have a platform independent reprex (possibly because you're no longer saving the string to disk, which can do weird things to newlines).

library(readr)

text <- "001\n002\n\r003\n004"

read_lines(text, skip = 0)
#> [1] "001"   "002"   "\r003" "004"
read_lines(text, skip = 1)
#> [1] "002"   "\r003" "004"
read_lines(text, skip = 2)
#> [1] ""
read_lines(text, skip = 3)
#> [1] "003" "004"
read_lines(text, skip = 4)
#> [1] "004"

^{Created on 2023-08-01 with reprex v2.0.2}

pepijn-devries · 2024-11-06T22:46:39Z

Is there any progress to report on this bug? It seems to be still present in the latest release...

hadley · 2024-11-07T13:45:37Z

@pepijn-devries if their was progress, you can assume it would be reported here...

pepijn-devries mentioned this issue Jun 21, 2023

Duplicated line read due to bug in read_lines pepijn-devries/ECOTOXr#23

Open

hadley added the reprex needs a minimal reproducible example label Jul 31, 2023

hadley added bug an unexpected problem or unintended behavior read 📖 and removed reprex needs a minimal reproducible example labels Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lines read and skip lines use different evaluation in read_lines #1500

Lines read and skip lines use different evaluation in read_lines #1500

pepijn-devries commented Jun 21, 2023

hadley commented Jul 31, 2023

pepijn-devries commented Aug 1, 2023

hadley commented Aug 1, 2023 •

edited

Loading

pepijn-devries commented Aug 1, 2023

hadley commented Aug 1, 2023

pepijn-devries commented Aug 1, 2023 •

edited

Loading

pepijn-devries commented Aug 1, 2023 •

edited

Loading

hadley commented Aug 1, 2023

pepijn-devries commented Nov 6, 2024

hadley commented Nov 7, 2024

Lines read and skip lines use different evaluation in read_lines #1500

Lines read and skip lines use different evaluation in read_lines #1500

Comments

pepijn-devries commented Jun 21, 2023

hadley commented Jul 31, 2023

pepijn-devries commented Aug 1, 2023

hadley commented Aug 1, 2023 • edited Loading

pepijn-devries commented Aug 1, 2023

hadley commented Aug 1, 2023

pepijn-devries commented Aug 1, 2023 • edited Loading

pepijn-devries commented Aug 1, 2023 • edited Loading

hadley commented Aug 1, 2023

pepijn-devries commented Nov 6, 2024

hadley commented Nov 7, 2024

hadley commented Aug 1, 2023 •

edited

Loading

pepijn-devries commented Aug 1, 2023 •

edited

Loading

pepijn-devries commented Aug 1, 2023 •

edited

Loading