Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lines read and skip lines use different evaluation in read_lines #1500

Open
pepijn-devries opened this issue Jun 21, 2023 · 8 comments
Open
Labels
bug an unexpected problem or unintended behavior read 📖

Comments

@pepijn-devries
Copy link

Thanks for your work on readr! It's most helpful, but I did came across the following problem.

I have a very large ASCII file which is too large to load entirely into memory. Therefore, I use read_lines to read it in chunks using the skip and n_max arguments, process the chunks and write the results to a file. It turned out that a specific line in the file was read twice. First I assumed that this was an error in the ASCII file, but after some testing it turned out that read_lines had read the same line twice.

It turns out that the skip arguments uses a different way of evaluating the number of lines (to be skipped) than the actual reading algorithm. I've prepared the following reprex by simplifying my case:

First prepare a text file with some nasty UTF8 characters:

library(readr)

dummy_text <-
  data.frame(
    a = sprintf("%03i", 1:255),                   ## add as identifier at the beginning of the line
    b = unlist(lapply(as.raw(1:255), rawToChar)), ## generate some nasty UTF8 characters
    c = "\n"                                      ## add line end
  )

## paste all lines together:
dummy_text <- paste(apply(as.matrix(dummy_text), 1, paste, collapse = ""), collapse = "")

## save as a temp file
dummy_file <- tempfile(fileext = ".txt")
writeLines(dummy_text, dummy_file)

Next, let's read from the file, 5 lines at a time:

chunk_size <- 5
lines_read <- 0
result <- character(0)

repeat {
  lines <- read_lines(dummy_file, skip = lines_read, n_max = chunk_size)
  print(problems(lines))
  if (length(lines) == 0) break
  lines_read <- lines_read + length(lines)
  result <- c(result, lines)
}

Next, try to extract the first three characters from each line, which I had added above as a unique identifier for each line:

result_check <- unlist(lapply(result, function(x) tryCatch({substr(x, 1, 3)}, error = function(e) NULL)))
duplicated(result_check[result_check != ""])

It turns out the the line starting with 014 is read twice. I suspect that "\U000d" is treated as line feed while reading the file, but not when counting the number of lines to be skipped. This causes the same line to be read twice. Is this intended (then this should be documented), or not (can this be fixed)?

This is my sessionInfo()

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252    LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C                       LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_2.1.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       pillar_1.9.0     dbplyr_2.3.2     cellranger_1.1.0 compiler_4.1.1   tools_4.1.1      digest_0.6.31   
 [8] bit_4.0.4        tibble_3.2.1     jsonlite_1.8.4   evaluate_0.21    RSQLite_2.2.8    memoise_2.0.1    lifecycle_1.0.3 
[15] lattice_0.20-44  pkgconfig_2.0.3  rlang_1.1.0      DBI_1.1.3        cli_3.4.1        rstudioapi_0.13  parallel_4.1.1  
[22] fastmap_1.1.0    withr_2.5.0      dplyr_1.1.2      httr_1.4.6       stringr_1.5.0    xml2_1.3.2       hms_1.1.2       
[29] generics_0.1.3   vctrs_0.6.2      rappdirs_0.3.3   tidyselect_1.2.0 bit64_4.0.5      grid_4.1.1       glue_1.6.2      
[36] R6_2.5.1         fansi_1.0.3      readxl_1.3.1     vroom_1.6.0      tzdb_0.1.2       blob_1.2.3       magrittr_2.0.3  
[43] ellipsis_0.3.2   leaps_3.1        rvest_1.0.1      utf8_1.2.2       stringi_1.7.6    cachem_1.0.6     crayon_1.5.2    
@hadley
Copy link
Member

hadley commented Jul 31, 2023

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

@hadley hadley added the reprex needs a minimal reproducible example label Jul 31, 2023
@pepijn-devries
Copy link
Author

Here's the reprex created with the reprex package. Hopefully this is more helpful...

library(readr)
#> Warning: package 'readr' was built under R version 4.1.3

dummy_text <-
  data.frame(
    a = sprintf("%03i", 1:255),                   ## add as identifier at the beginning of the line
    b = unlist(lapply(as.raw(1:255), rawToChar)), ## generate some nasty UTF8 characters
    c = "\n"                                      ## add line end
  )

## paste all lines together:
dummy_text <- paste(apply(as.matrix(dummy_text), 1, paste, collapse = ""), collapse = "")

## save as a temp file
dummy_file <- tempfile(fileext = ".txt")
writeLines(dummy_text, dummy_file)

chunk_size <- 5
lines_read <- 0
result <- character(0)

repeat {
  lines <- read_lines(dummy_file, skip = lines_read, n_max = chunk_size)
  ## print(problems(lines)) ## commented out for brevity
  if (length(lines) == 0) break
  lines_read <- lines_read + length(lines)
  result <- c(result, lines)
}
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)

## Next, try to extract the first three characters from each line, which I had added above as a unique identifier for each line
result_check <- unlist(lapply(result, function(x) tryCatch({substr(x, 1, 3)}, error = function(e) NULL)))
table(duplicated(result_check[result_check != ""]))
#> 
#> FALSE  TRUE 
#>   127     1

## It turns out the the line starting with '014' is read twice.

Created on 2023-08-01 with reprex v2.0.2

@hadley
Copy link
Member

hadley commented Aug 1, 2023

I can't replicate it:

library(readr)

lines <- paste0(
  sprintf("%03i", 1:255),                   ## add as identifier at the beginning of the line
  unlist(lapply(as.raw(1:255), rawToChar))  ## generate some nasty UTF8 characters
)
path <- tempfile()
writeLines(lines, path)

chunk_size <- 5
skips <- c(0, seq_len(length(lines) %/% chunk_size) * chunk_size)
chunks <- lapply(skips, \(skip) read_lines(path, skip = skip, n_max = chunk_size))
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)

id <- stringr::str_sub(unlist(chunks), 1, 3)
id[duplicated(id)]
#> character(0)

Created on 2023-08-01 with reprex v2.0.2

But I'm suspicious that your example is just tripping up on "\013" which is the carriage return, and I see you are on windows.

@pepijn-devries
Copy link
Author

You are right, after your comment I tried running my reprex on a Linux machine:

R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

There I don't get the issue described here. So it seems to be a Windows specific issue. For my specific case I have created a work-around. But I was hoping that readr would provide a platform-independent solution for reading files, i.e. produce the same result on each platform for the same file. If this is not possible than c'est la vie, but maybe this issue can be documented or produce a warning?

@hadley
Copy link
Member

hadley commented Aug 1, 2023

We can look into it, but would you mind having a go at a simpler reprex? I'm pretty sure the problem is related to having on line that uses \r\n where all the other lines use \n.

@pepijn-devries
Copy link
Author

pepijn-devries commented Aug 1, 2023

Sure, I will have a look whether I can pinpoint further what causes the issue. This might take some time...

@pepijn-devries
Copy link
Author

pepijn-devries commented Aug 1, 2023

It was easier to simpify the issue than I thought. You are right that \r is triggering funny behaviour on Windows:

library(readr)
#> Warning: package 'readr' was built under R version 4.1.3

text <- "001\n002\n\r003\n004"

read_lines(text, skip = 0)
#> [1] "001"   "002"   "\r003" "004"
read_lines(text, skip = 1)
#> [1] "002"   "\r003" "004"
read_lines(text, skip = 2)
#> [1] ""
read_lines(text, skip = 3)
#> [1] "003" "004"
read_lines(text, skip = 4)
#> [1] "004"

Created on 2023-08-01 with reprex v2.0.2

On Windows reading data from the text comes to a halt when using skip=2 in the reprex, and just returns an empty string. On Linux the code above behaves as expected. On Linux \r is read as a separate line, whereas on Windows, it is considered the same line as where 003 is.

Preferably, the same text is interpreted the same on each platform, or the user should be able to indicate which characters should be interpreted as a line feed.

@hadley
Copy link
Member

hadley commented Aug 1, 2023

I get the same behaviour on my mac, so it's great that we have a platform independent reprex (possibly because you're no longer saving the string to disk, which can do weird things to newlines).

library(readr)

text <- "001\n002\n\r003\n004"

read_lines(text, skip = 0)
#> [1] "001"   "002"   "\r003" "004"
read_lines(text, skip = 1)
#> [1] "002"   "\r003" "004"
read_lines(text, skip = 2)
#> [1] ""
read_lines(text, skip = 3)
#> [1] "003" "004"
read_lines(text, skip = 4)
#> [1] "004"

Created on 2023-08-01 with reprex v2.0.2

@hadley hadley added bug an unexpected problem or unintended behavior read 📖 and removed reprex needs a minimal reproducible example labels Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior read 📖
Projects
None yet
Development

No branches or pull requests

2 participants