Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inexplicable Piecewise File Reading Behavior using read_lines() #1444

Closed
hhabra opened this issue Oct 22, 2022 · 1 comment
Closed

Inexplicable Piecewise File Reading Behavior using read_lines() #1444

hhabra opened this issue Oct 22, 2022 · 1 comment

Comments

@hhabra
Copy link

hhabra commented Oct 22, 2022

Hello,

I have a file containing a very large number of lines (>150 million). For my purposes, I have tried to read the file a piece at a time (e.g. 10 million lines) using read_lines(), extracting the needed information before moving onto the next piece, but I'm getting some very erroneous behavior. To illustrate, I have created a small dummy file (attached) with about 30 lines. When I read this file 10 lines at a time 3 times and piece them together, the result is very different from reading 30 lines.

conn <- file("readr_error.txt", open = "rb")
r1 <- read_lines(conn, n_max = 10)  #attempting to read lines 1-10
r2 <- read_lines(conn, n_max = 10) #attempting to read lines 11-20
r3 <- read_lines(conn, n_max = 10) #attempting to read lines 21-30
r4 <- c(r1, r2, r3)
close(conn)

conn <- file("readr_error.txt", open = "rb")
r5 <- read_lines(conn, n_max = 30) #reading lines 1-30
close(conn)

#all lines from 11 to 30 are not equal
r4 == r5

On the other hand, if I use readLines from the base R package, it works as intended:

conn <- file("readr_error.txt", open = "rb")
r1 <- readLines(conn, n = 10)
r2 <- readLines(conn, n = 10)
r3 <- readLines(conn, n = 10)
r4 <- c(r1, r2, r3)
close(conn)

conn <- file("readr_error.txt", open = "rb")
r5 <- readLines(conn, n = 30)
close(conn)

#all are equal
r4 == r5

Can you please explain this?
readr_error.txt

@hadley
Copy link
Member

hadley commented Jul 31, 2023

Duplicate of #1494; unfortunately R doesn't expose a connection API so we can do streaming reads from a connection.

@hadley hadley closed this as completed Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants