Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does lazy reading actually work? #1499

Closed
abalter opened this issue Jun 16, 2023 · 2 comments
Closed

Does lazy reading actually work? #1499

abalter opened this issue Jun 16, 2023 · 2 comments

Comments

@abalter
Copy link

abalter commented Jun 16, 2023

The readr docs link to blog post that says:

readr 2.0 introduced ‘lazy’ reading by default. The idea of lazy reading is that instead of reading all the data in a CSV file up front you instead read it only on-demand.

Following the docs, I made certain that I am using "edition" 2 and that lazy reading is turned on:

> readr::edition_get()
[1] 2
> options(readr.read_lazy = TRUE)
> readr::should_read_lazy()
[1] TRUE

I created a large file by stacking 1000 iterations of shuffled mtcars using this code:

for (i in 1:1000)
{
  q = mtcars %>% sample_frac(1)
  write_csv(q, "mtcars.csv", append=T)
}

It's approximately 1.2M on disk

(base) balter@expiyes:~/winhome/OneDrive/Documents$ ls -lsh mtcars.csv
1.2M -rwxrwxrwx 1 balter balter 1.2M Jun 15 22:57 mtcars.csv
(base) balter@expiyes:~/winhome/OneDrive/Documents$ ls -ls mtcars.csv
1212 -rwxrwxrwx 1 balter balter 1237281 Jun 15 22:57 mtcars.csv

I then cleared my R environment, ran gc, and restarted R. Repeated that a few times for good measure. This is the output of my last call to gc:

> gc()
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 505764 27.1    1122997   60   644085 34.4
Vcells 767716  5.9    8388608   64  1650082 12.6

And this is the memory usage report from RStudio:

image

Next, I read in the mtcars x 1000 file with df = readr::read_csv("mtcars.csv", lazy=T). R tells me that my 1.2M file that I read in "lazily" is taking up 2.8M in memory:

> object.size(df)
2827576 bytes

This is my new memory usage report:

image

What I'm seeing is that I read in a file that is 1.2M on disk "lazily", and

  1. The object has a size of 2.8M in memory according to R
  2. The object added 19M to memory in RStudio used for R objects
@abalter
Copy link
Author

abalter commented Jun 17, 2023

@jimhester

@hadley
Copy link
Member

hadley commented Jul 31, 2023

Lazy reading mostly impacts string columns, which mtcars lacks so it's not a good example. I'd also suggest using lobstr::obj_size() to precisely measure the size of R objects.

library(readr)
path <- tempfile()

iris1k <- vctrs::vec_rbind(!!!rep(list(iris), 1000))
dim(iris1k)
#> [1] 150000      5
write_csv(iris1k, path)

lazy <- read_csv(path, lazy = TRUE, col_types = list())
eager <- read_csv(path, lazy = FALSE, col_types = list())

lobstr::obj_size(lazy)
#> 5.38 kB
lobstr::obj_size(eager)
#> 6.00 MB

Created on 2023-07-31 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants