Implementing sample_n and sample_frac #8

ColinFay · 2018-05-22T21:04:03Z

We could implement a chunk wise sample_n / sample_frac with:

library(tidyverse)
big <- rerun(1000, iris) %>% bind_rows()
path <- tempfile()
write_csv(big, path)

library(chunked)
sample_n.chunkwise <- function(.data, size){
  cmd <- lazyeval::lazy(sample_n(.data, size))
  chunked:::record(.data, cmd)
}

read_csv_chunkwise(path) %>% 
  sample_n(1) %>% 
  collect()

The sample would be done in each chunk that way.

What do you think about that?
If it sounds like a good idea, let me know and I'll send you a PR.

edwindj · 2018-05-23T19:54:08Z

I like the idea!
Minor problem with sample_n is that it would not have the same semantics: it would return a sample of number of chunks * n instead of n, but if we document that I can live with that :-)

xiaodaigh · 2019-08-24T04:21:10Z

disk.frame has implemented a sample_frac and sample_n is pending.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing sample_n and sample_frac #8

Implementing sample_n and sample_frac #8

ColinFay commented May 22, 2018

edwindj commented May 23, 2018

xiaodaigh commented Aug 24, 2019

Implementing sample_n and sample_frac #8

Implementing sample_n and sample_frac #8

Comments

ColinFay commented May 22, 2018

edwindj commented May 23, 2018

xiaodaigh commented Aug 24, 2019