Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing sample_n and sample_frac #8

Open
ColinFay opened this issue May 22, 2018 · 2 comments
Open

Implementing sample_n and sample_frac #8

ColinFay opened this issue May 22, 2018 · 2 comments

Comments

@ColinFay
Copy link

We could implement a chunk wise sample_n / sample_frac with:

library(tidyverse)
big <- rerun(1000, iris) %>% bind_rows()
path <- tempfile()
write_csv(big, path)

library(chunked)
sample_n.chunkwise <- function(.data, size){
  cmd <- lazyeval::lazy(sample_n(.data, size))
  chunked:::record(.data, cmd)
}

read_csv_chunkwise(path) %>% 
  sample_n(1) %>% 
  collect() 

The sample would be done in each chunk that way.

What do you think about that?
If it sounds like a good idea, let me know and I'll send you a PR.

@edwindj
Copy link
Owner

edwindj commented May 23, 2018

I like the idea!
Minor problem with sample_n is that it would not have the same semantics: it would return a sample of number of chunks * n instead of n, but if we document that I can live with that :-)

@xiaodaigh
Copy link

disk.frame has implemented a sample_frac and sample_n is pending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants