Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memoise and jobs #95

Closed
dmi3kno opened this issue Aug 11, 2019 · 2 comments
Closed

memoise and jobs #95

dmi3kno opened this issue Aug 11, 2019 · 2 comments

Comments

@dmi3kno
Copy link

dmi3kno commented Aug 11, 2019

I am using memoise to perform long and painful scraping, so I put the task into the job. Here's content of my job:

my_sum <- function(x){
  cat("Adding", paste(x, collapse = ", "), "and result is ")
  sum(x)
}
mem_sum <- memoise::memoise(my_sum, cache=memoise::cache_filesystem("~/.rcache"))

x<-list(1:3, 2:6)
lapply(x,mem_sum)

If I run the job, two cache files are produced, (as expected). If I rerun the same job, no new caches are created (as expected).

Now, I remembered that I have one more address to scrape, I come into my job script and modify it

# everything above is unchanged
x<-list(1:3, 2:6, 3:7)
lapply(x,mem_sum)

Three new files are created! Wait a second! memoise disregarded the previously created cache because the environment has changed.

I realize I've done something wrong, I come in and restore things as they were hoping memoise would pick old cache objects

# everything above is unchanged
x<-list(1:3, 2:6)
lapply(x,mem_sum)

Alas, memoise creates yet another two objects, disregarding everything that happened above. Now I have at least three duplicate calls to the same function with same arguments.
Funny part is that my local environment, of course, is completely unaware of any of that and there's no way to get to those cached versions.

memoise::has_cache(mem_sum)(1:3)
#> FALSE

Not only that, but there's also no way to "forget" or "clean" those caches, now, as far as I understand, since the "Jobs" environment is impossible to reproduce.

Questions:

  1. Do we have to memoise with the environment signature, given that we have an assumption that the function is "pure"? ref Clarify that memoise should only apply to pure function? #57
  2. How do I recover caches that are related to the same function, but created in a different environment. Shouldn't there be a way to decrypt the content of cache from my GlobEnv even though the cache has been created in the "job"? Maybe this is for {memoisetools} by @coolbutuseless?
@jimhester
Copy link
Member

I assume you are talking about RStudio background jobs here, but you didn't specify what a job was.

If you put mem_sum in a package this should still work as the parent environments would then be the same. Alternatively you could probably save the functions to a Rds and load them from there in a script at it would work that way as well.

I think these issues are largely tangential to the memoise package and just fall out from the way R environment inheritance works.

@ShixiangWang
Copy link

Same idea and issue, I wanted to cache the data as a background job with future package when starting a Shiny app, but it does not work for the second run, as a different cache file was generated.

-rw-r--r--    1 wsx  staff   6.2M 11  1 10:01 2ad6156fe3b31593abe4cdf4f2fa6b66.rds
-rw-r--r--    1 wsx  staff   6.2M 11  1 09:59 9ffe20404ea52f5deeb50a0cb2df32e8.rds

# Preload
library(future)
plan(multisession, workers = 2)
f %<-% {
  # The first run
  dataset_load("EGAD00001008549_circRNA_Ensemble", verbose = TRUE)
}

# wait seconds
system.time(dataset_load("EGAD00001008549_circRNA_Ensemble", verbose = TRUE))
system.time(dataset_load("EGAD00001008549_circRNA_Ensemble", verbose = TRUE))

# Start app
coco::run_app() # add parameters here (if any)
> system.time(dataset_load("EGAD00001008549_circRNA_Ensemble", verbose = TRUE)) # Use cache data from the future job does not work, as the data is generated as a new run
   user  system elapsed 
  4.496   0.617   5.552 
> system.time(dataset_load("EGAD00001008549_circRNA_Ensemble", verbose = TRUE)) # Use the cache data from the second run
   user  system elapsed 
      0       0       0 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants