You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The discussion in #3270 (@leventov) highlighted that the current cache model is sensitive to non obvious side-effects (e.g. cache miss in one case not invalidating another )
here, result will always be the same, because it uses the execution path hash to resolve dependent. At the very least, the make_dependent should invalidate use_dependent because there'll be a hash miss.
Let's have persistent_cache create a side_effect entry associated with a cell. ExecutionPath hashes will now also consume the side effect data created during the execution of the relevant block.
There are a couple other places a side effect registry could come into play:
marimo.random: could ensure that the call is consistent to the notebook running the same cell would give you the same random number (PRNG increment is dependent on number of total calls until that point in the dag, and a cache skip would be associated with a jump ahead). Also a note, I think jax has a pretty nice model- https://docs.jax.dev/en/latest/random-numbers.html
marimo.cache_timeout: Reason why I thought of this again. A user on discord reported they would like their sql queries to be cached until some timeout. This could be done through some form of cache clean up, but issuing a sideeffect is a neat way to achieve this as well. Conversely, mo.sql calls could issue a side effect behind the scenes.
marimo.fetch: Get a network value. Could also have a time expiry
marimo.file: file change watcher issues a side effect
marimo.env: Environmental variable access
These could also all be namespaced under sideeffects:
classSideEffectRegistry:
def__init__(self) ->None:
self.namespaces: dict[str, set[str]] = {}
defregister(self, cell_id: CellId_t, state_hash: bytes) ->None:
# state_hash is some how tied to the "side-effect"# For instance, in random, it might be the random state# timeout might be state = (now() - original_time) // time_intervalifcell_idnotinself.namespaces:
self.namespaces[cell_id] =set()
self.namespaces[cell_id].add(state_hash)
defdelete(self, cell_id) ->None:
"""Called when cells get cleaned up in CellRunnerKernel"""ifcell_idinself.namespaces:
delself.namespaces[cell_id]
Alternative
the side effect registry is just behind the scenes- but exposing some of the functionality through our api should allow for for more reliable usage of cache invalidation
Additional context
I have a branch from a few weeks back where I started casually hacking on this.
For the caching timeline I see:
base registry + maybe some of these "extras"
data "adapters" (i.e. read from redis or buckets opposed to just vanilla file systems)
async/sync executor + dataflow.Runner refactor
cell level caching executor
as completing the first bit of this story. A full white paper should go on to examine:
mandala
diskcache
joblib.Memory
functools.cache
comparisons where applicable
The text was updated successfully, but these errors were encountered:
Description
The discussion in #3270 (@leventov) highlighted that the current cache model is sensitive to non obvious side-effects (e.g. cache miss in one case not invalidating another )
A simple case follows:
here,
result
will always be the same, because it uses the execution path hash to resolvedependent
. At the very least, themake_dependent
should invalidateuse_dependent
because there'll be a hash miss.Let's have
persistent_cache
create aside_effect
entry associated with a cell. ExecutionPath hashes will now also consume the side effect data created during the execution of the relevant block.Suggested solution
There are a couple other places a side effect registry could come into play:
marimo.random: could ensure that the call is consistent to the notebook running the same cell would give you the same random number (PRNG increment is dependent on number of total calls until that point in the dag, and a cache skip would be associated with a jump ahead). Also a note, I think jax has a pretty nice model- https://docs.jax.dev/en/latest/random-numbers.html
marimo.cache_timeout: Reason why I thought of this again. A user on discord reported they would like their sql queries to be cached until some timeout. This could be done through some form of cache clean up, but issuing a sideeffect is a neat way to achieve this as well. Conversely, mo.sql calls could issue a side effect behind the scenes.
marimo.fetch: Get a network value. Could also have a time expiry
marimo.file: file change watcher issues a side effect
marimo.env: Environmental variable access
These could also all be namespaced under sideeffects:
cat marimo/_runtime/side_effects.py
Alternative
the side effect registry is just behind the scenes- but exposing some of the functionality through our api should allow for for more reliable usage of cache invalidation
Additional context
I have a branch from a few weeks back where I started casually hacking on this.
For the caching timeline I see:
as completing the first bit of this story. A full white paper should go on to examine:
comparisons where applicable
The text was updated successfully, but these errors were encountered: