Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Side effect apis + registry #3876

Open
dmadisetti opened this issue Feb 21, 2025 · 0 comments
Open

Side effect apis + registry #3876

dmadisetti opened this issue Feb 21, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@dmadisetti
Copy link
Collaborator

Description

The discussion in #3270 (@leventov) highlighted that the current cache model is sensitive to non obvious side-effects (e.g. cache miss in one case not invalidating another )

A simple case follows:

i = random()
with mo.persistent_cache("make_dependent"):
    dependent = UnserializableObj(i)

with mo.persistent_cache("use_dependent"):
    result = dependent.action()

here, result will always be the same, because it uses the execution path hash to resolve dependent. At the very least, the make_dependent should invalidate use_dependent because there'll be a hash miss.

Let's have persistent_cache create a side_effect entry associated with a cell. ExecutionPath hashes will now also consume the side effect data created during the execution of the relevant block.

class _cache_context():
    # ...
    def __exit__(self, ...):
        #...
        context.side_effect_registry.add(cell_id, cache.hash)
def hash_and_dequeue_execution_refs(...):
    # ...
    side_effects = set()
    for ancestor_id in to_hash:
        side_effects |= context.side_effect_registry.get(ancestor_id)
    # ... 
    self.hash_alg.update(side_effects)

Suggested solution

There are a couple other places a side effect registry could come into play:

marimo.random: could ensure that the call is consistent to the notebook running the same cell would give you the same random number (PRNG increment is dependent on number of total calls until that point in the dag, and a cache skip would be associated with a jump ahead). Also a note, I think jax has a pretty nice model- https://docs.jax.dev/en/latest/random-numbers.html

marimo.cache_timeout: Reason why I thought of this again. A user on discord reported they would like their sql queries to be cached until some timeout. This could be done through some form of cache clean up, but issuing a sideeffect is a neat way to achieve this as well. Conversely, mo.sql calls could issue a side effect behind the scenes.

marimo.fetch: Get a network value. Could also have a time expiry

marimo.file: file change watcher issues a side effect

marimo.env: Environmental variable access

These could also all be namespaced under sideeffects:

from marimo.sideeffects import random, fetch, file, env, timeout

cat marimo/_runtime/side_effects.py

class SideEffectRegistry:
    def __init__(self) -> None:
        self.namespaces: dict[str, set[str]] = {}

    def register(self, cell_id: CellId_t, state_hash: bytes) -> None:
        # state_hash is some how tied to the "side-effect"
        # For instance, in random, it might be the random state
        # timeout might be state = (now() - original_time) // time_interval
        if cell_id not in self.namespaces:
            self.namespaces[cell_id] = set()
        self.namespaces[cell_id].add(state_hash)

    def delete(self, cell_id) -> None:
        """Called when cells get cleaned up in CellRunnerKernel"""
        if cell_id in self.namespaces:
            del self.namespaces[cell_id]

Alternative

the side effect registry is just behind the scenes- but exposing some of the functionality through our api should allow for for more reliable usage of cache invalidation

Additional context

I have a branch from a few weeks back where I started casually hacking on this.

For the caching timeline I see:

  • base registry + maybe some of these "extras"
  • data "adapters" (i.e. read from redis or buckets opposed to just vanilla file systems)
  • async/sync executor + dataflow.Runner refactor
  • cell level caching executor

as completing the first bit of this story. A full white paper should go on to examine:

  • mandala
  • diskcache
  • joblib.Memory
  • functools.cache

comparisons where applicable

@dmadisetti dmadisetti added the enhancement New feature or request label Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant