Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent cache with a polars dataframe #2661

Closed
AdrienDart opened this issue Oct 18, 2024 · 4 comments
Closed

Persistent cache with a polars dataframe #2661

AdrienDart opened this issue Oct 18, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@AdrienDart
Copy link

Describe the bug

Hi,

I'm trying to save a polars dataframe in cache using the following operation.

from vega_datasets import data
import polars as pl
df = data.iris().pipe(pl.from_pandas)
with mo.persistent_cache('my_cache'):
    df1 = df

I get TypeError("Cannot change data-type for object array.") (sorry I can't post the whole traceback, issue at line 217 in data_to_buffer in hash.py)
Is that expected?

A monkey patch that works is:

df = df.lazy()
with mo.persistent_cache('my_cache'):
    df1 = df.collect()

Thanks,

Adrien

Environment

Marimo 0.9.10

Code to reproduce

See above.

@AdrienDart AdrienDart added the bug Something isn't working label Oct 18, 2024
@dmadisetti dmadisetti self-assigned this Oct 18, 2024
@dmadisetti
Copy link
Collaborator

No, this looks like a bug, marimo should detect whether the object is serializable in the way it expects. This exception is thrown when there's that discrepancy. There's a bit of dataframe checking logic under the hood, so I think this might be solved by moving that logic to narwhals

Thanks for the easily reproducible code. You may be able to get around this by putting defining df in a separate cell in the meantime.

@AdrienDart
Copy link
Author

Also, quick question, I notice the cached dataframe is saved as a pickle, could it be saved as a parquet for better performance/memory usage? Thanks for your help!

@dmadisetti
Copy link
Collaborator

Sure, I don't think any given file format should replace pickle, but maybe we'll expose a setting to choose a "loader" type.

Here's the pickle loader for your reference, I don't think it'd be too tricky to implement for any given storage type:

https://github.com/marimo-team/marimo/blob/main/marimo/_save/loaders/pickle.py

Couple other thoughts were npz, dill, and remote cache.

If you did want to play with this, the undocumented keyword arg _loader would let you inject a loader instance. You can see how we do this in testing:

name="one", _loader=MockLoader(data={"X": 7, "Y": 8})

akshayka pushed a commit that referenced this issue Jan 22, 2025
This has fixes for:

 - [x] Shadowed arguments
 - [x] Formatting causing issues with context block: #2633
 - [x] improved df "object detection": #2661
 
Following PR changes:

- Detect when execution hash relies on a another hash object (cache
breaking) (#3270)
-  Allow for pickle hash as fallback for "unhashable" variables (#3270)
- Expand `@persistent_cache` api (this shouldn't cache bust, so I might
just follow up) (#2653)

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@dmadisetti
Copy link
Collaborator

Closed by #3480

Thanks for reporting these!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants