Persistent cache with a polars dataframe #2661

AdrienDart · 2024-10-18T08:30:23Z

Describe the bug

Hi,

I'm trying to save a polars dataframe in cache using the following operation.

from vega_datasets import data
import polars as pl
df = data.iris().pipe(pl.from_pandas)
with mo.persistent_cache('my_cache'):
    df1 = df

I get TypeError("Cannot change data-type for object array.") (sorry I can't post the whole traceback, issue at line 217 in data_to_buffer in hash.py)
Is that expected?

A monkey patch that works is:

df = df.lazy()
with mo.persistent_cache('my_cache'):
    df1 = df.collect()

Thanks,

Adrien

Environment

Marimo 0.9.10

Code to reproduce

See above.

The text was updated successfully, but these errors were encountered:

dmadisetti · 2024-10-18T13:16:45Z

No, this looks like a bug, marimo should detect whether the object is serializable in the way it expects. This exception is thrown when there's that discrepancy. There's a bit of dataframe checking logic under the hood, so I think this might be solved by moving that logic to narwhals

Thanks for the easily reproducible code. You may be able to get around this by putting defining df in a separate cell in the meantime.

AdrienDart · 2024-10-24T07:53:49Z

Also, quick question, I notice the cached dataframe is saved as a pickle, could it be saved as a parquet for better performance/memory usage? Thanks for your help!

dmadisetti · 2024-10-24T15:05:46Z

Sure, I don't think any given file format should replace pickle, but maybe we'll expose a setting to choose a "loader" type.

Here's the pickle loader for your reference, I don't think it'd be too tricky to implement for any given storage type:

https://github.com/marimo-team/marimo/blob/main/marimo/_save/loaders/pickle.py

Couple other thoughts were npz, dill, and remote cache.

If you did want to play with this, the undocumented keyword arg _loader would let you inject a loader instance. You can see how we do this in testing:

marimo/tests/_save/test_cache.py

Line 49 in 45056be

name="one", _loader=MockLoader(data={"X": 7, "Y": 8})

This has fixes for: - [x] Shadowed arguments - [x] Formatting causing issues with context block: #2633 - [x] improved df "object detection": #2661 Following PR changes: - Detect when execution hash relies on a another hash object (cache breaking) (#3270) - Allow for pickle hash as fallback for "unhashable" variables (#3270) - Expand `@persistent_cache` api (this shouldn't cache bust, so I might just follow up) (#2653) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

dmadisetti · 2025-01-23T15:53:43Z

Closed by #3480

Thanks for reporting these!

AdrienDart added the bug Something isn't working label Oct 18, 2024

dmadisetti self-assigned this Oct 18, 2024

dmadisetti mentioned this issue Jan 17, 2025

cache iteration: various fixes- will invalidate exisiting cache #3480

Merged

3 tasks

dmadisetti closed this as completed Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent cache with a polars dataframe #2661

Persistent cache with a polars dataframe #2661

AdrienDart commented Oct 18, 2024

dmadisetti commented Oct 18, 2024

AdrienDart commented Oct 24, 2024

dmadisetti commented Oct 24, 2024

dmadisetti commented Jan 23, 2025

Persistent cache with a polars dataframe #2661

Persistent cache with a polars dataframe #2661

Comments

AdrienDart commented Oct 18, 2024

Describe the bug

Environment

Code to reproduce

dmadisetti commented Oct 18, 2024

AdrienDart commented Oct 24, 2024

dmadisetti commented Oct 24, 2024

dmadisetti commented Jan 23, 2025