Skip to content

Conversation

@jsignell
Copy link
Member

@jsignell jsignell commented Dec 30, 2025

Proposal

A new chunks option that is only allowed to use encoded chunks or multiples of them. No chunk splitting allowed.

Demo

Current behavior when chunks="auto"

ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks="auto")
image

This PR introduces a new option: chunks="preserve"

ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks="preserve")
image

Context

I originally set out to update the auto_chunks function in dask, but it felt like my goals were actually quite different. The goal of the dask auto_chunks function is to guarantee that the chunksize will be under a configurable limit while preserving the aspect ratio of previous_chunks (previous_chunks == encoding). This PR instead guarantees that encoded chunks are never split but it will multiply them by some factor to try to get the chunksize close to a targetsize. It doesn't try to preserve the aspect ratio of the chunks. Instead it goes after the dim where there is the greatest number of chunks and it tries to take those in bigger bites.

Also:

  • I'm not quite sure how the interface should work and I am definitely not attached to the word "preserve".
  • I originally put this in the DaskManager, but moved it to the base ChunkManagerEntrypoint class once I realized there was nothing dasky about it. I'm not sure if there is really supposed to be logic in methods on that class though.

@github-actions github-actions bot added topic-backends io topic-NamedArray Lightweight version of Variable labels Dec 30, 2025
* Move ``preserve_chunks`` to base ChunkManager class
* Get target size from dask config options for DaskManager
* Add test for open_zarr
@github-actions github-actions bot added topic-testing topic-hypothesis Strategies or tests using the hypothesis library labels Dec 31, 2025
({"x": "preserve", "y": -1}, (160, 500)),
],
)
def test_open_dataset_chunking_zarr_with_preserve(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are kind of slow.

@jsignell jsignell marked this pull request as ready for review December 31, 2025 18:46
@jsignell jsignell linked an issue Dec 31, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

io topic-backends topic-hypothesis Strategies or tests using the hypothesis library topic-NamedArray Lightweight version of Variable topic-testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

opening a zarr dataset taking so much time with dask

1 participant