`obstore`-based Store implementation #1661

kylebarron · 2024-02-08T15:30:21Z

A Zarr store based on obstore, a Python library that uses the Rust object_store crate under the hood.

object-store is a rust crate for interoperating with remote object stores like S3, GCS, Azure, etc. See the highlights section of its docs.

obstore maps async Rust functions to async Python functions, and is able to stream GET and LIST requests, which all make it a good candidate for use with the Zarr v3 Store protocol.

You should be able to test this branch with the latest version of obstore:

pip install --upgrade obstore

TODO:

Examples
Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

jhamman · 2024-02-08T17:19:12Z

Amazing @kylebarron! I'll spend some time playing with this today.

kylebarron · 2024-02-08T21:43:54Z

With roeap/object-store-python#9 it should be possible to fetch multiple ranges within a file concurrently with range coalescing (using get_ranges_async). Note that this object-store API accepts multiple ranges within one object, which is still not 100% aligned with the Zarr get_partial_values because that allows fetches across multiple objects.

That PR also adds a get_opts function which now supports "offset" and "suffix" ranges, of the sort Range:N- and Range:-N, which would allow removing the raise NotImplementedError on line 37.

martindurant · 2024-02-09T01:20:48Z

martindurant/rfsspec#3

src/zarr/v3/store/object_store.py

normanrz · 2024-02-12T14:30:14Z

Great work @kylebarron!
What are everbody's thoughts on having this in zarr-python vs. spinning it out as a separate package?

martindurant · 2024-02-12T14:45:13Z

What are everbody's thoughts on having this in zarr-python vs. spinning it out as a separate package?

I suggest we see whether it makes any improvements first, so it's author's choice for now.

kylebarron · 2024-02-12T16:00:11Z

While @rabernat has seen some impressive perf improvements in some settings when making many requests with Rust's tokio runtime, which would possibly also trickle down to a Python binding, the biggest advantage I see is improved ease of use in installation.

A common hurdle I've seen is handling dependency management, especially around boto3, aioboto3, etc dependencies. Versions need to be compatible at runtime with any other libraries the user also has in their environment. And Python doesn't allow multiple versions of the same dependency at the same time in one environment. With a Python library wrapping a statically-linked Rust binary, you can remove all Python dependencies and remove this class of hardship.

The underlying Rust object-store crate is stable and under open governance via the Apache Arrow project. We'll just have to wait on some discussion in object-store-python for exactly where that should live.

I don't have an opinion myself on where this should live, but it should be on the order of 100 lines of code wherever it is (unless the v3 store api changes dramatically)

jhamman · 2024-02-12T17:07:22Z

I suggest we see whether it makes any improvements first, so it's author's choice for now.

👍

What are everbody's thoughts on having this in zarr-python vs. spinning it out as a separate package?

I want to keep an open mind about what the core stores provided by Zarr-Python are. My current thinking is that we should just do a MemoryStore and a LocalFilesystemStore. Everything else can be opt-in by installing a 3rd party package. That said, I like having a few additional stores in the mix as we develop the store interface since it helps us think about the design more broadly.

martindurant · 2024-02-12T18:18:11Z

A common hurdle I've seen is handling dependency management, especially around boto3, aioboto3, etc dependencies.

This is no longer an issue, s3fs has much more relaxed deps than it used to. Furthermore, it's very likely to be already part of an installation environment.

normanrz · 2024-02-12T18:18:12Z

I want to keep an open mind about what the core stores provided by Zarr-Python are. My current thinking is that we should just do a MemoryStore and a LocalFilesystemStore. Everything else can be opt-in by installing a 3rd party package.

I agree with that. I think it is beneficial to keep the number of dependencies of core zarr-python small. But, I am open for discussion.

That said, I like having a few additional stores in the mix as we develop the store interface since it helps us think about the design more broadly.

Sure! That is certainly useful.

itsgifnotjiff · 2024-02-23T22:24:53Z

This is awesome work, thank you all!!!

src/zarr/v3/store/object_store.py

Co-authored-by: Deepak Cherian <[email protected]>

kylebarron · 2024-10-21T18:13:39Z

The object-store-python package is not very well maintained roeap/object-store-python#24, so I took a few days to implement my own wrapper around the Rust object_store crate: https://github.com/developmentseed/object-store-rs

I'd like to update this PR soonish to use that library instead.

martindurant · 2024-10-21T19:32:54Z

If the zarr group prefers object-store-rs, we can move it into the zarr-developers org, if you like. I would like to be involved in developing it, particularly if it can grow more explicit fsspec compatible functionality.

kylebarron · 2024-10-22T18:01:46Z

I have a few questions because the Store API has changed a bit since the spring.

There's a new BufferPrototype object. Is the BufferPrototype chosen by the store implementation or the caller? It would be very nice if this prototype could be chosen by the store implementation, because then we could return a RustBuffer object that implements the Python buffer protocol, but doesn't need to copy the buffer into Python memory.
Similarly for puts, is Buffer guaranteed to implement the buffer protocol? Contrary to fetching, we can't do zero-copy puts right now with object-store

I like that list now returns an AsyncGenerator. That aligns well with the underlying object-store rust API, but for technical reasons we can't expose that as an async iterable to Python yet (apache/arrow-rs#6587), even though we do expose the readable stream to Python as an async iterable.

TomAugspurger · 2024-10-22T18:55:00Z

Is the BufferPrototype chosen by the store implementation or the caller? It would be very nice if this prototype could be chosen by the store implementation, because then we could return a RustBuffer object that implements the Python buffer protocol, but doesn't need to copy the buffer into Python memory.

This came up in the discussion at https://github.com/zarr-developers/zarr-python/pull/2426/files/5e0ffe80d039d9261517d96ce87220ce8d48e4f2#diff-bb6bb03f87fe9491ef78156256160d798369749b4b35c06d4f275425bdb6c4ad. By default, it's passed as default_buffer_prototype though I think the user can override at the call site or globally.

Does it look compatible with what you need?

TomAugspurger

Thanks. One last question about the dependency in the pyproject.toml.

pyproject.toml

src/zarr/storage/_obstore.py

kylebarron · 2025-03-21T18:09:40Z

Now I'm just trying to get the tests to pass (re #1661 (comment)) and we should be good. (I can't get the tests to pass locally anyways; I get botocore.exceptions.ClientError: An error occurred (IllegalLocationConstraintException) on all the fsspec tests)

kylebarron · 2025-03-21T18:47:14Z

In a3afa44 (#1661) I added intersphinx support, which allows for automatic interlinking with the obstore docs. But restructured text drives me insane, so I hope those docs are good enough

TomAugspurger · 2025-03-24T01:13:54Z

Planning to merge this tomorrow if there aren't any objections.

TomAugspurger · 2025-03-24T16:30:41Z

Thanks for the great work everyone!

kylebarron · 2025-03-24T17:16:18Z

Thanks all! Just published obstore 0.6, which adds easier, automatic-token-refreshing integration with Planetary Computer. And I was able to get their zarr example working with this latest main!

import matplotlib.pyplot as plt
import pystac_client
import xarray as xr
from obstore.auth.planetary_computer import PlanetaryComputerCredentialProvider
from obstore.store import AzureStore
from zarr.storage import ObjectStore

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1/"
)
collection = catalog.get_collection("daymet-daily-hi")
asset = collection.assets["zarr-abfs"]

# The PlanetaryComputerCredentialProvider automatically fetches Planetary
# Computer SAS tokens as necessary and refreshes them before they expire
credential_provider = PlanetaryComputerCredentialProvider.from_asset(asset)
azure_store = AzureStore(credential_provider=credential_provider)
zarr_store = ObjectStore(azure_store, read_only=True)
ds = xr.open_dataset(zarr_store, consolidated=True, engine="zarr")

fig, ax = plt.subplots(figsize=(12, 12))
ds.sel(time="2009")["tmax"].mean(dim="time").plot.imshow(ax=ax, cmap="inferno")
fig

uv pyproject.toml

[project]
name = "zarr-obstore-pc"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
    "matplotlib>=3.10.1",
    "obstore>=0.6.0",
    "pystac-client>=0.8.6",
    "xarray>=2025.3.0",
    "zarr",
]

[tool.uv.sources]
zarr = { git = "https://github.com/zarr-developers/zarr-python" }

[dependency-groups]
dev = [
    "ipykernel>=6.29.5",
]

jhamman · 2025-03-24T17:21:29Z

Huge props to @kylebarron and @maxrjones for sticking with this PR and getting it in! We'll get this out as part of Zarr 3.1.

👏 👏 👏 👏 👏

ilan-gold · 2025-03-28T10:52:35Z

Hi this PR is very exciting - I am curious, is the performance expected to be better than fsspec for similar ops (fetching from remote etc)? It could be good to highlight a bit why to choose this store specifically. I would love to understand!

EDIT: I see #1661 (comment) - could be great to highlight this work in the docs!

kylebarron · 2025-03-28T17:11:30Z

Yes, I expect it to be significantly faster, but we don't have rigorous benchmarks yet. I'd love to see some Zarr benchmarks, and then maybe we can update the docs to reflect those.

itsgifnotjiff · 2025-03-28T17:18:07Z

I am not sure if this is within the scope of your benchmarking but if you can test the single point query times and performance for Zarr store in the 100 Tb range that would be great. Zarr v2 had problems with both number of inodes required and performance in my experience.

The groups/tree addition along with the explosion of large scale data means Zarr stores are either already performant enough or not performant at all for different use cases (geospatial in mine).

kylebarron · 2025-03-28T17:24:34Z

I don't personally use Zarr much, so ideally I want to enable other people to do benchmarking. But happy to pair or support in any way I can.

itsgifnotjiff · 2025-03-28T17:29:46Z

Makes perfect sense. I hope I get to benchmark it later this year. I will link my potential findings here as well 😊. Thank you so much for your work.

maxrjones · 2025-04-05T01:17:17Z

I am not sure if this is within the scope of your benchmarking but if you can test the single point query times and performance for Zarr store in the 100 Tb range that would be great. Zarr v2 had problems with both number of inodes required and performance in my experience.

The groups/tree addition along with the explosion of large scale data means Zarr stores are either already performant enough or not performant at all for different use cases (geospatial in mine).

Hey @itsgifnotjiff Davis Bennett wrote a great blog post for Earthmover that explains the general improvements in Zarr V3 with opening 100TB range datasets (which accounts for much of the time of single point queries) - you can read that here. The obstore store offers further improvements, as shown below. Full details are available in https://github.com/maxrjones/zarr-obstore-performance.

itsgifnotjiff · 2025-04-05T12:55:24Z

Thank you very much for this. I can't wait to see if these kind of performance improvements also apply to pseudo zarrs (zarrs backed by our binary files).

TomNicholas · 2025-04-05T14:13:01Z

@itsgifnotjiff what are "pseudo zarrs"? It is similar to a virtual Zarr? https://github.com/zarr-developers/VirtualiZarr

itsgifnotjiff · 2025-04-05T15:36:54Z

Yes I am trying to see if I can create Icechunk Arrays and/or Zarr stores for Petabytes of binary format data.

I work with Environment and Climate Change Canada where we have a wonderful binary format for NWP model outputs and like all the organisations I've talked to we can not abandon it but if we can build on top of it .... (Bit like gribjump from ECMWF or even some slides from Icechunk).

TomNicholas · 2025-04-05T17:09:37Z

like all the organisations I've talked to we can not abandon it but if we can build on top of it

Yes that's exactly the problem VirtualiZarr was built to solve.

even some slides from Icechunk

Those slides are referring to VirtualiZarr, which has facility for writing "virtual" zarr chunks into Icechunk (see virtualizarr docs or icechunk docs).

gribjump from ECMWF

Interesting - I hadn't heard of this.

But this issue is closed - @itsgifnotjiff let's continue this discussion on the VirtualiZarr repo - perhaps on this issue (or feel free to open a new one).

Initial object-store implementation

14be826

jhamman mentioned this pull request Feb 8, 2024

Zarr Sprint Topics zarr-developers/geozarr-spec#33

Closed

martindurant reviewed Feb 9, 2024

View reviewed changes

src/zarr/v3/store/object_store.py Outdated Show resolved Hide resolved

Merge branch 'v3' into kyle/object-store

a492bf0

jhamman added the V3 label Feb 13, 2024

Merge branch 'v3' into kyle/object-store

50b6c47

dcherian reviewed Feb 27, 2024

View reviewed changes

src/zarr/v3/store/object_store.py Outdated Show resolved Hide resolved

Update src/zarr/v3/store/object_store.py

afa79af

Co-authored-by: Deepak Cherian <[email protected]>

This was referenced Apr 9, 2024

Generating references from files in S3 (using kerchunk + fsspec) zarr-developers/VirtualiZarr#61

Closed

Using hidefix to determine byte ranges in HDF files? gauteh/hidefix#38

Open

Using cog3pio to determine byte ranges in COG files? weiji14/cog3pio#16

Open

jhamman added this to the After 3.0.0 milestone Apr 22, 2024

jhamman changed the base branch from v3 to main October 14, 2024 20:55

Merge branch 'main' into kyle/object-store

c466f9f

Merge branch 'main' into kyle/object-store

c3e7296

update

f5c884b

TomAugspurger approved these changes Mar 21, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

src/zarr/storage/_obstore.py Show resolved Hide resolved

kylebarron added 5 commits March 21, 2025 13:42

Update get parameterization

0f1092a

Don't exact pin

4dffea5

Add partially satisiable ranges

bcbe25e

fix doctest

6dc684d

Remove partially satisfiable tests (fails on fsspec)

f84284e

kylebarron added 3 commits March 21, 2025 14:11

fix tests?

8912e43

fix repr

bfb70cd

Obstore interlinking

a3afa44

Merge branch 'main' into kyle/object-store

622dbf2

TomAugspurger merged commit 9e8b50a into zarr-developers:main Mar 24, 2025
30 checks passed

github-project-automation bot moved this from In review to Done in Zarr-Python - 3.0 Mar 24, 2025

kylebarron deleted the kyle/object-store branch March 24, 2025 17:10

kylebarron mentioned this pull request Apr 7, 2025

Tiff test error zarr-developers/VirtualiZarr#526

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`obstore`-based Store implementation #1661

`obstore`-based Store implementation #1661

kylebarron commented Feb 8, 2024 •

edited

Loading

jhamman commented Feb 8, 2024

kylebarron commented Feb 8, 2024

martindurant commented Feb 9, 2024

normanrz commented Feb 12, 2024

martindurant commented Feb 12, 2024

kylebarron commented Feb 12, 2024

jhamman commented Feb 12, 2024

martindurant commented Feb 12, 2024

normanrz commented Feb 12, 2024

itsgifnotjiff commented Feb 23, 2024

kylebarron commented Oct 21, 2024

martindurant commented Oct 21, 2024

kylebarron commented Oct 22, 2024 •

edited

Loading

TomAugspurger commented Oct 22, 2024

TomAugspurger left a comment

kylebarron commented Mar 21, 2025

kylebarron commented Mar 21, 2025

TomAugspurger commented Mar 24, 2025

TomAugspurger commented Mar 24, 2025

kylebarron commented Mar 24, 2025 •

edited

Loading

jhamman commented Mar 24, 2025

ilan-gold commented Mar 28, 2025 •

edited

Loading

kylebarron commented Mar 28, 2025

itsgifnotjiff commented Mar 28, 2025

kylebarron commented Mar 28, 2025

itsgifnotjiff commented Mar 28, 2025

maxrjones commented Apr 5, 2025

itsgifnotjiff commented Apr 5, 2025

TomNicholas commented Apr 5, 2025

itsgifnotjiff commented Apr 5, 2025

TomNicholas commented Apr 5, 2025

obstore-based Store implementation #1661

obstore-based Store implementation #1661

Conversation

kylebarron commented Feb 8, 2024 • edited Loading

jhamman commented Feb 8, 2024

kylebarron commented Feb 8, 2024

martindurant commented Feb 9, 2024

normanrz commented Feb 12, 2024

martindurant commented Feb 12, 2024

kylebarron commented Feb 12, 2024

jhamman commented Feb 12, 2024

martindurant commented Feb 12, 2024

normanrz commented Feb 12, 2024

itsgifnotjiff commented Feb 23, 2024

kylebarron commented Oct 21, 2024

martindurant commented Oct 21, 2024

kylebarron commented Oct 22, 2024 • edited Loading

TomAugspurger commented Oct 22, 2024

TomAugspurger left a comment

Choose a reason for hiding this comment

kylebarron commented Mar 21, 2025

kylebarron commented Mar 21, 2025

TomAugspurger commented Mar 24, 2025

TomAugspurger commented Mar 24, 2025

kylebarron commented Mar 24, 2025 • edited Loading

jhamman commented Mar 24, 2025

ilan-gold commented Mar 28, 2025 • edited Loading

kylebarron commented Mar 28, 2025

itsgifnotjiff commented Mar 28, 2025

kylebarron commented Mar 28, 2025

itsgifnotjiff commented Mar 28, 2025

maxrjones commented Apr 5, 2025

itsgifnotjiff commented Apr 5, 2025

TomNicholas commented Apr 5, 2025

itsgifnotjiff commented Apr 5, 2025

TomNicholas commented Apr 5, 2025

`obstore`-based Store implementation #1661

`obstore`-based Store implementation #1661

kylebarron commented Feb 8, 2024 •

edited

Loading

kylebarron commented Oct 22, 2024 •

edited

Loading

kylebarron commented Mar 24, 2025 •

edited

Loading

ilan-gold commented Mar 28, 2025 •

edited

Loading