Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr Python v3 and Zarr v3 #355

Open
Jan-Willem opened this issue Feb 4, 2025 · 3 comments
Open

Zarr Python v3 and Zarr v3 #355

Jan-Willem opened this issue Feb 4, 2025 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@Jan-Willem
Copy link
Member

Currently, XRADIO makes use of Python Zarr v2 and Zarr format v2. Python Zarr v3 was recently released (https://zarr.dev/blog/zarr-python-3-release/ ) and offers full support for Zarr format v3, but it requires Python >= 3.11. This issue aims to look into what changes would need to be made to support Zarr format v3 and Zarr format v3. Note that zarr_version=3 is required for all the to_zarr calls.

@Jan-Willem Jan-Willem added the enhancement New feature or request label Feb 4, 2025
@FedeMPouzols
Copy link
Collaborator

Here is a summary of the current situation with zarr-python (my best understanding, I think ~90% correct but subject to uncertainty and changes througout the last/next few weeks).

Short version:

  1. It seems we can safely update 'zarr>=2,<3' => 'zarr>=3,<4' (in practice 'zarr>=2,<4' for as long as we support Python 3.10) without too many or too complicated changes.
  2. zarr-python 3.0.x with zarr_format=2 works well after a few changes in xradio, MSv4s can be read and written without issues and all tests pass , but with zarr_format=3 there are backwards compatibiliy issues in the specs, especially around data types (strings).

Longer version:

1.1 API changes in zarr-python 3 seem to have affected only the format of the encoding dict passed to the xarray Zarr backend, although in a bit flaky way. Also, the codecs are now passed as imported from zarr.codecs, rather than numcodecs. All other changes seem to be handled properly at the xarray Zarr backend layer (requires xarray>=2025.1.2). With that fixed, we can write/read using zarr-python 3, with zarr_format=2.

1.2: With zarr_format=3, strings are not currently supported in the Zarr specs ("not yet ported").
From https://zarr.readthedocs.io/en/latest/user-guide/v3_migration.html:

The following features that were supported by Zarr-Python 2 have not been ported to Zarr-Python 3 yet:
...
- Fixed-length string dtypes (#2347)

Variable and fixed length strings (Unicode or not) are not supported and produce warnings like the following, depending on the type of string:

 .../venv_xradio_python312/lib/python3.12/site-packages/zarr/core/array.py:3985: UserWarning: The dtype 
`<U155` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr 
implementations and may change in the future.
...
.../venv_xradio_python312/lib/python3.12/site-packages/zarr/core/array.py:3985: UserWarning: The dtype `|S2` is 
currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and 
may change in the future.
...
.../venv_xradio_python312/lib/python3.12/site-packages/zarr/codecs/vlen_utf8.py:99: UserWarning: The codec 
`vlen-bytes` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr 
implementations and may change in the future.
  return cls(**configuration_parsed)

There is this open issue/proposal to support fixed-length strings: https://zarr.readthedocs.io/en/latest/user-guide/v3_migration.html we'd have support for fixed-with (Unicode) strings.

Looking at the Zarr v3 data types page: https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#data-types:

This core specification defines a limited set of data types to represent boolean values, integers, and floating 
point numbers. Extensions may define additional data types. All of the data types defined here have a fixed size, 
in the sense that all values require the same number of bytes. However, extensions may define variable sized 
data types.

Then the data types page: https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#data-types says "Under construction".

The current behavior in zarr-python is that those strings not supported in zarr_format=3 are turned into object when saving them, and warnings like in the examples above are issued about lack of support in the Zarr v3 specs.
At present, even if we make all strings fixed-width and UTF-8, zarr_format=3 does not support that data type. This is probably going to be fixed or alleviated in the near? future but the the roadmap and time scales are not clear to me.

Regardless of what is the final type(s) for strings (#294), I think that we are anyway affected by this issue with strings ("not yet ported to v3").
My current understanding is that there is no way around it, other than for example: a) fiddling with the raw bits type and doing custom pre-/post- conversions when writing and reading, or b) accepting "string or object" in string arrays in the MSv4 schema (temporarily) or use a parallel schema that accepts that, both of which sound highly untidy to me.


Example of what happens with fixed-with strings and zarr_format=3:

>>> xds = xr.Dataset(data_vars={"ant": np.array(["a1", "b1", "c2"]).astype("S2")})
>>> xds.to_zarr("foo_ant.zarr", mode="w", zarr_format=3)
>>> xds_reloaded = xr.open_zarr("foo_ant.zarr")

# .../venv_xradio_python312/lib/python3.12/site-packages/zarr/codecs/vlen_utf8.py:99: UserWarning: The codec 
# `vlen-bytes` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr 
# implementations and may change in the future.
#   return cls(**configuration_parsed)

>>> xds_reloaded

Out[23]: 
<xarray.Dataset> Size: 24B
Dimensions:  (ant: 3)
Coordinates:
  * ant      (ant) object 24B b'a1' b'b1' b'c2'
Data variables:
    *empty*

Example with UTF-8 encoded fixed-width strings written with zarr_format=3:

>>> ant_names = [ant.encode("utf-8") for ant in ["a1", "b3", "c2"]]
>>> xds = xr.Dataset(data_vars={"ant": np.array(ant_names)})
>>> xds.to_zarr("foo_ant_utf8.zarr", mode="w", zarr_format=3)

# .../venv_xradio_python312/lib/python3.12/site-packages/zarr/codecs/vlen_utf8.py:99: UserWarning: The codec 
# `vlen-bytes` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr 
# implementations and may change in the future.
#   return cls(**configuration_parsed)
# 
# .../venv_xradio_python312/lib/python3.12/site-packages/zarr/core/array.py:3985: UserWarning: The dtype `|S2` 
# is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations 
# and may change in the future.
#  meta = AsyncArray._create_metadata_v3(
# 
# .../venv_xradio_python312/lib/python3.12/site-packages/zarr/api/asynchronous.py:197: UserWarning: 
# Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by 
# other zarr implementations and may change in the future.

>>> xds_reloaded = xr.open_zarr("foo_ant_utf8.zarr")

# .../venv_xradio_python312/lib/python3.12/site-packages/zarr/codecs/vlen_utf8.py:99: UserWarning: The codec
# `vlen-bytes` is currently not part in the Zarr format 3 specification. It may not be supported by other zarr 
# implementations and may change in the future.
#   return cls(**configuration_parsed)
>>> xds_reloaded
Out[12]: 
<xarray.Dataset> Size: 24B
Dimensions:  (ant: 3)
Coordinates:
  * ant      (ant) object 24B b'a1' b'b3' b'c2'
Data variables:
    *empty*

There is a nice quick summary of how similar strings are supported in zarr_format=2 here: zarr-developers/zarr-python#2347 (comment)


And finally, for reference, more details about the changes in the encoding dict of xds data vars and how encoders are specified:
Where zarr_format=2 uses "compressor" and "filters", zarr_format=3 uses "codecs". Example when we set Zstd commonly used in xradio:

zarr_format=2, example VISIBILITY/.zarray

  "compressor": {
    "id": "zstd",
    "level": 2
  },  

zarr_format=3, example VISIBILITY/zarr.json:

  "codecs": [
    {
      "name": "bytes",
      "configuration": {
        "endian": "little"
      }
    },
    {
      "name": "zstd",
      "configuration": {
        "level": 2,
        "checksum": false
      }
    }
  ],

From https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#array-metadata,
"the separate filters and compressor fields been combined into the single codecs field."

In addition to this, in the compressor(s) value one needs to pass codecs from the new zarr.codecs list, instead of numcodecs. There are some examples in the xarray tests: https://github.com/pydata/xarray/blob/03c1014d6e7e413598603b73eac538137583db8d/xarray/tests/test_backends.py#L2703-L2726

This doesn't seem to be yet stable and well documented. The xarry doc examples produce similar exceptions as we initially get in xradio when trying to use zarr-python 3, https://docs.xarray.dev/en/latest/user-guide/io.html#zarr-compressors-and-filters:

... TypeError: Expected a BytesBytesCodec. Got <class 'numcodecs.blosc.Blosc'> instead.

Also, some xarray tests need to silence zarr warnings (about types mostly I think): https://github.com/pydata/xarray/pull/9920/files

@FedeMPouzols
Copy link
Collaborator

Based on the comment above, unless we find something different/new, what I'd in principle include in this branch is:

  • a) Add the necessary changes to get xarray working with zarr-python 3.0.x, allowing the update to zarr-python 3.0.x for Python >=3.11. Python 3.10 support stays for now and would still use zarr-python 2.x and the current v2 API.

  • b) Fix zarr_format=2 for now (all versions of Python anyway, for all I/O in measurement_set, and I'd assume also image/), with an easy/one-line way to change from zarr_format 2 to 3 for experimentation (say xradio.measurement_set._utils._zarr.config.zarr_format = 3).

@Jan-Willem
Copy link
Member Author

After discussing this in the XRADIO meeting, we have decided on holding off upgrading to zarr-python 3 until DataTree offers full support:

pydata/xarray#9984
pydata/xarray#10020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants