Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill value lost with Icechunk #343

Open
rabernat opened this issue Dec 12, 2024 · 1 comment
Open

Fill value lost with Icechunk #343

rabernat opened this issue Dec 12, 2024 · 1 comment

Comments

@rabernat
Copy link
Collaborator

This issue is mostly likely related to the different treatment of missing data in Xarray with Zarr V3 vs. V2 (see pydata/xarray#5475). I don't think it's specific to Icechunk, but rather is related to Zarr V3. However, it can only be reproduced with IC afaik because only IC uses V3 in VirtualiZarr.

Note: this requires pip install git+https://github.com/mpiannucci/kerchunk@v3

To reproduce, first create test data

import xarray as xr
import numpy as np
import virtualizarr
from virtualizarr import open_virtual_dataset
import icechunk

ds = xr.DataArray([np.nan, 2, 3], name="foo").to_dataset()
ds.foo.encoding.update({"_FillValue": np.float32(-9999.9), "dtype": "f4"})

ds.to_netcdf("test.nc")

ds_nc = xr.open_dataset("test.nc").load()
print(ds_nc)
print(ds_nc.foo.encoding)
<xarray.Dataset> Size: 12B
Dimensions:  (dim_0: 3)
Dimensions without coordinates: dim_0
Data variables:
    foo      (dim_0) float32 12B nan 2.0 3.0
{'dtype': dtype('float32'), 'zlib': False, 'szip': False, 'zstd': False, 'bzip2': False, 'blosc': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': True, 'chunksizes': None, 'source': '/home/jovyan/GPM_IMERG/test.nc', 'original_shape': (3,), '_FillValue': np.float32(-9999.9)}

Next, create an Icechunk virtual dataset

dsv = open_virtual_dataset("test.nc", indexes={})
storage_config = icechunk.StorageConfig.filesystem("./icechunk-local")
store = icechunk.IcechunkStore.create(storage_config)
dsv.virtualize.to_icechunk(store)
store.commit("wrote store")

Now read it back

ds_ic = xr.open_dataset(store, engine="zarr", zarr_format=3, consolidated=False).load()
print(ds_ic)
print(ds_ic.foo.encoding)
<xarray.Dataset> Size: 12B
Dimensions:  (dim_0: 3)
Dimensions without coordinates: dim_0
Data variables:
    foo      (dim_0) float32 12B -1e+04 2.0 3.0
{'chunks': (3,), 'preferred_chunks': {'dim_0': 3}, 'codecs': [{'name': 'bytes', 'configuration': {'endian': 'little'}}], 'dtype': dtype('float32')}

As you can see, there is no NaN in the first value.

I'm not really sure how encoding works in VirtualiZarr, but I'd be happy to explain the relevant changes with Xarray's handling of fill value in Zarr V3 format (see e.g. https://github.com/pydata/xarray/blob/49502fcde4db6ea3da1f60ead589580cfdad5c98/xarray/backends/zarr.py#L787-L796).

cc @mpiannucci

@rabernat
Copy link
Collaborator Author

rabernat commented Dec 14, 2024

I was able to work around this in the following way:

from xarray.backends.zarr import FillValueCoder
import numpy as np
import zarr

coder = FillValueCoder()
group = zarr.open_group(store)
group['foo'].attrs['_FillValue'] = coder.encode(ds.foo.encoding['_FillValue'], np.dtype('f4'))

store.commit("fixed fill value")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant