Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nan values when saving parq files with virtualize.to_kerchunk() #339

Open
QuentinMaz opened this issue Dec 9, 2024 · 2 comments
Open

Nan values when saving parq files with virtualize.to_kerchunk() #339

QuentinMaz opened this issue Dec 9, 2024 · 2 comments

Comments

@QuentinMaz
Copy link

QuentinMaz commented Dec 9, 2024

Hi,

I have used virtualizarr to concatenate several .nc files into one parq one.
I noticed that when I then open the saved dataset, the first value of its index is replaced with nan.
I thus suspect that virtualize.to_kerchunk() might have a bug.

Here how to replicate the issue:

filename = "test"
# synthetic xarray.DataSet inspired by xarray's documentation
temperature = 15 + 8 * np.random.randn(2, 2, 3)
lon = [[-99.83, -99.32], [-99.79, -99.23]]
lat = [[42.25, 42.21], [42.63, 42.59]]
depths = np.arange(150, step=50)
da = xr.DataArray(
    data=temperature,
    dims=["x", "y", "depth"],
    coords=dict(
        lon=(["x", "y"], lon),
        lat=(["x", "y"], lat),
        depth=depths
    ),
    attrs=dict(
        description="Ambient temperature.",
        units="degC",
    ),
)
ds = da.to_dataset(name="temperature")
ds.to_netcdf(f"{filename}.nc")
vds = open_virtual_dataset(
    f"{filename}.nc",
    indexes={}, 
    decode_times=True, 
    loadable_variables=["lon", "lat", "depth"]
)
print("depth index of vds:\t\t", vds.depth.to_numpy())
# depth index of vds:		 [  0  50 100]

# saves as parq/ folder
vds.virtualize.to_kerchunk(f"{filename}.parq", format="parquet")
loaded_ds = xr.open_dataset(f"{filename}.parq", engine="kerchunk", chunks={})
print("depth index of the loaded vds:\t", loaded_ds.depth.to_numpy())
# depth index of the loaded vds:	 [ nan  50. 100.]

# temporary fix
loaded_ds.coords["depth"].values[0] = 0.
print("index after fix:\t\t", loaded_ds.depth.to_numpy())
# index after fix:		 [  0.  50. 100.]

I am a beginner and have therefore no idea of the cause...

@norlandrhagen
Copy link
Collaborator

norlandrhagen commented Dec 9, 2024

Hey there @QuentinMaz! Thanks for trying out VirtualiZarr and opening up a clear MRE.

Definitely seems like an issue. On some initial digging it seems like:

  • VirtualiZarr and Kerchunk's SingleHDF5ToZarr both point to the same on-disk reference for depth:
# Virtualizarr Depth manifest 
{'0': {'path': 'file:///<...>/test.nc','offset': 8256, 'length': 24}}
# kerchunk Depth reference
#   'depth/0': ['test.nc', 8256, 24],
  • Writing to .json or .parquet produces the same missing first value in depth.
  • It seems like when using Kerchunk to write the reference .json, the fill value for depth is null. When writing with .virtualize.to_kerchunk(), then fill_value for depth is 0. It seems like we're getting a fill_value coercion from None to 0 based on the float dtype :
    if self.fill_value is None:

I'll try to dig into this further, in the meantime if you're open to working a bit on the bleeding edge, you could try writing the references to Icechunk. It might take a bit of environment-fu since Kerchunk doesn't yet support Zarr V3. To keep a single environment, you could use the new Zarr-V3 compliant hdf5 reader, then write to Icechunk.

from virtualizarr.readers.hdf import HDFVirtualBackend

vds = open_virtual_dataset('file.nc', backend=HDFVirtualBackend)

@QuentinMaz
Copy link
Author

Thanks for the comments and answer @norlandrhagen!
I suspected indeed that the nan value might be due to the initial value being 0 especially.
I am a kind of beginner on the project I have recently joined, so I think I will stick to my temporary fix ;) But for sure I will try later your suggestions!

Even though I am pretty sure to not be skilful enough to help through your investigations, I will have a look at the status of the issue to (try to) follow your progress :) Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants