Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetime coordinate missing single timestep (NaT) when data is loaded #352

Open
ayushnag opened this issue Dec 16, 2024 · 1 comment
Open

Comments

@ayushnag
Copy link
Contributor

ayushnag commented Dec 16, 2024

I was trying to virtualize references and then load some NASA MERRA2 data offered at GESDISC. I noticed that when the dataset is loaded using the references made by virtualizarr, the initial timestep has a NaT value whereas the exact same dataset loaded using references made by kerchunk does not. Here is the data file used: https://data.gesdisc.earthdata.nasa.gov/data/MERRA2/M2T1NXSLV.5.12.4/1980/01/MERRA2_100.tavg1_2d_slv_Nx.19800103.nc4

Diving deeper, the only difference I could find between the two JSONs is that the fill_value is 0 for kerchunk whereas virtualizarr sets it to None:

Kerchunk time zarray: "time/.zarray": "{\"chunks\":[1],\"compressor\":null,\"dtype\":\"<i4\",\"fill_value\":null,\"filters\":[{\"elementsize\":4,\"id\":\"shuffle\"},{\"id\":\"zlib\",\"level\":2}],\"order\":\"C\",\"shape\":[24],\"zarr_format\":2}",

Virtualizarr --> kerchunk time zarray: "time/.zarray": "{\"shape\":[24],\"chunks\":[1],\"dtype\":\"<i4\",\"fill_value\":0,\"order\":\"C\",\"compressor\":null,\"filters\":[{\"elementsize\":4,\"id\":\"shuffle\"},{\"id\":\"zlib\",\"level\":2}],\"zarr_format\":2}",

I tried manually changing the virtualizarr produced JSONs fill_value to null and that fixed this issue. So TLDR seems like there is a bug in the default fill_value (potentially datetime specific?) cc @TomNicholas @mpiannucci

Code used to produce the two JSONs and highlighting the issue:

from kerchunk.hdf import SingleHdf5ToZarr
import ujson
m = SingleHdf5ToZarr('MERRA2_100.tavg1_2d_slv_Nx.19800103.nc4', inline_threshold=0)
m2 = m.translate()
with open("MERRA2_100.tavg1_2d_slv_Nx.19800103_NO_INLINE.nc4.json", "w") as f:
    f.write(ujson.dumps(m2))

ds = xr.open_dataset("MERRA2_100.tavg1_2d_slv_Nx.19800103_NO_INLINE.nc4.json", engine="kerchunk")
print(ds.time)
<xarray.DataArray 'time' (time: 24)> Size: 192B
array(['1980-01-03T00:30:00.000000000', '1980-01-03T01:30:00.000000000',
       '1980-01-03T02:30:00.000000000', '1980-01-03T03:30:00.000000000',
       '1980-01-03T04:30:00.000000000', '1980-01-03T05:30:00.000000000',
       '1980-01-03T06:30:00.000000000', '1980-01-03T07:30:00.000000000',
       '1980-01-03T08:30:00.000000000', '1980-01-03T09:30:00.000000000',
       '1980-01-03T10:30:00.000000000', '1980-01-03T11:30:00.000000000',
       '1980-01-03T12:30:00.000000000', '1980-01-03T13:30:00.000000000',
       '1980-01-03T14:30:00.000000000', '1980-01-03T15:30:00.000000000',
       '1980-01-03T16:30:00.000000000', '1980-01-03T17:30:00.000000000',
       '1980-01-03T18:30:00.000000000', '1980-01-03T19:30:00.000000000',
       '1980-01-03T20:30:00.000000000', '1980-01-03T21:30:00.000000000',
       '1980-01-03T22:30:00.000000000', '1980-01-03T23:30:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 192B 1980-01-03T00:30:00 ... 1980-01-03T23...
Attributes:
    begin_date:      19800103
    begin_time:      3000
    long_name:       time
    time_increment:  10000
    valid_range:     [-999999986991104.0, 999999986991104.0]
    vmax:            999999986991104.0
    vmin:            -999999986991104.0
from virtualizarr import open_virtual_dataset
vds = open_virtual_dataset("MERRA2_100.tavg1_2d_slv_Nx.19800103.nc4", indexes={})
vds.virtualize.to_kerchunk("MERRA2_100.tavg1_2d_slv_Nx.19800103_VIRTUALIZARR.nc4.json", format="json")
ds2 = xr.open_dataset("MERRA2_100.tavg1_2d_slv_Nx.19800103_VIRTUALIZARR.nc4.json", engine="kerchunk")
print(ds2.time)
<xarray.DataArray 'time' (time: 24)> Size: 192B
array([                          'NaT', '1980-01-03T01:30:00.000000000',
       '1980-01-03T02:30:00.000000000', '1980-01-03T03:30:00.000000000',
       '1980-01-03T04:30:00.000000000', '1980-01-03T05:30:00.000000000',
       '1980-01-03T06:30:00.000000000', '1980-01-03T07:30:00.000000000',
       '1980-01-03T08:30:00.000000000', '1980-01-03T09:30:00.000000000',
       '1980-01-03T10:30:00.000000000', '1980-01-03T11:30:00.000000000',
       '1980-01-03T12:30:00.000000000', '1980-01-03T13:30:00.000000000',
       '1980-01-03T14:30:00.000000000', '1980-01-03T15:30:00.000000000',
       '1980-01-03T16:30:00.000000000', '1980-01-03T17:30:00.000000000',
       '1980-01-03T18:30:00.000000000', '1980-01-03T19:30:00.000000000',
       '1980-01-03T20:30:00.000000000', '1980-01-03T21:30:00.000000000',
       '1980-01-03T22:30:00.000000000', '1980-01-03T23:30:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 192B NaT ... 1980-01-03T23:30:00
Attributes:
    begin_date:      19800103
    begin_time:      3000
    long_name:       time
    time_increment:  10000
    valid_range:     [-999999986991104.0, 999999986991104.0]
    vmax:            999999986991104.0
    vmin:            -999999986991104.0
@ayushnag
Copy link
Contributor Author

I think this is actually the same error as #339 (comment). Feel free to close this issue or keep it open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant