Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intake_xarray does not lazy read metadata from files #137

Open
kadykov opened this issue Oct 9, 2023 · 3 comments
Open

intake_xarray does not lazy read metadata from files #137

kadykov opened this issue Oct 9, 2023 · 3 comments

Comments

@kadykov
Copy link

kadykov commented Oct 9, 2023

The entries powered by intake_xarray driver does not lazy read metadata from the files.

# %%
import intake
import xarray as xr

ds = xr.Dataset(
    {
        "test_var": [0],
    },
    attrs={"xarray_metadata": "The metadata in the xarray file"},
)
ds.to_netcdf("test_metadata.nc")
ds.to_zarr("test_metadata.zarr", mode="w")
# %%
catalog_content = """sources:
  netcdf:
    driver: netcdf
    args:
      urlpath: '{{ CATALOG_DIR }}/test_metadata.nc'
      metadata:
        catalog_metadata: The metadata in the catalog entry
  zarr_intake_xarray:
    description: zarr archive read by intake_xarray
    driver: zarr
    args:
      urlpath: '{{ CATALOG_DIR }}/test_metadata.zarr'
      metadata:
        catalog_metadata: The metadata in the catalog entry
  zarr_intake:
    description: zarr archive read by intake
    driver: zarr_cat
    args:
      urlpath: '{{ CATALOG_DIR }}/test_metadata.zarr'
      metadata:
        catalog_metadata: The metadata in the catalog entry
"""

with open("catalog.yml", "w") as f:
    f.write(catalog_content)

cat = intake.open_catalog("catalog.yml")
print(f"{cat.netcdf.metadata = }")
print(f"{cat.zarr_intake_xarray.metadata = }")
print(f"{cat.zarr_intake.metadata = }")

As you see from the output, the metadata from the entry powered by intake driver has the field from the zarr file:

cat.netcdf.metadata = {'catalog_metadata': 'The metadata in the catalog entry'}
cat.zarr_intake_xarray.metadata = {'catalog_metadata': 'The metadata in the catalog entry'}
cat.zarr_intake.metadata = {'catalog_metadata': 'The metadata in the catalog entry', 'xarray_metadata': 'The metadata in the xarray file'}

However, after reading the files, the metadata is complete:

cat.netcdf.read()
cat.zarr_intake_xarray.read()

print(f"Netcdf metadata after reading: {cat.netcdf.metadata}")
print(f"Zarr metadata after reading: {cat.zarr_intake_xarray.metadata}")

Output:

Netcdf metadata after reading: {'catalog_metadata': 'The metadata in the catalog entry', 'dims': {'test_var': 1}, 'data_vars': {}, 'coords': ('test_var',), 'xarray_metadata': 'The metadata in the xarray file'}
Zarr metadata after reading: {'catalog_metadata': 'The metadata in the catalog entry', 'dims': {'test_var': 1}, 'data_vars': {}, 'coords': ('test_var',), 'xarray_metadata': 'The metadata in the xarray file'}

OS: Windows 10
python 3.11.5
intake 0.7.0
intake_xarray 0.7.0
xarray 2023.8.0
zarr 2.16.1

@martindurant
Copy link
Member

What do you think the right behaviour should be? Catalog entries are special in Intake (<2.0) in that they get their subentries eagerly, so they have access to the file metadata immediately, is this what you are getting at?

@kadykov
Copy link
Author

kadykov commented Oct 9, 2023

I expected that cat.netcdf.metadata includes also the metadata from the file like this: {'catalog_metadata': 'The metadata in the catalog entry', 'xarray_metadata': 'The metadata in the xarray file'}.
But now, the xarray_metadata key appears only after reading the whole file by executing cat.netcdf.read().

I think it would be better to have "lazy" metadata reading from files because there also could be some useful information... What do you think?

@martindurant
Copy link
Member

The .discover() method is meant exactly for this purpose, to get information from the file with a minimum of reads. It's usefulness varies by file type.

Actually, xarray is lazy by default, so even if you do a .read(), you do no load all the data into memory, only enough for xarray to be able to understand the file's layout (typically the attributes and coordinate arrays).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants