From 36d31acf67c18a90068358a3f4bfcae86fe0193e Mon Sep 17 00:00:00 2001 From: Tom Nicholas Date: Tue, 22 Jul 2025 00:07:52 +0100 Subject: [PATCH 1/6] update API --- docs/docs/virtual.md | 29 +++++++++++++++++++++++------ 1 file changed, 23 insertions(+), 6 deletions(-) diff --git a/docs/docs/virtual.md b/docs/docs/virtual.md index 6757870e0..7f29a0bff 100644 --- a/docs/docs/virtual.md +++ b/docs/docs/virtual.md @@ -16,13 +16,14 @@ While Icechunk works wonderfully with native chunks managed by Zarr, there is lo We are going to create a virtual dataset pointing to all of the [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) data for August 2024. This data is distributed publicly as netCDF files on AWS S3, with one netCDF file containing the Sea Surface Temperature (SST) data for each day of the month. We are going to use `VirtualiZarr` to combine all of these files into a single virtual dataset spanning the entire month, then write that dataset to Icechunk for use in analysis. -Before we get started, we need to install `virtualizarr`, and `icechunk`. We also need to install `fsspec` and `s3fs` for working with data on s3. +Before we get started, we need to install `virtualizarr` (this notebook uses VirtualiZarr v2.0.0), and `icechunk`. We also need to install `fsspec`, `s3fs`, and `obstore` for working with data on s3. ```shell -pip install virtualizarr icechunk fsspec s3fs +pip install virtualizarr icechunk fsspec s3fs obstore ``` -First, we need to find all of the files we are interested in, we will do this with fsspec using a `glob` expression to find every netcdf file in the August 2024 folder in the bucket: +First, we need to find all of the files we are interested in. +We can do this with fsspec using a `glob` expression to find every netcdf file in the August 2024 folder in the bucket: ```python import fsspec @@ -40,13 +41,27 @@ oisst_files = sorted(['s3://'+f for f in oisst_files]) #] ``` +These are netCDF4 files, which are really HDF5 files, so we need to user virtualizarr's `HDFParser`. + +We also need to give the parser a way to access our files. We do this by creating a `ObjectStoreRegistry` containing an obstore `S3Store` for that bucket. + Now that we have the filenames of the data we need, we can create virtual datasets with `VirtualiZarr`. This may take a minute. ```python from virtualizarr import open_virtual_dataset +from virtualizarr.parsers import HDFParser +from virtualizarr.registry import ObjectStoreRegistry + +bucket = "noaa-cdr-sea-surface-temp-optimum-interpolation-pds/" +store = S3Store( + bucket=bucket, + region="us-west-2", + skip_signature=True +) +registry = ObjectStoreRegistry({f"s3://{bucket}": store}) virtual_datasets =[ - open_virtual_dataset(url, indexes={}) + open_virtual_dataset(url, registry=registry, parser=HDFParser()) for url in oisst_files ] ``` @@ -101,7 +116,7 @@ With the repo created, and the virtual chunk container added, lets write our vir ```python session = repo.writable_session("main") -virtual_ds.virtualize.to_icechunk(session.store) +virtual_ds.vz.to_icechunk(session.store) ``` The refs are written so lets save our progress by committing to the store. @@ -234,7 +249,9 @@ No extra configuration is necessary for local filesystem references. ### Virtual Reference File Format Support -Currently, Icechunk supports `HDF5`, `netcdf4`, and `netcdf3` files for use in virtual references with `VirtualiZarr`. Support for other filetypes is under development in the VirtualiZarr project. Below are some relevant issues: +Icechunk supports storing virtual references to any format that VirtualiZarr can parse. VirtualiZarr ships with parsers for a range of formats, including `HDF5`, `netcdf4`, and `netcdf3`. You can also write your own [custom parser](https://virtualizarr.readthedocs.io/en/latest/custom_parsers.html) for virtualizing other file formats. + +Support for other common filetypes is under development within the VirtualiZarr project. Below are some relevant issues: - [meta issue for file format support](https://github.com/zarr-developers/VirtualiZarr/issues/218) - [Support for GRIB2 files](https://github.com/zarr-developers/VirtualiZarr/issues/312) From 67149d68e23a1c031a3eee52ff18ac3966939103 Mon Sep 17 00:00:00 2001 From: Tom Nicholas Date: Wed, 23 Jul 2025 14:32:15 +0100 Subject: [PATCH 2/6] update to correct region and add link to aws registry page --- docs/docs/virtual.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/docs/virtual.md b/docs/docs/virtual.md index 7f29a0bff..bbaaea4db 100644 --- a/docs/docs/virtual.md +++ b/docs/docs/virtual.md @@ -14,7 +14,7 @@ While Icechunk works wonderfully with native chunks managed by Zarr, there is lo ## Creating a virtual dataset with VirtualiZarr -We are going to create a virtual dataset pointing to all of the [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) data for August 2024. This data is distributed publicly as netCDF files on AWS S3, with one netCDF file containing the Sea Surface Temperature (SST) data for each day of the month. We are going to use `VirtualiZarr` to combine all of these files into a single virtual dataset spanning the entire month, then write that dataset to Icechunk for use in analysis. +We are going to create a virtual dataset pointing to all of the [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) data for August 2024. This data is distributed publicly as [netCDF files on AWS S3](https://registry.opendata.aws/noaa-cdr-oceanic/), with one netCDF file containing the Sea Surface Temperature (SST) data for each day of the month. We are going to use `VirtualiZarr` to combine all of these files into a single virtual dataset spanning the entire month, then write that dataset to Icechunk for use in analysis. Before we get started, we need to install `virtualizarr` (this notebook uses VirtualiZarr v2.0.0), and `icechunk`. We also need to install `fsspec`, `s3fs`, and `obstore` for working with data on s3. @@ -55,7 +55,7 @@ from virtualizarr.registry import ObjectStoreRegistry bucket = "noaa-cdr-sea-surface-temp-optimum-interpolation-pds/" store = S3Store( bucket=bucket, - region="us-west-2", + region="us-east-1", skip_signature=True ) registry = ObjectStoreRegistry({f"s3://{bucket}": store}) From 94ed284e715f3a9863eb81586a67720074973d7e Mon Sep 17 00:00:00 2001 From: Tom Nicholas Date: Wed, 23 Jul 2025 14:42:09 +0100 Subject: [PATCH 3/6] add obstore import --- docs/docs/virtual.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/docs/virtual.md b/docs/docs/virtual.md index bbaaea4db..2d3a6aceb 100644 --- a/docs/docs/virtual.md +++ b/docs/docs/virtual.md @@ -48,6 +48,8 @@ We also need to give the parser a way to access our files. We do this by creatin Now that we have the filenames of the data we need, we can create virtual datasets with `VirtualiZarr`. This may take a minute. ```python +from obstore.store import S3Store + from virtualizarr import open_virtual_dataset from virtualizarr.parsers import HDFParser from virtualizarr.registry import ObjectStoreRegistry From a9fb738298200501a451cdc77c755caf583019e5 Mon Sep 17 00:00:00 2001 From: Tom Nicholas Date: Wed, 23 Jul 2025 14:44:10 +0100 Subject: [PATCH 4/6] split the creation of the registry from actually parsing the files --- docs/docs/virtual.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/docs/docs/virtual.md b/docs/docs/virtual.md index 2d3a6aceb..4928b1eb3 100644 --- a/docs/docs/virtual.md +++ b/docs/docs/virtual.md @@ -43,15 +43,10 @@ oisst_files = sorted(['s3://'+f for f in oisst_files]) These are netCDF4 files, which are really HDF5 files, so we need to user virtualizarr's `HDFParser`. -We also need to give the parser a way to access our files. We do this by creating a `ObjectStoreRegistry` containing an obstore `S3Store` for that bucket. - -Now that we have the filenames of the data we need, we can create virtual datasets with `VirtualiZarr`. This may take a minute. +We also need to give the parser a way to access our files. We do this by creating an `ObjectStoreRegistry` containing an obstore `S3Store` for that bucket. ```python from obstore.store import S3Store - -from virtualizarr import open_virtual_dataset -from virtualizarr.parsers import HDFParser from virtualizarr.registry import ObjectStoreRegistry bucket = "noaa-cdr-sea-surface-temp-optimum-interpolation-pds/" @@ -61,6 +56,13 @@ store = S3Store( skip_signature=True ) registry = ObjectStoreRegistry({f"s3://{bucket}": store}) +``` + +Now that we have the filenames of the data we need, and a way to access them, we can create virtual datasets with `VirtualiZarr`. This may take a minute. + +```python +from virtualizarr import open_virtual_dataset +from virtualizarr.parsers import HDFParser virtual_datasets =[ open_virtual_dataset(url, registry=registry, parser=HDFParser()) From 1c6437eb5a0a40c59435cab2445aa4f3ff8ccb36 Mon Sep 17 00:00:00 2001 From: Tom Nicholas Date: Wed, 23 Jul 2025 14:53:59 +0100 Subject: [PATCH 5/6] explain why it might take a while to parse the files --- docs/docs/virtual.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/docs/docs/virtual.md b/docs/docs/virtual.md index 4928b1eb3..e8252b494 100644 --- a/docs/docs/virtual.md +++ b/docs/docs/virtual.md @@ -41,9 +41,7 @@ oisst_files = sorted(['s3://'+f for f in oisst_files]) #] ``` -These are netCDF4 files, which are really HDF5 files, so we need to user virtualizarr's `HDFParser`. - -We also need to give the parser a way to access our files. We do this by creating an `ObjectStoreRegistry` containing an obstore `S3Store` for that bucket. +VirtualiZarr uses [`obstore`](https://developmentseed.org/obstore/latest/) to access remote files, and we need to create an `ObjectStoreRegistry` containing an obstore `S3Store` for this bucket. ```python from obstore.store import S3Store @@ -58,11 +56,16 @@ store = S3Store( registry = ObjectStoreRegistry({f"s3://{bucket}": store}) ``` -Now that we have the filenames of the data we need, and a way to access them, we can create virtual datasets with `VirtualiZarr`. This may take a minute. +These are netCDF4 files, which are really HDF5 files, so we need to user virtualizarr's `HDFParser`. ```python -from virtualizarr import open_virtual_dataset from virtualizarr.parsers import HDFParser +``` + +Now that we have the filenames of the data we need, a way to access them, and a way to parse their contents, we can create virtual datasets with `VirtualiZarr`. This may take a minute, as it needs to fetch all the metadata from all the files. + +```python +from virtualizarr import open_virtual_dataset virtual_datasets =[ open_virtual_dataset(url, registry=registry, parser=HDFParser()) From 6770e4767dd7e928f2f7c3016fd371cb9cb665ca Mon Sep 17 00:00:00 2001 From: Tom Nicholas Date: Tue, 29 Jul 2025 13:17:48 +0100 Subject: [PATCH 6/6] add note about repo.save_config --- docs/docs/virtual.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/docs/virtual.md b/docs/docs/virtual.md index e8252b494..44a5cf801 100644 --- a/docs/docs/virtual.md +++ b/docs/docs/virtual.md @@ -119,6 +119,11 @@ credentials = icechunk.containers_credentials({"s3://mybucket/my/data/": icechun repo = icechunk.Repository.create(storage, config, credentials) ``` +!!! note + + The updated configuration will only be persisted across sessions if you call [`repo.save_config`][icechunk.Repository.save_config]. + This is therefore recommended, so that users reading the virtual chunks in later sessions do not also have to set the virtual containers. + With the repo created, and the virtual chunk container added, lets write our virtual dataset to Icechunk with VirtualiZarr! ```python