Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 37 additions & 8 deletions docs/docs/virtual.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,16 @@ While Icechunk works wonderfully with native chunks managed by Zarr, there is lo

## Creating a virtual dataset with VirtualiZarr

We are going to create a virtual dataset pointing to all of the [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) data for August 2024. This data is distributed publicly as netCDF files on AWS S3, with one netCDF file containing the Sea Surface Temperature (SST) data for each day of the month. We are going to use `VirtualiZarr` to combine all of these files into a single virtual dataset spanning the entire month, then write that dataset to Icechunk for use in analysis.
We are going to create a virtual dataset pointing to all of the [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) data for August 2024. This data is distributed publicly as [netCDF files on AWS S3](https://registry.opendata.aws/noaa-cdr-oceanic/), with one netCDF file containing the Sea Surface Temperature (SST) data for each day of the month. We are going to use `VirtualiZarr` to combine all of these files into a single virtual dataset spanning the entire month, then write that dataset to Icechunk for use in analysis.

Before we get started, we need to install `virtualizarr`, and `icechunk`. We also need to install `fsspec` and `s3fs` for working with data on s3.
Before we get started, we need to install `virtualizarr` (this notebook uses VirtualiZarr v2.0.0), and `icechunk`. We also need to install `fsspec`, `s3fs`, and `obstore` for working with data on s3.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason we are using fsspec and s3fs here is literally just to glob the files in the bucket. If we had a globbing function in obspec then we wouldn't need fsspec at all.


```shell
pip install virtualizarr icechunk fsspec s3fs
pip install virtualizarr icechunk fsspec s3fs obstore
```

First, we need to find all of the files we are interested in, we will do this with fsspec using a `glob` expression to find every netcdf file in the August 2024 folder in the bucket:
First, we need to find all of the files we are interested in.
We can do this with fsspec using a `glob` expression to find every netcdf file in the August 2024 folder in the bucket:

```python
import fsspec
Expand All @@ -40,13 +41,34 @@ oisst_files = sorted(['s3://'+f for f in oisst_files])
#]
```

Now that we have the filenames of the data we need, we can create virtual datasets with `VirtualiZarr`. This may take a minute.
VirtualiZarr uses [`obstore`](https://developmentseed.org/obstore/latest/) to access remote files, and we need to create an `ObjectStoreRegistry` containing an obstore `S3Store` for this bucket.

```python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```python
```python exec="on" session="virtual" source="material-block"

like this, and adding the same on other code blcoks. They share variables and state so long as they have the same "session"

from obstore.store import S3Store
from virtualizarr.registry import ObjectStoreRegistry

bucket = "noaa-cdr-sea-surface-temp-optimum-interpolation-pds/"
store = S3Store(
bucket=bucket,
region="us-east-1",
skip_signature=True
)
registry = ObjectStoreRegistry({f"s3://{bucket}": store})
```

These are netCDF4 files, which are really HDF5 files, so we need to user virtualizarr's `HDFParser`.

```python
from virtualizarr.parsers import HDFParser
```

Now that we have the filenames of the data we need, a way to access them, and a way to parse their contents, we can create virtual datasets with `VirtualiZarr`. This may take a minute, as it needs to fetch all the metadata from all the files.

```python
from virtualizarr import open_virtual_dataset

virtual_datasets =[
open_virtual_dataset(url, indexes={})
open_virtual_dataset(url, registry=registry, parser=HDFParser())
for url in oisst_files
]
```
Expand Down Expand Up @@ -97,11 +119,16 @@ credentials = icechunk.containers_credentials({"s3://mybucket/my/data/": icechun
repo = icechunk.Repository.create(storage, config, credentials)
```

!!! note

The updated configuration will only be persisted across sessions if you call [`repo.save_config`][icechunk.Repository.save_config].
This is therefore recommended, so that users reading the virtual chunks in later sessions do not also have to set the virtual containers.

With the repo created, and the virtual chunk container added, lets write our virtual dataset to Icechunk with VirtualiZarr!

```python
session = repo.writable_session("main")
virtual_ds.virtualize.to_icechunk(session.store)
virtual_ds.vz.to_icechunk(session.store)
```

The refs are written so lets save our progress by committing to the store.
Expand Down Expand Up @@ -234,7 +261,9 @@ No extra configuration is necessary for local filesystem references.

### Virtual Reference File Format Support

Currently, Icechunk supports `HDF5`, `netcdf4`, and `netcdf3` files for use in virtual references with `VirtualiZarr`. Support for other filetypes is under development in the VirtualiZarr project. Below are some relevant issues:
Icechunk supports storing virtual references to any format that VirtualiZarr can parse. VirtualiZarr ships with parsers for a range of formats, including `HDF5`, `netcdf4`, and `netcdf3`. You can also write your own [custom parser](https://virtualizarr.readthedocs.io/en/latest/custom_parsers.html) for virtualizing other file formats.

Support for other common filetypes is under development within the VirtualiZarr project. Below are some relevant issues:

- [meta issue for file format support](https://github.com/zarr-developers/VirtualiZarr/issues/218)
- [Support for GRIB2 files](https://github.com/zarr-developers/VirtualiZarr/issues/312)
Expand Down
Loading