-
Notifications
You must be signed in to change notification settings - Fork 63
Update docs on virtual references to match VirtualiZarr v2.0's updated API #1099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
TomNicholas
wants to merge
6
commits into
earth-mover:main
Choose a base branch
from
TomNicholas:update_virtualizarr_docs_for_v2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
36d31ac
update API
TomNicholas 67149d6
update to correct region and add link to aws registry page
TomNicholas 94ed284
add obstore import
TomNicholas a9fb738
split the creation of the registry from actually parsing the files
TomNicholas 1c6437e
explain why it might take a while to parse the files
TomNicholas 6770e47
add note about repo.save_config
TomNicholas File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -14,15 +14,16 @@ While Icechunk works wonderfully with native chunks managed by Zarr, there is lo | |||||
|
|
||||||
| ## Creating a virtual dataset with VirtualiZarr | ||||||
|
|
||||||
| We are going to create a virtual dataset pointing to all of the [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) data for August 2024. This data is distributed publicly as netCDF files on AWS S3, with one netCDF file containing the Sea Surface Temperature (SST) data for each day of the month. We are going to use `VirtualiZarr` to combine all of these files into a single virtual dataset spanning the entire month, then write that dataset to Icechunk for use in analysis. | ||||||
| We are going to create a virtual dataset pointing to all of the [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) data for August 2024. This data is distributed publicly as [netCDF files on AWS S3](https://registry.opendata.aws/noaa-cdr-oceanic/), with one netCDF file containing the Sea Surface Temperature (SST) data for each day of the month. We are going to use `VirtualiZarr` to combine all of these files into a single virtual dataset spanning the entire month, then write that dataset to Icechunk for use in analysis. | ||||||
|
|
||||||
| Before we get started, we need to install `virtualizarr`, and `icechunk`. We also need to install `fsspec` and `s3fs` for working with data on s3. | ||||||
| Before we get started, we need to install `virtualizarr` (this notebook uses VirtualiZarr v2.0.0), and `icechunk`. We also need to install `fsspec`, `s3fs`, and `obstore` for working with data on s3. | ||||||
|
|
||||||
| ```shell | ||||||
| pip install virtualizarr icechunk fsspec s3fs | ||||||
| pip install virtualizarr icechunk fsspec s3fs obstore | ||||||
| ``` | ||||||
|
|
||||||
| First, we need to find all of the files we are interested in, we will do this with fsspec using a `glob` expression to find every netcdf file in the August 2024 folder in the bucket: | ||||||
| First, we need to find all of the files we are interested in. | ||||||
| We can do this with fsspec using a `glob` expression to find every netcdf file in the August 2024 folder in the bucket: | ||||||
|
|
||||||
| ```python | ||||||
| import fsspec | ||||||
|
|
@@ -40,13 +41,34 @@ oisst_files = sorted(['s3://'+f for f in oisst_files]) | |||||
| #] | ||||||
| ``` | ||||||
|
|
||||||
| Now that we have the filenames of the data we need, we can create virtual datasets with `VirtualiZarr`. This may take a minute. | ||||||
| VirtualiZarr uses [`obstore`](https://developmentseed.org/obstore/latest/) to access remote files, and we need to create an `ObjectStoreRegistry` containing an obstore `S3Store` for this bucket. | ||||||
|
|
||||||
| ```python | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
like this, and adding the same on other code blcoks. They share variables and state so long as they have the same "session" |
||||||
| from obstore.store import S3Store | ||||||
| from virtualizarr.registry import ObjectStoreRegistry | ||||||
|
|
||||||
| bucket = "noaa-cdr-sea-surface-temp-optimum-interpolation-pds/" | ||||||
| store = S3Store( | ||||||
| bucket=bucket, | ||||||
| region="us-east-1", | ||||||
| skip_signature=True | ||||||
| ) | ||||||
| registry = ObjectStoreRegistry({f"s3://{bucket}": store}) | ||||||
| ``` | ||||||
|
|
||||||
| These are netCDF4 files, which are really HDF5 files, so we need to user virtualizarr's `HDFParser`. | ||||||
|
|
||||||
| ```python | ||||||
| from virtualizarr.parsers import HDFParser | ||||||
| ``` | ||||||
|
|
||||||
| Now that we have the filenames of the data we need, a way to access them, and a way to parse their contents, we can create virtual datasets with `VirtualiZarr`. This may take a minute, as it needs to fetch all the metadata from all the files. | ||||||
|
|
||||||
| ```python | ||||||
| from virtualizarr import open_virtual_dataset | ||||||
|
|
||||||
| virtual_datasets =[ | ||||||
| open_virtual_dataset(url, indexes={}) | ||||||
| open_virtual_dataset(url, registry=registry, parser=HDFParser()) | ||||||
| for url in oisst_files | ||||||
| ] | ||||||
| ``` | ||||||
|
|
@@ -97,11 +119,16 @@ credentials = icechunk.containers_credentials({"s3://mybucket/my/data/": icechun | |||||
| repo = icechunk.Repository.create(storage, config, credentials) | ||||||
| ``` | ||||||
|
|
||||||
| !!! note | ||||||
|
|
||||||
| The updated configuration will only be persisted across sessions if you call [`repo.save_config`][icechunk.Repository.save_config]. | ||||||
| This is therefore recommended, so that users reading the virtual chunks in later sessions do not also have to set the virtual containers. | ||||||
|
|
||||||
| With the repo created, and the virtual chunk container added, lets write our virtual dataset to Icechunk with VirtualiZarr! | ||||||
|
|
||||||
| ```python | ||||||
| session = repo.writable_session("main") | ||||||
| virtual_ds.virtualize.to_icechunk(session.store) | ||||||
| virtual_ds.vz.to_icechunk(session.store) | ||||||
| ``` | ||||||
|
|
||||||
| The refs are written so lets save our progress by committing to the store. | ||||||
|
|
@@ -234,7 +261,9 @@ No extra configuration is necessary for local filesystem references. | |||||
|
|
||||||
| ### Virtual Reference File Format Support | ||||||
|
|
||||||
| Currently, Icechunk supports `HDF5`, `netcdf4`, and `netcdf3` files for use in virtual references with `VirtualiZarr`. Support for other filetypes is under development in the VirtualiZarr project. Below are some relevant issues: | ||||||
| Icechunk supports storing virtual references to any format that VirtualiZarr can parse. VirtualiZarr ships with parsers for a range of formats, including `HDF5`, `netcdf4`, and `netcdf3`. You can also write your own [custom parser](https://virtualizarr.readthedocs.io/en/latest/custom_parsers.html) for virtualizing other file formats. | ||||||
|
|
||||||
| Support for other common filetypes is under development within the VirtualiZarr project. Below are some relevant issues: | ||||||
|
|
||||||
| - [meta issue for file format support](https://github.com/zarr-developers/VirtualiZarr/issues/218) | ||||||
| - [Support for GRIB2 files](https://github.com/zarr-developers/VirtualiZarr/issues/312) | ||||||
|
|
||||||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason we are using
fsspecands3fshere is literally just to glob the files in the bucket. If we had a globbing function inobspecthen we wouldn't needfsspecat all.