Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STAC Catalog architecture #21

Open
huard opened this issue Sep 27, 2023 · 0 comments
Open

STAC Catalog architecture #21

huard opened this issue Sep 27, 2023 · 0 comments

Comments

@huard
Copy link
Collaborator

huard commented Sep 27, 2023

There are multiple ways to organize catalogs, collections, items and assets. I think we need to agree on something to move forward.

Requirements:

  • Handle files containing one data variable and files with multiple data variables
  • Handle datasets where the same variable is split across multiple files (e.g. 10 years slices)
  • Handle netCDF files and opendap links
  • Handle Zarr objects (future-proof)
  • Simplify typical ensemble creation operations (all simulations that have x,y,z variables for a,b,c experiments)
  • Ensure search queries return significant results, and users are not drowned in results.

Options:

CMIP6 catalog / File item / Asset

All properties are at the File Item level, meaning Assets are just the various access endpoints. One simulation for a given model and experiment would be composed of multiple items (variables, periods). A typical search would return a large number of results.

CMIP6 catalog / Experiment collection / Model collection / Member collection / Variable collection / File item / Asset

Here we subdivide the catalog into multiple hierarchical collections. If we limit search results to collections, we'd be able to go down the hierarchy without being flooded by results (I assume). Aggregating the Items split by time periods would generate continuous time series.

Note that Collection IDs should be globally unique, meaning that the variable collection cannot simply be named tas, but would have to look something like cmip6_ssp370_canesm5_r1i1p2_tas. It is not clear how to deal with files that store multiple variables in this scheme, but the collection ID could be cmip6_ssp370_canesm5_r1i1p2_multi in those cases.

Unclear how search would work, since collection search is still at the proposal stage.

CMIP6 catalog / Experiment collection / Model collection / Member collection / Variable item / Asset

Only difference here with the previous option would be that for multiple time periods, we'd have only one time with multiple assets. This would mean that the start and end date would be asset properties, possibly messing with search functionality:

As detailed above, Items contain properties, which are the main source of metadata for searching across Items. Many content extensions can add further property fields as well. Any property that can be specified for an Item can also be specified for a specific asset. This can be used to override a property defined in the Item, or to specify fields for which there is no single value for all assets.

It is important to note that the STAC API does not facilitate searching across Asset properties in this way, and this should be used sparingly. It is primarily used to define properties at the Asset level that may be used during use of the data instead of for searching.

CMIP6 catalog / Simulation collection / File item / Asset

Here there is only one collection level that would indicate which files can be aggregated. The criteria would be for files to share the same spatial grid and calendar, and origin from the same climate model. A Simulation collection would include all experiments and realizations. Variables on the same grid would also be part of the same collection, but that would mean the same model would typically need at least two collections (atmos and ocean, which are on different grids).

I'm sure I missed a lot of potential issues, and I haven't yet done a review of other implementations to understand their organization. Most of the reading I've done on this had to do with Zarr datasets, and how to describe them within STAC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant