Cordex extension and trying to build on higher level abstractions #64

huard · 2024-10-11T18:22:53Z

I added a CMIP6-CORDEX extension and implementation, trying to create base classes that would simplify the addition of other extensions.

This simplifies a bit the implementation part, but you'll see that there is still some boilerplate code we could do without on the implementation side.

The main change is that I created a generic THREDDSCatalogDataModel. Extensions then only have to define the data model for their properties, and how to construct a unique ID. If a jsonschema is provided, then it will be used to validate the incoming data. I've disabled the validation done at the STAC extension level for now (see below).

I've struggled a bit with the role of the jsonschema here. In climate science, this is not a very popular tool. Even if scientific schemas appeared, we'd have to embed them into a STAC specific schema. You'll see that I've created a schema directory with the CORDEX schema for global attributes, but this is not a STAC schema per say. A STAC schema would embed those attributes into a property object, accompanied by a type. I didn't know how to embed a schema into another, that's why I disabled the extension schema validation.

To try it:

stac-populator run Ouranos_CMIP6-CORDEX http://localhost:8880/stac https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/disk2/ouranos/CORDEX/CMIP6/DD/NAM-12/OURANOS/MPI-ESM1-2-LR/ssp370/r1i1p1f1/CRCM5/v1-r1/day/tas/v20231208/catalog.html

If you think this kind of abstraction is useful, I could port those changes to the CMIP6 case in another PR.

fmigneault

About the "embedded schema", I think an extension could be defined as such (using YAML for short, but convert to JSON for applying it):

type: object
required:
  - type
  - properties
properties:
  type: 
    const: Feature
  properties:
    $ref: "STACpopulator/extensions/schemas/cordex6/cmip6-cordex-global-attrs-schema.json"

Then, you can reuse the JSON schema on its own or as STAC extension definition.

As for the PR itself, I have a strong sensation that THREDDSCatalogDataModel is essentially trying to accomplish what the "helpers" were trying to do (but missing some interface to connect the dots).

It is a bit hard to analyze the code path with all the abstractions involved. So, if I misinterpreted something in my comments, please let me know.

other todos

Need to add Ouranos_CMIP6-CORDEX to the table in the README.
Update changelog

.gitignore

STACpopulator/extensions/base.py

fmigneault · 2024-10-11T20:52:27Z

STACpopulator/extensions/base.py

+        uri =  cls._schema_uri.default
+        if uri is not None:
+            schema = json.load(open(uri))


Could be improved with requests file-handler, allowing either local or remote URI, but not "blocking" for the PR.

I was unsure how to deal with references within a schema if it was not local.

STACpopulator/extensions/base.py

fmigneault · 2024-10-11T21:02:29Z

STACpopulator/extensions/base.py

+    # List of properties not meant to be validated by json schema.
+    _schema_exclude: list[str] = PrivateAttr([])


Can't the model_config be used for that?

class Model(DataModel): model_config = ConfigDict( populate_by_name=True, extra="ignore", fields={"field-to-exclude": {"exclude":True}, )

Otherwise, reuse the same PrivateAttr approach, and filter by annotation/field-type?

I don't think so, because those were fields I wanted to exclude from the schema validation, but not from the model dump. I was thinking of a case where the schema is strictly prohibiting extra attributes, but I realize this might be a very edgy corner case.

STACpopulator/populator_base.py

fmigneault · 2024-10-11T21:46:10Z

STACpopulator/implementations/Ouranos_CMIP6-CORDEX/add_CORDEX6.py

+    def create_stac_item(self, item_name: str, item_data: dict[str, Any]) -> dict[str, Any]:
+        dm = self.data_model.from_data(item_data)
+        return dm.stac_item()


Looks like this could be directly in STACpopulatorBase since it only refers to data_model overridden by the class. Especially if extensions are generalized, this might become redundant across implementations.

However, I'm noticing here that we are still limited by a single extension. If I want to define a dataset that uses datacube and Cordex6DataModel properties, I have to create yet another populator and define the create_stac_item with by custom set of operations.

What we might need instead a list of helper-exntenions that apply onto the given data.
The pattern is very consistent.

For example, CMIP6populator and CORDEX_STAC_Populator could have:

class CMIP6populator(STACpopulatorBase): item_helpers = [CMIP6Helper, DatacubeHelper, THREDDSHelper] class CORDEX_STAC_Populator(STACpopulatorBase): item_helpers = [Cordex6Helper]

And then, we would have:

class STACpopulatorBase: def create_stac_item(self, item_name: str, item_data: dict[str, Any]) -> dict[str, Any]: item = pystac.Item(...) for helper in self.item_helpers: helper = SomeHelper(item_data) item = helper.apply(item) return item

Where each helper has something along the lines of:

def apply(item: pystac.Item) -> pystac.Item: dc_ext = DatacubeExtension.ext(item, add_if_missing=True) dc_ext.apply(dimensions=dc_helper.dimensions, variables=dc_helper.variables) return dc_ext.item # or def apply(item: pystac.Item) -> pystac.Item: valid_data = Cordex6DataModel(self.item_data) valid_json = json.loads(valid_data.model_dump_json(by_alias=True)) item.properties.update(valid_json) return item

Using this "helper" approach, you wouldn't need to define all the boiler-plate code for a typical "stac extension classes". What apply() does is up to the helper.

I like this idea, will look into it and come back with questions.

Note that that THREDDSCatalogDataModel automatically applies the datacube and thredds extension.

One issue I see with this is that the extension helpers have different __init__ requirements. So either the helpers know how to parse the input data, or the object instantiating them provides that logic.

We could have an in-between solution.
The item_helpers list could define instances rather than type references:

item_helpers = [ HelperWithoutArg(), THREDDSHelper(["<url>"]), ]

Anything that can be supplied at init would be created right away, and the STAC item objects would be obtained during the apply(item) call.

I don't think there are any cases where the helpers would be missing references limiting this approach, but to investigate...

Not sure I follow. You need the data to create instances of the helpers.

What we could do is something like this:

@classmethod def from_data(cls, data): """Instantiate class from data provided by THREDDS Loader. """ # This is where we match the Loader's output to the STAC item and extensions inputs. If we had multiple # loaders, that's probably the only thing that would be different between them. return cls(data=data, start_datetime=data["groups"]["CFMetadata"]["attributes"]["time_coverage_start"], end_datetime=data["groups"]["CFMetadata"]["attributes"]["time_coverage_end"], geometry=ncattrs_to_geometry(data), bbox=ncattrs_to_bbox(data), properties=data["attributes"], ) @model_validator(mode="before") @classmethod def datacube_helper(cls, data): """Validate the DataCubeHelper.""" data["datacube"] = DataCubeHelper(data['data']) return data @model_validator(mode="before") @classmethod def thredds_helper(cls, data): """Validate the DataCubeHelper.""" data["thredds"] = THREDDSHelper(data['data']["access_urls"]) return data

fmigneault · 2024-10-11T22:01:09Z

STACpopulator/implementations/Ouranos_CMIP6-CORDEX/add_CORDEX6.py

+    data_model = Cordex6DataModel
+    item_geometry_model = None  # Unnecessary, but kept for consistency


This is defined for the CMIP6populator:

class CMIP6populator(STACpopulatorBase): item_properties_model = CMIP6Properties item_geometry_model = GeoJSONPolygon

And data_model = Cordex6DataModel basically offers:

Cordex6DataModel.properties == CordexCmip6 # -> just like CMIP6Properties Cordex6DataModel -> THREDDSCatalogDataModel.geometry # -> just like item_geometry_model

I'm wondering if there's any duplication of the intended use of these properties?

Yes, because I didn't want to break the CMIP extension and implementation just yet. My idea was to try to generalize the CORDEX example, get a sense of where this is going, and once we're happy, then bring the changes to CMIP6.

STACpopulator/extensions/cordex6.py

fmigneault · 2024-10-11T22:04:41Z

STACpopulator/extensions/cordex6.py

+# This is generated using datamodel-codegen + manual edits
+class CordexCmip6(DataModel):


Since the model is generated from the schema, why is the @model_validator needed to load and validate the JSON schema?

I'm not seeing the subtlety from static code analysis.

It's slimmed down version of the schema without the actual CV validation. The schema includes enums with the CVs, while the pydantic.DataModel does not.

This is a question I struggled with. I felt it didn't make a lot of sense to duplicate the jsonschema validation in pydantic. On the other hand, relying only on the schema and not even seeing the attributes in the code felt obscure and not admin friendly. So I thought it would be useful to have a pydantic DataModel layer where you can add attributes to the data model, and exclude some that are in the schema but you don't want in the STAC item.

I wonder if that is an issue with datamodel-codegen, or an option to provide?
Normally, the enums should be possible using Literal type with pydantic.

I think it makes sense to have the DataModel auto-generated from schema to provide the attributes. It's easier to manipulate by users used to Python but not so much JSON schema.

Yes, it's definitely possible. I just didn't include the Literals in the python code.

huard added 9 commits October 3, 2024 11:36

typos

3c85233

work on extension abstraction and cordex example

f1c5683

removed json from .gitignore. Add CORDEX6 json schema

c0bdeb2

merge

3a4b6ef

embedding datacube and thredds extension in the base logic

b7d9b75

got it to work

98c6420

cordex implementation

33d78b9

add missing item_geometry_model

0212008

get the cordex extension to work with the stac-populator cli.

04d94df

huard requested a review from fmigneault as a code owner October 11, 2024 18:22

huard added 3 commits October 11, 2024 14:34

add some notes and comments

ac0cec2

clean-up

03e5738

remove break

8cceeab

fmigneault reviewed Oct 11, 2024

View reviewed changes

huard added 5 commits October 15, 2024 11:09

suggestions from review

8b583bb

added apply method to extension helpers

8db0847

include schemas in installation source

df35431

Put generic STAC item logic into BaseSTAC class

c6e68e8

docstring

d76fdd8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cordex extension and trying to build on higher level abstractions #64

Cordex extension and trying to build on higher level abstractions #64

huard commented Oct 11, 2024

fmigneault left a comment •

edited by huard

Loading

fmigneault Oct 11, 2024

huard Oct 15, 2024

fmigneault Oct 11, 2024

huard Oct 15, 2024

fmigneault Oct 11, 2024

huard Oct 15, 2024

huard Oct 15, 2024

huard Oct 15, 2024

fmigneault Oct 15, 2024

huard Oct 15, 2024

huard Oct 15, 2024

fmigneault Oct 11, 2024

huard Oct 15, 2024

fmigneault Oct 11, 2024

huard Oct 15, 2024

fmigneault Oct 15, 2024

huard Oct 15, 2024

		# List of properties not meant to be validated by json schema.
		_schema_exclude: list[str] = PrivateAttr([])

		data_model = Cordex6DataModel
		item_geometry_model = None # Unnecessary, but kept for consistency

		# This is generated using datamodel-codegen + manual edits
		class CordexCmip6(DataModel):

Cordex extension and trying to build on higher level abstractions #64

Are you sure you want to change the base?

Cordex extension and trying to build on higher level abstractions #64

Conversation

huard commented Oct 11, 2024

fmigneault left a comment • edited by huard Loading

Choose a reason for hiding this comment

other todos

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fmigneault left a comment •

edited by huard

Loading