Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cordex extension and trying to build on higher level abstractions #64

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

huard
Copy link
Collaborator

@huard huard commented Oct 11, 2024

I added a CMIP6-CORDEX extension and implementation, trying to create base classes that would simplify the addition of other extensions.

This simplifies a bit the implementation part, but you'll see that there is still some boilerplate code we could do without on the implementation side.

The main change is that I created a generic THREDDSCatalogDataModel. Extensions then only have to define the data model for their properties, and how to construct a unique ID. If a jsonschema is provided, then it will be used to validate the incoming data. I've disabled the validation done at the STAC extension level for now (see below).

I've struggled a bit with the role of the jsonschema here. In climate science, this is not a very popular tool. Even if scientific schemas appeared, we'd have to embed them into a STAC specific schema. You'll see that I've created a schema directory with the CORDEX schema for global attributes, but this is not a STAC schema per say. A STAC schema would embed those attributes into a property object, accompanied by a type. I didn't know how to embed a schema into another, that's why I disabled the extension schema validation.

To try it:

stac-populator run Ouranos_CMIP6-CORDEX http://localhost:8880/stac https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/disk2/ouranos/CORDEX/CMIP6/DD/NAM-12/OURANOS/MPI-ESM1-2-LR/ssp370/r1i1p1f1/CRCM5/v1-r1/day/tas/v20231208/catalog.html

If you think this kind of abstraction is useful, I could port those changes to the CMIP6 case in another PR.

Copy link
Collaborator

@fmigneault fmigneault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the "embedded schema", I think an extension could be defined as such (using YAML for short, but convert to JSON for applying it):

type: object
required:
  - type
  - properties
properties:
  type: 
    const: Feature
  properties:
    $ref: "STACpopulator/extensions/schemas/cordex6/cmip6-cordex-global-attrs-schema.json"

Then, you can reuse the JSON schema on its own or as STAC extension definition.

As for the PR itself, I have a strong sensation that THREDDSCatalogDataModel is essentially trying to accomplish what the "helpers" were trying to do (but missing some interface to connect the dots).

It is a bit hard to analyze the code path with all the abstractions involved. So, if I misinterpreted something in my comments, please let me know.

other todos

  • Need to add Ouranos_CMIP6-CORDEX to the table in the README.
  • Update changelog

.gitignore Show resolved Hide resolved
STACpopulator/extensions/base.py Outdated Show resolved Hide resolved
Comment on lines +78 to +80
uri = cls._schema_uri.default
if uri is not None:
schema = json.load(open(uri))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be improved with requests file-handler, allowing either local or remote URI, but not "blocking" for the PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure how to deal with references within a schema if it was not local.

STACpopulator/extensions/base.py Outdated Show resolved Hide resolved
Comment on lines 68 to 69
# List of properties not meant to be validated by json schema.
_schema_exclude: list[str] = PrivateAttr([])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't the model_config be used for that?

class Model(DataModel):
    model_config = ConfigDict(
      populate_by_name=True,
      extra="ignore",
      fields={"field-to-exclude": {"exclude":True}, 
    )

Otherwise, reuse the same PrivateAttr approach, and filter by annotation/field-type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because those were fields I wanted to exclude from the schema validation, but not from the model dump. I was thinking of a case where the schema is strictly prohibiting extra attributes, but I realize this might be a very edgy corner case.

STACpopulator/populator_base.py Outdated Show resolved Hide resolved
Comment on lines +31 to +33
def create_stac_item(self, item_name: str, item_data: dict[str, Any]) -> dict[str, Any]:
dm = self.data_model.from_data(item_data)
return dm.stac_item()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this could be directly in STACpopulatorBase since it only refers to data_model overridden by the class. Especially if extensions are generalized, this might become redundant across implementations.

However, I'm noticing here that we are still limited by a single extension. If I want to define a dataset that uses datacube and Cordex6DataModel properties, I have to create yet another populator and define the create_stac_item with by custom set of operations.

What we might need instead a list of helper-exntenions that apply onto the given data.
The pattern is very consistent.

For example, CMIP6populator and CORDEX_STAC_Populator could have:

class CMIP6populator(STACpopulatorBase):
    item_helpers = [CMIP6Helper, DatacubeHelper, THREDDSHelper]

class CORDEX_STAC_Populator(STACpopulatorBase):
    item_helpers = [Cordex6Helper]

And then, we would have:

class STACpopulatorBase:
    def create_stac_item(self, item_name: str, item_data: dict[str, Any]) -> dict[str, Any]:
        item = pystac.Item(...)
        for helper in self.item_helpers:
            helper = SomeHelper(item_data)
            item = helper.apply(item)
        return item

Where each helper has something along the lines of:

def apply(item: pystac.Item) -> pystac.Item:
    dc_ext = DatacubeExtension.ext(item, add_if_missing=True)
    dc_ext.apply(dimensions=dc_helper.dimensions, variables=dc_helper.variables)
    return dc_ext.item

# or 

def apply(item: pystac.Item) -> pystac.Item:
    valid_data = Cordex6DataModel(self.item_data)
    valid_json = json.loads(valid_data.model_dump_json(by_alias=True))
    item.properties.update(valid_json)
    return item

Using this "helper" approach, you wouldn't need to define all the boiler-plate code for a typical "stac extension classes". What apply() does is up to the helper.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, will look into it and come back with questions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that that THREDDSCatalogDataModel automatically applies the datacube and thredds extension.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue I see with this is that the extension helpers have different __init__ requirements. So either the helpers know how to parse the input data, or the object instantiating them provides that logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have an in-between solution.
The item_helpers list could define instances rather than type references:

item_helpers = [
  HelperWithoutArg(),
  THREDDSHelper(["<url>"]),
]

Anything that can be supplied at init would be created right away, and the STAC item objects would be obtained during the apply(item) call.

I don't think there are any cases where the helpers would be missing references limiting this approach, but to investigate...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow. You need the data to create instances of the helpers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we could do is something like this:

    @classmethod
    def from_data(cls, data):
        """Instantiate class from data provided by THREDDS Loader.
        """
        # This is where we match the Loader's output to the STAC item and extensions inputs. If we had multiple
        # loaders, that's probably the only thing that would be different between them.
        return cls(data=data,
                   start_datetime=data["groups"]["CFMetadata"]["attributes"]["time_coverage_start"],
                   end_datetime=data["groups"]["CFMetadata"]["attributes"]["time_coverage_end"],
                   geometry=ncattrs_to_geometry(data),
                   bbox=ncattrs_to_bbox(data),
                   properties=data["attributes"],
                   )

    @model_validator(mode="before")
    @classmethod
    def datacube_helper(cls, data):
        """Validate the DataCubeHelper."""
        data["datacube"] = DataCubeHelper(data['data'])
        return data

    @model_validator(mode="before")
    @classmethod
    def thredds_helper(cls, data):
        """Validate the DataCubeHelper."""
        data["thredds"] = THREDDSHelper(data['data']["access_urls"])
        return data

Comment on lines +28 to +29
data_model = Cordex6DataModel
item_geometry_model = None # Unnecessary, but kept for consistency
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defined for the CMIP6populator:

class CMIP6populator(STACpopulatorBase):
    item_properties_model = CMIP6Properties
    item_geometry_model = GeoJSONPolygon

And data_model = Cordex6DataModel basically offers:

Cordex6DataModel.properties == CordexCmip6  #  -> just like CMIP6Properties
Cordex6DataModel -> THREDDSCatalogDataModel.geometry  # -> just like item_geometry_model

I'm wondering if there's any duplication of the intended use of these properties?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because I didn't want to break the CMIP extension and implementation just yet. My idea was to try to generalize the CORDEX example, get a sense of where this is going, and once we're happy, then bring the changes to CMIP6.

STACpopulator/extensions/cordex6.py Outdated Show resolved Hide resolved
Comment on lines 13 to 14
# This is generated using datamodel-codegen + manual edits
class CordexCmip6(DataModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the model is generated from the schema, why is the @model_validator needed to load and validate the JSON schema?

I'm not seeing the subtlety from static code analysis.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slimmed down version of the schema without the actual CV validation. The schema includes enums with the CVs, while the pydantic.DataModel does not.

This is a question I struggled with. I felt it didn't make a lot of sense to duplicate the jsonschema validation in pydantic. On the other hand, relying only on the schema and not even seeing the attributes in the code felt obscure and not admin friendly. So I thought it would be useful to have a pydantic DataModel layer where you can add attributes to the data model, and exclude some that are in the schema but you don't want in the STAC item.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if that is an issue with datamodel-codegen, or an option to provide?
Normally, the enums should be possible using Literal type with pydantic.

I think it makes sense to have the DataModel auto-generated from schema to provide the attributes. It's easier to manipulate by users used to Python but not so much JSON schema.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's definitely possible. I just didn't include the Literals in the python code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants