Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify mechanism to get asset links #34

Open
huard opened this issue Nov 15, 2023 · 16 comments · May be fixed by #35
Open

Clarify mechanism to get asset links #34

huard opened this issue Nov 15, 2023 · 16 comments · May be fixed by #35

Comments

@huard
Copy link
Collaborator

huard commented Nov 15, 2023

STAC Assets creation depends on an attribute called access_urls, which holds the various endpoints served by THREDDS. At the moment, we get these endpoints by

  1. Sending a request to the NcML service -> xml
  2. Converting the xml response to a dict using xncml.Dataset.to_cf_dict -> attrs
  3. Updating attrs["access_urls"] with siphon.catalog.Dataset.access_urls

These look like this:

{'HTTPServer': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc',
 'OPENDAP': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc',
 'NCML': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncml/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc',
 'UDDC': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/uddc/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc',
 'ISO': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/iso/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc',
 'WCS': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/wcs/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc',
 'WMS': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/wms/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc',
 'NetcdfSubset': 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncss/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc'}

This is done by THREDDSLoader.extract_metadata.

I think a cleaner solution would be to rely on the THREDDS response itself for those access urls instead of the siphon implementation.

We can get the THREDDS access points by sending a get request to the same NcML service, but with parameters:
requests.get(url, params={"catalog": catalog, "dataset": dataset}) with

   catalog : str
      Link to catalog storing the dataset.
    dataset : str
      Relative link to the dataset.

With this modified request url, the response includes the following additional group THREDDSMetadata:

OrderedDict([('attributes',
              OrderedDict([('id',
                            'birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc'),
                           ('full_name',
                            'cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc')])),
             ('groups',
              OrderedDict([('services',
                            OrderedDict([('attributes',
                                          OrderedDict([('httpserver_service',
                                                        'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc'),
                                                       ('opendap_service',
                                                        'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc'),
                                                       ('wcs_service',
                                                        'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/wcs/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc?service=WCS&version=1.0.0&request=GetCapabilities'),
                                                       ('wms_service',
                                                        'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/wms/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc?service=WMS&version=1.3.0&request=GetCapabilities'),
                                                       ('nccs_service',
                                                        'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/ncss/birdhouse/testdata/xclim/cmip6/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc/dataset.html')]))])),
                           ('dates',
                            OrderedDict([('attributes', OrderedDict())]))]))])

This yields an id, and a list of services (note the keys are not the same as above, underlining the fact that the siphon implementation may be arbitrarily assigning names).

My feeling is that the function STAC_item_from_metadata should rely on the latter instead of the former, so it doesn't depend on custom logic hidden in the THREDDSLoader.extract_metadata.

The other bit of additional logic that we should get rid of is attrs["attributes"] = numpy_to_python_datatypes(attrs["attributes"]). I think this could be implemented into to_cf_dict, and I'm willing to make an xncml release with this in case you agree with the changes proposed here.

@huard
Copy link
Collaborator Author

huard commented Nov 15, 2023

@dchandan @fmigneault

@fmigneault
Copy link
Collaborator

@huard
Copy link
Collaborator Author

huard commented Nov 16, 2023

Interesting ! Didn't realize this was there.
I agree the formatting done by the NcML service now seems suspicious.

@dchandan
Copy link
Collaborator

dchandan commented Nov 16, 2023

I also found thatrequests.get(url, params={"catalog": catalog, "dataset": dataset}) might not work with all THREDDS servers. This could come down to the version of the THREDDS service as suggested by @fmigneault. But the Siphon based access_urls works for all cases, so I recommend sticking to that. This way, the single implementation should work (as far as I can tell) regardless of the specificities of the THREDDS server whose catalog is being crawled.

@fmigneault
Copy link
Collaborator

The only issue I got with the siphon approach is that it is way too greedy.
The siphon.catalog.TDSCatalog implementation is poorly made. It starts crawling all of THREDDS as soon as the class is created (on __init__) without leaving us the chance to tweak parameters. It should wait at least until __iter__ is called to list datasets. I add to do some really dirty hack workarounds to try limiting it in https://github.com/crim-ca/ncml2stac/blob/update-stac-populator-refactor/notebooks/ncml2stac.ipynb

@dchandan
Copy link
Collaborator

I remember you had mentioned something about that before. I think you found startup to be too slow when TDSCatalog was pointed to a large dataset such as the example NOAA THREDDS catalog. Subsequently I tested it on my end, and I found that the initialization was done in under 2-3 seconds, so I didn't experience too much of delay.

@dchandan
Copy link
Collaborator

But, I do agree, it would be nicer if the siphon implementation wasn't greedy.

@huard
Copy link
Collaborator Author

huard commented Nov 16, 2023

What I could propose is a stand-alone function that takes a URL and returns the same thing as our extract_metadata, but without siphon.

@huard
Copy link
Collaborator Author

huard commented Nov 16, 2023

Another option is to submit an issue to siphon and see where that takes us.

@dchandan
Copy link
Collaborator

What I could propose is a stand-alone function that takes a URL and returns the same thing as our extract_metadata, but without siphon.

Do you mean a catalog URL or a catalog URL with query parameters?

@huard
Copy link
Collaborator Author

huard commented Nov 16, 2023

Straight URL with no parameters, to avoid the issue you mentioned about not all servers supporting this.

  1. Parse the catalog URL
  2. Get the catalog XML with the services (as shown above by Francis)
  3. Construct the NcML service link
  4. Get the NcML
  5. Assemble everything together

@dchandan
Copy link
Collaborator

dchandan commented Nov 16, 2023

I might be getting confused with all the ncml/thredds stuff, but I think that's what we are currently doing...

@huard
Copy link
Collaborator Author

huard commented Nov 16, 2023

Yes, the only difference is just that I'd be bypassing siphon and putting all the logic in the same function.

This would let us test STAC item creation independently from the THREDDSLoader.

@dchandan
Copy link
Collaborator

Okay, but I don't think it's possible to bypass siphon for the reason that if the THREDDS server is old (https://psl.noaa.gov/thredds/catalog/Datasets/catalog.html) then without siphon, we can't interrogate the access URLs of the item.

@huard
Copy link
Collaborator Author

huard commented Nov 16, 2023

replace the html with xml

@huard
Copy link
Collaborator Author

huard commented Nov 16, 2023

In test_standalone_stac_item, we be able to replace

    thredds_url = "https://pavics.ouranos.ca/twitcher/ows/proxy/thredds"
    thredds_path = "birdhouse/testdata/xclim/cmip6"
    thredds_nc = "sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc"
    thredds_catalog = f"{thredds_url}/catalog/{thredds_path}/catalog.html"
    thredds_ds = f"{thredds_path}/{thredds_nc}"
    thredds_ncml_url = (
        f"{thredds_url}/ncml/{thredds_path}/{thredds_nc}"
        f"?catalog={quote_none_safe(thredds_catalog)}&dataset={quote_none_safe(thredds_ds)}"
    )

    # FIXME: avoid hackish workarounds
    data = requests.get(thredds_ncml_url).text
    attrs = xncml.Dataset.from_text(data).to_cf_dict()
    attrs["access_urls"] = {  # FIXME: all following should be automatically added, but they are not!
        "HTTPServer": f"{thredds_url}/fileServer/{thredds_path}/{thredds_nc}",
        "OPENDAP": f"{thredds_url}/dodsC/{thredds_path}/{thredds_nc}",
        "WCS": f"{thredds_url}/wcs/{thredds_path}/{thredds_nc}?service=WCS&version=1.0.0&request=GetCapabilities",
        "WMS": f"{thredds_url}/wms/{thredds_path}/{thredds_nc}?service=WMS&version=1.3.0&request=GetCapabilities",
        "NetcdfSubset": f"{thredds_url}/ncss/{thredds_path}/{thredds_nc}/dataset.html",
    }

by
attrs=ncattrs(url)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants