Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Roadmap]: Gaia Catalog #365

Open
7 tasks
tibbben opened this issue Nov 15, 2024 · 11 comments
Open
7 tasks

[Roadmap]: Gaia Catalog #365

tibbben opened this issue Nov 15, 2024 · 11 comments
Assignees
Labels
Roadmap Project component roadmaps

Comments

@tibbben
Copy link
Collaborator

tibbben commented Nov 15, 2024

Metadata federation and catalog interoperability:

  • Create tooling to utilize science on schema json-ld metadata representations to ingest metadata into the GAIA catalog from various sources.
  • Define workflow for assigning concept IDs and OHDSI controlled vocabulary terms to datasets in the GAIA catalog (variable level assignment within datasets as appropriate).
    • Collaborate with Gaia and Vocabulary working groups
  • Develop workflow for creating attr_spec and geom_spec metadata elements.
    • Collaborate with Gaia working group

Intuitive interface to create catalog metadata entries:

Intuitive interface to explore the catalog and select data for use:

  • Create visual data exploration tools.
  • Implement shopping cart style data selection -> then ETL on confirmation.
  • Leverage variable and dataset level concept ids and OHDSI controlled vocabulary terms for cohort definition.
@tibbben tibbben added the Roadmap Project component roadmaps label Nov 15, 2024
@fils
Copy link

fils commented Nov 21, 2024

  • define workflow for assigning concept IDs and other OHDSI vocabulary terms to catalog items

Is this entity resolution? Are these IDs in a controlled voc or ontollogy? If so we could
look at things like entity resulution using Relik or Gliner

  • evaluate json-ld model for cross walking data between catalogs

By "cross walking between catalogs" do you mean linking? If might that just be using PIDs and
shared instances of types to allow that to be discovered in query space?

  • incorporate metadata element for external ids for federation and sharing with other catalogs

More PIDs? Or is this trying to assign dereferencable IRIs for resources in the catalog?

  • develop workflow for creating attr_spec and geom_spec metadata elements

Not sure what this is, need more.

  • develop tool to facilitate metadata entry

Tooling UI stuff like this is hard. An easy entry point is trying to provide spread sheets with
drop lists (where possible) and then using that to generate the serialization. One could try
forms too. If there was a large set of available coding resources and the desire, one could look
at things like https://github.com/mlcommons/croissant/tree/main/editor for inspiration.

  • visual data exploration tools

Again, this is a big project. I'd recommend looking first to things like
https://github.com/gleanerio/archetype/blob/master/docs/tooling.md#network-visualization
and generating products from OHDSI-GIS that work with those, for exploring the graph as a network.
Note, tools like Qlever also have
support for GeoSPARQL and can disply maps leveraging OpenStreet Map. So you could graph query with
GIS connection there.

  • leverage variable and layer level concept ids for cohort definition

Out of my wheelhouse ...

  • shopping cart style data selection -> then ETL on confirmation

Shopping carts are hard too.. could you provide file serializations options as a parameter to the
the download URLs and let people collect and use those themeselves?

@fils
Copy link

fils commented Nov 21, 2024

Made an image for this discussion at https://github.com/fils/OHDSI-GIS-Metadata-Mapping/tree/master/docs/workflow

image

@rtmill rtmill moved this to 🏃‍♀ In Progress in GIS Project Management Dec 6, 2024
@jaygee-on-github
Copy link
Collaborator

jaygee-on-github commented Jan 17, 2025

For the JSON-LD representation in Doug's workflow we agreed to follow the EPSIFed science-on-schema.org Dataset guide.

It supports describing the variables with machine understandable vocabularies. Depending on the variable type, it supports other properties and their vocabularies like "unitcode".

We could have settled on a different schema.org Dataset profile like the Croissant extension. There are repositories like HuggingFace and Kaggle that support discovery using Croissant metadata that has been married to a set of datasets in one of these repositories creating a data zone of so-called "Croissant datasets". However, when it comes to discovery with GaiaCatalog this would not have been fit to purpose.

Recall that the GaiaCatalog doesn't contain any data, just metadata. Data and metadata are only married downstream in a staging database called the Gaia "backbone". This backbone does NOT serve as a repository in the Gaia workflow. Instead its purpose is to create a product -- records in a new OMOP CDM table called external_exposure. There is an ML framework in the OHDSI ecosystem but it runs on top of the OMOP CDM, not "Croissant datasets".

Image

@jaygee-on-github
Copy link
Collaborator

@kzollove , @rtmill , @fils, @tibbben, @diatomsRcool, @AEW0330 would this be a more complete way to describe dataset access in a catalog entry?

https://github.com/ESIPFed/science-on-schema.org/blob/main/guides/Dataset.md#distributions

This is another feature of the science-on-schema.org profile of a dataset that we are recommending.

When it comes to variables, following the ESIPFed profile, if the variable data type is numeric, we may want to include schema.measurementTechnique. In addition to text and a url, schema.measurementTechnique can take a StatisticalVariable. A StatisticalVariable can capture the nuances of an observation like the ones that Charles and Lara capture in Amadeus. @kzollove it may be something we can capture downstream in the OMOP external_exposure table in your exposure_relationship_concept_id which was intended to capture "the spatiotemporal join between exposure and location in a little more detail".

@kzollove
Copy link
Collaborator

@jaygee-on-github As of now, there are not really any OMOP tables that attempt to capture provenance of a measurement. Would you be thinking of this information being captured in an upstream GaiaCatalog and then "translated" to a vocabulary term downstream?

There are examples from the measurement table and domain of preserving information about measurement techniques. They often rely on LOINC or CPT4 codes for standard concepts. You can see some examples in these search results.

This is sort of how I imagined an exposure_relationship_concept working. I was thinking mostly about the spatiotemporal join information like you mentioned, but I could see it (or another concept) containing measurement technique information from the source data.

@jaygee-on-github
Copy link
Collaborator

@kzollove, @diatomsRcool it would be great if Charles and Lara already had a vocabulary for measurement techniques. It would be for aggregates what CDEs are for observations. I think Charles had an interest in a vocabulary of StatisticalVariables for this reason. In Amadeus I think Lara has such a field but it may not be structured. StatisticalVariable can provide structured metadata for a measurement technique

@jaygee-on-github
Copy link
Collaborator

@kzollove @tibbben @fils @AEW0330 @diatomsRcool: Following the science-on-schema.org Dataset profile, Doug created a catalog entry from a Tim GDSC catalog entry example. This example is a set of land parcels and their characteristics. The entry is a little sketchy. Also, at the variable level it includes two ways of describing a variable. In one way a variable description takes a "PropertyValue". In the other way the variable description takes a "StatisticalVariable". Each way provides a "definition of the calculated geospatial exposure metric". The "PropertyValue" path supports simple definitions. The "StatisticalVariable" path supports complex definitions. It is possible to mix and match. Doug calls this example the "0th order draft". Think of Doug's example as a template we can continue to work using Tim's example. Alternatively, we can use it as a guide which we can fill with other examples.

Tim has provided a second example from the GDSC catalog. It is a PM2.5 example.

We propose to have a couple of graduate students fill in the GaiaCatalog entry "template" Doug has proposed first with the PM2.5 example and then with one or more other examples including an SDoH one.

Something that Tim covers in a GDSC catalog entry that is not supported in Doug's template at the moment is the provenance of the dataset that is the subject of the catalog entry. We would like to defer support for provenance in the template right now.

Additionally, we are contemplating writing a paper about the GaiaCatalog as we use it more with geospatial datasets.

@jaygee-on-github
Copy link
Collaborator

@fils Doug, would you agree with the following:

A variableMeasured in a science-on-schema.org Dataset can take a StatisticalVariable when it is an aggregate. Alternatively, an aggregate can take a PropertyValue just like an observation or measurement does.

The Croissant Dataset profile is an alternative to the science-on-schema.org Dataset profile when the datapoints in a Dataset are not aggregates. Recall that in the Croissant profile a Dataset distribution takes a cr:FileObject or a cr:FileSet, and a cr:FileObject can be the source of cr:RecordSet which contains one or more cr:Field. Arguably a cr:Field in Croissant does not host aggregates like a variable in the science-on-schema.org Dataset profile can.

@tibbben
Copy link
Collaborator Author

tibbben commented Feb 10, 2025

Perhaps useful as examples:

the GDSC catalog on Box (for this focus on the "attributes" and "table_name" columns):
https://miami.box.com/s/cpe136whxprafac9ssvkig74ju4o2x7m

The SOLR endpoint (global pm 2.5 as a fairly complete example):

https://gdsc.idsc.miami.edu/solr/solr/dcat/select?fl=gdsc_attributes&indent=true&q.op=OR&q=gdsc_tablename%3Aglobal_pm25_concentration_1998_2016&wt=python

note: the gdsc_tablename can be edited with any table_name from the box file, for example the following is for table name "fl_2020_svi_county" (a less complete example):

https://gdsc.idsc.miami.edu/solr/solr/dcat/select?fl=gdsc_attributes&indent=true&q.op=OR&q=gdsc_tablename%3Afl_2020_svi_county&wt=python

NOTE currently the attribute metadata is serialized as (this is a draft):
name;description;source;type;unit;unit_concept_id;start_date;end_date;concept_id;external_id

And you can explore the metadata and some of the geographies at:
https://gdsc.idsc.miami.edu/detail/global_pm25_concentration_1998_2016?collection=all&active=None&query=pm2.5.

or simply:
https://gdsc.idsc.miami.edu/

@jaygee-on-github
Copy link
Collaborator

Using Tim's examples, this week we launched an activity staffed by a couple of APHRC graduate students that will use the template Doug developed. Recall it is based on the science-on-schema.org Dataset profile. However, the GDSC and other catalogs support provenance. With this in mind in a future we will be considering another Dataset profile Doug has contributed to: Croissant. We are also mindful of the RO-Crate work that Kyle is exploring.

@jaygee-on-github
Copy link
Collaborator

@tibbben, @fils, @p-talapova, @kzollove: Have we thought about how we might select concepts for the variables we include while fulfilling the variable part of our catalog entry template?

I am thinking we can't use ATHENA because the new vocabularies and their concepts are in a delta potentially filled with so-called community vocabularies.

@fils: What would be the level of effort to throw these vocabularies into a graph?

And a different question: @fils: could we perform any validation on the structures of these vocabularies if we did graph them?

Finally, @fils, tell me if these questions are too sketchy to answer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Roadmap Project component roadmaps
Projects
Status: 🏃‍♀ In Progress
Development

No branches or pull requests

4 participants