Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial WIS 2.0 metadata/search brainstorming/ideas #1

Open
tomkralidis opened this issue Feb 6, 2021 · 14 comments
Open

initial WIS 2.0 metadata/search brainstorming/ideas #1

tomkralidis opened this issue Feb 6, 2021 · 14 comments
Labels
ideas Ideas

Comments

@tomkralidis
Copy link
Collaborator

@wmo-im/tt-wismd / @wmo-im/tt-wigosmd in relation to WIS 2.0 and the metadata search demonstration project, notes from initial discussion with discussion with @6a6d74 (2020-12-15).

Note that these are initial ideas only for discussion with ET-Metadata. Please review and provide your thoughts and perspectives here, thanks.

Drivers

  • lower the barrier to entry
  • FAIR data principles
  • Web architecture/hypermedia
  • webby/of the web
  • search engine friendly

Metadata Standards

  • WIS and WIGOS metadata
    • linkage between dataset and the platform the generated/collected the data
    • a discovery metadata record should be able to reference a WIGOS metadata record (in OSCAR)
  • DCAT2: dataset+multiple realizations
    • unique identifiers are first class
    • consider community standards

Harvesting

  • suppliers provide URLs to metadata
  • harvest a set of metadata terms out of that that record, from a set of known formats (adapter pattern)
  • core tooling for data providers for converting their bespoke metadata into recognized formats if needed
    • data providers can contribute their converter to tooling (core+extension/plugin)
  • what is the machinery to harvest/push/pull records to a GISC destination

Catalogue options

The browser as the catalogue

  • is the browser search engine
  • WIS catalogue is NOT a primary search endpoint
  • probably doesn't need duplicated in each GISC
  • harvest from closest point to authoritative source
  • Structured data
    • e.g. Google Dataset search
  • schema.org annotations

Definitive WIS catalogue

  • People don't trust search engines
  • provide a vanilla search experience without "value add" from search engines to prioritize or promote various things
  • to assert the definitive list [authoritative data] as recognized by WMO
    • approved by PRs
    • quality statement
    • use this 'quality statement' identify quality / authority of datasets; enable search engines to see what is official
  • Searching from applications (e.g. GIS Desktop, QGIS, ArcGIS)
    • sensible for WIS Catalogue to provide an API
    • need to consider performance/availability
  • Metadata in the WIS Catalogue
    • WIS Catalogue only holds the smallest amount of metadata needed
    • refer back to the original metadata for the full description
    • meta-metadata, with link back to full metadata record
    • example:
      • identifier
      • type
      • title
      • abstract
      • keywords
      • extents
      • links
      • license
      • provenance
      • schema.org annotations
  • availability/uptime considerations
    • operational? 24x7?
    • number of instances? Synchronization? Or harvest metadata direct from source

Guidance and support to members

  • needed for NCs and DCPCs to do this to make their data searchable on Google, for example
    • e.g. publishing a schema.org record
    • tools for transformation/migration from WCMP
@tomkralidis tomkralidis added the ideas Ideas label Feb 8, 2021
@tomkralidis
Copy link
Collaborator Author

tomkralidis commented Feb 15, 2021

Further discussion with @efucile (2021-02-15) (cc @petersilva)

@petersilva
Copy link

petersilva commented Feb 16, 2021

https://github.com/wmo-im/GTStoWIS2#conventions better to get the shared repo, than my personal repo.

as discussed with @tomkralidis: The tables from WMO 386 Attachment II-5 are in the GTStoWIS2 folder in JSON format, and are chained together. Somebody should be able to string the tables together to produce one big table of all possible topics, but I remember @antje-s doing something akin to that, but it resulted in impractically large tables. I think it would have to be done with a keen appreciatiation for how all the tables link together, it is perhaps not so large then.

@petersilva
Copy link

I just went onto my server behind my experimental prototype ( https://hpfx.collab.science.gc.ca/~pas037/WMO_Sketch )
and did:


pas037@hpfx1:~/public_html/WMO_Sketch/20210215T08$ find WIS -type d  | wc -l
8929
pas037@hpfx1:~/public_html/WMO_Sketch/20210215T08$

for most countries, the hierarchy in a given hour is relatively simple...
Here is what the topic hierarchy for Italy at 8Z looks like on my prototype (using the GTStoWIS2 module from the repo):



WIS/it
WIS/it/roma_met_com_centre
WIS/it/roma_met_com_centre/surface
WIS/it/roma_met_com_centre/surface/aviation
WIS/it/roma_met_com_centre/surface/aviation/metar
WIS/it/roma_met_com_centre/surface/aviation/metar/it
WIS/it/roma_met_com_centre/surface/aviation/speci
WIS/it/roma_met_com_centre/surface/aviation/speci/it
WIS/it/roma_met_com_centre/observation
WIS/it/roma_met_com_centre/observation/surface
WIS/it/roma_met_com_centre/observation/surface/land
WIS/it/roma_met_com_centre/observation/surface/land/fixed
WIS/it/roma_met_com_centre/observation/surface/land/fixed/synop
WIS/it/roma_met_com_centre/observation/surface/land/fixed/synop/non-standard
WIS/it/roma_met_com_centre/observation/surface/land/fixed/synop/non-standard/0-90n
WIS/it/roma_met_com_centre/observation/surface/land/fixed/synop/non-standard/0-90n/90e-0
WIS/it/roma_met_com_centre/forecast
WIS/it/roma_met_com_centre/forecast/aviation
WIS/it/roma_met_com_centre/forecast/aviation/taf
WIS/it/roma_met_com_centre/forecast/aviation/taf/under12hours
WIS/it/roma_met_com_centre/forecast/aviation/taf/under12hours/it

My prototype feed is from the UNIDATA, so heavily biased with US data. when I look at the tree:

pas037@hpfx1:~/public_html/WMO_Sketch/20210215T08$ find WIS -type d  | grep WIS/us | wc -l
5967

There is a lot of:


WIS/us/KWEG/model-regional/cloud/0-90n/0-90w/06h
WIS/us/KWEG/model-regional/cloud/0-90n/0-90w/12h
WIS/us/KWEG/model-regional/cloud/0-90n/0-90w/24h
WIS/us/KWEG/model-regional/cloud/0-90n/0-90w/03h
WIS/us/KWEG/model-regional/cloud/0-90n/0-90w/27h

WIS/us/KWEC/model-regional/wave/0-90n/0-90w
WIS/us/KWEC/model-regional/wave/0-90n/0-90w/24h
WIS/us/KWEC/model-regional/wave/0-90n/0-90w/27h
WIS/us/KWEC/model-regional/wave/0-90n/0-90w/09h
WIS/us/KWEC/model-regional/wave/0-90n/0-90w/12h
WIS/us/KWEC/model-regional/wave/0-90n/0-90w/18h
WIS/us/KWEC/model-regional/wave/0-90n/0-90w/21h

One of the things we were debating is whether it makes sense to have the prediction hour in the topic hierarchy.
Is it really that often that people will want only the 21h forecast from a given product? We were thinking that we could eliminate a lot of topics, if we just remove the prediction hour from the topic tree. ( wmo-im/GTStoWIS2#5 ) It could shrink the tree, putting all forecast hours under the geographical topic, with the hour to differentiate them being left to the file name.

I think this is more practical, but it runs counter to the idea of metadata being "very granular" ... I think the temporal information is too granular for inclusion in the topic tree, but would appreciate other views.

@josusky
Copy link

josusky commented Feb 17, 2021

I agree with removal of the forecast hour from the tree. It seems more practical to leave possible forecast hour filtering, if needed, at the client's discretion. In fact, NWP model data are a good candidate for distribution through a service. Once this happens, then instead of sending 100s (1000s) of notification for each model run (about 100s/1000s of new files) your system will send one notification telling that the service has a new run available. For now, those who want just a subset of forecast hours will discard 100s/1000s of small notifications few times per day.

@petersilva
Copy link

note... the number of notifications is not changed... it is just that all of the outputs will be under the same topic, with different file names. You will subscribe to the KWEC/model-regional/0-90n/0-90w (aka atlantic ocean above the equatorish...) and there will be a file for each hour published under the same topic. For what it is worth, in operational forecasting, the 6 hour forecast is available before the 12 hour, they 18hour etc... so one announcement for the entire run would be unsuitable for real-time use, as it could delay transmission for ... usually upto an hour or so (I don't know about other countries, but in Canada, the "regional" (adaptive grid over North America) run is about 45 minutes long, the global (analogous to ECMWF guidance) is around 90 minutes long.) There are also more localized grids that have similar performance profiles to the regional (aka HRDPS.)

@josusky
Copy link

josusky commented Feb 19, 2021

Sorry, I did not check how this particular model is distributed. If its one file per forecast hour, then it is fine. My point was that some models produce hundreds of files per run. It is not a problem to filter out such number of notifications on client, but still it would nicer if the service just sent a notification when some new logical set of data becomes available. But we have diverged.
The case with forecast hour can be declared closed.

@josusky
Copy link

josusky commented Feb 19, 2021

When it concerns granularity mentioned by @tomkralidis. In WIS 1.0 we have one metadata records per GTS bulletin, but many of them are logically from the same category - e.g. surface observation from a certain part of the world. So in the proposed topic hierarchy they will naturally form sub-trees and so the subscribing will be easier.

@petersilva
Copy link

I think this is intimately related to: wmo-im/GTStoWIS2#9

@tomkralidis
Copy link
Collaborator Author

thanks @josusky. IMO we want a higher level of granularity so the WIS 2.0 catalogue does not become a bulletin search API, but a yellow pages so one can find/bind accordingly.

@tomkralidis
Copy link
Collaborator Author

as discussed with @tomkralidis: The tables from WMO 386 Attachment II-5 are in the GTStoWIS2 folder in JSON format, and are chained together. Somebody should be able to string the tables together to produce one big table of all possible topics, but I remember @antje-s doing something akin to that, but it resulted in impractically large tables. I think it would have to be done with a keen appreciatiation for how all the tables link together, it is perhaps not so large then.

If generating a 'supertable' is too large, can we describe the tables in question (C1, C2, C3, C6, C7, etc.) and their relationship? Perhaps this is described at https://github.com/wmo-im/GTStoWIS2#conventions ?

@petersilva
Copy link

Summary of the table linkages from WMO 386 Volume I Attachment II-5:

Table A : Data type designator T1 Matrix Table for T2A1A2ii definitions
Table B1 : Data type designator T2 (when T1 = A, C, F, N, S, T, U or W)
Table B2 : Data type designator T2 when T1 = D, G, H, X or Y)
Table B3 : Data type designator T2 (when T1 = I or J)
Table B4 : Data type designator T2 (when T1 = O)
Table B5 : Data type designator T2 (when T1 = E)
Table B6 : Data type designator T2 (when T1 = P, Q)
Table C1 : Geographical designators A1A2 for use in abbreviated headings T1T2A1A2ii CCCC YYGGgg
for bulletins containing meteorological information, excluding ships’ weather reports
and oceanographic data
Table C2 : Geographical designators A1A2 for use in abbreviated headings T1T2A1A2ii CCCC YYGGgg
for bulletins containing ships’ weather reports and oceanographic data including
reports from automatic marine stations
Table C3 : Geographical area designator A1 (when T1 = D, G, H, O, P, Q, T, X or Y) and geographical
area designator A2 (when T1 = I or J)
Table C4 : Reference time designator A2 (when T1 = D, G, H, J, O, P or T)
Table C5 : Reference time designator A2 (when T1 = Q, X or Y)
Table C6 : Data type designator A1 (when T1 = I or J)
Table C7 : Data type designator T2 and A1 (when T1 = K)
Table D1 : Level designator ii (when T1 = O)
Table D2 : Level designator ii (when T1 = D, G, H, J, P, Q, X or Y)
Table D3 : Level designator ii (when T1T2 = FA or UA)

If we drop hours, then Tables C4, and C5 disappear, how big is a supertablr? In the GTS2WIS module, @antje-s has already merged all of the table B's into one TableB that is about 400 lines.. or so.

TableA could be merged into TableB for about 4*26=104 entries... so about 504 for a hypothetical TableAB.
C1 shows up 11 times in TableA, and has around 300 entries. ... so the table would add 33000 lines if C1 were included.
We don't currently use C2,... weird... might be a gap.
C3 has 28 entries and is present 11 times, so 308 entries.
C6 has 121 entries and shows up only twice, so 242 entries.
C7 has 91 entries and shows up only once.

so the total of a single recursive JSON array merging all the tables into one big one is: 504+33000+308+242+91=34135
round it off to 35000. A bit much for humans to understand, but you could just read all the existing table data into one big in memory thing... TableTTAAii.json if you like.

Then there are 6000 known origin codes (CCCC) of 15K known airports... that could originate such products in theory.
ends up in the millions if you go there, so I guess we stop with just TTAAii with one table, and a second table for CCCC.
The origin code maps to the first two levels of hierarhcy (Country/Centre) and the TTAAii stuff is the rest of it.

@tomkralidis
Copy link
Collaborator Author

35K is tractable (this is the size of NASA GCMD for example). Can we have workflow that autogen's the supertable from the smaller tables (I suppose easier to manage that way as well)?

@petersilva
Copy link

petersilva commented Feb 20, 2021

I made the code to do this in issue009 branch on GTStoWIS2. You can clone and reproduce it... it's around 277KB (only 17000 entries in the end... some math might have been wrong) with the tables in their current state. I had to add D1 and D2 tables, which were missing. Also there are some cases where there is a comparison to do (ii < 49, for example) where only the threshold is included... so it might be wrong for those cases. Unclear to me how it can be used for now.

@antje-s
Copy link

antje-s commented Feb 22, 2021

Some comments on the ideas from above...

Metadata Standards
- WIS and WIGOS metadata
linkage between dataset and the platform the generated/collected the data

--> transfer link should be enough (background architecture differs greatly)
a discovery metadata record should be able to reference a WIGOS metadata record (in OSCAR)
--> WIGOS-IDs are included in metadata as place keyword
example
gmd:descriptiveKeywords uuidref="place"/gmd:MD_Keywords/gmd:keyword/gco:CharacterString
0-20000-0-44203, Rinchinlhumbe [ http://oscar.wmo.int/OSCAR/wigosid=0-20000-0-44203]
...
--> topic value could be part of the metadata record, e.g. under transfer options

Catalogue options
The browser as the catalogue
- WIS catalogue is NOT a primary search endpoint

--> if you do without your own search function, you also take away the possibility of your own search filter conversions, search result displays, search variants. this should be considered
- schema.org annotations
--> good, if you connect to the commercial search engines, you can also search there limited to your own website, e.g. www.google.de search for "meteogramm site:gisc.dwd.de"

Definitive WIS catalogue
- Metadata in the WIS Catalogue
-- WIS Catalogue only holds the smallest amount of metadata needed
-- refer back to the original metadata for the full description
-- meta-metadata, with link back to full metadata record

--> new concept, could be a solution to reduce size of single metadata record, but not solving the problem of granularity of the metadata, a grouping of the product metadata would be implemented by metadata on services, but there could also be different metadata for similar services, so an additional grouping of the metadata might be helpful (e.g. wcs-services, messaging services, data retrieval,...)
--> linking metadata to NC / DCPC would mean each NC (so e.g. also the weather service of Burkina Faso) would have to operate a catalog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ideas Ideas
Projects
None yet
Development

No branches or pull requests

4 participants