Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-word and multi-line keywords #100

Open
josusky opened this issue Apr 1, 2021 · 3 comments
Open

Multi-word and multi-line keywords #100

josusky opened this issue Apr 1, 2021 · 3 comments
Labels
Milestone

Comments

@josusky
Copy link
Contributor

josusky commented Apr 1, 2021

Summary and Purpose
Current version of the KPI-6 verifies several aspects of keyword elements but not the actual keyword values. Preliminary tests with pywcmp show, that there are records with keywords that are whole sentences or even have multiple lines.

Proposal

  1. Keywords containing line break shall be considered invalid. Found examples look like lists of keywords that were incorrectly divided when creating the XML representation.
  2. Multi word keywords (sic!) cannot be systematically banned as it is common practice to put observing station name as a geographic keyword. Further discussion is needed to find out if single word keywords could be a rule for other types.

Reason
Keywords are key elements for data discoverability therefore we should make them reliable.

@amilan17 amilan17 added this to the INFCOM-2 milestone Apr 6, 2022
@amilan17 amilan17 added the kpi label Apr 6, 2022
@amilan17
Copy link
Member

amilan17 commented Apr 6, 2022

@josusky - is this still an open issue?

@josusky
Copy link
Contributor Author

josusky commented Apr 12, 2022

Yes, this is still an issue. For example "urn:x-wmo:md:ca.gc.ec::1.1.1.3" contains following definition of keywords:

              <gmd:descriptiveKeywords>
                <gmd:MD_Keywords>
                  <gmd:keyword>
                    <gco:CharacterString>Wind Velocity
Geopotential Height
Pressure Levels</gco:CharacterString>
                  </gmd:keyword>
                </gmd:MD_Keywords>
              </gmd:descriptiveKeywords>

Apparently, this should have been 3 keywords.
On the other hand, for example urn:x-wmo:md:int.wmo.wis::CUJD01OJAM (urn_x-wmo_md_int.wmo.wis__CUJD01OJAM.xml) contains definition of keywords that starts with an empty keyword (that has a thesaurus):

            <gmd:descriptiveKeywords>
                <gmd:MD_Keywords id="WMOCodeListKeywords">
                    <gmd:keyword gco:nilReason="unknown" />
                    <gmd:type>
                        <gmd:MD_KeywordTypeCode codeList="http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/resources/Codelist/gmxCodelists.xml#MD_Keyword
TypeCode" codeListValue="theme">theme</gmd:MD_KeywordTypeCode>
                    </gmd:type>
                    <gmd:thesaurusName>
                        <gmd:CI_Citation id="WMOCodelist">
                            <gmd:title>
<gco:CharacterString>WMO_CodeList dictionary [http://wis.wmo.int/2010/metadata/version_1-2/WMOCodelists.xml#WMO_CategoryCode]</gco:CharacterString>
                            </gmd:title>
                            <gmd:date>

I have run the check on 49612 records (I am struggling to get it run over the full catalogue) and found 274 problematic keyword entries. That is not a lot, but it is still and issue.

@amilan17
Copy link
Member

@josusky @tomkralidis -- I don't think we added these tests to the KPIs. Shall we close this issue and mark it as not done?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants