Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[verification] Need to review all IRIs #70

Closed
fcbr opened this issue Aug 21, 2015 · 20 comments
Closed

[verification] Need to review all IRIs #70

fcbr opened this issue Aug 21, 2015 · 20 comments
Assignees
Labels

Comments

@fcbr
Copy link
Member

fcbr commented Aug 21, 2015

We probably should keep them plain ASCII to avoid any possible encoding issues between different systems and file formats.

@fcbr fcbr changed the title Need to review all IRIs [verification] Need to review all IRIs Aug 21, 2015
@fcbr fcbr added the data label Sep 18, 2015
@arademaker
Copy link
Member

Nós agora usamos + para words com espaços.

@arademaker
Copy link
Member

arademaker commented Aug 20, 2021

In the LMF generation, we had to deal with the ID datatype of XML. See https://en.wikipedia.org/wiki/Document_type_definition e https://www.w3.org/TR/REC-xml/#id

So far, we didn't pay attention to that, and URIs of words in the RDF files can contain / and accents. But this must be more robust.

@gdemelo had some links about how he implemented it in lexvo:

The Java API described at http://lexvo.org/linkeddata/tutorial.html includes a function to create such term URIs. The .jar file can be opened and includes source code in Java. Or you could use it in Clojure.

In Python, we can eventually use an already standard library for url encoding?!

We expect that words should have final encode like below for RDF and XML:

<https://w3id.org/own-pt/wn30-pt/instances/word-TCP%2FIP-n>
<LexicalEntry id="word-TCP%2FIP-n”>
 <Lemma partOfSpeech="n" writtenForm="TCP/IP”/>
 <Sense id="own-pt-wordsense-06666486-n-3" synset="own-pt-synset-06666486-n"/>
</LexicalEntry>

related to #168

@arademaker arademaker added this to the pre release 1.0 milestone Aug 20, 2021
@arademaker
Copy link
Member

Note that words may contain -, how the encode would deal with that? One possible solution for not confuse with the pos tagger is to transform the pattern word-ELS-pos to word-pos-EFS where ELS is 'encoded lexical form`.

@fredsonaguiar
Copy link

Note that words may contain -, how the encode would deal with that? One possible solution for not confuse with the pos tagger is to transform the pattern word-ELS-pos to word-pos-EFS where ELS is 'encoded lexical form`.

I'd say that makes sense, but this might be confusing, for Us and others, when related to the other URIs: Synsets and Senses follow the rule class-name-pos, or similar. For now, the encoder/quoter keeps chars '-' in Word URIs.

@fredsonaguiar
Copy link

About LMF LexicalEntry identifiers, since character % isn't allowed as part of a DTD ID, we replace it by ::. So, for instance, in the place of <LexicalEntry id="word-TCP%2FIP-n”> we got <LexicalEntry id="word-TCP::2FIP-n”>.

About the valid DTD IDs, see https://www.w3.org/TR/REC-xml/#id or https://www.liquid-technologies.com/DTD/Datatypes/ID.aspx

@arademaker
Copy link
Member

globalwordnet/schemas#55

Let us wait 1-2 days for some feedback.

@fredsonaguiar
Copy link

fredsonaguiar commented Sep 15, 2021

As well discussed in the issue referenced before, one of the more readable and natural ways to solve our problem is to consider quoting the not xml:ID valid characters/forms, such as comma replaced with -cm-, and apostrophe replaced with -ap-. Whenever impossible to have a name mapping, a final option is to replace with hexadecimal encoding.

A first approach is to think about mapping characters into HTML entity names, which is well documented mapping. I found a first smaller table for HTML Entity Names (Latin-1), manually scraped from https://cs.stanford.edu/people/miles/iso8859.html, removing delimiters & and ;, and considering only the first name. The rule was:

  1. replace spaces by underlines
  2. if CHAR is valid to compose DTD ID, let it be
  3. if CHAR is invalid to compose DTD ID, try to replace by -NAME-
  4. if CHAR is invalid to compose DTD ID and there is no name, replace by -HEX-

This way we got some reasonable, expected IDs:

  • "(militar) ocupação": own-pt-word--lpar-militar-rpar-_ocupação-n
  • "Território do Norte": own-pt-word-Território_do_Norte-n

Only two Lexical Entries had characters replaced by Hexadecimal Forms:

    <LexicalEntry id="own-pt-word-vapor_d-e28099-água">
      <Lemma partOfSpeech="n" writtenForm="vapor d’água"/>
      <Sense id="own-pt-wordsense-15055442-n-3" synset="own-pt-synset-15055442-n"/>
    </LexicalEntry>
    <LexicalEntry id="own-pt-word-Sacro_Império_Romano-e28093-Germânico">
      <Lemma partOfSpeech="n" writtenForm="Sacro Império Romano–Germânico"/>
      <Sense id="own-pt-wordsense-08169677-n-2" synset="own-pt-synset-08169677-n"/>
    </LexicalEntry>

@fredsonaguiar
Copy link

fredsonaguiar commented Sep 15, 2021

Another possible approach to avoid this scrapping and use a more robust table is to use the unicode entity names from HTML5, already implemented in https://docs.python.org/3/library/html.entities.html, based on https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references.

One may notice that many characters have multiple names, such as ℤ (integers; or Zopf;);. This is intentional, and some decision has to be made in our purpose: consider the first in lexicographical sorting, for instance.

Currently working into that.

@fredsonaguiar
Copy link

In own-pt/py-ownpt@0cbe46b, we implement the solution described before, obtaining the desirable results, such as follows:

    <LexicalEntry id="own-pt-word--lpar-de-rpar-_tempo_parcial-a">
      <Lemma partOfSpeech="a" writtenForm="(de) tempo parcial"/>
      <Sense id="own-pt-wordsense-01089369-a-1" synset="own-pt-synset-01089369-a"/>
    </LexicalEntry>
    ....
    <LexicalEntry id="own-pt-word-Sacro_Império_Romano-ndash-Germânico-n">
      <Lemma partOfSpeech="n" writtenForm="Sacro Império Romano–Germânico"/>
      <Sense id="own-pt-wordsense-08169677-n-2" synset="own-pt-synset-08169677-n"/>
    </LexicalEntry>
    ...
    <LexicalEntry id="own-pt-word-vapor_d-CloseCurlyQuote-água-n">
      <Lemma partOfSpeech="n" writtenForm="vapor d’água"/>
      <Sense id="own-pt-wordsense-15055442-n-3" synset="own-pt-synset-15055442-n"/>
    </LexicalEntry>

The resulting LMF was validated by the DTD

@fredsonaguiar
Copy link

In edfef02 we came back to the more readable Word URIs, without quoting. As result, we got URIs such as:

<https://w3id.org/own-pt/wn30-pt/instances/word-&-n> a wn30:Word ;
<https://w3id.org/own-pt/wn30-pt/instances/word-(militar)+ocupação-n> a wn30:Word ;
<https://w3id.org/own-pt/wn30-pt/instances/word-Solda+de+estanho-n> a wn30:Word ;
<https://w3id.org/own-pt/wn30-pt/instances/word-digno+de+confiança-a> a wn30:Word ;

After running the script https://github.com/own-pt/py-ownpt/blob/master/release.sh

@arademaker
Copy link
Member

Not sure if i understood the last comment. Why using underscore for spaces in the XML but plus in the RDF? I would prefer uniform ids, for XML and RDF... what is the URI in the RDF for own-pt-word-vapor_d-CloseCurlyQuote-água-n and own-pt-word--lpar-de-rpar-_tempo_parcial-a?

@fredsonaguiar
Copy link

fredsonaguiar commented Sep 16, 2021

Maybe some misunderstanding. I've already changed the code to generate the ID based on the URI, instead of the Lemma. So, for instance:

<https://w3id.org/own-pt/wn30-pt/instances/word-vapor+d'água-n> a wn30:Word ;
    wn30:lemma "vapor d'água"@pt ;
    wn30:pos "n" .

Gives us the XML ID as follows:

    <LexicalEntry id="own-pt-word-vapor-plus-d-CloseCurlyQuote-água-n">
      <Lemma partOfSpeech="n" writtenForm="vapor d'água"/>
      <Sense id="own-pt-wordsense-15055442-n-3" synset="own-pt-synset-15055442-n"/>
    </LexicalEntry>

Notice that the difference from before is that spaces were replaced by + on the URI, so the new IDs have -plus-, instead of _.

@arademaker
Copy link
Member

I was expecting

own-pt-word-vapor_d-CloseCurlyQuote-água-n

For both the uri in the RDF and for the xml ID.

@arademaker
Copy link
Member

Both should be based on lemma not the id based on the uri

@fredsonaguiar
Copy link

fredsonaguiar commented Sep 16, 2021

We did so in 225eba5 again by running the updated scripts. Then we got, for instance:

<https://w3id.org/own-pt/wn30-pt/instances/word-vapor_d-CloseCurlyQuote-água-n> a wn30:Word ;
    wn30:lemma "vapor d’água"@pt ;
    wn30:pos "n" .

Is translated to the XML Element as follows:

    <LexicalEntry id="own-pt-word-vapor_d-CloseCurlyQuote-água-n">
      <Lemma partOfSpeech="n" writtenForm="vapor d’água"/>
      <Sense id="own-pt-wordsense-15055442-n-3" synset="own-pt-synset-15055442-n"/>
    </LexicalEntry>

The only difference is that in the XML the LMF lexicon ID (in this case, own-pt) is prefixed to the IDs, replacing the RDF namespace, following suggestion from globalwordnet/schemas#55 (comment):

To construct an ID, you can then:

  1. Replace disallowed ID characters with the dash-escape-dash patterns
  2. Prefix own-pt- (or some other lexicon ID followed by a dash)
  3. Append a dash and the part-of-speech

@arademaker
Copy link
Member

@Prefix own-en : https://w3id.org/own/own-pt/instances/ .
@Prefix own-pt : https://w3id.org/own/own-pt/instances/ .
@Prefix owns: https://w3id.org/own/schema/ .

Words:

own-pt:word-Afloramento-n a wn30:Word ;
wn30:lemma "Afloramento"@pt ;
wn30:pos "n" .

own-en:word-run-v a wn30:Word ;
wn30:lemma "run"@en ;
wn30:otherForm "ran"@en,
"running"@en ;
wn30:pos "v" .

For wordsenses:

own-LL:wordsense-00001740-a-1

For synsets:

@Prefix own-en : https://w3id.org/own/own-en/instances/ .
@Prefix own-pt : https://w3id.org/own/own-pt/instances/ .
@Prefix owns: https://w3id.org/own/schema/ .

own-pt:synset-00001740-a a wn30:AdjectiveSynset ;
skos:inScheme https://w3id.org/own/own-pt/instances/ ;
wn30:offset "00001740" ;
wn30:synsetId "00001740-a" .

own-en:synset-00001740-n a wn30:BaseConcept,
wn30:NounSynset ;
skos:inScheme https://w3id.org/own/own-en/instances/ ;
wn30:gloss "that which is perceived or known or inferred to have its own distinct existence (living or nonliving)"@en ;
wn30:lexicographerFile "noun.Tops" ;
wn30:offset "00001740" ;
wn30:synsetId "00001740-n" .

https://w3id.org/own/own-pt/instances/word-abafamento-n
own-pt-word-abafamento-n

https://w3id.org/own/own-pt/instances/word-vapor_-CloseCurlyQuote-água-n
own-pt-word-vapor_-CloseCurlyQuote-água-n

https://w3id.org/own/own-en/instances/word-emergent-a
own-en-word-emergent-a

@fredsonaguiar
Copy link

fredsonaguiar commented Sep 16, 2021

What has to be done :

@arademaker
Copy link
Member

arademaker commented Sep 16, 2021

For nomlex instances, we can use the respective https://w3id.org/own/own-en/instances/ or https://w3id.org/own/own-pt/instances/. For the schema, ok to use https://w3id.org/own/schema/nomlex/ but this creates another problem! We need to rethink https://w3id.org/own/schema/ since that would be a prefix for the nomlex schema URI.

Two options:

  1. make the classes and properties from nomlex part of the general schema
  2. change the more general schema to https://w3id.org/own/schema/wn/

I prefer (1), that is, to merge the nomlex schema into the general schema, that is, we can incorporate the nomlex classes and properties into our general schema for encoding WN in RDF.

@fredsonaguiar
Copy link

fredsonaguiar commented Sep 18, 2021

In a479a97 and f9248c4 we update the schemas, after running schema.sh. After replacing the URIs, we ran the splitting, to grant the new serialization, including order and prefixes, for instance. Please, take a look.

The resulting files became considerably smaller.

@fredsonaguiar
Copy link

For now, I think we're closing this issue. If there are any new issues related to this discussion, feel free to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants