[verification] Need to review all IRIs #70

fcbr · 2015-08-21T17:57:23Z

We probably should keep them plain ASCII to avoid any possible encoding issues between different systems and file formats.

arademaker · 2021-06-10T16:34:52Z

Nós agora usamos + para words com espaços.

arademaker · 2021-08-20T14:45:41Z

In the LMF generation, we had to deal with the ID datatype of XML. See https://en.wikipedia.org/wiki/Document_type_definition e https://www.w3.org/TR/REC-xml/#id

So far, we didn't pay attention to that, and URIs of words in the RDF files can contain / and accents. But this must be more robust.

@gdemelo had some links about how he implemented it in lexvo:

The Java API described at http://lexvo.org/linkeddata/tutorial.html includes a function to create such term URIs. The .jar file can be opened and includes source code in Java. Or you could use it in Clojure.

In Python, we can eventually use an already standard library for url encoding?!

We expect that words should have final encode like below for RDF and XML:

<https://w3id.org/own-pt/wn30-pt/instances/word-TCP%2FIP-n>

<LexicalEntry id="word-TCP%2FIP-n”>
 <Lemma partOfSpeech="n" writtenForm="TCP/IP”/>
 <Sense id="own-pt-wordsense-06666486-n-3" synset="own-pt-synset-06666486-n"/>
</LexicalEntry>

related to #168

arademaker · 2021-08-20T14:48:58Z

Note that words may contain -, how the encode would deal with that? One possible solution for not confuse with the pos tagger is to transform the pattern word-ELS-pos to word-pos-EFS where ELS is 'encoded lexical form`.

fredsonaguiar · 2021-08-27T19:29:57Z

Note that words may contain -, how the encode would deal with that? One possible solution for not confuse with the pos tagger is to transform the pattern word-ELS-pos to word-pos-EFS where ELS is 'encoded lexical form`.

I'd say that makes sense, but this might be confusing, for Us and others, when related to the other URIs: Synsets and Senses follow the rule class-name-pos, or similar. For now, the encoder/quoter keeps chars '-' in Word URIs.

fredsonaguiar · 2021-08-27T19:39:03Z

About LMF LexicalEntry identifiers, since character % isn't allowed as part of a DTD ID, we replace it by ::. So, for instance, in the place of <LexicalEntry id="word-TCP%2FIP-n”> we got <LexicalEntry id="word-TCP::2FIP-n”>.

About the valid DTD IDs, see https://www.w3.org/TR/REC-xml/#id or https://www.liquid-technologies.com/DTD/Datatypes/ID.aspx

arademaker · 2021-09-03T19:29:44Z

globalwordnet/schemas#55

Let us wait 1-2 days for some feedback.

fredsonaguiar · 2021-09-15T03:34:17Z

As well discussed in the issue referenced before, one of the more readable and natural ways to solve our problem is to consider quoting the not xml:ID valid characters/forms, such as comma replaced with -cm-, and apostrophe replaced with -ap-. Whenever impossible to have a name mapping, a final option is to replace with hexadecimal encoding.

A first approach is to think about mapping characters into HTML entity names, which is well documented mapping. I found a first smaller table for HTML Entity Names (Latin-1), manually scraped from https://cs.stanford.edu/people/miles/iso8859.html, removing delimiters & and ;, and considering only the first name. The rule was:

replace spaces by underlines
if CHAR is valid to compose DTD ID, let it be
if CHAR is invalid to compose DTD ID, try to replace by -NAME-
if CHAR is invalid to compose DTD ID and there is no name, replace by -HEX-

This way we got some reasonable, expected IDs:

"(militar) ocupação": own-pt-word--lpar-militar-rpar-_ocupação-n
"Território do Norte": own-pt-word-Território_do_Norte-n

Only two Lexical Entries had characters replaced by Hexadecimal Forms:

    <LexicalEntry id="own-pt-word-vapor_d-e28099-água">
      <Lemma partOfSpeech="n" writtenForm="vapor d’água"/>
      <Sense id="own-pt-wordsense-15055442-n-3" synset="own-pt-synset-15055442-n"/>
    </LexicalEntry>
    <LexicalEntry id="own-pt-word-Sacro_Império_Romano-e28093-Germânico">
      <Lemma partOfSpeech="n" writtenForm="Sacro Império Romano–Germânico"/>
      <Sense id="own-pt-wordsense-08169677-n-2" synset="own-pt-synset-08169677-n"/>
    </LexicalEntry>

fredsonaguiar · 2021-09-15T03:45:26Z

Another possible approach to avoid this scrapping and use a more robust table is to use the unicode entity names from HTML5, already implemented in https://docs.python.org/3/library/html.entities.html, based on https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references.

One may notice that many characters have multiple names, such as ℤ (integers; or Zopf;);. This is intentional, and some decision has to be made in our purpose: consider the first in lexicographical sorting, for instance.

Currently working into that.

fredsonaguiar · 2021-09-15T19:17:00Z

In own-pt/py-ownpt@0cbe46b, we implement the solution described before, obtaining the desirable results, such as follows:

    <LexicalEntry id="own-pt-word--lpar-de-rpar-_tempo_parcial-a">
      <Lemma partOfSpeech="a" writtenForm="(de) tempo parcial"/>
      <Sense id="own-pt-wordsense-01089369-a-1" synset="own-pt-synset-01089369-a"/>
    </LexicalEntry>
    ....
    <LexicalEntry id="own-pt-word-Sacro_Império_Romano-ndash-Germânico-n">
      <Lemma partOfSpeech="n" writtenForm="Sacro Império Romano–Germânico"/>
      <Sense id="own-pt-wordsense-08169677-n-2" synset="own-pt-synset-08169677-n"/>
    </LexicalEntry>
    ...
    <LexicalEntry id="own-pt-word-vapor_d-CloseCurlyQuote-água-n">
      <Lemma partOfSpeech="n" writtenForm="vapor d’água"/>
      <Sense id="own-pt-wordsense-15055442-n-3" synset="own-pt-synset-15055442-n"/>
    </LexicalEntry>

The resulting LMF was validated by the DTD

fredsonaguiar · 2021-09-15T21:13:19Z

In edfef02 we came back to the more readable Word URIs, without quoting. As result, we got URIs such as:

<https://w3id.org/own-pt/wn30-pt/instances/word-&-n> a wn30:Word ;
<https://w3id.org/own-pt/wn30-pt/instances/word-(militar)+ocupação-n> a wn30:Word ;
<https://w3id.org/own-pt/wn30-pt/instances/word-Solda+de+estanho-n> a wn30:Word ;
<https://w3id.org/own-pt/wn30-pt/instances/word-digno+de+confiança-a> a wn30:Word ;

After running the script https://github.com/own-pt/py-ownpt/blob/master/release.sh

arademaker · 2021-09-16T00:22:49Z

Not sure if i understood the last comment. Why using underscore for spaces in the XML but plus in the RDF? I would prefer uniform ids, for XML and RDF... what is the URI in the RDF for own-pt-word-vapor_d-CloseCurlyQuote-água-n and own-pt-word--lpar-de-rpar-_tempo_parcial-a?

fredsonaguiar · 2021-09-16T01:37:52Z

Maybe some misunderstanding. I've already changed the code to generate the ID based on the URI, instead of the Lemma. So, for instance:

<https://w3id.org/own-pt/wn30-pt/instances/word-vapor+d'água-n> a wn30:Word ;
    wn30:lemma "vapor d'água"@pt ;
    wn30:pos "n" .

Gives us the XML ID as follows:

    <LexicalEntry id="own-pt-word-vapor-plus-d-CloseCurlyQuote-água-n">
      <Lemma partOfSpeech="n" writtenForm="vapor d'água"/>
      <Sense id="own-pt-wordsense-15055442-n-3" synset="own-pt-synset-15055442-n"/>
    </LexicalEntry>

Notice that the difference from before is that spaces were replaced by + on the URI, so the new IDs have -plus-, instead of _.

arademaker · 2021-09-16T01:43:30Z

I was expecting

own-pt-word-vapor_d-CloseCurlyQuote-água-n

For both the uri in the RDF and for the xml ID.

arademaker · 2021-09-16T01:44:22Z

Both should be based on lemma not the id based on the uri

fredsonaguiar · 2021-09-16T14:46:57Z

We did so in 225eba5 again by running the updated scripts. Then we got, for instance:

<https://w3id.org/own-pt/wn30-pt/instances/word-vapor_d-CloseCurlyQuote-água-n> a wn30:Word ;
    wn30:lemma "vapor d’água"@pt ;
    wn30:pos "n" .

Is translated to the XML Element as follows:

    <LexicalEntry id="own-pt-word-vapor_d-CloseCurlyQuote-água-n">
      <Lemma partOfSpeech="n" writtenForm="vapor d’água"/>
      <Sense id="own-pt-wordsense-15055442-n-3" synset="own-pt-synset-15055442-n"/>
    </LexicalEntry>

The only difference is that in the XML the LMF lexicon ID (in this case, own-pt) is prefixed to the IDs, replacing the RDF namespace, following suggestion from globalwordnet/schemas#55 (comment):

To construct an ID, you can then:

Replace disallowed ID characters with the dash-escape-dash patterns

Prefix own-pt- (or some other lexicon ID followed by a dash)

Append a dash and the part-of-speech

arademaker · 2021-09-16T18:55:14Z

@Prefix own-en : https://w3id.org/own/own-pt/instances/ .
@Prefix own-pt : https://w3id.org/own/own-pt/instances/ .
@Prefix owns: https://w3id.org/own/schema/ .

Words:

own-pt:word-Afloramento-n a wn30:Word ;
wn30:lemma "Afloramento"@pt ;
wn30:pos "n" .

own-en:word-run-v a wn30:Word ;
wn30:lemma "run"@en ;
wn30:otherForm "ran"@en,
"running"@en ;
wn30:pos "v" .

For wordsenses:

own-LL:wordsense-00001740-a-1

For synsets:

@Prefix own-en : https://w3id.org/own/own-en/instances/ .
@Prefix own-pt : https://w3id.org/own/own-pt/instances/ .
@Prefix owns: https://w3id.org/own/schema/ .

own-pt:synset-00001740-a a wn30:AdjectiveSynset ;
skos:inScheme https://w3id.org/own/own-pt/instances/ ;
wn30:offset "00001740" ;
wn30:synsetId "00001740-a" .

own-en:synset-00001740-n a wn30:BaseConcept,
wn30:NounSynset ;
skos:inScheme https://w3id.org/own/own-en/instances/ ;
wn30:gloss "that which is perceived or known or inferred to have its own distinct existence (living or nonliving)"@en ;
wn30:lexicographerFile "noun.Tops" ;
wn30:offset "00001740" ;
wn30:synsetId "00001740-n" .

https://w3id.org/own/own-pt/instances/word-abafamento-n
own-pt-word-abafamento-n

https://w3id.org/own/own-pt/instances/word-vapor_-CloseCurlyQuote-água-n
own-pt-word-vapor_-CloseCurlyQuote-água-n

https://w3id.org/own/own-en/instances/word-emergent-a
own-en-word-emergent-a

fredsonaguiar · 2021-09-16T20:20:22Z

What has to be done :

Words local names: make Word RDF URIs and LMF IDs consistent
prefix schema: rename wn30 to owns, for https://w3id.org/own/schema/
prefix pt instances: use prefix own-pt for https://w3id.org/own/own-pt/instances/
prefix en instances: use prefix own-en for https://w3id.org/own/own-en/instances/
predicates skos:inScheme in own-pt: use https://w3id.org/own/own-pt/instances/
predicates skos:inScheme in own-en: use https://w3id.org/own/own-en/instances/
review nomlex URIs: instances and schema

arademaker · 2021-09-16T20:40:06Z

For nomlex instances, we can use the respective https://w3id.org/own/own-en/instances/ or https://w3id.org/own/own-pt/instances/. For the schema, ok to use https://w3id.org/own/schema/nomlex/ but this creates another problem! We need to rethink https://w3id.org/own/schema/ since that would be a prefix for the nomlex schema URI.

Two options:

make the classes and properties from nomlex part of the general schema
change the more general schema to https://w3id.org/own/schema/wn/

I prefer (1), that is, to merge the nomlex schema into the general schema, that is, we can incorporate the nomlex classes and properties into our general schema for encoding WN in RDF.

fredsonaguiar · 2021-09-18T01:22:38Z

In a479a97 and f9248c4 we update the schemas, after running schema.sh. After replacing the URIs, we ran the splitting, to grant the new serialization, including order and prefixes, for instance. Please, take a look.

The resulting files became considerably smaller.

fredsonaguiar · 2021-09-18T01:40:19Z

For now, I think we're closing this issue. If there are any new issues related to this discussion, feel free to reopen it.

fcbr changed the title ~~Need to review all IRIs~~ [verification] Need to review all IRIs Aug 21, 2015

fcbr added the data label Sep 18, 2015

arademaker assigned fredsonaguiar May 18, 2021

arademaker added this to the pre release 1.0 milestone Aug 20, 2021

fredsonaguiar mentioned this issue Sep 17, 2021

Rename Schemas #190

Closed

fredsonaguiar mentioned this issue Sep 18, 2021

update schema uri #188

Closed

fredsonaguiar closed this as completed Sep 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[verification] Need to review all IRIs #70

[verification] Need to review all IRIs #70

fcbr commented Aug 21, 2015

arademaker commented Jun 10, 2021

arademaker commented Aug 20, 2021 •

edited

Loading

arademaker commented Aug 20, 2021

fredsonaguiar commented Aug 27, 2021

fredsonaguiar commented Aug 27, 2021

arademaker commented Sep 3, 2021

fredsonaguiar commented Sep 15, 2021 •

edited

Loading

fredsonaguiar commented Sep 15, 2021 •

edited

Loading

fredsonaguiar commented Sep 15, 2021

fredsonaguiar commented Sep 15, 2021

arademaker commented Sep 16, 2021

fredsonaguiar commented Sep 16, 2021 •

edited

Loading

arademaker commented Sep 16, 2021

arademaker commented Sep 16, 2021

fredsonaguiar commented Sep 16, 2021 •

edited

Loading

arademaker commented Sep 16, 2021

fredsonaguiar commented Sep 16, 2021 •

edited

Loading

arademaker commented Sep 16, 2021 •

edited

Loading

fredsonaguiar commented Sep 18, 2021 •

edited

Loading

fredsonaguiar commented Sep 18, 2021

[verification] Need to review all IRIs #70

[verification] Need to review all IRIs #70

Comments

fcbr commented Aug 21, 2015

arademaker commented Jun 10, 2021

arademaker commented Aug 20, 2021 • edited Loading

arademaker commented Aug 20, 2021

fredsonaguiar commented Aug 27, 2021

fredsonaguiar commented Aug 27, 2021

arademaker commented Sep 3, 2021

fredsonaguiar commented Sep 15, 2021 • edited Loading

fredsonaguiar commented Sep 15, 2021 • edited Loading

fredsonaguiar commented Sep 15, 2021

fredsonaguiar commented Sep 15, 2021

arademaker commented Sep 16, 2021

fredsonaguiar commented Sep 16, 2021 • edited Loading

arademaker commented Sep 16, 2021

arademaker commented Sep 16, 2021

fredsonaguiar commented Sep 16, 2021 • edited Loading

arademaker commented Sep 16, 2021

fredsonaguiar commented Sep 16, 2021 • edited Loading

arademaker commented Sep 16, 2021 • edited Loading

fredsonaguiar commented Sep 18, 2021 • edited Loading

fredsonaguiar commented Sep 18, 2021

arademaker commented Aug 20, 2021 •

edited

Loading

fredsonaguiar commented Sep 15, 2021 •

edited

Loading

fredsonaguiar commented Sep 15, 2021 •

edited

Loading

fredsonaguiar commented Sep 16, 2021 •

edited

Loading

fredsonaguiar commented Sep 16, 2021 •

edited

Loading

fredsonaguiar commented Sep 16, 2021 •

edited

Loading

arademaker commented Sep 16, 2021 •

edited

Loading

fredsonaguiar commented Sep 18, 2021 •

edited

Loading