-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[verification] Need to review all IRIs #70
Comments
Nós agora usamos |
In the LMF generation, we had to deal with the ID datatype of XML. See https://en.wikipedia.org/wiki/Document_type_definition e https://www.w3.org/TR/REC-xml/#id So far, we didn't pay attention to that, and URIs of words in the RDF files can contain @gdemelo had some links about how he implemented it in lexvo:
In Python, we can eventually use an already standard library for url encoding?! We expect that words should have final encode like below for RDF and XML:
related to #168 |
Note that words may contain |
I'd say that makes sense, but this might be confusing, for Us and others, when related to the other URIs: Synsets and Senses follow the rule |
About LMF LexicalEntry identifiers, since character About the valid DTD IDs, see https://www.w3.org/TR/REC-xml/#id or https://www.liquid-technologies.com/DTD/Datatypes/ID.aspx |
Let us wait 1-2 days for some feedback. |
As well discussed in the issue referenced before, one of the more readable and natural ways to solve our problem is to consider quoting the not xml:ID valid characters/forms, such as A first approach is to think about mapping characters into HTML entity names, which is well documented mapping. I found a first smaller table for HTML Entity Names (Latin-1), manually scraped from https://cs.stanford.edu/people/miles/iso8859.html, removing delimiters
This way we got some reasonable, expected IDs:
Only two Lexical Entries had characters replaced by Hexadecimal Forms: <LexicalEntry id="own-pt-word-vapor_d-e28099-água">
<Lemma partOfSpeech="n" writtenForm="vapor d’água"/>
<Sense id="own-pt-wordsense-15055442-n-3" synset="own-pt-synset-15055442-n"/>
</LexicalEntry>
<LexicalEntry id="own-pt-word-Sacro_Império_Romano-e28093-Germânico">
<Lemma partOfSpeech="n" writtenForm="Sacro Império Romano–Germânico"/>
<Sense id="own-pt-wordsense-08169677-n-2" synset="own-pt-synset-08169677-n"/>
</LexicalEntry> |
Another possible approach to avoid this scrapping and use a more robust table is to use the unicode entity names from HTML5, already implemented in https://docs.python.org/3/library/html.entities.html, based on https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references. One may notice that many characters have multiple names, such as ℤ ( Currently working into that. |
In own-pt/py-ownpt@0cbe46b, we implement the solution described before, obtaining the desirable results, such as follows: <LexicalEntry id="own-pt-word--lpar-de-rpar-_tempo_parcial-a">
<Lemma partOfSpeech="a" writtenForm="(de) tempo parcial"/>
<Sense id="own-pt-wordsense-01089369-a-1" synset="own-pt-synset-01089369-a"/>
</LexicalEntry>
....
<LexicalEntry id="own-pt-word-Sacro_Império_Romano-ndash-Germânico-n">
<Lemma partOfSpeech="n" writtenForm="Sacro Império Romano–Germânico"/>
<Sense id="own-pt-wordsense-08169677-n-2" synset="own-pt-synset-08169677-n"/>
</LexicalEntry>
...
<LexicalEntry id="own-pt-word-vapor_d-CloseCurlyQuote-água-n">
<Lemma partOfSpeech="n" writtenForm="vapor d’água"/>
<Sense id="own-pt-wordsense-15055442-n-3" synset="own-pt-synset-15055442-n"/>
</LexicalEntry> The resulting LMF was validated by the DTD |
In edfef02 we came back to the more readable Word URIs, without quoting. As result, we got URIs such as: <https://w3id.org/own-pt/wn30-pt/instances/word-&-n> a wn30:Word ;
<https://w3id.org/own-pt/wn30-pt/instances/word-(militar)+ocupação-n> a wn30:Word ;
<https://w3id.org/own-pt/wn30-pt/instances/word-Solda+de+estanho-n> a wn30:Word ;
<https://w3id.org/own-pt/wn30-pt/instances/word-digno+de+confiança-a> a wn30:Word ; After running the script https://github.com/own-pt/py-ownpt/blob/master/release.sh |
Not sure if i understood the last comment. Why using underscore for spaces in the XML but plus in the RDF? I would prefer uniform ids, for XML and RDF... what is the URI in the RDF for |
Maybe some misunderstanding. I've already changed the code to generate the ID based on the URI, instead of the Lemma. So, for instance: <https://w3id.org/own-pt/wn30-pt/instances/word-vapor+d'água-n> a wn30:Word ;
wn30:lemma "vapor d'água"@pt ;
wn30:pos "n" . Gives us the XML ID as follows: <LexicalEntry id="own-pt-word-vapor-plus-d-CloseCurlyQuote-água-n">
<Lemma partOfSpeech="n" writtenForm="vapor d'água"/>
<Sense id="own-pt-wordsense-15055442-n-3" synset="own-pt-synset-15055442-n"/>
</LexicalEntry> Notice that the difference from before is that spaces were replaced by |
I was expecting own-pt-word-vapor_d-CloseCurlyQuote-água-n For both the uri in the RDF and for the xml ID. |
Both should be based on lemma not the id based on the uri |
We did so in 225eba5 again by running the updated scripts. Then we got, for instance: <https://w3id.org/own-pt/wn30-pt/instances/word-vapor_d-CloseCurlyQuote-água-n> a wn30:Word ;
wn30:lemma "vapor d’água"@pt ;
wn30:pos "n" . Is translated to the XML Element as follows: <LexicalEntry id="own-pt-word-vapor_d-CloseCurlyQuote-água-n">
<Lemma partOfSpeech="n" writtenForm="vapor d’água"/>
<Sense id="own-pt-wordsense-15055442-n-3" synset="own-pt-synset-15055442-n"/>
</LexicalEntry> The only difference is that in the XML the LMF lexicon ID (in this case,
|
@Prefix own-en : https://w3id.org/own/own-pt/instances/ . Words: own-pt:word-Afloramento-n a wn30:Word ; own-en:word-run-v a wn30:Word ; For wordsenses: own-LL:wordsense-00001740-a-1 For synsets: @Prefix own-en : https://w3id.org/own/own-en/instances/ . own-pt:synset-00001740-a a wn30:AdjectiveSynset ; own-en:synset-00001740-n a wn30:BaseConcept, https://w3id.org/own/own-pt/instances/word-abafamento-n https://w3id.org/own/own-pt/instances/word-vapor_-CloseCurlyQuote-água-n https://w3id.org/own/own-en/instances/word-emergent-a |
What has to be done :
|
For nomlex instances, we can use the respective https://w3id.org/own/own-en/instances/ or https://w3id.org/own/own-pt/instances/. For the schema, ok to use https://w3id.org/own/schema/nomlex/ but this creates another problem! We need to rethink https://w3id.org/own/schema/ since that would be a prefix for the nomlex schema URI. Two options:
I prefer (1), that is, to merge the nomlex schema into the general schema, that is, we can incorporate the nomlex classes and properties into our general schema for encoding WN in RDF. |
For now, I think we're closing this issue. If there are any new issues related to this discussion, feel free to reopen it. |
We probably should keep them plain ASCII to avoid any possible encoding issues between different systems and file formats.
The text was updated successfully, but these errors were encountered: