-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WN-LMF feedback from Francis Bond #146
Comments
Hi @fcbond, thank you for this excellent feedback (forwarded to me from Bolette). I have looked into every one of your points and there is indeed a lot that we can improve. Unfortunately, most of the serious issues are down to how DanNet is put together lexicographically, which I am not able to fix entirely by myself. |
One way the blocking redundancies might be removed entirely could be to introduce new senses (with new IDs) that contain explicit links back to the original DDO sources. This would also require some method of synthesizing new labels since the labels currently contain important DDO metadata (entry numbers). |
Thanks @simongray for the detailed analysis. I will look at properly in a few days, as I am slow to understand the Danish bits :-). One thing I can make clear:
This only happens when you say You only do this for these four synsets (I am not sure why only):
If you link to an existing ILI, or are not proposing a new one, then you don't need the definition. |
Thanks, @fcbond. How odd. I'll look into it. |
Certain synsets in DanNet seem to have multiple senses with the same lemma, e.g. "stendynge": https://wordnet.dk/dannet/data/synset-65080 or https://wordnet.dk/dannet/data/synset-3440
This causes the duplication in the dataset. Since it is a structural issue owing to how DanNet was constructed from existing entries in DDO, it's not something I alone have the power to change (it requires major lexicographical decisions).
This is a mirror of the former issue, but seemingly with the added issue of certain entries in DanNet having two written representations (not ideal), e.g. https://wordnet.dk/dannet/data/word-11039302-18
This latter, smaller issue should be possible for me to fix.
This has to do with our manual linking to the Princeton WordNet (which are now ILI links).
There was apparently no QA in the process which has resulted in multiple duplicate links to the same ILI, e.g. https://wordnet.dk/dannet/external/ili/i71549
This refers to the fact that the dataset currently doesn't contain any
<ILIDefinition>
elements following the<Definition>
element.This should be trivial to fix, but I'm not sure why it's a blocking issue? Why should DanNet contain the ILI definitions? Should every WN-LMF contain these same definitions? Why isn't this a redundancy?
It has so far been our norm to publish the curated links and not the inferred/generated ones such
hyponym
(we publishhypernym
).Changing this would mean deviating from this norm and it will also increase the time/resources required to generate the WN-LMF dataset since it will need to query the virtual inference graph rather than its constituent parts.
This is a general feature of DanNet. An adjective like "fredelig" is a child of "egenskab" (noun). There is no changing this unless the lexicographers redo the general structure of DanNet.
Most of these self-references are
wn:domain_topic
since the domain of a domain is itself.The second-largest group of self-references are
wn:similar
which seems to stem from the more recent addition of adjectives to DanNet. These should be trivial to remove entirely from DanNet after which the new WN-LMF can be generated.The remainder seems to be part-whole curiosities such as "rice" being a part of "rice": https://wordnet.dk/dannet/data/synset-1424
These can probably just be removed manually?
The text was updated successfully, but these errors were encountered: