Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WN-LMF feedback from Francis Bond #146

Open
simongray opened this issue Nov 4, 2024 · 4 comments
Open

WN-LMF feedback from Francis Bond #146

simongray opened this issue Nov 4, 2024 · 4 comments

Comments

@simongray
Copy link
Member

simongray commented Nov 4, 2024

Blocking issues

  • redundant sense between lexical entry and synset

Certain synsets in DanNet seem to have multiple senses with the same lemma, e.g. "stendynge": https://wordnet.dk/dannet/data/synset-65080 or https://wordnet.dk/dannet/data/synset-3440

This causes the duplication in the dataset. Since it is a structural issue owing to how DanNet was constructed from existing entries in DDO, it's not something I alone have the power to change (it requires major lexicographical decisions).

  • redundant lexical entry with the same lemma and synset

This is a mirror of the former issue, but seemingly with the added issue of certain entries in DanNet having two written representations (not ideal), e.g. https://wordnet.dk/dannet/data/word-11039302-18

This latter, smaller issue should be possible for me to fix.

  • ILI is repeated across synsets

This has to do with our manual linking to the Princeton WordNet (which are now ILI links).

There was apparently no QA in the process which has resulted in multiple duplicate links to the same ILI, e.g. https://wordnet.dk/dannet/external/ili/i71549

  • proposed ILI is missing a definition

This refers to the fact that the dataset currently doesn't contain any <ILIDefinition> elements following the <Definition> element.

This should be trivial to fix, but I'm not sure why it's a blocking issue? Why should DanNet contain the ILI definitions? Should every WN-LMF contain these same definitions? Why isn't this a redundancy?

Things it would be nice to fix (or confirm that they are intended

  • reverse relation is missing

It has so far been our norm to publish the curated links and not the inferred/generated ones such hyponym (we publish hypernym).

Changing this would mean deviating from this norm and it will also increase the time/resources required to generate the WN-LMF dataset since it will need to query the virtual inference graph rather than its constituent parts.

  • synset's part-of-speech is different from its hypernym's

This is a general feature of DanNet. An adjective like "fredelig" is a child of "egenskab" (noun). There is no changing this unless the lexicographers redo the general structure of DanNet.

  • relation is a self-loop

Most of these self-references are wn:domain_topic since the domain of a domain is itself.

The second-largest group of self-references are wn:similar which seems to stem from the more recent addition of adjectives to DanNet. These should be trivial to remove entirely from DanNet after which the new WN-LMF can be generated.

The remainder seems to be part-whole curiosities such as "rice" being a part of "rice": https://wordnet.dk/dannet/data/synset-1424

These can probably just be removed manually?

I think the redundant sense/lexical entry is probably just an error in the script that builds the LMF. The ILI repeating across synsets is maybe more serious, but could possibly also be fixed programmatically --- trivially by unlinking these synsets, but more interestingly by looking at the hypernym/hyponym relations between them if any. There are very few proposed ILIs without definitions, in this case they could just not be proposed, or definitions written.

Reverse relation is missing we can easily add. Synset POS different from hypernym is not so common normally, I would like to check that these are OK. relation is a self loop are probably bugs, but again it would need to be checked.

@simongray
Copy link
Member Author

simongray commented Nov 4, 2024

Hi @fcbond, thank you for this excellent feedback (forwarded to me from Bolette). I have looked into every one of your points and there is indeed a lot that we can improve.

Unfortunately, most of the serious issues are down to how DanNet is put together lexicographically, which I am not able to fix entirely by myself.

@simongray
Copy link
Member Author

One way the blocking redundancies might be removed entirely could be to introduce new senses (with new IDs) that contain explicit links back to the original DDO sources.

This would also require some method of synthesizing new labels since the labels currently contain important DDO metadata (entry numbers).

@fcbond
Copy link

fcbond commented Nov 5, 2024

Thanks @simongray for the detailed analysis. I will look at properly in a few days, as I am slow to understand the Danish bits :-).

One thing I can make clear:

proposed ILI is missing a definition

This refers to the fact that the dataset currently doesn't contain any <ILIDefinition> elements following the element.

This should be trivial to fix, but I'm not sure why it's a blocking issue? Why should DanNet contain the ILI definitions?
Should every WN-LMF contain these same definitions? Why isn't this a redundancy?

This only happens when you say ili="in", which means that you want to add a new concept to ILI. In this case, you must add an English definition for the new ILI, so the speakers of other languages can coordinate more easily. You can easily solve this by instead saying ili ="" which means that there is no ILI for this synset, and you are not proposing one (at the moment, you always can later).

You only do this for these four synsets (I am not sure why only):

    <Synset id="synset-60737" members="sense-28000738" lexfile="noun.artifact" ili="in" partOfSpeech="n">
    <Synset id="synset-53283" members="sense-28124956" lexfile="noun.artifact" ili="in" partOfSpeech="n">
    <Synset id="synset-40476" members="sense-21032192" lexfile="noun.artifact" ili="in" partOfSpeech="n">
    <Synset id="synset-58042" members="sense-21009570 sense-21060649" lexfile="noun.act" ili="in" partOfSpeech="n">

If you link to an existing ILI, or are not proposing a new one, then you don't need the definition.

@simongray
Copy link
Member Author

Thanks, @fcbond. How odd. I'll look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants