-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2021 Release Candidate #747
Comments
The RDF version works for me. I could owl:import it into VocBench 3 (with GraphDB Free as backend and with RDF4J) without a problem and so far everything looks fine. One very minor thing: the wn ontology (https://globalwordnet.github.io/schemas/wn#) had to be owl:imported separately (I could not find an owl:imports statement in the turtle file). |
The first WNDB export is useless because: Also no frames (+) are present in data.verb. |
The RDF version includes examples. FWIW, here are the statistics.
|
Thanks, I have updated the candidates:
|
In the GWA XML file, the new Sense id format is unpractical to parse because underscore is overloaded as both a key separator and a space in multi-word expressions: for ex. id="ewn-a_la_carte__3_00_00__". XML Relation statistics:
|
@ekaf sense keys are parsed and verified at every commit. The actual implementation is here: |
All synsets in the data.* files still have lex_filenum 45 (which is not a valid number), instead of their correct lex_filenum. In the second WNDB export, index.sense is not in the proper PWN format, because the sense keys use __ instead of the specified %. Also, in index.sense, the sense keys should not be prefixed by "ewn-", since almost all of them are PWN 3.1 keys. WNDB has less relations than the XML file: 384509 - 383876 = 633. This difference is fully explained by the absence WNDB relation statistics:
|
Concerning the new XML Sense id format, I still find it unpractical to extract the individual components of an id like "ewn-a_la_carte__3_00_00__", where there is no fixed number of components, since underscore is used both inside and outside lemmas, and __ may appear once or twice. Isn't the ID format too important a change to make at the last minute before release? It deserves a (rather big) issue... |
Yes, any separator that allows uncomplicated extraction of the individual parts.would be fine. Actually, ekaf/ski#1 (comment) already pointed out the problem of using colons (:) in XML IDs back in january 2020, since the original XML specification discourages it. |
Not really "a few months ago" but July 15, |
Maybe reopen #662, so that the issue can be discussed? The introduction of single underscores was even less announced than the colons. |
I have reopened, but I hope that the use of periods is the most acceptable solution and that we can close this quickly to proceed with the release. |
Wouldn't it have been simpler to keep the previous ID and also preserve the sense key as it was (parseable, with colons as field separators), a simple (legacy) key that doesn't have to bear the xsd:ID requirements. |
As proposed by @1313ou, keeping the original verbatim sense keys would be ideal, if feasible |
The reason for change was a bug report by @1313ou globalwordnet/schemas#5 and globalwordnet/schemas#25 Please note that we also intend to drop the XML as a working format (see #664) and the YAML version has no issues with using the Princeton sense keys as is. Please also note that the previous XML ID scheme is highly problematic to calculate and involves lots of 'encoded' information. I am happy to reintroduce the verbatim sense keys as another attribute but it seems pointless when they can easily be obtained from the IDs. |
Can you elaborate on that ? If it was used for the 2020 release, why shouldn't it work now? globalwordnet/schemas#5 was not in favour of using sensekeys as IDs, it was intended to clarify their status as inner (generated) keys or foreign legacy keys (referring to PWN) |
The old sense IDs encoded the ordering of members of a synset. This is now captured by the |
So now that the ordering is readily available, I take it it should be easy to use it when generating IDs. |
I updated the Release Candidate (WNDB Export)
|
Yeah... sort of... but it is easier to remove the hack entirely. Part of the issue is that incoming links would get broken by rearrangements of the order of members in the synset as this changed the sense IDs that were used as targets. |
I tried loading the XML file with Wn but got this: >>> import wn
>>> wn.add('english-wordnet-2021.xml')
Skipping ewn:2020 (Open English WordNet); already added
>>> This is because the |
In wndb, everything is ok now: there are no ambiguous sense keys, no transitive loop, every symmetric relation is symmetric, and the asymmetric ones are as expected. There is only one minor annoyance left in data.noun: the gloss at offset 15539647 has a few spurious backslashes: "We, the intellectually curious, may soon find ourselves trapped in a pen,\ \ fenced in by rule-bound sticklerism and overzealous concern for our personal\ \ safety, unless we exercise our civil liberties and our curiosity," he declaims. |
@jmccrae wrote "I am happy to reintroduce the verbatim sense keys as another attribute but it seems pointless when they can easily be obtained from the IDs." Yes, if sense keys can be obtained from the IDs by a simple translation of the separators, i. e. without looking up in an extensive mapping, then there is no need to have two different attributes for the same information. |
Updates on Release Candidates
|
@jmccrae thanks, the XML version loads without issue, now. Sorry I'm a bit late to the discussion on XML IDs. I don't want to delay things, but I'm concerned about the method of replacement for punctuation, as noted in globalwordnet/schemas#55. Specifically the use of dashes both to delimit escape sequences and as literal dash characters makes it ripe for collisions. Here's an example with both: <LexicalEntry id="ewn-bird-ap-s-eye-s"> The relevant code is here: english-wordnet/scripts/wordnet.py Lines 774 to 775 in 82b7103
I don't think there's any issues now, but considering that, for example, |
Even after sorting index.sense though, NLTK's lemma_from_key still fails on 36938 sense keys, which allows to find a very severe bug in the WNDB 2021 data.* files! Consider for ex. the OEWN 2021 sense key 'abacus%1:06:01::', which NLTK expects to be 'abacus%1:06:00::', based on the fact that it's lex_id is 0 in data.noun. The actual problem is not in index.sense but in data.noun, which has two different synsets for abacus, both with lex_id 0 in lexfile 06, which is not allowed. One of these lex_ids should be 1 instead, and then index.sense would (hopefully) be correct. |
Good catch @ekaf. The WNDB export script did not understand the changed IDs. I have updated now |
It's much better now, but there is still a problem with 2562 adjective satellites, where NLTK expects the head to be the first lemma in the head synset. The head adjective is the one that has antonyms, and according to the senseidx manual:
So the order of these lemmas is wrong in data.adj. Below are the first 10 examples, where the sense key is correct, but not the order of the head_word within its synset in data.adj. WNDB 2021 sense keys, NLTK expects: |
Currently, most OEWN synsets don't respect the sense ordering specified in the wndb manual:
As a consequence, 2591 sense keys don't correspond to the lemma ordering in their synsets. Out of these, 23 are not adjective satellites, for ex.:
Additionally, 86130 (73.12%) Synset wordnames from Wn 3.1 changed in Wn 2021, for ex,:
And 4522 (3.84%) Synset wordnumbers from Wn 3.1 changed in Wn 2021, for ex.:
NLTK's synset identifiers are not expected to be completely stable across versions, but the number of changes is much lower between Princeton versions, where name changes are particularly rare:
|
This refers to globalwordnet/schemas#17 |
There was a mistake in the extraction script and it was using the position of the sense in the synset not the entry. I have updated now. Note there are some slight divergences from Princeton here, as we now have a real idea of what a lexical entry. So, for example 'bass' is now distinguished between the two pronunciations and we don't rank the senses across the two entries. To ensure backwards compatibility I randomly list all senses of one entry before or after the other. |
Thanks @jmccrae, when and where will the updated WNDB be available? It doesn't seem to be at the usual place. |
Should be here. I see the update at this link: http://john.mccr.ae/oewn2021/english-wordnet-2021.zip |
@jmccrae, that's not an update: it is still the version from last Thursday, with all the problems mentioned above. |
@ekaf This file has definitely been updated
|
@jmccrae, what you write is true, however doing 'ls -l dict' tells that the datafiles are from "Oct 21", and comparing them with the previous release yields no difference. |
@ekaf The only file updated was |
But the problem is the order of the words within their synsets in the data.* files... |
Updating index.sense has changed nothing wrt. the problem in WNDB (#747 (comment)). The same 2591 sense keys as before cannot be converted to NLTK lemmas. Here are the first 10 in sorted order: Bad key in Wn 2021: a_priori%5:00:00:theoretical:00 (Wn31:Lemma('a_priori.s.02.a_priori')) But the problem is bigger than that, because most sense keys (77% actually) point to a different sense number than in PWN 3.1, because Wn 2021 does not order the lemmas within their synset according to cntlist as specified. |
Maybe the following output better illustrates the problem: Bad key in Wn 2021 for Lemma('a_priori.s.02.a_priori'), index.sense:a_priori%5:00:00:theoretical:00 != datafile:a_priori%5:00:00:theoretic:00 |
A small number of WNDB data.* synsets have definitions with more than one "|", leading to parse failures.
02020442 00 a 01 rural 0 011 ! 02022522 a 0101 + 04961506 n 0102 & 02020981 a 0000 & 02021158 a 0000 & 02021320 a 0000 & 02021613 a 0000 & 02021727 a 0000 & 02021895 a 0000 & 02022057 a 0000 & 02022225 a 0000 & 02022388 a 0000 | of or relating to the countryside as opposed to the city; living in or characteristic of farming or country life| living in or characteristic of farming or country life; rural people; large rural households; unpaved rural roads; an economy that is basically rural; rural electrification; rural free delivery These explain all the 23 key errors which do not concern adjective satellites. |
I updated the WNDB files again. @ekaf I can't seem to reproduce these errors. Perhaps you could share with me the code that you are testing? |
Of course @jmccrae, here is the code. But to support PWN 3.1, it requires the latest NLTK 'develop' branch, plus the latest nltk_data, and additionally you need to copy the EWN 2021 wndb files to a directory called wordnet2021 under nltk_data/corpora, and uncomment the line that mentions 'wordnet2021' in nltk/corpus/__init__py, because NLTK does not yet support EWN .
|
Congratulation @jmccrae, there are no more bad keys in the newest WNDB update, after the ordering of words within their synsets is fixed. 15 (0.01%) lemmas from Wn 3.1 changed in Wn 2021
2124 (1.8%) Synset id. wordnames from Wn 3.1 changed in Wn 2021
7277 (6.18%) Synset id. wordnumbers from Wn 3.1 changed in Wn 2021
|
thanks for redirecting. I'm getting the error after updating from nltk.corpus import wordnet2021
synsets = wordnet2021.all_synsets()
output = []
for synset in synsets:
try:
lemmas = wordnet2021.synset(synset._name).lemmas()
for lemma in lemmas:
t = lemma.synset()._name.split(".")[1]
key = lemma.key()
r = str(wordnet2021.ss2of(lemma.synset()))
offset = r[:r.index("-")] + t
output.append(key + " " + "wn:" + offset)
except:
print(synset, "name Not found==================================================")
with open("sensekeys2offsets.txt", "w") as writer:
for line in output:
writer.write(line + "\n") Here is the error. Traceback (most recent call last):
File "/home/irfan/PycharmProjects/StreusleSpacy/sensekeys2offsets.py", line 16, in <module>
for synset in synsets:
File "/home/irfan/environments/ewiser/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py", line 1721, in all_synsets
synset = from_pos_and_line(pos_tag, line)
File "/home/irfan/environments/ewiser/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py", line 1513, in _synset_from_pos_and_line
frame_string_fmt = VERB_FRAME_STRINGS[frame_number]
IndexError: tuple index out of range
|
The code above by @mirfan899 would run smoothly with the updated WNDB, if the following out-of range frame numbers were fixed. Isn't it surprising that such trivial bugs have not been stopped earlier?
|
OEWN 2021 has 39 subcategorization frames instead of WordNet's 35. It seems the proper fix is for edit: I was looking at the XML release of OEWN 2021 and not the WNDB one, until just now. It looks like the Also, side note, the module name |
All this fuss (and need for extra code) for 4 verb frames that occur only ONCE (hardly a subcategorization then, is it?):
They join Prince William, Duke of Cumberland, Capital: Critique of Political Economy and Yahoo! in the list of oddities Is this worth breaking legacy code? NB The grinder I have written produces a wndb_compat preview version of oewn2021 that works as a seamless stand-in replacement of PWN and passes @ekaf 's test with the current nltk lib (that uses hardwired verb frames). |
Interesting dict, @1313ou ! I confirm that it doesn't have the frame problem. It also passes the "Bad key" test, just as OEWN's latest WNDB release also does. However, running the https://github.com/globalwordnet/english-wordnet/files/7459290/synchanged.py.txt script produces more changed wordnumbers due to the index.POS* files not respecting the sense ranking frequencies: where WNDB has 7277, yours has 7938. In my opinion, this is a problem worth some "fuss":
NB: NLTK perceives your version as "3.1" because it extracts the version number from the 1740 bytes license text in data.adj |
I removed the links to frames > 35 in the WNDB version. |
There has been a lot of discussion around issue #747, since it was closed two weeks ago. Maybe it deserves reopening temporarily? Concerning @goodmami's judicious remarks (#747 (comment)), hardcoding the verb frames as NLTK does is an NLTK bug. An issue should be raised there, but the newer verb frames should also be added to the verb.Framestext file, as in @1313ou's wndb distribution (https://github.com/x-englishwordnet/wndb), which is right now superior to the official OEWN export. I have been thinking along the same lines as @goodmami, weighing advantages against drawbacks in calling the NLTK package 'wordnet2021'. There is no legal issue, since Princeton no longer holds a trademark in the WordNet name. Perhaps @jmccrae has some preference concerning the NLTK package name? |
That would be a better fix, but note that the NLTK distribution of WordNet 3.0 does not include |
Hi all, I am at this point satisfied that the issues with the release are now only minor. The XML and RDF versions appear to be completely stable, the WNDB is still a little problematic but it is not possible to accommodate all modifications made by this project relative to this legacy format. It would IMHO be better if NLTK and other tools moved to the XML format that is more stable. Can I ask you all to please stop posting to this issue? Instead, please create new issues (using the "Release format issue" template). I will post an announcement of the release to Corpora List, Linguist List, WN Users and some social media on Sunday. I would like to thank you all for you contributions, it has all been incredibly useful and I think we can all be proud of how much we have achieved with this release. |
All issues are now closed and the 2021 candidate is ready to be released.
I have the candidate files here:
Please let me know if there are any issues
The text was updated successfully, but these errors were encountered: