Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2021 Release Candidate #747

Closed
jmccrae opened this issue Sep 3, 2021 · 76 comments
Closed

2021 Release Candidate #747

jmccrae opened this issue Sep 3, 2021 · 76 comments
Labels
help wanted Extra attention is needed
Milestone

Comments

@jmccrae
Copy link
Member

jmccrae commented Sep 3, 2021

All issues are now closed and the 2021 candidate is ready to be released.

I have the candidate files here:

Please let me know if there are any issues

@jmccrae jmccrae added the help wanted Extra attention is needed label Sep 3, 2021
@jmccrae jmccrae added this to the 2021 Release milestone Sep 3, 2021
@rwingerter55
Copy link

The RDF version works for me. I could owl:import it into VocBench 3 (with GraphDB Free as backend and with RDF4J) without a problem and so far everything looks fine. One very minor thing: the wn ontology (https://globalwordnet.github.io/schemas/wn#) had to be owl:imported separately (I could not find an owl:imports statement in the turtle file).

@ekaf
Copy link
Contributor

ekaf commented Sep 7, 2021

The first WNDB export is useless because:
-All synsets in the data.* files have lex_filenum 45, which is not a valid number
-index.sense contains 0 bytes. Many further tests require the sense keys.

Also no frames (+) are present in data.verb.

@rwingerter55
Copy link

The RDF version includes examples. FWIW, here are the statistics.

partOfSpeech noOfSynsets withExamples pct
adjective 10726 7588 71%
adjective_satellite 16329 12569 77%
adverb 4609 4169 90%
noun 87119 12776 15%
verb 16696 12546 75%
Total 135479 49648 37%

@jmccrae
Copy link
Member Author

jmccrae commented Sep 9, 2021

Thanks, I have updated the candidates:

  • OWL:imports statements added
  • Sense IDs no longer contain :
  • index.sense is not empty
  • Frames are available in the verb sections of the WNDB export

@ekaf
Copy link
Contributor

ekaf commented Sep 9, 2021

In the GWA XML file, the new Sense id format is unpractical to parse because underscore is overloaded as both a key separator and a space in multi-word expressions: for ex. id="ewn-a_la_carte__3_00_00__".

XML Relation statistics:

Count Relation
3878 also
8000 antonym
1278 attribute
221 causes
74687 derivation
1511 domain_region
6730 domain_topic
408 entails
1379 exemplifies
1511 has_domain_region
6730 has_domain_topic
12291 holo_member
9199 holo_part
830 holo_substance
91503 hypernym
91503 hyponym
8589 instance_hypernym
8589 instance_hyponym
221 is_caused_by
408 is_entailed_by
1379 is_exemplified_by
12291 mero_member
9199 mero_part
830 mero_substance
4 other
73 participle
8073 pertainym
23194 similar
384509 Total

@jmccrae
Copy link
Member Author

jmccrae commented Sep 9, 2021

@ekaf sense keys are parsed and verified at every commit. The actual implementation is here:
https://github.com/globalwordnet/english-wordnet/blob/master/scripts/wordnet_yaml.py#L21

@ekaf
Copy link
Contributor

ekaf commented Sep 9, 2021

All synsets in the data.* files still have lex_filenum 45 (which is not a valid number), instead of their correct lex_filenum.

In the second WNDB export, index.sense is not in the proper PWN format, because the sense keys use __ instead of the specified %. Also, in index.sense, the sense keys should not be prefixed by "ewn-", since almost all of them are PWN 3.1 keys.
But after making these two simple edits with search and replace, I found that all validity tests pass OK.

WNDB has less relations than the XML file: 384509 - 383876 = 633. This difference is fully explained by the absence
of 221 is_caused_by + 408 is_entailed_by + 4 other, so both relation totals agree perfectly.

WNDB relation statistics:

Count Symbol Source Target Type
4028 ! a a lex
21458 & a a sem
2 + a a lex
1 -c a a sem
1 ;c a a sem
37 \ a a lex
2704 ^ a a sem
12769 + a n lex
1118 ;c a n sem
9 ;c a n lex
76 ;r a n sem
7 ;r a n lex
227 ;u a n sem
5 ;u a n lex
639 = a n sem
4784 \ a n lex
1 ^ a n sem
9 + a r lex
1568 + a v lex
73 < a v lex
12769 + n a lex
1118 -c n a sem
9 -c n a lex
76 -r n a sem
7 -r n a lex
227 -u n a sem
5 -u n a lex
639 = n a sem
2160 ! n n lex
12291 #m n n sem
9199 #p n n sem
830 #s n n sem
12291 %m n n sem
9199 %p n n sem
830 %s n n sem
2957 + n n lex
4296 -c n n sem
4 -c n n lex
1268 -r n n sem
145 -r n n lex
658 -u n n sem
396 -u n n lex
4296 ;c n n sem
4 ;c n n lex
1268 ;r n n sem
145 ;r n n lex
658 ;u n n sem
396 ;u n n lex
78200 @ n n sem
8589 @i n n sem
12 ^ n n sem
78200 ~ n n sem
8589 ~i n n sem
39 -c n r sem
71 -u n r sem
1 -u n r lex
21518 + n v lex
1262 -c n v sem
1 -c n v lex
9 -r n v sem
6 -r n v lex
19 -u n v sem
2 -u n v lex
9 + r a lex
3251 \ r a lex
39 ;c r n sem
71 ;u r n sem
1 ;u r n lex
712 ! r r lex
1 \ r r lex
6 ^ r r sem
1568 + v a lex
21518 + v n lex
1262 ;c v n sem
1 ;c v n lex
9 ;r v n sem
6 ;r v n lex
19 ;u v n sem
2 ;u v n lex
1100 ! v v lex
1734 $ v v sem
2 $ v v lex
408 * v v sem
221 > v v sem
13303 @ v v sem
1148 ^ v v lex
7 ^ v v sem
13303 ~ v v sem
383876 Total

@ekaf
Copy link
Contributor

ekaf commented Sep 9, 2021

Concerning the new XML Sense id format, I still find it unpractical to extract the individual components of an id like "ewn-a_la_carte__3_00_00__", where there is no fixed number of components, since underscore is used both inside and outside lemmas, and __ may appear once or twice.

Isn't the ID format too important a change to make at the last minute before release? It deserves a (rather big) issue...

@jmccrae
Copy link
Member Author

jmccrae commented Sep 9, 2021

The ID format change was proposed a few months ago in #662. It was then pointed out in #749 that using : in XML IDs breaks some systems.

Another option could be to use periods, e.g., ewn-a_la_carte__3.00.00..

@ekaf
Copy link
Contributor

ekaf commented Sep 9, 2021

Yes, any separator that allows uncomplicated extraction of the individual parts.would be fine.

Actually, ekaf/ski#1 (comment) already pointed out the problem of using colons (:) in XML IDs back in january 2020, since the original XML specification discourages it.

@ekaf
Copy link
Contributor

ekaf commented Sep 10, 2021

After applying small hacks, and simply replacing the 'wordnet' folder in nltk_data/corpora, the WNDB works with NLTK.

First, NLTK complains that the 'lexnames' file is absent. After copying this file from any previous release (it never changed since WN 1.6, where it was introduced), an error is raised because 45 is not a valid lexfile number. After adding 45 in 'lexnames' , everything went smoothly, although it would be better to have the correct lexfile numbers instead of 45 in data.*. Of course, index.sense was fixed to conform to the PWN sense key format.

After this I used the new graph capabilities of the current development branch in NLTK to draw the hyponyms of "meeting":

import nltk
from nltk.parse.dependencygraph import dot2img
from nltk.corpus import wordnet as wn
dot2img(wn.digraph([wn.synset('meeting.n.01')], lambda s:s.hyponyms(), attr={'rankdir':'LR', 'compound':'true'}, shapes={'':'plaintext'}), 'png')
hypo-meeting

@1313ou
Copy link
Contributor

1313ou commented Sep 10, 2021

Not really "a few months ago" but July 15,
"The IDs we use on sense's XML will be changed to resemble the PWN sense IDs"
No idea then what they would look like. The introduction of colons was not announced.

@ekaf
Copy link
Contributor

ekaf commented Sep 10, 2021

Maybe reopen #662, so that the issue can be discussed? The introduction of single underscores was even less announced than the colons.

@jmccrae
Copy link
Member Author

jmccrae commented Sep 10, 2021

I have reopened, but I hope that the use of periods is the most acceptable solution and that we can close this quickly to proceed with the release.

@1313ou
Copy link
Contributor

1313ou commented Sep 10, 2021

Wouldn't it have been simpler to keep the previous ID and also preserve the sense key as it was (parseable, with colons as field separators), a simple (legacy) key that doesn't have to bear the xsd:ID requirements.
I don't understand the point of this move. We're not pressed for space, the word net is not going to be stored in a microcontroller, where resources are scarce. And it unnecessarily breaks existing apps.

@ekaf
Copy link
Contributor

ekaf commented Sep 10, 2021

As proposed by @1313ou, keeping the original verbatim sense keys would be ideal, if feasible

@jmccrae
Copy link
Member Author

jmccrae commented Sep 13, 2021

The reason for change was a bug report by @1313ou

globalwordnet/schemas#5 and globalwordnet/schemas#25

Please note that we also intend to drop the XML as a working format (see #664) and the YAML version has no issues with using the Princeton sense keys as is. Please also note that the previous XML ID scheme is highly problematic to calculate and involves lots of 'encoded' information. I am happy to reintroduce the verbatim sense keys as another attribute but it seems pointless when they can easily be obtained from the IDs.

@1313ou
Copy link
Contributor

1313ou commented Sep 13, 2021

Please also note that the previous XML ID scheme is highly problematic to calculate and involves lots of 'encoded' information.

Can you elaborate on that ? If it was used for the 2020 release, why shouldn't it work now?

globalwordnet/schemas#5 was not in favour of using sensekeys as IDs, it was intended to clarify their status as inner (generated) keys or foreign legacy keys (referring to PWN)

@jmccrae
Copy link
Member Author

jmccrae commented Sep 13, 2021

The old sense IDs encoded the ordering of members of a synset. This is now captured by the members attribute on synsets. It was a horrible hack that caused lots of headaches!

@1313ou
Copy link
Contributor

1313ou commented Sep 13, 2021

So now that the ordering is readily available, I take it it should be easy to use it when generating IDs.

@jmccrae
Copy link
Member Author

jmccrae commented Sep 13, 2021

I updated the Release Candidate (WNDB Export)

  • Added lexnames
  • Fixed lexfiles in data.noun (etc). There is no more lexfile number 45!
  • Fixed IDs in index.sense

@jmccrae
Copy link
Member Author

jmccrae commented Sep 13, 2021

So now that the ordering is readily available, I take it it should be easy to use it when generating IDs.

Yeah... sort of... but it is easier to remove the hack entirely. Part of the issue is that incoming links would get broken by rearrangements of the order of members in the synset as this changed the sense IDs that were used as targets.

@goodmami
Copy link
Member

I tried loading the XML file with Wn but got this:

>>> import wn
>>> wn.add('english-wordnet-2021.xml')
Skipping ewn:2020 (Open English WordNet); already added

>>> 

This is because the version attribute on <Lexicon> is still "2020". I understand not wanting to version it "2021" until it's ready, but it's also inaccurate to say "2020". What about "2021rc1", etc. for pre-release versions?

@ekaf
Copy link
Contributor

ekaf commented Sep 14, 2021

In wndb, everything is ok now: there are no ambiguous sense keys, no transitive loop, every symmetric relation is symmetric, and the asymmetric ones are as expected.

There is only one minor annoyance left in data.noun: the gloss at offset 15539647 has a few spurious backslashes: "We, the intellectually curious, may soon find ourselves trapped in a pen,\ \ fenced in by rule-bound sticklerism and overzealous concern for our personal\ \ safety, unless we exercise our civil liberties and our curiosity," he declaims.

@ekaf
Copy link
Contributor

ekaf commented Sep 14, 2021

@jmccrae wrote "I am happy to reintroduce the verbatim sense keys as another attribute but it seems pointless when they can easily be obtained from the IDs."

Yes, if sense keys can be obtained from the IDs by a simple translation of the separators, i. e. without looking up in an extensive mapping, then there is no need to have two different attributes for the same information.

@jmccrae
Copy link
Member Author

jmccrae commented Sep 14, 2021

Updates on Release Candidates

@goodmami
Copy link
Member

@jmccrae thanks, the XML version loads without issue, now.

Sorry I'm a bit late to the discussion on XML IDs. I don't want to delay things, but I'm concerned about the method of replacement for punctuation, as noted in globalwordnet/schemas#55. Specifically the use of dashes both to delimit escape sequences and as literal dash characters makes it ripe for collisions. Here's an example with both:

<LexicalEntry id="ewn-bird-ap-s-eye-s">

The relevant code is here:

elif c == '-':
return '-'

I don't think there's any issues now, but considering that, for example, ! becomes -ex- and some words have ex- (e.g., ex-president), it's not a stretch to imagine we might have ex-ex- (e.g., ex-ex-president). Since these issues are only with XML IDs and not specific to English, I propose that we have a lossless XML-ID sanitation function for all WN-LMF wordnets (not just OEWN). By "lossless" I really mean collision-free, as I don't think IDs should be parsed and interpreted, but it could be nice to recover the original ID.

@ekaf
Copy link
Contributor

ekaf commented Oct 21, 2021

Even after sorting index.sense though, NLTK's lemma_from_key still fails on 36938 sense keys, which allows to find a very severe bug in the WNDB 2021 data.* files!

Consider for ex. the OEWN 2021 sense key 'abacus%1:06:01::', which NLTK expects to be 'abacus%1:06:00::', based on the fact that it's lex_id is 0 in data.noun. The actual problem is not in index.sense but in data.noun, which has two different synsets for abacus, both with lex_id 0 in lexfile 06, which is not allowed. One of these lex_ids should be 1 instead, and then index.sense would (hopefully) be correct.

@jmccrae
Copy link
Member Author

jmccrae commented Oct 21, 2021

Good catch @ekaf. The WNDB export script did not understand the changed IDs. I have updated now

@ekaf
Copy link
Contributor

ekaf commented Oct 21, 2021

It's much better now, but there is still a problem with 2562 adjective satellites, where NLTK expects the head to be the first lemma in the head synset. The head adjective is the one that has antonyms, and according to the senseidx manual:

head_word is only present if the sense is in an adjective satellite synset. It is the lemma of the first word of the satellite's head synset.

So the order of these lemmas is wrong in data.adj. Below are the first 10 examples, where the sense key is correct, but not the order of the head_word within its synset in data.adj.

WNDB 2021 sense keys, NLTK expects:
'a_priori%5:00:00:theoretical:00', 'a_priori%5:00:00:theoretic:00'
'abecedarian%5:00:00:alphabetic:00', 'abecedarian%5:00:00:alphabetical:00'
'aberrant%5:00:00:abnormal:00', 'aberrant%5:00:00:unnatural:03'
'ablated%5:00:00:decreased:00', 'ablated%5:00:00:reduced:04'
'ablaze%5:00:00:colorful:00', 'ablaze%5:00:00:colourful:00'
'ablaze%5:00:00:lighted:00', 'ablaze%5:00:00:lit:02'
'aboulic%5:00:00:neurotic:00', 'aboulic%5:00:00:psychoneurotic:00'
'abrasive%5:00:00:rough:00', 'abrasive%5:00:00:unsmooth:00'
'abscessed%5:00:00:septic:00', 'abscessed%5:00:00:infected:00'
'absolute%5:00:00:inalienable:00', 'absolute%5:00:00:unalienable:00'

@ekaf
Copy link
Contributor

ekaf commented Oct 25, 2021

Currently, most OEWN synsets don't respect the sense ordering specified in the wndb manual:

Sense Numbers
Senses in WordNet are generally ordered from most to least frequently
used, with the most common sense numbered 1 . Frequency of use is
determined by the number of times a sense is tagged in the various
semantic concordance texts. Senses that are not semantically tagged
follow the ordered senses.

As a consequence, 2591 sense keys don't correspond to the lemma ordering in their synsets. Out of these, 23 are not adjective satellites, for ex.:

Bad key in Wn 2021: green_bean%1:20:01::
Bad key in Wn 2021: all_of_a_sudden%4:02:01::

Additionally, 86130 (73.12%) Synset wordnames from Wn 3.1 changed in Wn 2021, for ex,:

Key tartness%1:07:00:: in 3.1:Synset('acerbity.n.02'), 2021:Synset('tartness.n.02')
Wn 3.1: ['acerbity', 'tartness'] a sharp sour taste
Wn 2021: ['tartness', 'acerbity'] a sharp sour taste

And 4522 (3.84%) Synset wordnumbers from Wn 3.1 changed in Wn 2021, for ex.:

Key pasty%1:13:00:: in 3.1:Synset('pasty.n.01'), 2021:Synset('pasty.n.02')
Wn 3.1: ['pasty'] small meat pie or turnover
Wn 2021: ['pasty'] small meat pie or turnover

NLTK's synset identifiers are not expected to be completely stable across versions, but the number of changes is much lower between Princeton versions, where name changes are particularly rare:

130 (0.11%) Synset wordnames from Wn 3.0 changed in Wn 3.1
348 (0.3%) Synset wordnumbers from Wn 3.0 changed in Wn 3.1

@1313ou
Copy link
Contributor

1313ou commented Oct 25, 2021

This refers to globalwordnet/schemas#17

@jmccrae
Copy link
Member Author

jmccrae commented Oct 25, 2021

There was a mistake in the extraction script and it was using the position of the sense in the synset not the entry. I have updated now.

Note there are some slight divergences from Princeton here, as we now have a real idea of what a lexical entry. So, for example 'bass' is now distinguished between the two pronunciations and we don't rank the senses across the two entries. To ensure backwards compatibility I randomly list all senses of one entry before or after the other.

@ekaf
Copy link
Contributor

ekaf commented Oct 26, 2021

There was a mistake in the extraction script and it was using the position of the sense in the synset not the entry. I have updated now.

Thanks @jmccrae, when and where will the updated WNDB be available? It doesn't seem to be at the usual place.

@jmccrae
Copy link
Member Author

jmccrae commented Oct 26, 2021

Thanks @jmccrae, when and where will the updated WNDB be available? It doesn't seem to be at the usual place.

Should be here. I see the update at this link: http://john.mccr.ae/oewn2021/english-wordnet-2021.zip

@ekaf
Copy link
Contributor

ekaf commented Oct 26, 2021

@jmccrae, that's not an update: it is still the version from last Thursday, with all the problems mentioned above.

@jmccrae
Copy link
Member Author

jmccrae commented Oct 27, 2021

@ekaf This file has definitely been updated

jmccrae@cirrocumulus [10:05:53] [~/scratch/tmp] 
-> % wget http://john.mccr.ae/oewn2021/english-wordnet-2021.zip
--2021-10-27 10:06:10--  http://john.mccr.ae/oewn2021/english-wordnet-2021.zip
Resolving john.mccr.ae (john.mccr.ae)... 128.199.47.101
Connecting to john.mccr.ae (john.mccr.ae)|128.199.47.101|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11298296 (11M) [application/zip]
Saving to: ‘english-wordnet-2021.zip’

english-wordnet-2021.zip                        100%[======================================================================================================>]  10.77M  9.13MB/s    in 1.2s    

2021-10-27 10:06:11 (9.13 MB/s) - ‘english-wordnet-2021.zip’ saved [11298296/11298296]

jmccrae@cirrocumulus [10:06:11] [~/scratch/tmp] 
-> % unzip english-wordnet-2021.zip 
Archive:  english-wordnet-2021.zip
   creating: dict/
  inflating: dict/index.noun         
  inflating: dict/index.verb         
  inflating: dict/index.sense        
  inflating: dict/cntlist            
  inflating: dict/data.noun          
  inflating: dict/verb.Framestext    
  inflating: dict/data.verb          
  inflating: dict/data.adv           
  inflating: dict/sents.vrb          
  inflating: dict/adv.exc            
  inflating: dict/cntlist.rev        
  inflating: dict/noun.exc           
  inflating: dict/data.adj           
  inflating: dict/sentidx.vrb        
  inflating: dict/verb.exc           
  inflating: dict/index.adv          
  inflating: dict/lexnames           
  inflating: dict/adj.exc            
 extracting: dict/cousin.exc         
  inflating: dict/index.adj          
jmccrae@cirrocumulus [10:06:13] [~/scratch/tmp] 
-> % grep "cat%" dict/index.sense 
abyssinian_cat%1:05:00:: 02140530 1 0
alley_cat%1:05:00:: 02138727 1 0
angora_cat%1:05:00:: 02139695 1 0
bear_cat%1:05:00:: 02530834 1 0
bearcat%1:05:00:: 02152320 1 0
bell_the_cat%2:41:00:: 02537396 1 0
big_cat%1:05:00:: 02144025 1 0
black_cat%1:05:00:: 02470966 1 0
blue_cat%1:05:00:: 02540895 1 0
blue_channel_cat%1:05:00:: 02540895 1 0
bobcat%1:05:00:: 02143699 1 0
burmese_cat%1:05:00:: 02140134 1 0
caffer_cat%1:05:00:: 02142089 1 0
calico_cat%1:05:00:: 02139459 1 0
cat%1:04:00:: 00903094 8 0
cat%1:05:00:: 02137837 1 0
cat%1:05:02:: 02144025 7 0
cat%1:06:00:: 03012554 5 0
cat%1:06:01:: 03010455 6 0
cat%1:06:02:: 03640970 4 0
cat%1:18:00:: 09991444 3 0
cat%1:18:01:: 10244691 2 0
cat%2:29:00:: 00075977 2 0
cat%2:35:00:: 01404894 1 0

@ekaf
Copy link
Contributor

ekaf commented Oct 27, 2021

@jmccrae, what you write is true, however doing 'ls -l dict' tells that the datafiles are from "Oct 21", and comparing them with the previous release yields no difference.

@jmccrae
Copy link
Member Author

jmccrae commented Oct 27, 2021

@ekaf The only file updated was index.sense. That file has a timestamp of Oct 25th

@ekaf
Copy link
Contributor

ekaf commented Oct 27, 2021

But the problem is the order of the words within their synsets in the data.* files...

@ekaf
Copy link
Contributor

ekaf commented Oct 28, 2021

Updating index.sense has changed nothing wrt. the problem in WNDB (#747 (comment)). The same 2591 sense keys as before cannot be converted to NLTK lemmas. Here are the first 10 in sorted order:

Bad key in Wn 2021: a_priori%5:00:00:theoretical:00 (Wn31:Lemma('a_priori.s.02.a_priori'))
Bad key in Wn 2021: abecedarian%5:00:00:alphabetic:00 (Wn31:Lemma('abecedarian.s.01.abecedarian'))
Bad key in Wn 2021: aberrant%5:00:00:abnormal:00 (Wn31:Lemma('aberrant.s.01.aberrant'))
Bad key in Wn 2021: ablated%5:00:00:decreased:00 (Wn31:Lemma('ablated.s.01.ablated'))
Bad key in Wn 2021: ablaze%5:00:00:colorful:00 (Wn31:Lemma('ablaze.s.03.ablaze'))
Bad key in Wn 2021: ablaze%5:00:00:lighted:00 (Wn31:Lemma('ablaze.s.02.ablaze'))
Bad key in Wn 2021: aboulic%5:00:00:neurotic:00 (Wn31:Lemma('abulic.s.01.aboulic'))
Bad key in Wn 2021: abrasive%5:00:00:rough:00 (Wn31:Lemma('abrasive.s.01.abrasive'))
Bad key in Wn 2021: abscessed%5:00:00:septic:00 (Wn31:Lemma('abscessed.s.01.abscessed'))
Bad key in Wn 2021: absolute%5:00:00:inalienable:00 (Wn31:Lemma('absolute.s.05.absolute'))

But the problem is bigger than that, because most sense keys (77% actually) point to a different sense number than in PWN 3.1, because Wn 2021 does not order the lemmas within their synset according to cntlist as specified.

@ekaf
Copy link
Contributor

ekaf commented Oct 28, 2021

Maybe the following output better illustrates the problem:

Bad key in Wn 2021 for Lemma('a_priori.s.02.a_priori'), index.sense:a_priori%5:00:00:theoretical:00 != datafile:a_priori%5:00:00:theoretic:00
Bad key in Wn 2021 for Lemma('abecedarian.s.01.abecedarian'), index.sense:abecedarian%5:00:00:alphabetic:00 != datafile:abecedarian%5:00:00:alphabetical:00
Bad key in Wn 2021 for Lemma('ablated.s.01.ablated'), index.sense:ablated%5:00:00:decreased:00 != datafile:ablated%5:00:00:reduced:04
Bad key in Wn 2021 for Lemma('ablaze.s.02.ablaze'), index.sense:ablaze%5:00:00:lighted:00 != datafile:ablaze%5:00:00:lit:02
Bad key in Wn 2021 for Lemma('ablaze.s.02.afire'), index.sense:afire%5:00:00:lighted:00 != datafile:afire%5:00:00:lit:02
Bad key in Wn 2021 for Lemma('ablaze.s.02.aflame'), index.sense:aflame%5:00:00:lighted:00 != datafile:aflame%5:00:00:lit:02
Bad key in Wn 2021 for Lemma('ablaze.s.02.aflare'), index.sense:aflare%5:00:00:lighted:00 != datafile:aflare%5:00:00:lit:02
Bad key in Wn 2021 for Lemma('ablaze.s.02.alight'), index.sense:alight%5:00:00:lighted:00 != datafile:alight%5:00:00:lit:02
Bad key in Wn 2021 for Lemma('ablaze.s.02.on_fire'), index.sense:on_fire%5:00:00:lighted:00 != datafile:on_fire%5:00:00:lit:02
Bad key in Wn 2021 for Lemma('ablaze.s.03.ablaze'), index.sense:ablaze%5:00:00:colorful:00 != datafile:ablaze%5:00:00:colourful:00

@ekaf
Copy link
Contributor

ekaf commented Nov 1, 2021

A small number of WNDB data.* synsets have definitions with more than one "|", leading to parse failures.

awk -F"|" 'NF>2' data.*

02020442 00 a 01 rural 0 011 ! 02022522 a 0101 + 04961506 n 0102 & 02020981 a 0000 & 02021158 a 0000 & 02021320 a 0000 & 02021613 a 0000 & 02021727 a 0000 & 02021895 a 0000 & 02022057 a 0000 & 02022225 a 0000 & 02022388 a 0000 | of or relating to the countryside as opposed to the city; living in or characteristic of farming or country life| living in or characteristic of farming or country life; rural people; large rural households; unpaved rural roads; an economy that is basically rural; rural electrification; rural free delivery
00511460 02 r 07 suddenly 1 all_of_a_sudden 1 on_the_spur_of_the_moment 1 of_a_sudden 1 abruptly 1 dead 1 short 8 002 \ 01125863 a 0101 \ 00511460 r 0505 | quickly and without warning| happening unexpectedly| on impulse; without premeditation; he stopped suddenly; suddenly she felt a sharp pain in her side; he decided to go to Chicago on the spur of the moment; he made up his mind suddenly
07801774 13 n 0a moong 0 green_gram 0 monggo 0 green_bean 1 munggo 0 mung 0 mash_bean 0 mung_bean 0 green_soy 0 golden_gram 0 002 @ 07801444 n 0000 #p 12683521 n 0000 | seed of the mung bean plant used for food| food from the seed of the mung bean plant
12683921 20 n 0a moong 0 green_gram 0 monggo 0 green_bean 1 munggo 0 mung 1 mash_bean 0 mung_bean 1 green_soy 0 golden_gram 0 002 @ 13245376 n 0000 #p 12683521 n 0000 | seed of the mung bean plant; used for food| seed of the mung bean plant
02763303 32 v 01 tweet 1 002 @ 00987481 v 0000 @ 01065604 v 0000 02 + 02 00 + 08 00 | To post a message on Twitter| post tweets, i.e. short text messages on the popular social media website Twitter

These explain all the 23 key errors which do not concern adjective satellites.

@jmccrae
Copy link
Member Author

jmccrae commented Nov 1, 2021

I updated the WNDB files again. @ekaf I can't seem to reproduce these errors. Perhaps you could share with me the code that you are testing?

@ekaf
Copy link
Contributor

ekaf commented Nov 1, 2021

Of course @jmccrae, here is the code. But to support PWN 3.1, it requires the latest NLTK 'develop' branch, plus the latest nltk_data, and additionally you need to copy the EWN 2021 wndb files to a directory called wordnet2021 under nltk_data/corpora, and uncomment the line that mentions 'wordnet2021' in nltk/corpus/__init__py, because NLTK does not yet support EWN .

import nltk
from nltk.corpus import wordnet31 as wn1
from nltk.corpus import wordnet2021 as wn2

v1=wn1.get_version()
v2=wn2.get_version()

keys1={l.split()[0].lower() for l in wn1.open('index.sense')}
keys2={l.split()[0].lower() for l in wn2.open('index.sense')}

for sk in keys1.intersection(keys2):
    try:
        lem=wn2.lemma_from_key(sk)
    except:
        lem1=wn1.lemma_from_key(sk)
        print("Bad key in Wn {}: {} (Wn {}:{})".format(v2,sk,v1,lem1))

@ekaf
Copy link
Contributor

ekaf commented Nov 2, 2021

Congratulation @jmccrae, there are no more bad keys in the newest WNDB update, after the ordering of words within their synsets is fixed.
The stability of the NLTK synset identifiers for EWN 2021 w.r.t. PWN 3.1 is much improved, compared to #747 (comment), although it could still be better, since the ordering of synsets for each lemma in the index.POS* files has changed more than it probably needs to.
Here are extracts from the output of the attached test script synchanged.py.txt:

15 (0.01%) lemmas from Wn 3.1 changed in Wn 2021
These are mostly changes in capitalization, for ex.:

Key baroque%3:01:00::
    Wn 3.1: Lemma('baroque.a.02.baroque')
    Wn 2021: Lemma('baroque.a.02.Baroque')

2124 (1.8%) Synset id. wordnames from Wn 3.1 changed in Wn 2021
These are mostly proper names, with the long form now being preferred over the short form, for ex.:

Key lexington%1:04:00::
    Wn 3.1: Synset('lexington.n.03') ['Lexington', 'Concord', 'Lexington_and_Concord'] the first battle of the American Revolution (April 19, 1775)
    Wn 2021: Synset('battles_of_lexington_and_concord.n.01') ['Battles_of_Lexington_and_Concord', 'Lexington', 'Concord', 'Lexington_and_Concord'] the first battle of the American Revolution (April 19, 1775)

7277 (6.18%) Synset id. wordnumbers from Wn 3.1 changed in Wn 2021
These are caused by the new ordering of synsets in the index.POS* files, for ex.:

Key mindless%5:00:00:meaningless:00
    Wn 3.1: Synset('mindless.s.01') ['mindless'] lacking the thinking capacity characteristic of a conscious being
    Wn 2021: Synset('mindless.s.02') ['mindless'] lacking the thinking capacity characteristic of a conscious being; the shrieking of the mindless wind

Key fighting_joe_hooker%1:18:00::
    Wn 3.1: Synset('hooker.n.01') ['Hooker', 'Joseph_Hooker', 'Fighting_Joe_Hooker'] United States general in the Union Army who was defeated at Chancellorsville by Robert E. Lee (1814-1879)
    Wn 2021: Synset('hooker.n.04') ['Hooker', 'Joseph_Hooker', 'Fighting_Joe_Hooker'] United States general in the Union Army who was defeated at Chancellorsville by Robert E. Lee (1814-1879)

Key vocalist%1:18:00::
    Wn 3.1: Synset('singer.n.01') ['singer', 'vocalist', 'vocalizer', 'vocaliser'] a person who sings
    Wn 2021: Synset('singer.n.03') ['singer', 'vocalist', 'vocalizer', 'vocaliser'] a person who sings

@mirfan899
Copy link

thanks for redirecting. I'm getting the error after updating nltk from latest develop branch and using latest http://john.mccr.ae/oewn2021/english-wordnet-2021.zip

from nltk.corpus import wordnet2021

synsets = wordnet2021.all_synsets()
output = []

for synset in synsets:
    try:
        lemmas = wordnet2021.synset(synset._name).lemmas()
        for lemma in lemmas:
            t = lemma.synset()._name.split(".")[1]
            key = lemma.key()
            r = str(wordnet2021.ss2of(lemma.synset()))
            offset = r[:r.index("-")] + t
            output.append(key + " " + "wn:" + offset)
    except:
        print(synset, "name Not found==================================================")

with open("sensekeys2offsets.txt", "w") as writer:
    for line in output:
        writer.write(line + "\n")

Here is the error.

Traceback (most recent call last):
  File "/home/irfan/PycharmProjects/StreusleSpacy/sensekeys2offsets.py", line 16, in <module>
    for synset in synsets:
  File "/home/irfan/environments/ewiser/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py", line 1721, in all_synsets
    synset = from_pos_and_line(pos_tag, line)
  File "/home/irfan/environments/ewiser/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py", line 1513, in _synset_from_pos_and_line
    frame_string_fmt = VERB_FRAME_STRINGS[frame_number]
IndexError: tuple index out of range

@ekaf
Copy link
Contributor

ekaf commented Nov 3, 2021

The code above by @mirfan899 would run smoothly with the updated WNDB, if the following out-of range frame numbers were fixed. Isn't it surprising that such trivial bugs have not been stopped earlier?

Frame number 36 too high for 2764181-v
Frame number 37 too high for 2768667-v
Frame number 38 too high for 2768981-v
Frame number 39 too high for 2771745-v

@goodmami
Copy link
Member

goodmami commented Nov 3, 2021

if the following out-of range frame numbers were fixed. Isn't it surprising that such trivial bugs have not been stopped earlier?

OEWN 2021 has 39 subcategorization frames instead of WordNet's 35. It seems the proper fix is for nltk.corpus.wordnet2021 to augment the VERB_FRAME_STRINGS data structure with the 4 new ones to avoid the IndexError.

edit: I was looking at the XML release of OEWN 2021 and not the WNDB one, until just now. It looks like the verb.Framestext file only includes 35 frames. That could be fixed, at least, but the NLTK would still require a change to VERB_FRAME_STRINGS as it's hard-coded.

Also, side note, the module name wordnet2021 is not accurate as it is not a new version of WordNet, but I agree that openenglishwordnet2021 is a bit long and oewn2021 a bit opaque. I'm pretty sure you're aware of the trademark issue, but also users of NLTK might misattribute the data to Princeton if it's just called wordnet2021.

@1313ou
Copy link
Contributor

1313ou commented Nov 3, 2021

All this fuss (and need for extra code) for 4 verb frames that occur only ONCE (hardly a subcategorization then, is it?):

  • via-out-of
  • via-for
  • via via-on-anim
  • via via-at
<Sense id="oewn-scare_the_bejesus__2.37.01.." n="0" subcat="via-out-of" synset="oewn-91001671-v"/>
<Sense id="oewn-fix__2.36.02.." n="12" subcat="via-for" synset="oewn-90000281-v"/>
<Sense id="oewn-hate__2.32.01.." n="1" subcat="via via-on-anim" synset="oewn-90017091-v"/>
<Sense id="oewn-mug__2.39.01.." n="1" subcat="via via-at" synset="oewn-90015901-v"/>

They join Prince William, Duke of Cumberland, Capital: Critique of Political Economy and Yahoo! in the list of oddities
see #167

Is this worth breaking legacy code?

NB The grinder I have written produces a wndb_compat preview version of oewn2021 that works as a seamless stand-in replacement of PWN and passes @ekaf 's test with the current nltk lib (that uses hardwired verb frames).
https://github.com/x-englishwordnet/wndb
To be announced after oewn2021 comes out.

@ekaf
Copy link
Contributor

ekaf commented Nov 3, 2021

Interesting dict, @1313ou ! I confirm that it doesn't have the frame problem. It also passes the "Bad key" test, just as OEWN's latest WNDB release also does. However, running the https://github.com/globalwordnet/english-wordnet/files/7459290/synchanged.py.txt script produces more changed wordnumbers due to the index.POS* files not respecting the sense ranking frequencies: where WNDB has 7277, yours has 7938. In my opinion, this is a problem worth some "fuss":

15 (0.01%) lemmas from Wn 3.1 changed in Wn 3.1
2124 (1.8%) Synset id. wordnames from Wn 3.1 changed in Wn 3.1
7938 (6.74%) Synset id. wordnumbers from Wn 3.1 changed in Wn 3.1

NB: NLTK perceives your version as "3.1" because it extracts the version number from the 1740 bytes license text in data.adj

@jmccrae
Copy link
Member Author

jmccrae commented Nov 3, 2021

I removed the links to frames > 35 in the WNDB version.

@ekaf
Copy link
Contributor

ekaf commented Nov 4, 2021

There has been a lot of discussion around issue #747, since it was closed two weeks ago. Maybe it deserves reopening temporarily?

Concerning @goodmami's judicious remarks (#747 (comment)), hardcoding the verb frames as NLTK does is an NLTK bug. An issue should be raised there, but the newer verb frames should also be added to the verb.Framestext file, as in @1313ou's wndb distribution (https://github.com/x-englishwordnet/wndb), which is right now superior to the official OEWN export.

I have been thinking along the same lines as @goodmami, weighing advantages against drawbacks in calling the NLTK package 'wordnet2021'. There is no legal issue, since Princeton no longer holds a trademark in the WordNet name. Perhaps @jmccrae has some preference concerning the NLTK package name?

@goodmami
Copy link
Member

goodmami commented Nov 4, 2021

hardcoding the verb frames as NLTK does is an NLTK bug. An issue should be raised there, but the newer verb frames should also be added to the verb.Framestext file

That would be a better fix, but note that the NLTK distribution of WordNet 3.0 does not include verb.Framestext, so the hardcoded values should be used for that data, while wn31 or OEWN 2021 could load them dynamically.

@jmccrae
Copy link
Member Author

jmccrae commented Nov 5, 2021

Hi all, I am at this point satisfied that the issues with the release are now only minor. The XML and RDF versions appear to be completely stable, the WNDB is still a little problematic but it is not possible to accommodate all modifications made by this project relative to this legacy format. It would IMHO be better if NLTK and other tools moved to the XML format that is more stable.

Can I ask you all to please stop posting to this issue? Instead, please create new issues (using the "Release format issue" template).

I will post an announcement of the release to Corpora List, Linguist List, WN Users and some social media on Sunday.

I would like to thank you all for you contributions, it has all been incredibly useful and I think we can all be proud of how much we have achieved with this release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

7 participants