2021 Release Candidate #747

jmccrae · 2021-09-03T14:49:52Z

All issues are now closed and the 2021 candidate is ready to be released.

I have the candidate files here:

Please let me know if there are any issues

rwingerter55 · 2021-09-06T07:58:38Z

The RDF version works for me. I could owl:import it into VocBench 3 (with GraphDB Free as backend and with RDF4J) without a problem and so far everything looks fine. One very minor thing: the wn ontology (https://globalwordnet.github.io/schemas/wn#) had to be owl:imported separately (I could not find an owl:imports statement in the turtle file).

ekaf · 2021-09-07T06:49:46Z

The first WNDB export is useless because:
-All synsets in the data.* files have lex_filenum 45, which is not a valid number
-index.sense contains 0 bytes. Many further tests require the sense keys.

Also no frames (+) are present in data.verb.

rwingerter55 · 2021-09-08T10:17:56Z

The RDF version includes examples. FWIW, here are the statistics.

partOfSpeech	noOfSynsets	withExamples	pct
adjective	10726	7588	71%
adjective_satellite	16329	12569	77%
adverb	4609	4169	90%
noun	87119	12776	15%
verb	16696	12546	75%
Total	135479	49648	37%

jmccrae · 2021-09-09T09:05:57Z

Thanks, I have updated the candidates:

OWL:imports statements added
Sense IDs no longer contain :
index.sense is not empty
Frames are available in the verb sections of the WNDB export

ekaf · 2021-09-09T10:51:47Z

In the GWA XML file, the new Sense id format is unpractical to parse because underscore is overloaded as both a key separator and a space in multi-word expressions: for ex. id="ewn-a_la_carte__3_00_00__".

XML Relation statistics:

Count	Relation
3878	also
8000	antonym
1278	attribute
221	causes
74687	derivation
1511	domain_region
6730	domain_topic
408	entails
1379	exemplifies
1511	has_domain_region
6730	has_domain_topic
12291	holo_member
9199	holo_part
830	holo_substance
91503	hypernym
91503	hyponym
8589	instance_hypernym
8589	instance_hyponym
221	is_caused_by
408	is_entailed_by
1379	is_exemplified_by
12291	mero_member
9199	mero_part
830	mero_substance
4	other
73	participle
8073	pertainym
23194	similar
384509	Total

jmccrae · 2021-09-09T10:54:13Z

@ekaf sense keys are parsed and verified at every commit. The actual implementation is here:
https://github.com/globalwordnet/english-wordnet/blob/master/scripts/wordnet_yaml.py#L21

ekaf · 2021-09-09T11:19:50Z

All synsets in the data.* files still have lex_filenum 45 (which is not a valid number), instead of their correct lex_filenum.

In the second WNDB export, index.sense is not in the proper PWN format, because the sense keys use __ instead of the specified %. Also, in index.sense, the sense keys should not be prefixed by "ewn-", since almost all of them are PWN 3.1 keys.
But after making these two simple edits with search and replace, I found that all validity tests pass OK.

WNDB has less relations than the XML file: 384509 - 383876 = 633. This difference is fully explained by the absence
of 221 is_caused_by + 408 is_entailed_by + 4 other, so both relation totals agree perfectly.

WNDB relation statistics:

Count	Symbol	Source	Target	Type
4028	!	a	a	lex
21458	&	a	a	sem
2	+	a	a	lex
1	-c	a	a	sem
1	;c	a	a	sem
37	\	a	a	lex
2704	^	a	a	sem
12769	+	a	n	lex
1118	;c	a	n	sem
9	;c	a	n	lex
76	;r	a	n	sem
7	;r	a	n	lex
227	;u	a	n	sem
5	;u	a	n	lex
639	=	a	n	sem
4784	\	a	n	lex
1	^	a	n	sem
9	+	a	r	lex
1568	+	a	v	lex
73	<	a	v	lex
12769	+	n	a	lex
1118	-c	n	a	sem
9	-c	n	a	lex
76	-r	n	a	sem
7	-r	n	a	lex
227	-u	n	a	sem
5	-u	n	a	lex
639	=	n	a	sem
2160	!	n	n	lex
12291	#m	n	n	sem
9199	#p	n	n	sem
830	#s	n	n	sem
12291	%m	n	n	sem
9199	%p	n	n	sem
830	%s	n	n	sem
2957	+	n	n	lex
4296	-c	n	n	sem
4	-c	n	n	lex
1268	-r	n	n	sem
145	-r	n	n	lex
658	-u	n	n	sem
396	-u	n	n	lex
4296	;c	n	n	sem
4	;c	n	n	lex
1268	;r	n	n	sem
145	;r	n	n	lex
658	;u	n	n	sem
396	;u	n	n	lex
78200	@	n	n	sem
8589	@i	n	n	sem
12	^	n	n	sem
78200	~	n	n	sem
8589	~i	n	n	sem
39	-c	n	r	sem
71	-u	n	r	sem
1	-u	n	r	lex
21518	+	n	v	lex
1262	-c	n	v	sem
1	-c	n	v	lex
9	-r	n	v	sem
6	-r	n	v	lex
19	-u	n	v	sem
2	-u	n	v	lex
9	+	r	a	lex
3251	\	r	a	lex
39	;c	r	n	sem
71	;u	r	n	sem
1	;u	r	n	lex
712	!	r	r	lex
1	\	r	r	lex
6	^	r	r	sem
1568	+	v	a	lex
21518	+	v	n	lex
1262	;c	v	n	sem
1	;c	v	n	lex
9	;r	v	n	sem
6	;r	v	n	lex
19	;u	v	n	sem
2	;u	v	n	lex
1100	!	v	v	lex
1734	$	v	v	sem
2	$	v	v	lex
408	*	v	v	sem
221	>	v	v	sem
13303	@	v	v	sem
1148	^	v	v	lex
7	^	v	v	sem
13303	~	v	v	sem
383876	Total

ekaf · 2021-09-09T11:27:02Z

Concerning the new XML Sense id format, I still find it unpractical to extract the individual components of an id like "ewn-a_la_carte__3_00_00__", where there is no fixed number of components, since underscore is used both inside and outside lemmas, and __ may appear once or twice.

Isn't the ID format too important a change to make at the last minute before release? It deserves a (rather big) issue...

jmccrae · 2021-09-09T12:27:18Z

The ID format change was proposed a few months ago in #662. It was then pointed out in #749 that using : in XML IDs breaks some systems.

Another option could be to use periods, e.g., ewn-a_la_carte__3.00.00..

ekaf · 2021-09-09T14:29:12Z

Yes, any separator that allows uncomplicated extraction of the individual parts.would be fine.

Actually, ekaf/ski#1 (comment) already pointed out the problem of using colons (:) in XML IDs back in january 2020, since the original XML specification discourages it.

ekaf · 2021-09-10T09:02:56Z

After applying small hacks, and simply replacing the 'wordnet' folder in nltk_data/corpora, the WNDB works with NLTK.

First, NLTK complains that the 'lexnames' file is absent. After copying this file from any previous release (it never changed since WN 1.6, where it was introduced), an error is raised because 45 is not a valid lexfile number. After adding 45 in 'lexnames' , everything went smoothly, although it would be better to have the correct lexfile numbers instead of 45 in data.*. Of course, index.sense was fixed to conform to the PWN sense key format.

After this I used the new graph capabilities of the current development branch in NLTK to draw the hyponyms of "meeting":

import nltk
from nltk.parse.dependencygraph import dot2img
from nltk.corpus import wordnet as wn
dot2img(wn.digraph([wn.synset('meeting.n.01')], lambda s:s.hyponyms(), attr={'rankdir':'LR', 'compound':'true'}, shapes={'':'plaintext'}), 'png')

1313ou · 2021-09-10T09:52:43Z

Not really "a few months ago" but July 15,
"The IDs we use on sense's XML will be changed to resemble the PWN sense IDs"
No idea then what they would look like. The introduction of colons was not announced.

ekaf · 2021-09-10T10:02:57Z

Maybe reopen #662, so that the issue can be discussed? The introduction of single underscores was even less announced than the colons.

jmccrae · 2021-09-10T10:56:27Z

I have reopened, but I hope that the use of periods is the most acceptable solution and that we can close this quickly to proceed with the release.

1313ou · 2021-09-10T15:44:20Z

Wouldn't it have been simpler to keep the previous ID and also preserve the sense key as it was (parseable, with colons as field separators), a simple (legacy) key that doesn't have to bear the xsd:ID requirements.
I don't understand the point of this move. We're not pressed for space, the word net is not going to be stored in a microcontroller, where resources are scarce. And it unnecessarily breaks existing apps.

ekaf · 2021-09-10T16:30:33Z

As proposed by @1313ou, keeping the original verbatim sense keys would be ideal, if feasible

jmccrae · 2021-09-13T08:41:45Z

The reason for change was a bug report by @1313ou

globalwordnet/schemas#5 and globalwordnet/schemas#25

Please note that we also intend to drop the XML as a working format (see #664) and the YAML version has no issues with using the Princeton sense keys as is. Please also note that the previous XML ID scheme is highly problematic to calculate and involves lots of 'encoded' information. I am happy to reintroduce the verbatim sense keys as another attribute but it seems pointless when they can easily be obtained from the IDs.

1313ou · 2021-09-13T09:11:10Z

Please also note that the previous XML ID scheme is highly problematic to calculate and involves lots of 'encoded' information.

Can you elaborate on that ? If it was used for the 2020 release, why shouldn't it work now?

globalwordnet/schemas#5 was not in favour of using sensekeys as IDs, it was intended to clarify their status as inner (generated) keys or foreign legacy keys (referring to PWN)

jmccrae · 2021-09-13T12:23:46Z

The old sense IDs encoded the ordering of members of a synset. This is now captured by the members attribute on synsets. It was a horrible hack that caused lots of headaches!

1313ou · 2021-09-13T13:07:45Z

So now that the ordering is readily available, I take it it should be easy to use it when generating IDs.

jmccrae · 2021-09-13T13:31:59Z

I updated the Release Candidate (WNDB Export)

Added lexnames
Fixed lexfiles in data.noun (etc). There is no more lexfile number 45!
Fixed IDs in index.sense

jmccrae · 2021-09-13T13:34:07Z

So now that the ordering is readily available, I take it it should be easy to use it when generating IDs.

Yeah... sort of... but it is easier to remove the hack entirely. Part of the issue is that incoming links would get broken by rearrangements of the order of members in the synset as this changed the sense IDs that were used as targets.

goodmami · 2021-09-13T21:55:55Z

I tried loading the XML file with Wn but got this:

>>> import wn
>>> wn.add('english-wordnet-2021.xml')
Skipping ewn:2020 (Open English WordNet); already added

>>>

This is because the version attribute on <Lexicon> is still "2020". I understand not wanting to version it "2021" until it's ready, but it's also inaccurate to say "2020". What about "2021rc1", etc. for pre-release versions?

ekaf · 2021-09-14T07:34:44Z

In wndb, everything is ok now: there are no ambiguous sense keys, no transitive loop, every symmetric relation is symmetric, and the asymmetric ones are as expected.

There is only one minor annoyance left in data.noun: the gloss at offset 15539647 has a few spurious backslashes: "We, the intellectually curious, may soon find ourselves trapped in a pen,\ \ fenced in by rule-bound sticklerism and overzealous concern for our personal\ \ safety, unless we exercise our civil liberties and our curiosity," he declaims.

ekaf · 2021-09-14T07:50:49Z

@jmccrae wrote "I am happy to reintroduce the verbatim sense keys as another attribute but it seems pointless when they can easily be obtained from the IDs."

Yes, if sense keys can be obtained from the IDs by a simple translation of the separators, i. e. without looking up in an extensive mapping, then there is no need to have two different attributes for the same information.

jmccrae · 2021-09-14T10:11:06Z

Updates on Release Candidates

Updated definition of 'sticklerism' (all versions)
Sense IDs use periods (XML) cf Change _ to . replacing : in Sense IDs #754
version=2021 (XML)

goodmami · 2021-09-14T17:14:21Z

@jmccrae thanks, the XML version loads without issue, now.

Sorry I'm a bit late to the discussion on XML IDs. I don't want to delay things, but I'm concerned about the method of replacement for punctuation, as noted in globalwordnet/schemas#55. Specifically the use of dashes both to delimit escape sequences and as literal dash characters makes it ripe for collisions. Here's an example with both:

<LexicalEntry id="ewn-bird-ap-s-eye-s">

The relevant code is here:

english-wordnet/scripts/wordnet.py

Lines 774 to 775 in 82b7103

    
           elif c == '-': 
        
               return '-'

I don't think there's any issues now, but considering that, for example, ! becomes -ex- and some words have ex- (e.g., ex-president), it's not a stretch to imagine we might have ex-ex- (e.g., ex-ex-president). Since these issues are only with XML IDs and not specific to English, I propose that we have a lossless XML-ID sanitation function for all WN-LMF wordnets (not just OEWN). By "lossless" I really mean collision-free, as I don't think IDs should be parsed and interpreted, but it could be nice to recover the original ID.

ekaf · 2021-10-21T12:27:01Z

Even after sorting index.sense though, NLTK's lemma_from_key still fails on 36938 sense keys, which allows to find a very severe bug in the WNDB 2021 data.* files!

Consider for ex. the OEWN 2021 sense key 'abacus%1:06:01::', which NLTK expects to be 'abacus%1:06:00::', based on the fact that it's lex_id is 0 in data.noun. The actual problem is not in index.sense but in data.noun, which has two different synsets for abacus, both with lex_id 0 in lexfile 06, which is not allowed. One of these lex_ids should be 1 instead, and then index.sense would (hopefully) be correct.

jmccrae · 2021-10-21T13:58:26Z

Good catch @ekaf. The WNDB export script did not understand the changed IDs. I have updated now

ekaf · 2021-10-21T15:12:17Z

It's much better now, but there is still a problem with 2562 adjective satellites, where NLTK expects the head to be the first lemma in the head synset. The head adjective is the one that has antonyms, and according to the senseidx manual:

head_word is only present if the sense is in an adjective satellite synset. It is the lemma of the first word of the satellite's head synset.

So the order of these lemmas is wrong in data.adj. Below are the first 10 examples, where the sense key is correct, but not the order of the head_word within its synset in data.adj.

WNDB 2021 sense keys, NLTK expects:
'a_priori%5:00:00:theoretical:00', 'a_priori%5:00:00:theoretic:00'
'abecedarian%5:00:00:alphabetic:00', 'abecedarian%5:00:00:alphabetical:00'
'aberrant%5:00:00:abnormal:00', 'aberrant%5:00:00:unnatural:03'
'ablated%5:00:00:decreased:00', 'ablated%5:00:00:reduced:04'
'ablaze%5:00:00:colorful:00', 'ablaze%5:00:00:colourful:00'
'ablaze%5:00:00:lighted:00', 'ablaze%5:00:00:lit:02'
'aboulic%5:00:00:neurotic:00', 'aboulic%5:00:00:psychoneurotic:00'
'abrasive%5:00:00:rough:00', 'abrasive%5:00:00:unsmooth:00'
'abscessed%5:00:00:septic:00', 'abscessed%5:00:00:infected:00'
'absolute%5:00:00:inalienable:00', 'absolute%5:00:00:unalienable:00'

ekaf · 2021-10-25T05:00:35Z

Currently, most OEWN synsets don't respect the sense ordering specified in the wndb manual:

Sense Numbers
Senses in WordNet are generally ordered from most to least frequently
used, with the most common sense numbered 1 . Frequency of use is
determined by the number of times a sense is tagged in the various
semantic concordance texts. Senses that are not semantically tagged
follow the ordered senses.

As a consequence, 2591 sense keys don't correspond to the lemma ordering in their synsets. Out of these, 23 are not adjective satellites, for ex.:

Bad key in Wn 2021: green_bean%1:20:01::
Bad key in Wn 2021: all_of_a_sudden%4:02:01::

Additionally, 86130 (73.12%) Synset wordnames from Wn 3.1 changed in Wn 2021, for ex,:

Key tartness%1:07:00:: in 3.1:Synset('acerbity.n.02'), 2021:Synset('tartness.n.02')
Wn 3.1: ['acerbity', 'tartness'] a sharp sour taste
Wn 2021: ['tartness', 'acerbity'] a sharp sour taste

And 4522 (3.84%) Synset wordnumbers from Wn 3.1 changed in Wn 2021, for ex.:

Key pasty%1:13:00:: in 3.1:Synset('pasty.n.01'), 2021:Synset('pasty.n.02')
Wn 3.1: ['pasty'] small meat pie or turnover
Wn 2021: ['pasty'] small meat pie or turnover

NLTK's synset identifiers are not expected to be completely stable across versions, but the number of changes is much lower between Princeton versions, where name changes are particularly rare:

130 (0.11%) Synset wordnames from Wn 3.0 changed in Wn 3.1
348 (0.3%) Synset wordnumbers from Wn 3.0 changed in Wn 3.1

1313ou · 2021-10-25T09:36:38Z

This refers to globalwordnet/schemas#17

jmccrae · 2021-10-25T10:58:18Z

There was a mistake in the extraction script and it was using the position of the sense in the synset not the entry. I have updated now.

Note there are some slight divergences from Princeton here, as we now have a real idea of what a lexical entry. So, for example 'bass' is now distinguished between the two pronunciations and we don't rank the senses across the two entries. To ensure backwards compatibility I randomly list all senses of one entry before or after the other.

ekaf · 2021-10-26T06:50:52Z

There was a mistake in the extraction script and it was using the position of the sense in the synset not the entry. I have updated now.

Thanks @jmccrae, when and where will the updated WNDB be available? It doesn't seem to be at the usual place.

jmccrae · 2021-10-26T07:48:53Z

Thanks @jmccrae, when and where will the updated WNDB be available? It doesn't seem to be at the usual place.

Should be here. I see the update at this link: http://john.mccr.ae/oewn2021/english-wordnet-2021.zip

ekaf · 2021-10-26T17:59:44Z

@jmccrae, that's not an update: it is still the version from last Thursday, with all the problems mentioned above.

jmccrae · 2021-10-27T09:07:17Z

@ekaf This file has definitely been updated

jmccrae@cirrocumulus [10:05:53] [~/scratch/tmp] 
-> % wget http://john.mccr.ae/oewn2021/english-wordnet-2021.zip
--2021-10-27 10:06:10--  http://john.mccr.ae/oewn2021/english-wordnet-2021.zip
Resolving john.mccr.ae (john.mccr.ae)... 128.199.47.101
Connecting to john.mccr.ae (john.mccr.ae)|128.199.47.101|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11298296 (11M) [application/zip]
Saving to: ‘english-wordnet-2021.zip’

english-wordnet-2021.zip                        100%[======================================================================================================>]  10.77M  9.13MB/s    in 1.2s    

2021-10-27 10:06:11 (9.13 MB/s) - ‘english-wordnet-2021.zip’ saved [11298296/11298296]

jmccrae@cirrocumulus [10:06:11] [~/scratch/tmp] 
-> % unzip english-wordnet-2021.zip 
Archive:  english-wordnet-2021.zip
   creating: dict/
  inflating: dict/index.noun         
  inflating: dict/index.verb         
  inflating: dict/index.sense        
  inflating: dict/cntlist            
  inflating: dict/data.noun          
  inflating: dict/verb.Framestext    
  inflating: dict/data.verb          
  inflating: dict/data.adv           
  inflating: dict/sents.vrb          
  inflating: dict/adv.exc            
  inflating: dict/cntlist.rev        
  inflating: dict/noun.exc           
  inflating: dict/data.adj           
  inflating: dict/sentidx.vrb        
  inflating: dict/verb.exc           
  inflating: dict/index.adv          
  inflating: dict/lexnames           
  inflating: dict/adj.exc            
 extracting: dict/cousin.exc         
  inflating: dict/index.adj          
jmccrae@cirrocumulus [10:06:13] [~/scratch/tmp] 
-> % grep "cat%" dict/index.sense 
abyssinian_cat%1:05:00:: 02140530 1 0
alley_cat%1:05:00:: 02138727 1 0
angora_cat%1:05:00:: 02139695 1 0
bear_cat%1:05:00:: 02530834 1 0
bearcat%1:05:00:: 02152320 1 0
bell_the_cat%2:41:00:: 02537396 1 0
big_cat%1:05:00:: 02144025 1 0
black_cat%1:05:00:: 02470966 1 0
blue_cat%1:05:00:: 02540895 1 0
blue_channel_cat%1:05:00:: 02540895 1 0
bobcat%1:05:00:: 02143699 1 0
burmese_cat%1:05:00:: 02140134 1 0
caffer_cat%1:05:00:: 02142089 1 0
calico_cat%1:05:00:: 02139459 1 0
cat%1:04:00:: 00903094 8 0
cat%1:05:00:: 02137837 1 0
cat%1:05:02:: 02144025 7 0
cat%1:06:00:: 03012554 5 0
cat%1:06:01:: 03010455 6 0
cat%1:06:02:: 03640970 4 0
cat%1:18:00:: 09991444 3 0
cat%1:18:01:: 10244691 2 0
cat%2:29:00:: 00075977 2 0
cat%2:35:00:: 01404894 1 0

ekaf · 2021-10-27T09:56:48Z

@jmccrae, what you write is true, however doing 'ls -l dict' tells that the datafiles are from "Oct 21", and comparing them with the previous release yields no difference.

jmccrae · 2021-10-27T10:07:05Z

@ekaf The only file updated was index.sense. That file has a timestamp of Oct 25th

ekaf · 2021-10-27T11:14:06Z

But the problem is the order of the words within their synsets in the data.* files...

ekaf · 2021-10-28T06:19:15Z

Updating index.sense has changed nothing wrt. the problem in WNDB (#747 (comment)). The same 2591 sense keys as before cannot be converted to NLTK lemmas. Here are the first 10 in sorted order:

Bad key in Wn 2021: a_priori%5:00:00:theoretical:00 (Wn31:Lemma('a_priori.s.02.a_priori'))
Bad key in Wn 2021: abecedarian%5:00:00:alphabetic:00 (Wn31:Lemma('abecedarian.s.01.abecedarian'))
Bad key in Wn 2021: aberrant%5:00:00:abnormal:00 (Wn31:Lemma('aberrant.s.01.aberrant'))
Bad key in Wn 2021: ablated%5:00:00:decreased:00 (Wn31:Lemma('ablated.s.01.ablated'))
Bad key in Wn 2021: ablaze%5:00:00:colorful:00 (Wn31:Lemma('ablaze.s.03.ablaze'))
Bad key in Wn 2021: ablaze%5:00:00:lighted:00 (Wn31:Lemma('ablaze.s.02.ablaze'))
Bad key in Wn 2021: aboulic%5:00:00:neurotic:00 (Wn31:Lemma('abulic.s.01.aboulic'))
Bad key in Wn 2021: abrasive%5:00:00:rough:00 (Wn31:Lemma('abrasive.s.01.abrasive'))
Bad key in Wn 2021: abscessed%5:00:00:septic:00 (Wn31:Lemma('abscessed.s.01.abscessed'))
Bad key in Wn 2021: absolute%5:00:00:inalienable:00 (Wn31:Lemma('absolute.s.05.absolute'))

But the problem is bigger than that, because most sense keys (77% actually) point to a different sense number than in PWN 3.1, because Wn 2021 does not order the lemmas within their synset according to cntlist as specified.

ekaf · 2021-10-28T11:59:15Z

Maybe the following output better illustrates the problem:

Bad key in Wn 2021 for Lemma('a_priori.s.02.a_priori'), index.sense:a_priori%5:00:00:theoretical:00 != datafile:a_priori%5:00:00:theoretic:00
Bad key in Wn 2021 for Lemma('abecedarian.s.01.abecedarian'), index.sense:abecedarian%5:00:00:alphabetic:00 != datafile:abecedarian%5:00:00:alphabetical:00
Bad key in Wn 2021 for Lemma('ablated.s.01.ablated'), index.sense:ablated%5:00:00:decreased:00 != datafile:ablated%5:00:00:reduced:04
Bad key in Wn 2021 for Lemma('ablaze.s.02.ablaze'), index.sense:ablaze%5:00:00:lighted:00 != datafile:ablaze%5:00:00:lit:02
Bad key in Wn 2021 for Lemma('ablaze.s.02.afire'), index.sense:afire%5:00:00:lighted:00 != datafile:afire%5:00:00:lit:02
Bad key in Wn 2021 for Lemma('ablaze.s.02.aflame'), index.sense:aflame%5:00:00:lighted:00 != datafile:aflame%5:00:00:lit:02
Bad key in Wn 2021 for Lemma('ablaze.s.02.aflare'), index.sense:aflare%5:00:00:lighted:00 != datafile:aflare%5:00:00:lit:02
Bad key in Wn 2021 for Lemma('ablaze.s.02.alight'), index.sense:alight%5:00:00:lighted:00 != datafile:alight%5:00:00:lit:02
Bad key in Wn 2021 for Lemma('ablaze.s.02.on_fire'), index.sense:on_fire%5:00:00:lighted:00 != datafile:on_fire%5:00:00:lit:02
Bad key in Wn 2021 for Lemma('ablaze.s.03.ablaze'), index.sense:ablaze%5:00:00:colorful:00 != datafile:ablaze%5:00:00:colourful:00

ekaf · 2021-11-01T10:13:35Z

A small number of WNDB data.* synsets have definitions with more than one "|", leading to parse failures.

awk -F"|" 'NF>2' data.*

02020442 00 a 01 rural 0 011 ! 02022522 a 0101 + 04961506 n 0102 & 02020981 a 0000 & 02021158 a 0000 & 02021320 a 0000 & 02021613 a 0000 & 02021727 a 0000 & 02021895 a 0000 & 02022057 a 0000 & 02022225 a 0000 & 02022388 a 0000 | of or relating to the countryside as opposed to the city; living in or characteristic of farming or country life| living in or characteristic of farming or country life; rural people; large rural households; unpaved rural roads; an economy that is basically rural; rural electrification; rural free delivery
00511460 02 r 07 suddenly 1 all_of_a_sudden 1 on_the_spur_of_the_moment 1 of_a_sudden 1 abruptly 1 dead 1 short 8 002 \ 01125863 a 0101 \ 00511460 r 0505 | quickly and without warning| happening unexpectedly| on impulse; without premeditation; he stopped suddenly; suddenly she felt a sharp pain in her side; he decided to go to Chicago on the spur of the moment; he made up his mind suddenly
07801774 13 n 0a moong 0 green_gram 0 monggo 0 green_bean 1 munggo 0 mung 0 mash_bean 0 mung_bean 0 green_soy 0 golden_gram 0 002 @ 07801444 n 0000 #p 12683521 n 0000 | seed of the mung bean plant used for food| food from the seed of the mung bean plant
12683921 20 n 0a moong 0 green_gram 0 monggo 0 green_bean 1 munggo 0 mung 1 mash_bean 0 mung_bean 1 green_soy 0 golden_gram 0 002 @ 13245376 n 0000 #p 12683521 n 0000 | seed of the mung bean plant; used for food| seed of the mung bean plant
02763303 32 v 01 tweet 1 002 @ 00987481 v 0000 @ 01065604 v 0000 02 + 02 00 + 08 00 | To post a message on Twitter| post tweets, i.e. short text messages on the popular social media website Twitter

These explain all the 23 key errors which do not concern adjective satellites.

jmccrae · 2021-11-01T17:22:01Z

I updated the WNDB files again. @ekaf I can't seem to reproduce these errors. Perhaps you could share with me the code that you are testing?

ekaf · 2021-11-01T17:57:10Z

Of course @jmccrae, here is the code. But to support PWN 3.1, it requires the latest NLTK 'develop' branch, plus the latest nltk_data, and additionally you need to copy the EWN 2021 wndb files to a directory called wordnet2021 under nltk_data/corpora, and uncomment the line that mentions 'wordnet2021' in nltk/corpus/__init__py, because NLTK does not yet support EWN .

import nltk
from nltk.corpus import wordnet31 as wn1
from nltk.corpus import wordnet2021 as wn2

v1=wn1.get_version()
v2=wn2.get_version()

keys1={l.split()[0].lower() for l in wn1.open('index.sense')}
keys2={l.split()[0].lower() for l in wn2.open('index.sense')}

for sk in keys1.intersection(keys2):
    try:
        lem=wn2.lemma_from_key(sk)
    except:
        lem1=wn1.lemma_from_key(sk)
        print("Bad key in Wn {}: {} (Wn {}:{})".format(v2,sk,v1,lem1))

ekaf · 2021-11-02T09:15:41Z

Congratulation @jmccrae, there are no more bad keys in the newest WNDB update, after the ordering of words within their synsets is fixed.
The stability of the NLTK synset identifiers for EWN 2021 w.r.t. PWN 3.1 is much improved, compared to #747 (comment), although it could still be better, since the ordering of synsets for each lemma in the index.POS* files has changed more than it probably needs to.
Here are extracts from the output of the attached test script synchanged.py.txt:

15 (0.01%) lemmas from Wn 3.1 changed in Wn 2021
These are mostly changes in capitalization, for ex.:

Key baroque%3:01:00::
    Wn 3.1: Lemma('baroque.a.02.baroque')
    Wn 2021: Lemma('baroque.a.02.Baroque')

2124 (1.8%) Synset id. wordnames from Wn 3.1 changed in Wn 2021
These are mostly proper names, with the long form now being preferred over the short form, for ex.:

Key lexington%1:04:00::
    Wn 3.1: Synset('lexington.n.03') ['Lexington', 'Concord', 'Lexington_and_Concord'] the first battle of the American Revolution (April 19, 1775)
    Wn 2021: Synset('battles_of_lexington_and_concord.n.01') ['Battles_of_Lexington_and_Concord', 'Lexington', 'Concord', 'Lexington_and_Concord'] the first battle of the American Revolution (April 19, 1775)

7277 (6.18%) Synset id. wordnumbers from Wn 3.1 changed in Wn 2021
These are caused by the new ordering of synsets in the index.POS* files, for ex.:

Key mindless%5:00:00:meaningless:00
    Wn 3.1: Synset('mindless.s.01') ['mindless'] lacking the thinking capacity characteristic of a conscious being
    Wn 2021: Synset('mindless.s.02') ['mindless'] lacking the thinking capacity characteristic of a conscious being; the shrieking of the mindless wind

Key fighting_joe_hooker%1:18:00::
    Wn 3.1: Synset('hooker.n.01') ['Hooker', 'Joseph_Hooker', 'Fighting_Joe_Hooker'] United States general in the Union Army who was defeated at Chancellorsville by Robert E. Lee (1814-1879)
    Wn 2021: Synset('hooker.n.04') ['Hooker', 'Joseph_Hooker', 'Fighting_Joe_Hooker'] United States general in the Union Army who was defeated at Chancellorsville by Robert E. Lee (1814-1879)

Key vocalist%1:18:00::
    Wn 3.1: Synset('singer.n.01') ['singer', 'vocalist', 'vocalizer', 'vocaliser'] a person who sings
    Wn 2021: Synset('singer.n.03') ['singer', 'vocalist', 'vocalizer', 'vocaliser'] a person who sings

mirfan899 · 2021-11-02T17:51:28Z

thanks for redirecting. I'm getting the error after updating nltk from latest develop branch and using latest http://john.mccr.ae/oewn2021/english-wordnet-2021.zip

from nltk.corpus import wordnet2021

synsets = wordnet2021.all_synsets()
output = []

for synset in synsets:
    try:
        lemmas = wordnet2021.synset(synset._name).lemmas()
        for lemma in lemmas:
            t = lemma.synset()._name.split(".")[1]
            key = lemma.key()
            r = str(wordnet2021.ss2of(lemma.synset()))
            offset = r[:r.index("-")] + t
            output.append(key + " " + "wn:" + offset)
    except:
        print(synset, "name Not found==================================================")

with open("sensekeys2offsets.txt", "w") as writer:
    for line in output:
        writer.write(line + "\n")

Here is the error.

Traceback (most recent call last):
  File "/home/irfan/PycharmProjects/StreusleSpacy/sensekeys2offsets.py", line 16, in <module>
    for synset in synsets:
  File "/home/irfan/environments/ewiser/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py", line 1721, in all_synsets
    synset = from_pos_and_line(pos_tag, line)
  File "/home/irfan/environments/ewiser/lib/python3.6/site-packages/nltk/corpus/reader/wordnet.py", line 1513, in _synset_from_pos_and_line
    frame_string_fmt = VERB_FRAME_STRINGS[frame_number]
IndexError: tuple index out of range

ekaf · 2021-11-03T03:03:32Z

The code above by @mirfan899 would run smoothly with the updated WNDB, if the following out-of range frame numbers were fixed. Isn't it surprising that such trivial bugs have not been stopped earlier?

Frame number 36 too high for 2764181-v
Frame number 37 too high for 2768667-v
Frame number 38 too high for 2768981-v
Frame number 39 too high for 2771745-v

goodmami · 2021-11-03T03:57:23Z

if the following out-of range frame numbers were fixed. Isn't it surprising that such trivial bugs have not been stopped earlier?

OEWN 2021 has 39 subcategorization frames instead of WordNet's 35. It seems the proper fix is for nltk.corpus.wordnet2021 to augment the VERB_FRAME_STRINGS data structure with the 4 new ones to avoid the IndexError.

edit: I was looking at the XML release of OEWN 2021 and not the WNDB one, until just now. It looks like the verb.Framestext file only includes 35 frames. That could be fixed, at least, but the NLTK would still require a change to VERB_FRAME_STRINGS as it's hard-coded.

Also, side note, the module name wordnet2021 is not accurate as it is not a new version of WordNet, but I agree that openenglishwordnet2021 is a bit long and oewn2021 a bit opaque. I'm pretty sure you're aware of the trademark issue, but also users of NLTK might misattribute the data to Princeton if it's just called wordnet2021.

1313ou · 2021-11-03T06:20:07Z

All this fuss (and need for extra code) for 4 verb frames that occur only ONCE (hardly a subcategorization then, is it?):

via-out-of
via-for
via via-on-anim
via via-at

<Sense id="oewn-scare_the_bejesus__2.37.01.." n="0" subcat="via-out-of" synset="oewn-91001671-v"/>
<Sense id="oewn-fix__2.36.02.." n="12" subcat="via-for" synset="oewn-90000281-v"/>
<Sense id="oewn-hate__2.32.01.." n="1" subcat="via via-on-anim" synset="oewn-90017091-v"/>
<Sense id="oewn-mug__2.39.01.." n="1" subcat="via via-at" synset="oewn-90015901-v"/>

They join Prince William, Duke of Cumberland, Capital: Critique of Political Economy and Yahoo! in the list of oddities
see #167

Is this worth breaking legacy code?

NB The grinder I have written produces a wndb_compat preview version of oewn2021 that works as a seamless stand-in replacement of PWN and passes @ekaf 's test with the current nltk lib (that uses hardwired verb frames).
https://github.com/x-englishwordnet/wndb
To be announced after oewn2021 comes out.

ekaf · 2021-11-03T09:27:29Z

Interesting dict, @1313ou ! I confirm that it doesn't have the frame problem. It also passes the "Bad key" test, just as OEWN's latest WNDB release also does. However, running the https://github.com/globalwordnet/english-wordnet/files/7459290/synchanged.py.txt script produces more changed wordnumbers due to the index.POS* files not respecting the sense ranking frequencies: where WNDB has 7277, yours has 7938. In my opinion, this is a problem worth some "fuss":

15 (0.01%) lemmas from Wn 3.1 changed in Wn 3.1
2124 (1.8%) Synset id. wordnames from Wn 3.1 changed in Wn 3.1
7938 (6.74%) Synset id. wordnumbers from Wn 3.1 changed in Wn 3.1

NB: NLTK perceives your version as "3.1" because it extracts the version number from the 1740 bytes license text in data.adj

jmccrae · 2021-11-03T10:12:02Z

I removed the links to frames > 35 in the WNDB version.

ekaf · 2021-11-04T06:44:12Z

There has been a lot of discussion around issue #747, since it was closed two weeks ago. Maybe it deserves reopening temporarily?

Concerning @goodmami's judicious remarks (#747 (comment)), hardcoding the verb frames as NLTK does is an NLTK bug. An issue should be raised there, but the newer verb frames should also be added to the verb.Framestext file, as in @1313ou's wndb distribution (https://github.com/x-englishwordnet/wndb), which is right now superior to the official OEWN export.

I have been thinking along the same lines as @goodmami, weighing advantages against drawbacks in calling the NLTK package 'wordnet2021'. There is no legal issue, since Princeton no longer holds a trademark in the WordNet name. Perhaps @jmccrae has some preference concerning the NLTK package name?

goodmami · 2021-11-04T15:56:01Z

hardcoding the verb frames as NLTK does is an NLTK bug. An issue should be raised there, but the newer verb frames should also be added to the verb.Framestext file

That would be a better fix, but note that the NLTK distribution of WordNet 3.0 does not include verb.Framestext, so the hardcoded values should be used for that data, while wn31 or OEWN 2021 could load them dynamically.

jmccrae · 2021-11-05T12:54:20Z

Hi all, I am at this point satisfied that the issues with the release are now only minor. The XML and RDF versions appear to be completely stable, the WNDB is still a little problematic but it is not possible to accommodate all modifications made by this project relative to this legacy format. It would IMHO be better if NLTK and other tools moved to the XML format that is more stable.

Can I ask you all to please stop posting to this issue? Instead, please create new issues (using the "Release format issue" template).

I will post an announcement of the release to Corpora List, Linguist List, WN Users and some social media on Sunday.

I would like to thank you all for you contributions, it has all been incredibly useful and I think we can all be proud of how much we have achieved with this release.

jmccrae added the help wanted Extra attention is needed label Sep 3, 2021

jmccrae added this to the 2021 Release milestone Sep 3, 2021

ekaf mentioned this issue Sep 10, 2021

Remaining illegal colon in ID #751

Closed

ekaf mentioned this issue Oct 22, 2021

Support alternative Wordnet versions nltk/nltk#2860

Merged

ekaf mentioned this issue Nov 2, 2021

How to use it with NLTK? #771

Closed

2021 Release Candidate #747

2021 Release Candidate #747

Comments

jmccrae commented Sep 3, 2021

rwingerter55 commented Sep 6, 2021

ekaf commented Sep 7, 2021 • edited Loading

rwingerter55 commented Sep 8, 2021

jmccrae commented Sep 9, 2021

ekaf commented Sep 9, 2021 • edited Loading

jmccrae commented Sep 9, 2021

ekaf commented Sep 9, 2021 • edited Loading

WNDB relation statistics:

ekaf commented Sep 9, 2021 • edited Loading

jmccrae commented Sep 9, 2021

ekaf commented Sep 9, 2021 • edited Loading

ekaf commented Sep 10, 2021 • edited Loading

1313ou commented Sep 10, 2021

ekaf commented Sep 10, 2021 • edited Loading

jmccrae commented Sep 10, 2021

1313ou commented Sep 10, 2021 • edited Loading

ekaf commented Sep 10, 2021

jmccrae commented Sep 13, 2021

1313ou commented Sep 13, 2021

jmccrae commented Sep 13, 2021

1313ou commented Sep 13, 2021

jmccrae commented Sep 13, 2021

jmccrae commented Sep 13, 2021 • edited Loading

goodmami commented Sep 13, 2021

ekaf commented Sep 14, 2021

ekaf commented Sep 14, 2021

jmccrae commented Sep 14, 2021

goodmami commented Sep 14, 2021

ekaf commented Oct 21, 2021

jmccrae commented Oct 21, 2021

ekaf commented Oct 21, 2021 • edited Loading

ekaf commented Oct 25, 2021

1313ou commented Oct 25, 2021

jmccrae commented Oct 25, 2021

ekaf commented Oct 26, 2021

jmccrae commented Oct 26, 2021

ekaf commented Oct 26, 2021

jmccrae commented Oct 27, 2021

ekaf commented Oct 27, 2021

jmccrae commented Oct 27, 2021

ekaf commented Oct 27, 2021

ekaf commented Oct 28, 2021 • edited Loading

ekaf commented Oct 28, 2021

ekaf commented Nov 1, 2021

jmccrae commented Nov 1, 2021

ekaf commented Nov 1, 2021

ekaf commented Nov 2, 2021

mirfan899 commented Nov 2, 2021

ekaf commented Nov 3, 2021

goodmami commented Nov 3, 2021 • edited Loading

1313ou commented Nov 3, 2021

ekaf commented Nov 3, 2021

jmccrae commented Nov 3, 2021

ekaf commented Nov 4, 2021

goodmami commented Nov 4, 2021

jmccrae commented Nov 5, 2021

ekaf commented Sep 7, 2021 •

edited

Loading

ekaf commented Sep 9, 2021 •

edited

Loading

ekaf commented Sep 9, 2021 •

edited

Loading

ekaf commented Sep 9, 2021 •

edited

Loading

ekaf commented Sep 9, 2021 •

edited

Loading

ekaf commented Sep 10, 2021 •

edited

Loading

ekaf commented Sep 10, 2021 •

edited

Loading

1313ou commented Sep 10, 2021 •

edited

Loading

jmccrae commented Sep 13, 2021 •

edited

Loading

ekaf commented Oct 21, 2021 •

edited

Loading

ekaf commented Oct 28, 2021 •

edited

Loading

goodmami commented Nov 3, 2021 •

edited

Loading