English "grey" is tokenized as NN, American "gray" as JJ #11787

vikingvynotking · 2022-11-10T17:03:54Z

Traditional English uses grey for the colour between black and white; American English uses gray. Modern English accepts both, however only gray is considered an adjective in spacy.

If this is more appropriate as a discussion, please LMK.

How to reproduce the behaviour

In [1]: import spacy
In [2]: nlp = spacy.load('en_core_web_sm')
In [3]: doc = nlp('The grey wolf')
In [4]: doc[1].tag_, doc[1].pos_
Out[4]: ('NN', 'NOUN')
In [5]: doc = nlp('The gray wolf')
In [6]: doc[1].tag_, doc[1].pos_
Out[6]: ('JJ', 'ADJ')

Your Environment

Info about spaCy

spaCy version: 3.4.2
Platform: Linux-5.10.124-linuxkit-aarch64-with
Python version: 3.11.0
Pipelines: en_core_web_sm (3.4.1)

The text was updated successfully, but these errors were encountered:

polm · 2022-11-11T04:07:30Z

This is actually the kind of thing that belongs in #3052.

In my tests, while this happened specifically with "wolf", for many other phrases both "grey" and "gray" are an adjective. Looking at OntoNotes, "gray" is several times more common than "grey", so this looks like just normal variation. In particular, "Grey" is often a proper noun (surname or "Gandalf the Grey"), which seems to be influencing some of the wolf examples.

Since this seems to be normal variation, I don't think there's any action for us to take with it, though it would be good to revisit if we do more extensive data augmentation later.

vikingvynotking · 2022-11-15T01:41:47Z

Looking at OntoNotes, "gray" is several times more common than "grey"

I would be surprised if this is true for the entire corpus of English writing - it seems common enough in the texts I'm processing; I was however surprised to find a few certain grey things are treated specially - grey wolf, grey elephant, grey cloud have grey as not just a noun but an NNP; grey day, and a few others return grey as an adjective as expected.

This is weird though:

In [25]:  [(x.pos_, x.tag_) for x in nlp('a grey rug')]
Out[25]: [('DET', 'DT'), ('NOUN', 'NN'), ('ADJ', 'JJ')]

In [26]:  [(x.pos_, x.tag_) for x in nlp('a gray rug')]
Out[26]: [('DET', 'DT'), ('ADJ', 'JJ'), ('NOUN', 'NN')]

I don't care where you're from, a rug is pretty much never an adjective :)

polm · 2022-11-16T06:24:28Z

As mentioned in #3052, it's very hard to troubleshoot issues with the models individual examles - they aren't perfect, and sometimes they're just wrong for a particular case. Additionally, you mention "the entire corpus of English writing", but we have to work with the training data we have, which is OntoNotes. If we find a better dataset we'd be happy to move to it.

Note you'll probably get better results with larger models, or with complete sentences.

It does seem like there might be something weird with gray/grey, so we'll take a look at it, but it may take a while, or it may end up just being something weird due to our input data.

vikingvynotking · 2022-11-16T17:08:25Z

Yeah, that's fair - I wasn't trying to imply the model should be perfect in any way, I'm just running some tests over a few documents and these things jump out at me for some reason. Thanks for your response; I'll try to find some work-arounds for the oddities I discover.

Note you'll probably get better results ... with complete sentences.

The examples I provided are simplifications of real-world data; I didn't really want to paste the entire text here :)

Anyway, thanks again, spacy is phenomenal so I might just be too picky!

github-actions · 2022-12-17T00:02:07Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

polm added lang / en English language data and models perf / accuracy Performance: accuracy labels Nov 11, 2022

polm added the resolved The issue was addressed / answered label Nov 11, 2022

github-actions bot removed the resolved The issue was addressed / answered label Nov 15, 2022

vikingvynotking closed this as completed Nov 16, 2022

github-actions bot locked as resolved and limited conversation to collaborators Dec 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English "grey" is tokenized as NN, American "gray" as JJ #11787

English "grey" is tokenized as NN, American "gray" as JJ #11787

vikingvynotking commented Nov 10, 2022

polm commented Nov 11, 2022 •

edited

Loading

vikingvynotking commented Nov 15, 2022 •

edited

Loading

polm commented Nov 16, 2022

vikingvynotking commented Nov 16, 2022 •

edited

Loading

github-actions bot commented Dec 17, 2022

English "grey" is tokenized as NN, American "gray" as JJ #11787

English "grey" is tokenized as NN, American "gray" as JJ #11787

Comments

vikingvynotking commented Nov 10, 2022

How to reproduce the behaviour

Your Environment

Info about spaCy

polm commented Nov 11, 2022 • edited Loading

vikingvynotking commented Nov 15, 2022 • edited Loading

polm commented Nov 16, 2022

vikingvynotking commented Nov 16, 2022 • edited Loading

github-actions bot commented Dec 17, 2022

polm commented Nov 11, 2022 •

edited

Loading

vikingvynotking commented Nov 15, 2022 •

edited

Loading

vikingvynotking commented Nov 16, 2022 •

edited

Loading