Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English "grey" is tokenized as NN, American "gray" as JJ #11787

Closed
vikingvynotking opened this issue Nov 10, 2022 · 5 comments
Closed

English "grey" is tokenized as NN, American "gray" as JJ #11787

vikingvynotking opened this issue Nov 10, 2022 · 5 comments
Labels
lang / en English language data and models perf / accuracy Performance: accuracy

Comments

@vikingvynotking
Copy link

Traditional English uses grey for the colour between black and white; American English uses gray. Modern English accepts both, however only gray is considered an adjective in spacy.

If this is more appropriate as a discussion, please LMK.

How to reproduce the behaviour

In [1]: import spacy
In [2]: nlp = spacy.load('en_core_web_sm')
In [3]: doc = nlp('The grey wolf')
In [4]: doc[1].tag_, doc[1].pos_
Out[4]: ('NN', 'NOUN')
In [5]: doc = nlp('The gray wolf')
In [6]: doc[1].tag_, doc[1].pos_
Out[6]: ('JJ', 'ADJ')

Your Environment

Info about spaCy

  • spaCy version: 3.4.2
  • Platform: Linux-5.10.124-linuxkit-aarch64-with
  • Python version: 3.11.0
  • Pipelines: en_core_web_sm (3.4.1)
@polm polm added lang / en English language data and models perf / accuracy Performance: accuracy labels Nov 11, 2022
@polm
Copy link
Contributor

polm commented Nov 11, 2022

This is actually the kind of thing that belongs in #3052.

In my tests, while this happened specifically with "wolf", for many other phrases both "grey" and "gray" are an adjective. Looking at OntoNotes, "gray" is several times more common than "grey", so this looks like just normal variation. In particular, "Grey" is often a proper noun (surname or "Gandalf the Grey"), which seems to be influencing some of the wolf examples.

Since this seems to be normal variation, I don't think there's any action for us to take with it, though it would be good to revisit if we do more extensive data augmentation later.

@polm polm added the resolved The issue was addressed / answered label Nov 11, 2022
@vikingvynotking
Copy link
Author

vikingvynotking commented Nov 15, 2022

Looking at OntoNotes, "gray" is several times more common than "grey"

I would be surprised if this is true for the entire corpus of English writing - it seems common enough in the texts I'm processing; I was however surprised to find a few certain grey things are treated specially - grey wolf, grey elephant, grey cloud have grey as not just a noun but an NNP; grey day, and a few others return grey as an adjective as expected.

This is weird though:

In [25]:  [(x.pos_, x.tag_) for x in nlp('a grey rug')]
Out[25]: [('DET', 'DT'), ('NOUN', 'NN'), ('ADJ', 'JJ')]

In [26]:  [(x.pos_, x.tag_) for x in nlp('a gray rug')]
Out[26]: [('DET', 'DT'), ('ADJ', 'JJ'), ('NOUN', 'NN')]

I don't care where you're from, a rug is pretty much never an adjective :)

@github-actions github-actions bot removed the resolved The issue was addressed / answered label Nov 15, 2022
@polm
Copy link
Contributor

polm commented Nov 16, 2022

As mentioned in #3052, it's very hard to troubleshoot issues with the models individual examles - they aren't perfect, and sometimes they're just wrong for a particular case. Additionally, you mention "the entire corpus of English writing", but we have to work with the training data we have, which is OntoNotes. If we find a better dataset we'd be happy to move to it.

Note you'll probably get better results with larger models, or with complete sentences.

It does seem like there might be something weird with gray/grey, so we'll take a look at it, but it may take a while, or it may end up just being something weird due to our input data.

@vikingvynotking
Copy link
Author

vikingvynotking commented Nov 16, 2022

Yeah, that's fair - I wasn't trying to imply the model should be perfect in any way, I'm just running some tests over a few documents and these things jump out at me for some reason. Thanks for your response; I'll try to find some work-arounds for the oddities I discover.

Note you'll probably get better results ... with complete sentences.

The examples I provided are simplifications of real-world data; I didn't really want to paste the entire text here :)

Anyway, thanks again, spacy is phenomenal so I might just be too picky!

@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / en English language data and models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

2 participants