-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
English "grey" is tokenized as NN, American "gray" as JJ #11787
Comments
This is actually the kind of thing that belongs in #3052. In my tests, while this happened specifically with "wolf", for many other phrases both "grey" and "gray" are an adjective. Looking at OntoNotes, "gray" is several times more common than "grey", so this looks like just normal variation. In particular, "Grey" is often a proper noun (surname or "Gandalf the Grey"), which seems to be influencing some of the wolf examples. Since this seems to be normal variation, I don't think there's any action for us to take with it, though it would be good to revisit if we do more extensive data augmentation later. |
I would be surprised if this is true for the entire corpus of English writing - it seems common enough in the texts I'm processing; I was however surprised to find a few certain grey things are treated specially - grey wolf, grey elephant, grey cloud have grey as not just a noun but an NNP; grey day, and a few others return grey as an adjective as expected. This is weird though:
I don't care where you're from, a rug is pretty much never an adjective :) |
As mentioned in #3052, it's very hard to troubleshoot issues with the models individual examles - they aren't perfect, and sometimes they're just wrong for a particular case. Additionally, you mention "the entire corpus of English writing", but we have to work with the training data we have, which is OntoNotes. If we find a better dataset we'd be happy to move to it. Note you'll probably get better results with larger models, or with complete sentences. It does seem like there might be something weird with gray/grey, so we'll take a look at it, but it may take a while, or it may end up just being something weird due to our input data. |
Yeah, that's fair - I wasn't trying to imply the model should be perfect in any way, I'm just running some tests over a few documents and these things jump out at me for some reason. Thanks for your response; I'll try to find some work-arounds for the oddities I discover.
The examples I provided are simplifications of real-world data; I didn't really want to paste the entire text here :) Anyway, thanks again, spacy is phenomenal so I might just be too picky! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Traditional English uses grey for the colour between black and white; American English uses gray. Modern English accepts both, however only gray is considered an adjective in spacy.
If this is more appropriate as a discussion, please LMK.
How to reproduce the behaviour
Your Environment
Info about spaCy
The text was updated successfully, but these errors were encountered: