Remove questionable stopwords from default lists #6307
Replies: 2 comments
-
Stop words lists are pretty much always task-specific, so it's hard to provide good general-purpose lists. They're also less relevant than they used to be and I would personally prefer to get rid of them (not the stop word functionality, just trying to provide lists with the library; also to discourage people from doing knee-jerk stop word filtering where it may no longer make sense), but they've been in the library since the beginning and many people rely on them and it's not a huge burden to maintain the current lists. I'm not sure about the history of the French stop word list. It's true that we don't have good documentation on the sources of a lot of the stop words, plus user contributions have led to modifications for many languages. We do welcome PRs here, although we'd be wary of making major changes to the English stop word list at this point, since this would be unexpected for many users. In any case, it doesn't sound like the provided stop word lists are the right match for your task. After loading a model, you can update the stop word status of any lexeme (token type) with: nlp.vocab["word"].is_stop = True
nlp.vocab["word"].is_stop = False The stop word flags aren't saved with the model as of v2.3.0+, so you'd need to do this every time you load a model. (This is a regression that I didn't catch due to #5238. We may need to take another look here, but given how the flags are implemented I'm not sure there's a going to be a good solution.) |
Beta Was this translation helpful? Give feedback.
-
Hi @adrianeboyd, That makes sense. Same here, I would rather get rid of general-purpose stopwords in my application... But it's already there and many people use them. So maintenance is the word. I will open a PR for the French stop word list, it has less impact than the English one. FYI the nltk stopword lists are much more conservative, perhaps it could be an inspiration: https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/stopwords.zip Cheers, Alex |
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
I got myself into an interesting rabbit hole recently: stopwords.
My context is that I need to bundle default stop words in my own application. The application being industry-agnostic, I am looking for generic stopwords which are not domain-specific. In addition to this, these stopwords should work well with several tokenizers: a simple whitespace tokenizer and the slightly better regex tokenizer from scikit-learn.
As I am a big fan of your work 💥 so my initial idea was to use spaCy for this, with some post-processing on my side to account for the different tokenizers. Hence, I started doing a small inspection of the stopwords bundled in spaCy 3.0-rc1. I only looked at French and English which I speak fluently.
I have to admit I was surprised by inclusion of stopwords such as:
serious
,well
,never
,very
,please
in English. Excluding these stopwords would be a loss of valuable information for any human analysis or Machine Learning task, e.g. sentiment analysis.I was even more surprised when I looked at French stop words. My compatriots will quickly understand the problem 🇫🇷
zut
,clic
,clac
,remarquable
,sacrebleu
,sapristi
,bien
,nul
,pff
,olé
,paf
,sein
,plouf
, ... It looks like the words come from a Tintin cartoon.Joke aside I realize this is not a simple topic, and that there is a balance between completeness, cross-domain applicability, and need for customization. For instance, scikit-learn is quite upfront in their documentation about the issues with the English stopwords they are bundling. They linked an interesting academic paper on this topic: https://www.aclweb.org/anthology/W18-2502/
Perhaps a good idea to start with would be to remove the questionable stop words I spotted in English and French? Longer term, I would suggest some crowdsourcing with native speakers for each language.
Cheers,
Alex
Beta Was this translation helpful? Give feedback.
All reactions