spaCy preprocessing for topic modelling #10575
Replies: 1 comment 1 reply
-
Have you considered just filtering stopwords out of your final generated list of topic terms? You seem to have successfully zeroed vectors for tokens like you wanted.
I'm not sure that will work the way you intend it to. spacy-transformers does BERT tokenization behind the scenes and aligns the tokens to spaCy tokens, so all you're really changing here is how the BERT sub-tokens are stitched together. But maybe that will help with the results in BERTopic - I'm not really familiar with how it works. You might want to ask the BERTopic developers at their repo. Also just going to link to the general preprocessing FAQ. |
Beta Was this translation helpful? Give feedback.
-
hello, I intend to use spaCy to process about 2 million tweets for a topic modelling task. I intend to use BERTopic. Its
topic_model
is initialized usingnlp
object anddocs
list-like object is used as an input for fit and transform methods. I did some testing and in some topics I am occasionally getting stopwords included in the list of words that supposed to be representative for a particular topic.This is not a desired outcome. Because of that I want to find some way to ignore stopwords.
It seems that BERTopic relies on vector attribute of
Doc
objects. Thus, I came up with the following solution, that zeroes-out vectors for stopwords and "interpunction" tokens. However I am not that experienced with BERTopic and spaCy, thus any opinion/suggestion would be much appreciated. Please also note themerge_entities
andmerge_noun_chunks
pipes which I believe should improve the topic modelling (but this is yet to be tested, any suggestions about that are also very welcome :) ).The output seems to be promising, or am I missing something here?
An ideal solution would be to replace the
tok2vec
pipe withbetter_tok2vec
that would ignore all stopwords when computing vectors (preferably both onDoc
andToken
level). This is because stopwords still influence the vectors of merged noun chunks. For example see the word "war" in the output for the following doc.Outout:
Beta Was this translation helpful? Give feedback.
All reactions