spaCy preprocessing for topic modelling #10575

demonoff · 2022-03-29T10:23:29Z

demonoff
Mar 29, 2022

hello, I intend to use spaCy to process about 2 million tweets for a topic modelling task. I intend to use BERTopic. Its topic_model is initialized using nlp object and docs list-like object is used as an input for fit and transform methods. I did some testing and in some topics I am occasionally getting stopwords included in the list of words that supposed to be representative for a particular topic.

This is not a desired outcome. Because of that I want to find some way to ignore stopwords.

It seems that BERTopic relies on vector attribute of Doc objects. Thus, I came up with the following solution, that zeroes-out vectors for stopwords and "interpunction" tokens. However I am not that experienced with BERTopic and spaCy, thus any opinion/suggestion would be much appreciated. Please also note the merge_entities and merge_noun_chunks pipes which I believe should improve the topic modelling (but this is yet to be tested, any suggestions about that are also very welcome :) ).

import numpy as np
import pandas as pd
import spacy
from dframcy import DframCy

name = "en_core_web_trf"  # Spacy"s transformer models
name = "en_core_web_md"   # Spacy"s non-transformer models

if name.endswith("_trf"):
    spacy.prefer_gpu()

exclude = [
    # "tagger",
    "parser",
    # "ner",
    # "attribute_ruler",
    # "lemmatizer",
]
nlp = spacy.load(name=name, exclude=exclude)
nlp.add_pipe("merge_entities")
nlp.add_pipe("merge_noun_chunks")

eta0 = 1e-10
vector0 = np.zeros((300,))
for it0 in nlp.Defaults.stop_words:
    vector0 = np.random.uniform(-1, 1, (300,))*eta0
    nlp.vocab.set_vector(it0, vector0)
    nlp.vocab.set_vector(it0.upper(), vector0)
    nlp.vocab.set_vector(it0.capitalize(), vector0)

for it0 in nlp.Defaults.stop_words:
    vector0 = np.random.uniform(-1, 1, (300,))*eta0
    nlp.vocab.set_vector(it0, None)

for it0 in nlp.Defaults.infixes:
    vector0 = np.random.uniform(-1, 1, (300,))*eta0
    nlp.vocab.set_vector(it0, vector0)

for it0 in nlp.Defaults.prefixes:
    vector0 = np.random.uniform(-1, 1, (300,))*eta0
    nlp.vocab.set_vector(it0, vector0)

for it0 in nlp.Defaults.suffixes:
    vector0 = np.random.uniform(-1, 1, (300,))*eta0
    nlp.vocab.set_vector(it0, vector0)

dframcy = DframCy(nlp)

doc = nlp("""it was 2044 when the war between the Northen Alliance and the Southen Federation has began.
Net income was $9.4 million compared to the prior year of $2.7 million :)
Revenue exceeded twelve billion dollars, with a loss of $1b.
""")
# doc = nlp("""it was 2044, the war, war.""")

for ent0 in doc.ents:
    print(ent0)

df0 = dframcy.to_dataframe(
    doc,
    columns=[
        "text",
        "pos_",
        "tag_",
        "dep_",
        "ent_type_",
        "ent_iob_",
        "lemma_",
        "is_stop",
        "is_punct",
    ]
)
with pd.option_context("display.width", 3333):
    print(df0)

print(f"{np.round(doc.vector[:5], 3)}")
for tk in doc:
    print(f"{tk.text!r:>28} {tk.is_stop:>1} {tk.is_punct:>1} {tk.has_vector:>1} {np.round(tk.vector[:5], 3)}")

The output seems to be promising, or am I missing something here?

2044
the Northen Alliance
the Southen Federation
$9.4 million
the prior year
$2.7 million
twelve billion dollars
1b
                token_text token_pos_ token_tag_ token_dep_ token_ent_type_ token_ent_iob_            token_lemma_  token_is_stop  token_is_punct
0                       it       PRON        PRP                                         O                      it           True           False
1                      was       VERB        VBD                                         O                      be           True           False
2                     2044        NUM         CD                       DATE              B                    2044          False           False
3                     when        ADV        WRB                                         O                    when           True           False
4                      the        DET         DT                                         O                     the           True           False
5                      war       NOUN         NN                                         O                     war          False           False
6                  between        ADP         IN                                         O                 between           True           False
7     the Northen Alliance        DET         DT                        ORG              B    the Northen Alliance          False           False
8                      and      CCONJ         CC                                         O                     and           True           False
9   the Southen Federation        DET         DT                        ORG              B  the Southen Federation          False           False
10                     has       VERB        VBZ                                         O                    have           True           False
11                   began       VERB        VBN                                         O                   begin          False           False
12                       .      PUNCT          .                                         O                       .          False            True
13                      \n      SPACE        _SP                                         O                      \n          False           False
14                     Net        ADJ         JJ                                         O                     net          False           False
15                  income       NOUN         NN                                         O                  income          False           False
16                     was       VERB        VBD                                         O                      be           True           False
17            $9.4 million        SYM          $                      MONEY              B            $9.4 million          False           False
18                compared       VERB        VBN                                         O                 compare          False           False
19                      to        ADP         IN                                         O                      to           True           False
20          the prior year        DET         DT                       DATE              B          the prior year          False           False
21                      of        ADP         IN                                         O                      of           True           False
22            $2.7 million        SYM          $                      MONEY              B            $2.7 million          False           False
23                      :)      PUNCT          :                                         O                      :)          False            True
24                      \n      SPACE        _SP                                         O                      \n          False           False
25                 Revenue      PROPN        NNP                                         O                 Revenue          False           False
26                exceeded       VERB        VBD                                         O                  exceed          False           False
27  twelve billion dollars       NOUN         NN                      MONEY              B   twelve billion dollar          False           False
28                       ,      PUNCT          ,                                         O                       ,          False            True
29                    with        ADP         IN                                         O                    with           True           False
30                       a        DET         DT                                         O                       a           True           False
31                    loss       NOUN         NN                                         O                    loss          False           False
32                      of        ADP         IN                                         O                      of           True           False
33                       $        SYM          $                                         O                       $          False           False
34                      1b        NUM         CD                      MONEY              B                      1b          False           False
35                       .      PUNCT          .                                         O                       .          False            True
36                      \n      SPACE        _SP                                         O                      \n          False           False
[-0.108  0.134  0.07  -0.045 -0.032]
                        'it' 1 0 1 [-0.  0. -0.  0. -0.]
                       'was' 1 0 1 [-0.  0.  0.  0. -0.]
                      '2044' 0 0 1 [ 0.665  0.225  0.386 -0.003 -0.054]
                      'when' 1 0 1 [-0.  0. -0.  0.  0.]
                       'the' 1 0 1 [-0.  0.  0.  0.  0.]
                       'war' 0 0 1 [-0.307 -0.147  0.661  0.011 -0.425]
                   'between' 1 0 1 [-0.  0.  0. -0. -0.]
      'the Northen Alliance' 0 0 1 [-0.17 -0.14  0.19 -0.11  0.43]
                       'and' 1 0 1 [-0. -0. -0. -0.  0.]
    'the Southen Federation' 0 0 1 [ 0.064 -0.083  0.267 -0.029 -0.031]
                       'has' 1 0 1 [ 0.  0. -0. -0.  0.]
                     'began' 0 0 1 [ 0.423 -0.082  0.221  0.053  0.097]
                         '.' 0 1 1 [ 0.012  0.208 -0.126 -0.593  0.125]
                        '\n' 0 0 0 [0. 0. 0. 0. 0.]
                       'Net' 0 0 1 [-0.589  0.082 -0.114  0.532 -0.065]
                    'income' 0 0 1 [-0.835  0.507 -0.116 -0.12  -0.223]
                       'was' 1 0 1 [-0.  0.  0.  0. -0.]
              '$9.4 million' 0 0 1 [-0.225  0.307  0.158 -0.34   0.035]
                  'compared' 0 0 1 [-0.652  0.318  0.234 -0.352 -0.157]
                        'to' 1 0 1 [ 0.  0. -0.  0.  0.]
            'the prior year' 0 0 1 [ 0.062  0.093  0.013 -0.069  0.003]
                        'of' 1 0 1 [ 0. -0. -0.  0.  0.]
              '$2.7 million' 0 0 1 [-0.171  0.408  0.102 -0.09   0.078]
                        ':)' 0 1 1 [ 0.008  0.182 -0.282 -0.116 -0.481]
                        '\n' 0 0 0 [0. 0. 0. 0. 0.]
                   'Revenue' 0 0 1 [-0.696  0.738  0.313 -0.242 -0.319]
                  'exceeded' 0 0 1 [-0.166  0.421  0.118 -0.19  -0.013]
    'twelve billion dollars' 0 0 1 [-0.338 -0.03   0.319 -0.248 -0.239]
                         ',' 0 1 1 [ 0.  0. -0.  0.  0.]
                      'with' 1 0 1 [-0.  0. -0. -0. -0.]
                         'a' 1 0 1 [-0. -0. -0.  0.  0.]
                      'loss' 0 0 1 [-0.706  0.996 -0.31   0.828 -0.894]
                        'of' 1 0 1 [ 0. -0. -0.  0.  0.]
                         '$' 0 0 1 [-0.607  0.425  0.51  -0.288  0.515]
                        '1b' 0 0 1 [0.215 0.315 0.191 0.31  0.321]
                         '.' 0 1 1 [ 0.012  0.208 -0.126 -0.593  0.125]

An ideal solution would be to replace the tok2vec pipe with better_tok2vec that would ignore all stopwords when computing vectors (preferably both on Doc and Token level). This is because stopwords still influence the vectors of merged noun chunks. For example see the word "war" in the output for the following doc.

doc = nlp("""The the stuff, it was War, The War, a war.""")

Outout:

[-0.076 -0.02   0.093 -0.04  -0.085]
                       'The' 1 0 1 [-0.  0. -0.  0.  0.]
                       'the' 1 0 1 [-0.  0. -0.  0.  0.]
                     'stuff' 0 0 1 [-0.233 -0.101 -0.323  0.043 -0.162]
                         ',' 0 1 1 [-0. -0. -0. -0. -0.]
                        'it' 1 0 1 [ 0. -0.  0. -0. -0.]
                       'was' 1 0 1 [ 0. -0.  0.  0. -0.]
                       'War' 0 0 1 [-0.307 -0.147  0.661  0.011 -0.425]
                         ',' 0 1 1 [-0. -0. -0. -0. -0.]
                   'The War' 0 0 1 [-0.154 -0.074  0.331  0.005 -0.213]
                         ',' 0 1 1 [-0. -0. -0. -0. -0.]
                         'a' 1 0 1 [ 0. -0. -0.  0.  0.]
                       'war' 0 0 1 [-0.307 -0.147  0.661  0.011 -0.425]
                         '.' 0 1 1 [ 0.012  0.208 -0.126 -0.593  0.125]

polm · 2022-03-30T03:47:42Z

polm
Mar 30, 2022

Have you considered just filtering stopwords out of your final generated list of topic terms?

You seem to have successfully zeroed vectors for tokens like you wanted.

Please also note the merge_entities and merge_noun_chunks pipes which I believe should improve the topic modelling (but this is yet to be tested, any suggestions about that are also very welcome :) ).

I'm not sure that will work the way you intend it to. spacy-transformers does BERT tokenization behind the scenes and aligns the tokens to spaCy tokens, so all you're really changing here is how the BERT sub-tokens are stitched together. But maybe that will help with the results in BERTopic - I'm not really familiar with how it works. You might want to ask the BERTopic developers at their repo.

Also just going to link to the general preprocessing FAQ.

1 reply

demonoff Mar 30, 2022
Author

thanks for the tips. I will explore more, if I learn something useful/interesting I will post a followup :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spaCy preprocessing for topic modelling #10575

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

spaCy preprocessing for topic modelling #10575

demonoff Mar 29, 2022

Replies: 1 comment · 1 reply

polm Mar 30, 2022

demonoff Mar 30, 2022 Author

demonoff
Mar 29, 2022

Replies: 1 comment 1 reply

polm
Mar 30, 2022

demonoff Mar 30, 2022
Author