Example for reducing the size of a vocabulary #5398

ned2 · 2020-05-04T11:49:56Z

ned2
May 4, 2020

I need to use a custom spaCy model in an AWS Lambda and would like to have the whole model installed into the lambda (rather than pulling it from S3 on init) so need to reduce the size of the model on disk.

I've managed to reduce the vectors to an acceptable size by calling Vocab.prune_vectors() and updating the probabilities of the vocab's lexemes to match the distribution in my dataset, so the appropriate lexemes are pruned. like this:

for lex in nlp.vocab:                                                                                                                                                                        
    lex.prob = lex_probs[lex.orth_]
removed_words = nlp.vocab.prune_vectors(len(lex_probs))

The lexemes on disk are still taking up a bit of space though (132mb), so I would like to remove some. I found this suggestion to create a new Vocab instance and only keep the desired lexemes. I can't quite work out how to make this happen.

Perhaps an example that reduces both vectors and lexemes could be a useful addition to the docs?

adrianeboyd · 2020-05-04T13:16:12Z

adrianeboyd
May 4, 2020

Good timing for this question! I've been working on removing the stored lexemes from the models here: #5238. This is 99% finished and planned for v2.3. This will reduce the model size and loading time a lot for models with vectors.

For v2.2 there's not really any way to remove lexemes from an existing vocab. There's also some information in the lexemes that's not stored elsewhere (probs, clusters, custom normalizations), which makes it a little tricky because this could affect the model performance. (prob and cluster were more important for spacy v1 but aren't used as standard model features in v2. norm is still used, though.)

Ines's comment is still basically what to do: load your model, initialize a new vocab for the same language with English.Defaults.create_vocab() (or whatever language), and then copy the information for the lexemes you want to keep into the new vocab. (Depending on what you need, you might also need to copy more lexeme features like is_oov, cluster, and norm. Most of the features are derivable, but not quite all.)

I think you can just copy nlp.vocab.vectors to the new vocab. The most important internal part to get right is lexeme.rank which links the lexemes internally to the right row in the vectors table, which is copied from vectors.key2row when models are loaded like this:

spaCy/spacy/_ml.py

Lines 289 to 293 in c045a9c

    
           for word in vocab: 
        
               if word.orth in vectors.key2row: 
        
                   word.rank = vectors.key2row[word.orth] 
        
               else: 
        
                   word.rank = util.OOV_RANK

Actually, as long as you save and reload the model, this should happen anyway, so you only really need to do this if you're modifying the vectors on-the-fly without reloading.

So the whole thing would look something like this (I haven't tested this, so check this carefully to make sure the model performance stays the same!)

import spacy
from spacy.lang.en import English
import spacy.util

nlp = spacy.load("large_model")

new_vocab = English.Defaults.create_vocab()
new_vocab.lookups = nlp.vocab.lookups
new_vocab.vectors = nlp.vocab.vectors

for lex in nlp.vocab:
    if keep_lexeme(lex):
        new_vocab[lex] # this adds a lexeme
        new_vocab[lex].prob = lex_probs[lex]

nlp.vocab = new_vocab
nlp.vocab.prune_vectors(num_vectors)
nlp.to_disk("reduced_model")
nlp = spacy.load("reduced_model")

Edited to add: you definitely need to reload the model no matter what to reinitialize all the pipeline components with the new vocab (just like Ines said in the earlier comment).

Edited to remove incorrect lemmatizer-related code.

0 replies

ned2 · 2020-05-05T14:44:12Z

ned2
May 5, 2020
Author

Wow thanks so much for this @adrianeboyd, that was super helpful. I don't think I could have worked that out on my own. I think I've gotten it all working now, although haven't done performance profiling yet. There were a couple of little hiccups along the way, which I'll document in case it's helpful.

There did not appear to be a nlp.vocab.lemmatizer attribute, so I just removed that line.
More of a show stopper was the automatic renaming of a new vector table described in Incorrect POS tags when multiple models are loaded #3853 that produces this warning:

UserWarning: [W019] Changing vectors name from en_core_web_lg.vectors to en_core_web_lg.vectors_31210, to avoid clash with previously loaded vectors. See Issue #3853

The problem is that when loading the model, the nlp.tagger.cfg["pretrained_vectors"] has the old vector name (in my case en_core_web_lg.vectors) and nlp.vocab.vectors.name has the new one (en_core_web_lg.vectors_31210) resulting in this exception:

OSError: [E050] Can't find model 'en_core_web_lg.vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

I managed to fix this by updating the cfg using this snippet mentioned in this issue comment.

And the the end result:

import spacy
from spacy.lang.en import English

def prune_model(nlp, lex_map):
    new_vocab = English.Defaults.create_vocab()
    new_vocab.lookups = nlp.vocab.lookups
    new_vocab.vectors = nlp.vocab.vectors

    for lex in nlp.vocab:
        if lex.orth_ in lex_map:
            new_vocab[lex.orth_] # this adds a lexeme                                                                                                                                                                                   
            new_vocab[lex.orth_].prob = lex_map[lex.orth_]
    nlp.vocab = new_vocab
    removed_words = nlp.vocab.prune_vectors(len(lex_map))

    for name, pipe in nlp.pipeline:
        if hasattr(pipe, "cfg") and pipe.cfg.get("pretrained_vectors"):
            pipe.cfg["pretrained_vectors"] = nlp.vocab.vectors.name
    nlp.to_disk("reduced_model")
    nlp = spacy.load("reduced_model")

One thing I was curious about is that the size of the removed_words dictionary was a lot smaller than I expected. The vectors were pruned down to the expected size of 31210, but the number of removed words was only 8437 (with nlp.vocab for en_core_web_lg) reporting a size of 1340241.

0 replies

adrianeboyd · 2020-05-06T08:53:45Z

adrianeboyd
May 6, 2020

Ah, the vectors names, I forgot about the details related to the pipeline components. (A lot of those details will be less hacky in v3.) The lemmatizer bit was also wrong. If you have the right lookups and save/reload the model, the lemmatizer should be initialized correctly.

There are two different sizes here, which may be part of the confusion. One is the number of vectors and the other is the number of lexemes. In the provided models these come from two different sources, which is why there are 1.3M lexemes but 600K vectors for en_core_web_lg.

I'm not sure why prune_vectors isn't returning the data you'd expect (it may be a bug), but you can go by what you can see in the results in nlp.vocab.vectors.key2row and nlp.vocab.vectors.data, which is what will be stored in the saved model. Likewise, you can check for lex in nlp.vocab to see if it contains the words that you expect.

0 replies

ned2 · 2020-05-21T01:02:41Z

ned2
May 21, 2020
Author

Yep, I was aware of the difference between number of lexemes and number of vectors. Good to know that nlp.vocab.vectors.key2row and nlp.vocab.vectors.data are sources of truth for the latter.

One thing I'm still a little confused about is why the vocab for the en_core_web_lg was reporting 1.3 million lexemes in the first place. The spaCy docs report 685k keys (and 685k vectors). I was interpreting "keys" here as lexemes in the vocab. Is that not the case?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example for reducing the size of a vocabulary #5398

{{title}}

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Example for reducing the size of a vocabulary #5398

ned2 May 4, 2020

Replies: 4 comments

adrianeboyd May 4, 2020

ned2 May 5, 2020 Author

adrianeboyd May 6, 2020

ned2 May 21, 2020 Author

ned2
May 4, 2020

adrianeboyd
May 4, 2020

ned2
May 5, 2020
Author

adrianeboyd
May 6, 2020

ned2
May 21, 2020
Author