Example for reducing the size of a vocabulary #5398
Replies: 4 comments
-
Good timing for this question! I've been working on removing the stored lexemes from the models here: #5238. This is 99% finished and planned for v2.3. This will reduce the model size and loading time a lot for models with vectors. For v2.2 there's not really any way to remove lexemes from an existing vocab. There's also some information in the lexemes that's not stored elsewhere (probs, clusters, custom normalizations), which makes it a little tricky because this could affect the model performance. ( Ines's comment is still basically what to do: load your model, initialize a new vocab for the same language with I think you can just copy Lines 289 to 293 in c045a9c Actually, as long as you save and reload the model, this should happen anyway, so you only really need to do this if you're modifying the vectors on-the-fly without reloading. So the whole thing would look something like this (I haven't tested this, so check this carefully to make sure the model performance stays the same!)
Edited to add: you definitely need to reload the model no matter what to reinitialize all the pipeline components with the new vocab (just like Ines said in the earlier comment). Edited to remove incorrect lemmatizer-related code. |
Beta Was this translation helpful? Give feedback.
-
Wow thanks so much for this @adrianeboyd, that was super helpful. I don't think I could have worked that out on my own. I think I've gotten it all working now, although haven't done performance profiling yet. There were a couple of little hiccups along the way, which I'll document in case it's helpful.
The problem is that when loading the model, the
I managed to fix this by updating the And the the end result: import spacy
from spacy.lang.en import English
def prune_model(nlp, lex_map):
new_vocab = English.Defaults.create_vocab()
new_vocab.lookups = nlp.vocab.lookups
new_vocab.vectors = nlp.vocab.vectors
for lex in nlp.vocab:
if lex.orth_ in lex_map:
new_vocab[lex.orth_] # this adds a lexeme
new_vocab[lex.orth_].prob = lex_map[lex.orth_]
nlp.vocab = new_vocab
removed_words = nlp.vocab.prune_vectors(len(lex_map))
for name, pipe in nlp.pipeline:
if hasattr(pipe, "cfg") and pipe.cfg.get("pretrained_vectors"):
pipe.cfg["pretrained_vectors"] = nlp.vocab.vectors.name
nlp.to_disk("reduced_model")
nlp = spacy.load("reduced_model") One thing I was curious about is that the size of the |
Beta Was this translation helpful? Give feedback.
-
Ah, the vectors names, I forgot about the details related to the pipeline components. (A lot of those details will be less hacky in v3.) The lemmatizer bit was also wrong. If you have the right lookups and save/reload the model, the lemmatizer should be initialized correctly. There are two different sizes here, which may be part of the confusion. One is the number of vectors and the other is the number of lexemes. In the provided models these come from two different sources, which is why there are 1.3M lexemes but 600K vectors for I'm not sure why |
Beta Was this translation helpful? Give feedback.
-
Yep, I was aware of the difference between number of lexemes and number of vectors. Good to know that One thing I'm still a little confused about is why the vocab for the |
Beta Was this translation helpful? Give feedback.
-
I need to use a custom spaCy model in an AWS Lambda and would like to have the whole model installed into the lambda (rather than pulling it from S3 on init) so need to reduce the size of the model on disk.
I've managed to reduce the vectors to an acceptable size by calling
Vocab.prune_vectors()
and updating the probabilities of the vocab's lexemes to match the distribution in my dataset, so the appropriate lexemes are pruned. like this:The lexemes on disk are still taking up a bit of space though (132mb), so I would like to remove some. I found this suggestion to create a new
Vocab
instance and only keep the desired lexemes. I can't quite work out how to make this happen.Perhaps an example that reduces both vectors and lexemes could be a useful addition to the docs?
Beta Was this translation helpful? Give feedback.
All reactions