Memory Management: Allow None embedding_model with pre-computed embeddings #2271

twright8 · 2025-01-23T12:48:58Z

First, I want to preface that I'm not an expert with BERTopic's internals, so I apologize if I'm misunderstanding something fundamental.
I'm running into an issue where I have limited VRAM and want to use BERTopic with an LLM for topic labeling. My workflow involves:

Pre-computing embeddings using a large embedding model
Saving these embeddings to disk
Loading them back for topic modeling

The challenge I'm facing is that when I try to initialize BERTopic with embedding_model=None while providing pre-computed embeddings, I get an error. This means I have to keep the large embedding model in memory even though I've already generated the embeddings.

Here's a minimal example of my current workaround using a dummy embedder:

class DummyEmbedder:
    def embed_documents(self, documents, verbose=False):
        return embeddings  # Using precomputed embeddings
    
    def embed_words(self, words, verbose=False):
        return embeddings

dummy_embedder = DummyEmbedder()

topic_model = BERTopic(
    embedding_model=dummy_embedder,
    umap_model=reduced_e,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    ctfidf_model=ctfidf_model,
    representation_model=representation_models,
)
topics, probs = topic_model.fit_transform(data, embeddings=embeddings)

Question

Is there a better way to handle this scenario?
Would it make sense to allow embedding_model=None when pre-computed embeddings are provided?
Is my dummy embedder approach safe to use, or could it cause issues I'm not aware of?

The text was updated successfully, but these errors were encountered:

MaartenGr · 2025-01-26T08:16:09Z

The challenge I'm facing is that when I try to initialize BERTopic with embedding_model=None while providing pre-computed embeddings, I get an error. This means I have to keep the large embedding model in memory even though I've already generated the embeddings.

Before talking about solutions, it's important that we have a clear picture of this error that you mention first. The thing is, you can use embedding_model=None with pre-computed embeddings. The only thing that is stopping you is what is in representation_models that might make use of word-level embeddings (such as KeyBERTInspired).

In other words, there are two things missing:

The full code that creates the error
The error itself

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Management: Allow None embedding_model with pre-computed embeddings #2271

Memory Management: Allow None embedding_model with pre-computed embeddings #2271

twright8 commented Jan 23, 2025 •

edited

Loading

MaartenGr commented Jan 26, 2025

Memory Management: Allow None embedding_model with pre-computed embeddings #2271

Memory Management: Allow None embedding_model with pre-computed embeddings #2271

Comments

twright8 commented Jan 23, 2025 • edited Loading

MaartenGr commented Jan 26, 2025

twright8 commented Jan 23, 2025 •

edited

Loading