Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Management: Allow None embedding_model with pre-computed embeddings #2271

Open
twright8 opened this issue Jan 23, 2025 · 1 comment
Open

Comments

@twright8
Copy link

twright8 commented Jan 23, 2025

First, I want to preface that I'm not an expert with BERTopic's internals, so I apologize if I'm misunderstanding something fundamental.
I'm running into an issue where I have limited VRAM and want to use BERTopic with an LLM for topic labeling. My workflow involves:

  • Pre-computing embeddings using a large embedding model
  • Saving these embeddings to disk
  • Loading them back for topic modeling

The challenge I'm facing is that when I try to initialize BERTopic with embedding_model=None while providing pre-computed embeddings, I get an error. This means I have to keep the large embedding model in memory even though I've already generated the embeddings.

Here's a minimal example of my current workaround using a dummy embedder:

class DummyEmbedder:
    def embed_documents(self, documents, verbose=False):
        return embeddings  # Using precomputed embeddings
    
    def embed_words(self, words, verbose=False):
        return embeddings

dummy_embedder = DummyEmbedder()

topic_model = BERTopic(
    embedding_model=dummy_embedder,
    umap_model=reduced_e,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    ctfidf_model=ctfidf_model,
    representation_model=representation_models,
)
topics, probs = topic_model.fit_transform(data, embeddings=embeddings)

Question

  • Is there a better way to handle this scenario?
  • Would it make sense to allow embedding_model=None when pre-computed embeddings are provided?
  • Is my dummy embedder approach safe to use, or could it cause issues I'm not aware of?
@MaartenGr
Copy link
Owner

The challenge I'm facing is that when I try to initialize BERTopic with embedding_model=None while providing pre-computed embeddings, I get an error. This means I have to keep the large embedding model in memory even though I've already generated the embeddings.

Before talking about solutions, it's important that we have a clear picture of this error that you mention first. The thing is, you can use embedding_model=None with pre-computed embeddings. The only thing that is stopping you is what is in representation_models that might make use of word-level embeddings (such as KeyBERTInspired).

In other words, there are two things missing:

  • The full code that creates the error
  • The error itself

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants