Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERTopic random results with random_state set when input order is changed #2276

Open
JvdsReform opened this issue Jan 28, 2025 · 3 comments
Open

Comments

@JvdsReform
Copy link

JvdsReform commented Jan 28, 2025

I'm using BERTopic with UMAP and HDBScan. I set the random_state of UMAP to a specific number, but I still get wildly different results on fit_transform if my input array (the corpus) is in another order. For example, When I have a collection of descriptions of size ~350 and I switch the first and last element, then my amount of clusters go from 20 to 24. Is this normal? Which step in the BERTopic process creates this randomness?

Code in question where I create the model:

bi_encoder = SentenceTransformer("all-mpnet-base-v2")
embeddings = bi_encoder.encode(input_corpus, show_progress_bar=True)

umap_model = UMAP(n_neighbors=3, n_components=3, min_dist=0.1, metric="cosine", random_state=40)
hdbscan_model = HDBSCAN(prediction_data=True)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")

topic_model = BERTopic(
    embedding_model=bi_encoder,
    vectorizer_model=vectorizer_model,
    hdbscan_model=hdbscan_model,
    umap_model=umap_model,
    verbose=True,
)

Versions:

bertopic==0.16.4
umap-learn==0.5.7
hdbscan==0.8.40
sentence-transformers==3.3.1
@saipavankumar-muppalaneni

I am frustrated by the same issue. the model gave 4 good topics from the data but from later on its only making 2 topic.

@JvdsReform JvdsReform changed the title BERTopic random results with random_state set BERTopic random results with random_state set when input order is changed Jan 29, 2025
@MaartenGr
Copy link
Owner

@JvdsReform That is likely a result of the underlying models (HDBSCAN or UMAP) rather than something with BERTopic itself. BERTopic is a modular framework and will have no effect on the randomness of the results. I believe the order might have an effect with UMAP if I remember correctly. You could check out the issues page of UMAP as I remember there might be one or two issues about this.

Also note that without more information (code, BERTopic version, etc.) it is incredibly hard to see more about this particular subject (hence why I advise opening an issues and providing the information suggested there). That said, this is most likely related to the underlying models.

@saipavankumar-muppalaneni As mentioned above, without more information it's hard for me to say why this is happening to you. Did you use a random_state? Did you run it in the exact same environment before and after? Which versions do you have? How did you initialize BERTopic? Etc.

@JvdsReform
Copy link
Author

@MaartenGr Indeed, sorry for the vague issue, I edited the question with some additional info. I'll also look this up in the HDBSCAN and UMAP repo's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants