BERTopic random results with random_state set when input order is changed #2276

JvdsReform · 2025-01-28T14:29:22Z

I'm using BERTopic with UMAP and HDBScan. I set the random_state of UMAP to a specific number, but I still get wildly different results on fit_transform if my input array (the corpus) is in another order. For example, When I have a collection of descriptions of size ~350 and I switch the first and last element, then my amount of clusters go from 20 to 24. Is this normal? Which step in the BERTopic process creates this randomness?

Code in question where I create the model:

bi_encoder = SentenceTransformer("all-mpnet-base-v2")
embeddings = bi_encoder.encode(input_corpus, show_progress_bar=True)

umap_model = UMAP(n_neighbors=3, n_components=3, min_dist=0.1, metric="cosine", random_state=40)
hdbscan_model = HDBSCAN(prediction_data=True)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")

topic_model = BERTopic(
    embedding_model=bi_encoder,
    vectorizer_model=vectorizer_model,
    hdbscan_model=hdbscan_model,
    umap_model=umap_model,
    verbose=True,
)

Versions:

bertopic==0.16.4
umap-learn==0.5.7
hdbscan==0.8.40
sentence-transformers==3.3.1

The text was updated successfully, but these errors were encountered:

saipavankumar-muppalaneni · 2025-01-29T16:38:22Z

I am frustrated by the same issue. the model gave 4 good topics from the data but from later on its only making 2 topic.

MaartenGr · 2025-01-29T17:10:03Z

@JvdsReform That is likely a result of the underlying models (HDBSCAN or UMAP) rather than something with BERTopic itself. BERTopic is a modular framework and will have no effect on the randomness of the results. I believe the order might have an effect with UMAP if I remember correctly. You could check out the issues page of UMAP as I remember there might be one or two issues about this.

Also note that without more information (code, BERTopic version, etc.) it is incredibly hard to see more about this particular subject (hence why I advise opening an issues and providing the information suggested there). That said, this is most likely related to the underlying models.

@saipavankumar-muppalaneni As mentioned above, without more information it's hard for me to say why this is happening to you. Did you use a random_state? Did you run it in the exact same environment before and after? Which versions do you have? How did you initialize BERTopic? Etc.

JvdsReform · 2025-01-29T17:18:06Z

@MaartenGr Indeed, sorry for the vague issue, I edited the question with some additional info. I'll also look this up in the HDBSCAN and UMAP repo's.

JvdsReform changed the title ~~BERTopic random results with random_state set~~ BERTopic random results with random_state set when input order is changed Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERTopic random results with random_state set when input order is changed #2276

BERTopic random results with random_state set when input order is changed #2276

JvdsReform commented Jan 28, 2025 •

edited

Loading

saipavankumar-muppalaneni commented Jan 29, 2025

MaartenGr commented Jan 29, 2025

JvdsReform commented Jan 29, 2025

BERTopic random results with random_state set when input order is changed #2276

BERTopic random results with random_state set when input order is changed #2276

Comments

JvdsReform commented Jan 28, 2025 • edited Loading

saipavankumar-muppalaneni commented Jan 29, 2025

MaartenGr commented Jan 29, 2025

JvdsReform commented Jan 29, 2025

JvdsReform commented Jan 28, 2025 •

edited

Loading