-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metric= "cosine" error reported #2217
Comments
Getting the same error, here is my full log. Im guessing yours is the same # Define hdbscan clustering model, also different for data types
# Pre-defined parameters (min cluster size = 10 (Comes from mintopicsize), )
hdbscan_model = HDBSCAN(min_cluster_size=10, cluster_selection_method="eom", prediction_data=True,
metric="cosine") topic_model= BERTopic(representation_model=representation_model,
vectorizer_model=vectorizer_model,ctfidf_model=ctfidf_model, embedding_model=sentence_model,
umap_model = umap_model, hdbscan_model = hdbscan_model,
calculate_probabilities=False, verbose=True, nr_topics='auto')
topics, probs = topic_model.fit_transform(prompts_df["prompt"].tolist(),
embeddings=np.array(prompts_df["embedding"].tolist())) {
"name": "ValueError",
"message": "Unrecognized metric 'cosine'",
"stack": "---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File sklearn\\\\metrics\\\\_dist_metrics.pyx:416, in sklearn.metrics._dist_metrics.DistanceMetric64.get_metric()
KeyError: 'cosine'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[29], line 8
3 import sklearn
5 topic_model= BERTopic(representation_model=representation_model, vectorizer_model=vectorizer_model,ctfidf_model=ctfidf_model, embedding_model=sentence_model,
6 umap_model = umap_model, hdbscan_model = hdbscan_model, calculate_probabilities=False, verbose=True, nr_topics='auto')
----> 8 topics, probs = topic_model.fit_transform(prompts_df[\"prompt\"].tolist(), embeddings=np.array(prompts_df[\"embedding\"].tolist()))
11 reduced_embeddings = UMAP(n_neighbors=10, n_components=2,
12 min_dist=0.0, metric='cosine').fit_transform(np.array(prompts_df[\"embedding\"].tolist()))
File C:\\LocalData\\pabflore\\Bertopic\\BERTopic\\bertopic\\_bertopic.py:463, in BERTopic.fit_transform(self, documents, embeddings, images, y)
459 umap_embeddings = self.umap_model.transform(embeddings)
461 if len(documents) > 0:
462 # Cluster reduced embeddings
--> 463 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
464 if self._is_zeroshot() and len(assigned_documents) > 0:
465 documents, embeddings = self._combine_zeroshot_topics(
466 documents, embeddings, assigned_documents, assigned_embeddings
467 )
File C:\\LocalData\\pabflore\\Bertopic\\BERTopic\\bertopic\\_bertopic.py:3777, in BERTopic._cluster_embeddings(self, umap_embeddings, documents, partial_fit, y)
3775 else:
3776 try:
-> 3777 self.hdbscan_model.fit(umap_embeddings, y=y)
3778 except TypeError:
3779 self.hdbscan_model.fit(umap_embeddings)
File c:\\Users\\Localadmin_pabflore\\miniconda3\\envs\\Bertopicfull\\lib\\site-packages\\hdbscan\\hdbscan_.py:1221, in HDBSCAN.fit(self, X, y)
1211 kwargs.update(self._metric_kwargs)
1212 kwargs['gen_min_span_tree'] |= self.branch_detection_data
1214 (
1215 self.labels_,
1216 self.probabilities_,
1217 self.cluster_persistence_,
1218 self._condensed_tree,
1219 self._single_linkage_tree,
1220 self._min_spanning_tree,
-> 1221 ) = hdbscan(clean_data, **kwargs)
1223 if self.metric != \"precomputed\" and not self._all_finite:
1224 # remap indices to align with original data in the case of non-finite entries.
1225 self._condensed_tree = remap_condensed_tree(
1226 self._condensed_tree, internal_to_raw, outliers
1227 )
File c:\\Users\\Localadmin_pabflore\\miniconda3\\envs\\Bertopicfull\\lib\\site-packages\\hdbscan\\hdbscan_.py:869, in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, max_cluster_size, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
856 (single_linkage_tree, result_min_span_tree) = memory.cache(
857 _hdbscan_prims_balltree
858 )(
(...)
866 **kwargs
867 )
868 else:
--> 869 (single_linkage_tree, result_min_span_tree) = memory.cache(
870 _hdbscan_boruvka_balltree
871 )(
872 X,
873 min_samples,
874 alpha,
875 metric,
876 p,
877 leaf_size,
878 approx_min_span_tree,
879 gen_min_span_tree,
880 core_dist_n_jobs,
881 **kwargs
882 )
884 return (
885 _tree_to_labels(
886 X,
(...)
895 + (result_min_span_tree,)
896 )
File c:\\Users\\Localadmin_pabflore\\miniconda3\\envs\\Bertopicfull\\lib\\site-packages\\joblib\\memory.py:312, in NotMemorizedFunc.__call__(self, *args, **kwargs)
311 def __call__(self, *args, **kwargs):
--> 312 return self.func(*args, **kwargs)
File c:\\Users\\Localadmin_pabflore\\miniconda3\\envs\\Bertopicfull\\lib\\site-packages\\hdbscan\\hdbscan_.py:384, in _hdbscan_boruvka_balltree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
381 if X.dtype != np.float64:
382 X = X.astype(np.float64)
--> 384 tree = BallTree(X, metric=metric, leaf_size=leaf_size, **kwargs)
385 alg = BallTreeBoruvkaAlgorithm(
386 tree,
387 min_samples,
(...)
392 **kwargs
393 )
394 min_spanning_tree = alg.spanning_tree()
File sklearn\\\
eighbors\\\\_binary_tree.pxi:904, in sklearn.neighbors._ball_tree.BinaryTree64.__init__()
File sklearn\\\\metrics\\\\_dist_metrics.pyx:207, in sklearn.metrics._dist_metrics.DistanceMetric.get_metric()
File sklearn\\\\metrics\\\\_dist_metrics.pyx:418, in sklearn.metrics._dist_metrics.DistanceMetric64.get_metric()
ValueError: Unrecognized metric 'cosine'"
} |
@PipaFlores This is not a bug related to BERTopic, but to HDBSCAN. Have you tried installing the latest version of HDBSCAN? You can find more information about this here and here. |
What version of HDBSCAN do you think this version of 0.16.4 should be adapted to? I currently use BERTopic of 0.16.4 and HDBSCAN of 0.8.30 |
@superseanyoung You could try using the latest version of HDBSCAN which might fix your issue. If that doesn't work, you can try creating a new environment (that often helps with these kinds of issues). If that all doesn't work, I would advise posting on the HDBSCAN repo since they know more about that than here (it's also not related to BERTopic). |
Hi. I have this issue, and I used the latest version of HDBSCAN, also created a new environment, did not work. |
Have you searched existing issues? 🔎
Desribe the bug
I wanted to set cosine to represent the distance parameter when I personalized hdbscan. This instantiation didn't go wrong, but I passed the instantiation hdbscan_model into BERTopic(). Unrecognized metric 'cosine'
Reproduction
BERTopic Version
0.16.4
The text was updated successfully, but these errors were encountered: