Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metric= "cosine" error reported #2217

Open
1 task done
superseanyoung opened this issue Nov 15, 2024 · 5 comments
Open
1 task done

metric= "cosine" error reported #2217

superseanyoung opened this issue Nov 15, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@superseanyoung
Copy link

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

I wanted to set cosine to represent the distance parameter when I personalized hdbscan. This instantiation didn't go wrong, but I passed the instantiation hdbscan_model into BERTopic(). Unrecognized metric 'cosine'

Reproduction

from bertopic import BERTopic
hdbscan_model = HDBSCAN(
min_cluster_size=200,,
min_samples=20,
metric='cosine',
prediction_data=True
)
topic_model = BERTopic(embedding_model=transformer_model, 
                           #min_topic_size=3, 
                           verbose=True,
                           umap_model=umap_model,
                           hdbscan_model=hdbscan_model,
                           ctfidf_model=ctfidf_model,
                           representation_model=representation_model,
                           #top_n_words=10,
                           #min_topic_size=10,
                           #nr_topics=None,
                           #low_memory=False,
                           #calculate_probabilities=True
                          )
    topics,probs=topic_model.fit_transform(sentences,embeddings=embeddings)

BERTopic Version

0.16.4

@superseanyoung superseanyoung added the bug Something isn't working label Nov 15, 2024
@PipaFlores
Copy link
Contributor

Getting the same error, here is my full log. Im guessing yours is the same

# Define hdbscan clustering model, also different for data types
# Pre-defined parameters (min cluster size = 10 (Comes from mintopicsize), )
hdbscan_model = HDBSCAN(min_cluster_size=10, cluster_selection_method="eom", prediction_data=True, 
metric="cosine")
topic_model= BERTopic(representation_model=representation_model, 
                      vectorizer_model=vectorizer_model,ctfidf_model=ctfidf_model, embedding_model=sentence_model, 
                                                    umap_model = umap_model, hdbscan_model = hdbscan_model, 
calculate_probabilities=False, verbose=True, nr_topics='auto')

topics,  probs = topic_model.fit_transform(prompts_df["prompt"].tolist(), 
embeddings=np.array(prompts_df["embedding"].tolist()))
{
	"name": "ValueError",
	"message": "Unrecognized metric 'cosine'",
	"stack": "---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File sklearn\\\\metrics\\\\_dist_metrics.pyx:416, in sklearn.metrics._dist_metrics.DistanceMetric64.get_metric()

KeyError: 'cosine'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[29], line 8
      3 import sklearn
      5 topic_model= BERTopic(representation_model=representation_model, vectorizer_model=vectorizer_model,ctfidf_model=ctfidf_model, embedding_model=sentence_model, 
      6                                                     umap_model = umap_model, hdbscan_model = hdbscan_model, calculate_probabilities=False, verbose=True, nr_topics='auto')
----> 8 topics,  probs = topic_model.fit_transform(prompts_df[\"prompt\"].tolist(), embeddings=np.array(prompts_df[\"embedding\"].tolist()))
     11 reduced_embeddings = UMAP(n_neighbors=10, n_components=2,
     12                             min_dist=0.0, metric='cosine').fit_transform(np.array(prompts_df[\"embedding\"].tolist()))

File C:\\LocalData\\pabflore\\Bertopic\\BERTopic\\bertopic\\_bertopic.py:463, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    459         umap_embeddings = self.umap_model.transform(embeddings)
    461 if len(documents) > 0:
    462     # Cluster reduced embeddings
--> 463     documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
    464     if self._is_zeroshot() and len(assigned_documents) > 0:
    465         documents, embeddings = self._combine_zeroshot_topics(
    466             documents, embeddings, assigned_documents, assigned_embeddings
    467         )

File C:\\LocalData\\pabflore\\Bertopic\\BERTopic\\bertopic\\_bertopic.py:3777, in BERTopic._cluster_embeddings(self, umap_embeddings, documents, partial_fit, y)
   3775 else:
   3776     try:
-> 3777         self.hdbscan_model.fit(umap_embeddings, y=y)
   3778     except TypeError:
   3779         self.hdbscan_model.fit(umap_embeddings)

File c:\\Users\\Localadmin_pabflore\\miniconda3\\envs\\Bertopicfull\\lib\\site-packages\\hdbscan\\hdbscan_.py:1221, in HDBSCAN.fit(self, X, y)
   1211 kwargs.update(self._metric_kwargs)
   1212 kwargs['gen_min_span_tree'] |= self.branch_detection_data
   1214 (
   1215     self.labels_,
   1216     self.probabilities_,
   1217     self.cluster_persistence_,
   1218     self._condensed_tree,
   1219     self._single_linkage_tree,
   1220     self._min_spanning_tree,
-> 1221 ) = hdbscan(clean_data, **kwargs)
   1223 if self.metric != \"precomputed\" and not self._all_finite:
   1224     # remap indices to align with original data in the case of non-finite entries.
   1225     self._condensed_tree = remap_condensed_tree(
   1226         self._condensed_tree, internal_to_raw, outliers
   1227     )

File c:\\Users\\Localadmin_pabflore\\miniconda3\\envs\\Bertopicfull\\lib\\site-packages\\hdbscan\\hdbscan_.py:869, in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, max_cluster_size, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    856             (single_linkage_tree, result_min_span_tree) = memory.cache(
    857                 _hdbscan_prims_balltree
    858             )(
   (...)
    866                 **kwargs
    867             )
    868         else:
--> 869             (single_linkage_tree, result_min_span_tree) = memory.cache(
    870                 _hdbscan_boruvka_balltree
    871             )(
    872                 X,
    873                 min_samples,
    874                 alpha,
    875                 metric,
    876                 p,
    877                 leaf_size,
    878                 approx_min_span_tree,
    879                 gen_min_span_tree,
    880                 core_dist_n_jobs,
    881                 **kwargs
    882             )
    884 return (
    885     _tree_to_labels(
    886         X,
   (...)
    895     + (result_min_span_tree,)
    896 )

File c:\\Users\\Localadmin_pabflore\\miniconda3\\envs\\Bertopicfull\\lib\\site-packages\\joblib\\memory.py:312, in NotMemorizedFunc.__call__(self, *args, **kwargs)
    311 def __call__(self, *args, **kwargs):
--> 312     return self.func(*args, **kwargs)

File c:\\Users\\Localadmin_pabflore\\miniconda3\\envs\\Bertopicfull\\lib\\site-packages\\hdbscan\\hdbscan_.py:384, in _hdbscan_boruvka_balltree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
    381 if X.dtype != np.float64:
    382     X = X.astype(np.float64)
--> 384 tree = BallTree(X, metric=metric, leaf_size=leaf_size, **kwargs)
    385 alg = BallTreeBoruvkaAlgorithm(
    386     tree,
    387     min_samples,
   (...)
    392     **kwargs
    393 )
    394 min_spanning_tree = alg.spanning_tree()

File sklearn\\\
eighbors\\\\_binary_tree.pxi:904, in sklearn.neighbors._ball_tree.BinaryTree64.__init__()

File sklearn\\\\metrics\\\\_dist_metrics.pyx:207, in sklearn.metrics._dist_metrics.DistanceMetric.get_metric()

File sklearn\\\\metrics\\\\_dist_metrics.pyx:418, in sklearn.metrics._dist_metrics.DistanceMetric64.get_metric()

ValueError: Unrecognized metric 'cosine'"
}

@MaartenGr
Copy link
Owner

@PipaFlores This is not a bug related to BERTopic, but to HDBSCAN. Have you tried installing the latest version of HDBSCAN? You can find more information about this here and here.

@superseanyoung
Copy link
Author

What version of HDBSCAN do you think this version of 0.16.4 should be adapted to? I currently use BERTopic of 0.16.4 and HDBSCAN of 0.8.30

@MaartenGr
Copy link
Owner

@superseanyoung You could try using the latest version of HDBSCAN which might fix your issue. If that doesn't work, you can try creating a new environment (that often helps with these kinds of issues). If that all doesn't work, I would advise posting on the HDBSCAN repo since they know more about that than here (it's also not related to BERTopic).

@TalaN1993
Copy link

@superseanyoung You could try using the latest version of HDBSCAN which might fix your issue. If that doesn't work, you can try creating a new environment (that often helps with these kinds of issues). If that all doesn't work, I would advise posting on the HDBSCAN repo since they know more about that than here (it's also not related to BERTopic).

Hi. I have this issue, and I used the latest version of HDBSCAN, also created a new environment, did not work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants