Hierarchical Clustering and Intertopic Distance Map #2269

chenhaola · 2025-01-18T14:09:00Z

Hello!
When I use the bertopic clustering, I find that the similar points in the intertopic distance map are not shown in the hierarchical clustering, which is that these two functions are very different, and ask for reasons.
Thank you!!

Attach the code:
import numpy as np
from bertopic import BERTopic
from transformers.pipelines import pipeline
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer

with open('./data/切词6.txt', 'r', encoding='utf-8') as file:
docs = file.readlines()
print('条数: ', len(docs))
print('预览第一条: ', docs[0])

vectorizer_model = None
embedding_model = embedding_model = model = SentenceTransformer("C:\Users\ASUS\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2\snapshots\ea78891063587eb050ed4166b20062eaf978037c")
embeddings = np.load('c:\Users\ASUS\Desktop\BERTopic-Tutorial-main\embedding\emb6.npy')
print(embeddings.shape)

embeddings = embedding_model.encode(docs)
print('Embeddings shape:', embeddings.shape)

2. 创建UMAP降维模型

umap_model = UMAP(
n_neighbors=15,
n_components=5,
min_dist=0.0,
metric='cosine',
random_state=42 # ⚠️ 防止随机 https://maartengr.github.io/BERTopic/faq.html
)

3. 创建HDBSCAN聚类模型

hdbscan_model = HDBSCAN(
min_cluster_size=20,
min_samples=1,
metric='euclidean'

predictio_data=True 算文档归属需要

)

定义停用词列表

stop_words = [
'Adolescent', 'adolescents', 'game', 'gaming', 'addiction', 'internet',
'study', 'research', 'analysis', 'findings', 'results', 'literature',
'review', 'impact', 'effects', 'factors', 'behavior', 'risk', 'prevention',
'treatment', 'gamer', 'games', 'children', 'disorder', 'gamers', 'child',
'student', 'students', 'play', 'player', 'plays', 'players', 'addictions',

]

初始化CountVectorizer并设置停用词

vectorizer_model = CountVectorizer(stop_words=stop_words)

6. 正式创建BERTopic模型

topic_model = BERTopic(
embedding_model=embedding_model,
vectorizer_model=vectorizer_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
nr_topics='auto'
)

查看主题

topics, probs = topic_model.fit_transform(docs, embeddings=embeddings) #传入训练好的词向量
topic_model.get_topic_info()
from bertopic.representation import MaximalMarginalRelevance # 导入

diversity: How diverse the select keywords/keyphrases are.

Values range between 0 and 1 with 0 being not diverse at all

and 1 being most diverse.

representation_model = MaximalMarginalRelevance(diversity=0.5) # 创建mmr模型
topic_model = BERTopic(
embedding_model=embedding_model,
vectorizer_model=vectorizer_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
representation_model=representation_model # 传入模型
)

topics, probs = topic_model.fit_transform(docs, embeddings=embeddings)
topic_info = topic_model.get_topic_info()
topic_info
topic_docs = topic_model.get_document_info(docs)
topic_docs.to_csv('./聚类结果6.csv')
with open('./data/文本6.txt', 'r', encoding='utf-8') as file:
texts = file.readlines()
print('文本条数：', len(texts))
topic_docs.insert(1, '原文', texts)
with open('./data/时间6.txt', 'r', encoding='utf-8') as file:
years = file.readlines()
print('文本条数：', len(years))
topic_docs.insert(2, '时间', years)
topic_docs.to_csv('./聚类结果66.csv')
topic_model.visualize_barchart(top_n_topics=23, custom_labels=True)
topic_model.visualize_topics()
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings, hide_document_hover=True)

hierarchical_topics = topic_model.hierarchical_topics(docs)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

MaartenGr · 2025-01-22T11:52:07Z

When I use the bertopic clustering, I find that the similar points in the intertopic distance map are not shown in the hierarchical clustering, which is that these two functions are very different, and ask for reasons.

That's true, these functions are quite different from one another because the former is a visualization tool and the latter is doing the modeling.

With the intertopic distance map, the topic embeddings are reduced to 2-dimensions and then visualized. With the hierarchical model, a clustering algorithm is used on the full topic embeddings (which are quite a bit more than 2-dimensions). Also note that reducing just a few topics to 2-d will always be less accurate than when you do it with many more documents (thousands, millions, etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hierarchical Clustering and Intertopic Distance Map #2269

Hierarchical Clustering and Intertopic Distance Map #2269

chenhaola commented Jan 18, 2025

MaartenGr commented Jan 22, 2025

Hierarchical Clustering and Intertopic Distance Map #2269

Hierarchical Clustering and Intertopic Distance Map #2269

Comments

chenhaola commented Jan 18, 2025

2. 创建UMAP降维模型

3. 创建HDBSCAN聚类模型

predictio_data=True 算文档归属需要

定义停用词列表

初始化CountVectorizer并设置停用词

6. 正式创建BERTopic模型

查看主题

diversity: How diverse the select keywords/keyphrases are.

Values range between 0 and 1 with 0 being not diverse at all

and 1 being most diverse.

MaartenGr commented Jan 22, 2025