Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical Clustering and Intertopic Distance Map #2269

Open
chenhaola opened this issue Jan 18, 2025 · 1 comment
Open

Hierarchical Clustering and Intertopic Distance Map #2269

chenhaola opened this issue Jan 18, 2025 · 1 comment

Comments

@chenhaola
Copy link

Hello!
When I use the bertopic clustering, I find that the similar points in the intertopic distance map are not shown in the hierarchical clustering, which is that these two functions are very different, and ask for reasons.
Thank you!!

Attach the code:
import numpy as np
from bertopic import BERTopic
from transformers.pipelines import pipeline
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer

with open('./data/切词6.txt', 'r', encoding='utf-8') as file:
docs = file.readlines()
print('条数: ', len(docs))
print('预览第一条: ', docs[0])

vectorizer_model = None
embedding_model = embedding_model = model = SentenceTransformer("C:\Users\ASUS\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2\snapshots\ea78891063587eb050ed4166b20062eaf978037c")
embeddings = np.load('c:\Users\ASUS\Desktop\BERTopic-Tutorial-main\embedding\emb6.npy')
print(embeddings.shape)

embeddings = embedding_model.encode(docs)
print('Embeddings shape:', embeddings.shape)

2. 创建UMAP降维模型

umap_model = UMAP(
n_neighbors=15,
n_components=5,
min_dist=0.0,
metric='cosine',
random_state=42 # ⚠️ 防止随机 https://maartengr.github.io/BERTopic/faq.html
)

3. 创建HDBSCAN聚类模型

hdbscan_model = HDBSCAN(
min_cluster_size=20,
min_samples=1,
metric='euclidean'

predictio_data=True 算文档归属需要

)

定义停用词列表

stop_words = [
'Adolescent', 'adolescents', 'game', 'gaming', 'addiction', 'internet',
'study', 'research', 'analysis', 'findings', 'results', 'literature',
'review', 'impact', 'effects', 'factors', 'behavior', 'risk', 'prevention',
'treatment', 'gamer', 'games', 'children', 'disorder', 'gamers', 'child',
'student', 'students', 'play', 'player', 'plays', 'players', 'addictions',

]

初始化CountVectorizer并设置停用词

vectorizer_model = CountVectorizer(stop_words=stop_words)

6. 正式创建BERTopic模型

topic_model = BERTopic(
embedding_model=embedding_model,
vectorizer_model=vectorizer_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
nr_topics='auto'
)

查看主题

topics, probs = topic_model.fit_transform(docs, embeddings=embeddings) #传入训练好的词向量
topic_model.get_topic_info()
from bertopic.representation import MaximalMarginalRelevance # 导入

diversity: How diverse the select keywords/keyphrases are.

Values range between 0 and 1 with 0 being not diverse at all

and 1 being most diverse.

representation_model = MaximalMarginalRelevance(diversity=0.5) # 创建mmr模型
topic_model = BERTopic(
embedding_model=embedding_model,
vectorizer_model=vectorizer_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
representation_model=representation_model # 传入模型
)

topics, probs = topic_model.fit_transform(docs, embeddings=embeddings)
topic_info = topic_model.get_topic_info()
topic_info
topic_docs = topic_model.get_document_info(docs)
topic_docs.to_csv('./聚类结果6.csv')
with open('./data/文本6.txt', 'r', encoding='utf-8') as file:
texts = file.readlines()
print('文本条数:', len(texts))
topic_docs.insert(1, '原文', texts)
with open('./data/时间6.txt', 'r', encoding='utf-8') as file:
years = file.readlines()
print('文本条数:', len(years))
topic_docs.insert(2, '时间', years)
topic_docs.to_csv('./聚类结果66.csv')
topic_model.visualize_barchart(top_n_topics=23, custom_labels=True)
topic_model.visualize_topics()
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings, hide_document_hover=True)

hierarchical_topics = topic_model.hierarchical_topics(docs)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

@MaartenGr
Copy link
Owner

When I use the bertopic clustering, I find that the similar points in the intertopic distance map are not shown in the hierarchical clustering, which is that these two functions are very different, and ask for reasons.

That's true, these functions are quite different from one another because the former is a visualization tool and the latter is doing the modeling.

With the intertopic distance map, the topic embeddings are reduced to 2-dimensions and then visualized. With the hierarchical model, a clustering algorithm is used on the full topic embeddings (which are quite a bit more than 2-dimensions). Also note that reducing just a few topics to 2-d will always be less accurate than when you do it with many more documents (thousands, millions, etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants