You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello!
When I use the bertopic clustering, I find that the similar points in the intertopic distance map are not shown in the hierarchical clustering, which is that these two functions are very different, and ask for reasons.
Thank you!!
Attach the code:
import numpy as np
from bertopic import BERTopic
from transformers.pipelines import pipeline
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
with open('./data/切词6.txt', 'r', encoding='utf-8') as file:
docs = file.readlines()
print('条数: ', len(docs))
print('预览第一条: ', docs[0])
When I use the bertopic clustering, I find that the similar points in the intertopic distance map are not shown in the hierarchical clustering, which is that these two functions are very different, and ask for reasons.
That's true, these functions are quite different from one another because the former is a visualization tool and the latter is doing the modeling.
With the intertopic distance map, the topic embeddings are reduced to 2-dimensions and then visualized. With the hierarchical model, a clustering algorithm is used on the full topic embeddings (which are quite a bit more than 2-dimensions). Also note that reducing just a few topics to 2-d will always be less accurate than when you do it with many more documents (thousands, millions, etc.).
Hello!
When I use the bertopic clustering, I find that the similar points in the intertopic distance map are not shown in the hierarchical clustering, which is that these two functions are very different, and ask for reasons.
Thank you!!
Attach the code:
import numpy as np
from bertopic import BERTopic
from transformers.pipelines import pipeline
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
with open('./data/切词6.txt', 'r', encoding='utf-8') as file:
docs = file.readlines()
print('条数: ', len(docs))
print('预览第一条: ', docs[0])
vectorizer_model = None
embedding_model = embedding_model = model = SentenceTransformer("C:\Users\ASUS\.cache\huggingface\hub\models--sentence-transformers--all-MiniLM-L6-v2\snapshots\ea78891063587eb050ed4166b20062eaf978037c")
embeddings = np.load('c:\Users\ASUS\Desktop\BERTopic-Tutorial-main\embedding\emb6.npy')
print(embeddings.shape)
embeddings = embedding_model.encode(docs)
print('Embeddings shape:', embeddings.shape)
2. 创建UMAP降维模型
umap_model = UMAP(⚠️ 防止随机 https://maartengr.github.io/BERTopic/faq.html
n_neighbors=15,
n_components=5,
min_dist=0.0,
metric='cosine',
random_state=42 #
)
3. 创建HDBSCAN聚类模型
hdbscan_model = HDBSCAN(
min_cluster_size=20,
min_samples=1,
metric='euclidean'
predictio_data=True 算文档归属需要
)
定义停用词列表
stop_words = [
'Adolescent', 'adolescents', 'game', 'gaming', 'addiction', 'internet',
'study', 'research', 'analysis', 'findings', 'results', 'literature',
'review', 'impact', 'effects', 'factors', 'behavior', 'risk', 'prevention',
'treatment', 'gamer', 'games', 'children', 'disorder', 'gamers', 'child',
'student', 'students', 'play', 'player', 'plays', 'players', 'addictions',
]
初始化CountVectorizer并设置停用词
vectorizer_model = CountVectorizer(stop_words=stop_words)
6. 正式创建BERTopic模型
topic_model = BERTopic(
embedding_model=embedding_model,
vectorizer_model=vectorizer_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
nr_topics='auto'
)
查看主题
topics, probs = topic_model.fit_transform(docs, embeddings=embeddings) #传入训练好的词向量
topic_model.get_topic_info()
from bertopic.representation import MaximalMarginalRelevance # 导入
diversity: How diverse the select keywords/keyphrases are.
Values range between 0 and 1 with 0 being not diverse at all
and 1 being most diverse.
representation_model = MaximalMarginalRelevance(diversity=0.5) # 创建mmr模型
topic_model = BERTopic(
embedding_model=embedding_model,
vectorizer_model=vectorizer_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
representation_model=representation_model # 传入模型
)
topics, probs = topic_model.fit_transform(docs, embeddings=embeddings)
topic_info = topic_model.get_topic_info()
topic_info
topic_docs = topic_model.get_document_info(docs)
topic_docs.to_csv('./聚类结果6.csv')
with open('./data/文本6.txt', 'r', encoding='utf-8') as file:
texts = file.readlines()
print('文本条数:', len(texts))
topic_docs.insert(1, '原文', texts)
with open('./data/时间6.txt', 'r', encoding='utf-8') as file:
years = file.readlines()
print('文本条数:', len(years))
topic_docs.insert(2, '时间', years)
topic_docs.to_csv('./聚类结果66.csv')
topic_model.visualize_barchart(top_n_topics=23, custom_labels=True)
topic_model.visualize_topics()
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings, hide_document_hover=True)
hierarchical_topics = topic_model.hierarchical_topics(docs)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
The text was updated successfully, but these errors were encountered: