Skip to content

A package to run embedded topic modelling with ETM. Adapted from the original at: https://github.com/adjidieng/ETM

Notifications You must be signed in to change notification settings

bui-thanh-lam/embedded-topic-model

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Embedded Topic Model

PyPI version Actions Status License

This package was made to easily run embedded topic modelling on a given corpus.

ETM is a topic model that marries the probabilistic topic modelling of Latent Dirichlet Allocation with the contextual information brought by word embeddings-most specifically, word2vec. ETM models topics as points in the word embedding space, arranging together topics and words with similar context. As such, ETM can either learn word embeddings alongside topics, or be given pretrained embeddings to discover the topic patterns on the corpus.

ETM was originally published by Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei on a article titled "Topic Modeling in Embedding Spaces" in 2019. This code is an adaptation of the original provided with the article. Most of the original code was kept here, with some changes here and there, mostly for ease of usage.

With the tools provided here, you can run ETM on your dataset using simple steps.

Installation

You can use this package by cloning this repository. Installation via pip will be updated soon.

Usage

To use ETM on your corpus, you must first preprocess the documents into a format understandable by the model. This package has a quick-use preprocessing script. The only requirement is that the corpus must be composed by a list of strings, where each string corresponds to a document in the corpus.

You can preprocess your corpus as follows:

from embedded_topic_model.utils import preprocessing
import json

# Loading a dataset in JSON format. As said, documents must be composed by string sentences
corpus_file = 'datasets/example_dataset.json'
documents_raw = json.load(open(dataset, 'r'))
documents = [document['body'] for document in documents_raw]

# Preprocessing the dataset
vocabulary, train_dataset, _, = preprocessing.create_etm_datasets(
    documents, 
    min_df=0.01, 
    max_df=0.75, 
    train_size=0.85, 
)

Then, you can train word2vec embeddings to use with the ETM model. This is optional, and if you're not interested on training your embeddings, you can either pass a pretrained word2vec embeddings file for ETM or learn the embeddings using ETM itself. If you want ETM to learn its word embeddings, just pass train_embeddings=True as an instance parameter.

To pretrain the embeddings, you can do the following:

from embedded_topic_model.utils import embedding

# Training word2vec embeddings
embeddings_mapping = embedding.create_word2vec_embedding_from_dataset(documents)

To create and fit the model using the training data, execute:

from embedded_topic_model.core.nets import ProdEtm, Model
from embedded_topic_model.core.topic_models import Trainer

# Declare model architecture
prodetm = ProdEtm(
    len(vocabulary),
    num_topics=50,
    train_embeddings=True
)
# Declare a trainer to train/eval model
topic_model = Trainer(
    vocabulary,
    prodetm
)

topic_model.fit(train_dataset)

Also, to obtain the topics, topic coherence or topic diversity of the model, you can do as follows:

topics          = topic_model.get_topics(20)
topic_coherence = topic_model.get_topic_coherence()
topic_diversity = topic_model.get_topic_diversity()

Citation

To cite ETM, use the original article's citation:

@article{dieng2019topic,
    title = {Topic modeling in embedding spaces},
    author = {Dieng, Adji B and Ruiz, Francisco J R and Blei, David M},
    journal = {arXiv preprint arXiv: 1907.04907},
    year = {2019}
}

Acknowledgements

Credits given to Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei for the original work.

License

Licensed under MIT license.

About

A package to run embedded topic modelling with ETM. Adapted from the original at: https://github.com/adjidieng/ETM

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 86.3%
  • Jupyter Notebook 13.7%