Skip to content

Widget for exploring and sampling words from text data through word vectors

License

Notifications You must be signed in to change notification settings

SODAS-KU/w2widget

Repository files navigation

w2widget

Widget for exploring and sampling words from text data through word2vec models in order to construct topic dictionaries.

Installation

Use conda to create a new environment and install the requirements

conda create -n w2widget python=3.9
conda activate w2widget
pip install -e .

Examples

In the widget_example.ipynb you can play with the widget from pretrained data from Reuters dataset.

If you want to see an example of the data-workflow generating the necessary input, check out workflow_example.ipynb.

Doc2Vec

This module helps with calculating and handling doc2vec. The approach applied is that every document's vector is calculated by taking a weighted (ie. based on inverse frequencies) average of the document's word vectors.

Note

I recommend using a sentence-embedding model instead of the following approach.

from w2widget.doc2vec import calculate_inverse_frequency, Doc2Vec

# Calculate word weigts from inverse frequency
word_weights = calculate_inverse_frequency(document_tokens)

# Initiate the model
dv_model = Doc2Vec(wv_model, word_weights)

# Add documents and calculated the document vectors
dv_model.add_doc2vec(document_tokens)

# reduce the dimensions
dv_model.reduce_dimensions()

# Store the embeddings
two_dim_doc_embedding = dv_model.TSNE_embedding_array

Widget

This widget module displays the results from:

  • A gensim word2vec model,
  • it's 2-dimensional embedding (ie. TSNE).
  • The custom implemented doc2vec model,
  • it's 2-dimensional embedding (ie. TSNE).
  • A list of tokenized documents with whitespaces and
  • optionally a list of initial search words
from w2widget.widget import Widget

wv_widget = Widget(
    wv_model,
    two_dim_word_embedding,
    tokens_with_ws
    dv_model=None,
    two_dim_doc_embedding=None,
    initial_search_words=[],
)

wv_widget.display_widget()

You can save the topics to a json file from the widget, or access them from the dictionary stored in wv_widget.topics.

About

Widget for exploring and sampling words from text data through word vectors

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published