Skip to content
This repository has been archived by the owner on Jun 2, 2021. It is now read-only.

An incremental clustering system which is capable of maintaining the growing number of topic clusters of news articles online from a crawler

License

Notifications You must be signed in to change notification settings

vanam/Incremental-News-Clustering

Repository files navigation

Incremental News Clustering

Build Status License

The goal was to research model-based clustering methods, notably the Distance Dependent Chinese Restaurant Process (ddCRP), and propose an incremental clustering system which would be capable of maintaining the growing number of topic clusters of news articles coming online from a crawler. LDA, LSA, and doc2vec methods were used to represent a document as a fixed-length numeric vector. Cluster assignments given by a proof-of-concept implementation of such a system were evaluated using various metrics, notably purity, F-measure and V-measure. A modification of V-measure -- NV-measure -- was introduced in order to penalize an excessive or insufficient number of clusters. The best results were achieved with doc2vec and ddCRP.

Due to copyright, news articles used for experiments are only available at the university library.

Full thesis text: thesis.pdf
Poster: Vana_Martin_2018.pdf

BibTeX citation:

@MASTERSTHESIS {martinvana2018,
    author  = "Martin Váňa",
    title   = "Incremental News Clustering",
    school  = "University of West Bohemia",
    year    = "2018",
    address = "Pilsen",
    month   = "may"
}

Installation

Requirements

  • Python 3.5
  • Pip
  • Pipenv

Ubuntu

$ sudo apt-get install python3 python3-tk python3-pip
$ pip3 install pipenv

Project dependencies

$ pipenv install --dev

If it fails for some reason try pipenv install --dev --skip-lock

~/.bashrc

export PYTHONPATH='.'

Development

Configure PyCharm

Activate project's virtualenv

$ pipenv shell

Run script

$ pipenv run python <script_name>.py

Run tests

$ pipenv run pytest tests

About

An incremental clustering system which is capable of maintaining the growing number of topic clusters of news articles online from a crawler

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published