Skip to content

jmichellec/VU-ILT-Thesis

Repository files navigation

Topic Modelling for Dutch Texts

This repository contains the code that was created as part of the MA Linguistics (Text Mining) thesis and the internship at Inspectie Leefomgeving & Transport: "An Empirical Framework for Topic Modelling for Dutch Texts based on Newspaper Articles on Soil Pollution"

Folders

This section describes the data and models that are found in the folders.

Data

data-2010-2020.tsv: original unpreprocessed dataset containing newspaper articles from 2010 to 2020
complete-clean-preprocessed-data-2010-2020-1.tsv: preprocessed data
Woonplaatsen_in_Nederland_2020_20122021_042012.csv: CBS residencies with minor adaptations
gold_top2vec.tsv: documents with gold labels for annotation study
human_evaluation_results.tsv: responses of annotation study to gold label
ILT-survey.tsv: original results annotation study (straight from Google Forms)

Embedding models

robbert-v2-dutch-base-finetuned-model: Fine-tuned RobBERT model
d2v.docvectors - Trained Doc2Vec vectors from scratch (used for BERTopic)
d2v.model Doc2Vec Model (used for BERTopic)
d2v.model.trainables.syn1.npy: Doc2Vec trainables (used for BERTopic)
d2v.model.wv.vectors.npy: Doc2Vec word vectors (used for BERTopic)

Evaluation

All text files contain topic coherence scores of hyperparameter tuning of each model.

Models

Pickle file of each model (including the different hyperparameter settings which are depicted with _X; in which X is a number. See code for exact settings.)

Plots

Contains plots of statistics of the dataset.

Training results

Contains a notebook containing code for masked language modelling to fine-tune RobBERT to the dataset.

Code

All model notebooks include evaluation using topic coherence.

data_analysis.ipynb: Data cleaning and statistics. Includes removal of metadata from body text
Traditional Models.ipynb: LDA & NMF, bag-of-words and TF-IDF processing
Top2Vec.ipynb: Top2Vec implementation and document semantic similarity search for annotation study
BERTopic.ipynb: BERTopic implementation using vanilla RobBERT and fine-tuned.

LDASEQ.ipynb (unused): Code for LDA over time

Packages

gensim == 4.1.2
latextable
numpy == 1.21.5
pandas == 1.4.1
session_info == 1.0.0
spacy == 3.2.4
tabulate == 0.8.9
texttable == 1.6.4
matplotlib == 3.5.1
nltk == 3.5
top2vec == 1.0.0
umap == 0.5.2
bertopic == 0.9.4
transformers == 4.16.2

License

MIT

About

Code MA thesis Linguistics (Text Mining) at VU/ILT

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published