This repository contains the code that was created as part of the MA Linguistics (Text Mining) thesis and the internship at Inspectie Leefomgeving & Transport: "An Empirical Framework for Topic Modelling for Dutch Texts based on Newspaper Articles on Soil Pollution"
This section describes the data and models that are found in the folders.
• data-2010-2020.tsv: original unpreprocessed dataset containing newspaper articles from 2010 to 2020
• complete-clean-preprocessed-data-2010-2020-1.tsv: preprocessed data
• Woonplaatsen_in_Nederland_2020_20122021_042012.csv: CBS residencies with minor adaptations
• gold_top2vec.tsv: documents with gold labels for annotation study
• human_evaluation_results.tsv: responses of annotation study to gold label
• ILT-survey.tsv: original results annotation study (straight from Google Forms)
• robbert-v2-dutch-base-finetuned-model: Fine-tuned RobBERT model
• d2v.docvectors - Trained Doc2Vec vectors from scratch (used for BERTopic)
• d2v.model Doc2Vec Model (used for BERTopic)
• d2v.model.trainables.syn1.npy: Doc2Vec trainables (used for BERTopic)
• d2v.model.wv.vectors.npy: Doc2Vec word vectors (used for BERTopic)
All text files contain topic coherence scores of hyperparameter tuning of each model.
Pickle file of each model (including the different hyperparameter settings which are depicted with _X; in which X is a number. See code for exact settings.)
Contains plots of statistics of the dataset.
Contains a notebook containing code for masked language modelling to fine-tune RobBERT to the dataset.
All model notebooks include evaluation using topic coherence.
• data_analysis.ipynb: Data cleaning and statistics. Includes removal of metadata from body text
• Traditional Models.ipynb: LDA & NMF, bag-of-words and TF-IDF processing
• Top2Vec.ipynb: Top2Vec implementation and document semantic similarity search for annotation study
• BERTopic.ipynb: BERTopic implementation using vanilla RobBERT and fine-tuned.
• LDASEQ.ipynb (unused): Code for LDA over time
gensim == 4.1.2
latextable
numpy == 1.21.5
pandas == 1.4.1
session_info == 1.0.0
spacy == 3.2.4
tabulate == 0.8.9
texttable == 1.6.4
matplotlib == 3.5.1
nltk == 3.5
top2vec == 1.0.0
umap == 0.5.2
bertopic == 0.9.4
transformers == 4.16.2