Skip to content

RSDO-DS3/SloSemanticShiftDetection

Repository files navigation

A System for Semantic Change Detection

V tem repozitoriju se nahaja rezultat aktivnosti A3.3 - R3.3.4 Orodje za prepoznavanje semantičnih premikov ter izvajanje diahronih analiz, ki je nastalo v okviru projekta Razvoj slovenščine v digitalnem okolju.


We propose a novel scalable method for word usage-change detection that offers large gains in processing time and significant memory savings while offering the same interpretability and better performance than unscalable methods.

For more details, see the conference paper, published in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics.

Project structure

  • resources/ contain language resources (e.g. a list of stopwords) required for running the script on Slovenian datasets.
  • embeddings/ is a placeholder for generated temporal embeddings.
  • data/ contains example dataset.

Setup

Instructions for installation assume the usage of PyPI package manager.

Install dependencies if needed: pip install -r requirements.txt
You also need to download 'tokenizers/punkt/english.pickle' using nltk library.

Prepare the data :

The script preprocess.py accepts corpus in the tsv format as an input (see 'data/example_data.csv' for example). To run the script on the example data, run:

python preprocess.py  --data_path data/example_data.tsv --chunks_column date --text_column text --lang slo --output_dir output  --min_freq 10

Arguments:
--data_path Path to the tsv file containing the data
--chunks_column Name of the column in the data tsv file that should be used for splitting the corpus into chunk, between which semantic shift will be calculated
--text_column Name of the column in the data tsv file containing text
--lang Language of the corpus, currently only Slovenian ('slo') and English ('en') are supported
--output_dir Path to the folder that will contain generated output vocab and language_model training files
--min_freq Minimum frequency of the word in a specific chunk to be included in the vocabulary

Outputs:
1.) preprocessed tsv corpus saved to the folder containing input data
2.) vocab.pickle A pickled vocab class used as input for script get_embeddings_scalable.py
3.) train_lm.txt An input train corpus for language model fine-tuning
4.) test_lm.txt An input test corpus for language model fine-tuning
5.) vocab_list_of_words.csv All words in the corpus vocab for which semantic shift will be calculated

Fine-tune RoBERTa language model:

Fine-tune RoBERTa language model:

python finetune_mlm.py --train_file output/train_lm.txt --validation_file output/test_lm.txt --output_dir models --data_path data/example_data.tsv --chunks_column date --model_name_or_path EMBEDDIA/sloberta --do_train true --do_eval true --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --save_steps 20000 --evaluation_strategy steps --eval_steps 20000 --overwrite_cache --num_train_epochs 10 --max_seq_length 512

Arguments:
--train_file An input train corpus for language model fine-tuning generated by the preprocessing.py script
--validation_file An input test corpus for language model fine-tuning generated by the preprocessing.py script
--output_dir Directory where the fine-tuned model is saved
--data_path Path to the tsv file containing the data
--chunks_column Name of the column in the data tsv file that should be used for splitting the corpus into chunk, between which semantic shift will be calculated)
--model_name_or_path Which transformer model to use, currently only English 'roberta-base' and Slovenian 'EMBEDDIA/sloberta' models are supported

Outputs:
1.) A fine-tuned RoBERTa model that can be used for embedding generation in script get_embeddings_scalable.py

Extract embeddings:

Generate corpus chunk specific embeddings:

python get_embeddings_scalable.py --vocab_path output/vocab.pickle --embeddings_path embeddings/embeddings.pickle --lang slo --path_to_fine_tuned_model models --batch_size 16 --max_sequence_length 256 --device cuda

Arguments:
--vocab_path Paths to vocab pickle file generated by the preprocessing.py script
--embeddings_path Path to output pickle file containing embeddings.
--lang Language of the corpus, currently only Slovenian ('slo') and English ('en') are supported
--path_to_fine_tuned_model Path to fine-tuned model. If empty, pretrained model is used

Outputs:
1.) output pickle file containing embeddings

Conduct clustering and measure semantic shift:

python measure_semantic_shift.py --output_dir output --embeddings_path embeddings/embeddings.pickle --random_state 123 --cluster_size_threshold 10 --metric JSD

Arguments:
--output_dir Paths to output results folder
--embeddings_path Path to input pickle file containing embeddings
--random_state Random seed
--cluster_size_threshold Remove cluster or merge it with other if it contains less than threshold word usages
--metric Which metric to use for measuring semantic shift, should be JSD or WD

Outputs:
1.) word_list_results.csv a list of all words in the vocab with their semantic change scores
2.) corpus_slices.pkl pickled list of corpus slices used as input for script 'interpretation.py'
3.) id2sents.pkl pickled sentence dictionary used as input for script 'interpretation.py'
4.) kmeans_5_labels.pkl pickled dictionary of kmeans cluster labels for each word usage used as input for script 'interpretation.py'
5.) sents.pkl pickled list of all sentences used as input for script 'interpretation.py'

Extract keywords for each cluster and plot clusters distributions for interpretation:

python interpretation.py  --target_words "diplomat,objava" --lang slo --input_dir output --results_dir results --cluster_size_threshold 10 --max_df 0.8 --num_keywords 10

Arguments:
--target_words Target words to analyse, separated by comma
--lang Language of the corpus, currently only Slovenian ('slo') and English ('en') are supported
--input_dir Folder containing data generated by the script 'measure_semantic_shift.py
--results_dir Path to final results
--max_df Words that appear in more than that percentage of clusters will not be used as keywords
--cluster_size_threshold Clusters smaller than a threshold will be deleted
--num_keywords Number of keywords per cluster

Outputs:
1.) An image showing a distribution of word usages for each target word
2.) A tsv document per each target word containing information about sentences in which it appeared and into which cluster was each usage clustered


Operacijo Razvoj slovenščine v digitalnem okolju sofinancirata Republika Slovenija in Evropska unija iz Evropskega sklada za regionalni razvoj. Operacija se izvaja v okviru Operativnega programa za izvajanje evropske kohezijske politike v obdobju 2014-2020.


SloCOREF tool (modeling and training) was partly financed by CLARIN.SI.

About

No description, website, or topics provided.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages