Skip to content

Dataset used for EHRI Multi-lingual Automated Subject Indexing (MASI) paper

License

Notifications You must be signed in to change notification settings

EHRI/ehri-masi-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

EHRI Multilingual Subject Indexing Test Dataset

Overview

This dataset consists of texts, derived from descriptions of Holocaust-related archival material, each of which is associated with one or more subject terms from the EHRI Terms controlled vocabulary.

This dataset is described more thoroughly in the following paper, and is published to support reproduction:

Multilingual Automated Subject Indexing: a comparative study of LLMs vs alternative approaches in the context of the EHRI project

Authors

  • Maria Dermentzi, Orcid: 0000-0001-8159-7600
  • Mike Bryant, Orcid: 0000-0003-0765-7390
  • Fabio Rovigo, Orcid: 0000-0001-5760-3185
  • Herminio García-González, Orcid: 0000-0001-5590-4857

License

See accompanying file: LICENSE.txt

Data collection

The data has been exported from the EHRI Portal on 2023-10-11.

Structure

The dataset is split into a training and test portion, comprising 25,732 and 10,860 descriptions respectively. There is an additional (smaller) test set for evaluation purposes (named eval) consisting of 167 items.

Data processing

The processing applied to this dataset and the stratification techniques used are described in the referenced paper.

Format

The data is structured in Annif Full-text document corpus format, where the basename of each file is the ID of an item on the EHRI portal. For example, the text for a description is containing in some-id.txt and the associated subject terms in some-id.tsv.

Reproducibility

To reproduce the results of the paper, follow these steps:

# Install GIT LFS, method here assumes Debian-based Linux 
sudo apt install git-lfs
git lfs install

# Note: *do this in a virtual environment*
# This assumes you're using virtualenvwrapper
mkvirtualenv ehri_masi
workon ehri_masi

# Checkout and install the modified version of Annif.
git clone https://github.com/EHRI/Annif.git ehri_masi
cd ehri_masi

# Checkout the branch with the modifications for this paper
git checkout ehri_masi

# Install python dependencies
pip install -r requirements-ehri.txt
pip install -e .[nn,fasttext,spacy,omikuji]

# Clone the dataset to the data directory
git clone https://github.com/EHRI/ehri-masi-data.git data

# Activate the example `projects.cfg`
cp projects.cfg.ehri projects.cfg

# Load the test vocabulary with the name used in the project.cfg file
annif load-vocab ehri_sm data/ehri_sm.ttl

# Clone the model for the EHRI fine-tuned BERT-based model to the `models` directory
# NB: this may take some time...
git clone https://huggingface.co/mdermentzi/finetuned-bert-base-multilingual-cased-ehri-terms models/finetuned-bert-base-multilingual-cased-ehri-terms

# Train the Annif models
for model in tfidf-ehri mllm-ehri fasttext-ehri omikuji-parabel-ehri nn-ehri; do
  echo "Training $model..."
  annif train $model data/train
done

# Evaluate the Annif models
for model in tfidf-ehri mllm-ehri fasttext-ehri omikuji-parabel-ehri nn-ehri; do
  echo "Evaluating $model..."
  annif eval $model data/eval
done

# Evaluate the EHRI fine-tuned model
annif eval bertft-ehri data/eval

# Evaluate the MDeBERTa based zero-shot model. This will be extremely slow
# unless you have a good CUDA-supporting graphics card...
annif eval mdeberta-ehri data/eval

Qualitative evaluation spreadsheets

There is a script for generating spreadsheets for qualitative evaluation in the ehri_masi Annif branch called annif-to-spreadsheet.py, which takes a set of files generated by the Annif index command and creates a Google Sheet for comparing the output of each tool. The hard part here involves authenticating the gspread library, which you need to do as per the following instructions:

https://docs.gspread.org/en/v6.1.3/oauth2.html#enable-api-access-for-a-project

When you have obtained a credentials JSON file, save it in the project directory as .secrets/credentials.json.

# Generating spreadsheets for qualitative evaluation. Pick the classifiers you want to evaluate. 
# Here we will use mdeberta-ehri, bertft-ehri, and nn-ehri.
# We use the --suffix option to distinguish each set of indexed files.
for model in nn-ehri bertft-ehri mdeberta-ehri; do
  echo "Indexing docs using $model..."
  annif index --suffix ".$model" $model data/eval
done

# Now we use the `annif-to-spreadsheet.py` script to generate a spreadsheet.
python annif-to-spreadsheet.py --name "My Eval" data/eval nn-ehri bertft-ehri mdeberta-ehri 

If authentication has gone to plan, the script will print out the URL of the spreadsheet it created. To see other options of annif-to-spreadsheet.py, run:

annif-to-spreadsheet.py --help

Version history

v1.0.0 (2025-01-15): Initial release for paper submission.

About

Dataset used for EHRI Multi-lingual Automated Subject Indexing (MASI) paper

Resources

License

Stars

Watchers

Forks

Packages

No packages published