This dataset consists of texts, derived from descriptions of Holocaust-related archival material, each of which is associated with one or more subject terms from the EHRI Terms controlled vocabulary.
This dataset is described more thoroughly in the following paper, and is published to support reproduction:
Multilingual Automated Subject Indexing: a comparative study of LLMs vs alternative approaches in the context of the EHRI project
- Maria Dermentzi, Orcid: 0000-0001-8159-7600
- Mike Bryant, Orcid: 0000-0003-0765-7390
- Fabio Rovigo, Orcid: 0000-0001-5760-3185
- Herminio García-González, Orcid: 0000-0001-5590-4857
See accompanying file: LICENSE.txt
The data has been exported from the EHRI Portal on 2023-10-11.
The dataset is split into a training and test portion, comprising 25,732 and 10,860
descriptions respectively. There is an additional (smaller) test set for evaluation
purposes (named eval
) consisting of 167 items.
The processing applied to this dataset and the stratification techniques used are described in the referenced paper.
The data is structured in Annif Full-text document corpus
where the basename of each file is the ID of an item on the EHRI portal.
For example, the text for a description is containing in some-id.txt
and the associated subject terms in some-id.tsv
To reproduce the results of the paper, follow these steps:
# Install GIT LFS, method here assumes Debian-based Linux
sudo apt install git-lfs
git lfs install
# Note: *do this in a virtual environment*
# This assumes you're using virtualenvwrapper
mkvirtualenv ehri_masi
workon ehri_masi
# Checkout and install the modified version of Annif.
git clone ehri_masi
cd ehri_masi
# Checkout the branch with the modifications for this paper
git checkout ehri_masi
# Install python dependencies
pip install -r requirements-ehri.txt
pip install -e .[nn,fasttext,spacy,omikuji]
# Clone the dataset to the data directory
git clone data
# Activate the example `projects.cfg`
cp projects.cfg.ehri projects.cfg
# Load the test vocabulary with the name used in the project.cfg file
annif load-vocab ehri_sm data/ehri_sm.ttl
# Clone the model for the EHRI fine-tuned BERT-based model to the `models` directory
# NB: this may take some time...
git clone models/finetuned-bert-base-multilingual-cased-ehri-terms
# Train the Annif models
for model in tfidf-ehri mllm-ehri fasttext-ehri omikuji-parabel-ehri nn-ehri; do
echo "Training $model..."
annif train $model data/train
# Evaluate the Annif models
for model in tfidf-ehri mllm-ehri fasttext-ehri omikuji-parabel-ehri nn-ehri; do
echo "Evaluating $model..."
annif eval $model data/eval
# Evaluate the EHRI fine-tuned model
annif eval bertft-ehri data/eval
# Evaluate the MDeBERTa based zero-shot model. This will be extremely slow
# unless you have a good CUDA-supporting graphics card...
annif eval mdeberta-ehri data/eval
There is a script for generating spreadsheets for qualitative evaluation in the ehri_masi
branch called
, which takes a set of files generated by the Annif index
command and creates a Google Sheet for comparing the output of each tool. The hard part here involves
authenticating the gspread
library, which you need to do as per the following instructions:
When you have obtained a credentials JSON file, save it in the project directory as .secrets/credentials.json
# Generating spreadsheets for qualitative evaluation. Pick the classifiers you want to evaluate.
# Here we will use mdeberta-ehri, bertft-ehri, and nn-ehri.
# We use the --suffix option to distinguish each set of indexed files.
for model in nn-ehri bertft-ehri mdeberta-ehri; do
echo "Indexing docs using $model..."
annif index --suffix ".$model" $model data/eval
# Now we use the `` script to generate a spreadsheet.
python --name "My Eval" data/eval nn-ehri bertft-ehri mdeberta-ehri
If authentication has gone to plan, the script will print out the URL of the spreadsheet it created.
To see other options of
, run: --help
v1.0.0 (2025-01-15): Initial release for paper submission.