This repository presents a benchmark for Historical Language Models with main focus on NER Datasets such as HIPE-2022.
The following Historical Language Models are currently used in benchmarks:
Model | Hugging Face Model Hub Org |
---|---|
hmBERT | Historical Multilingual Language Models for Named Entity Recognition |
hmTEAMS | Historical Multilingual TEAMS Models |
hmByT5 | Historical Multilingual and Monolingual ByT5 Models |
We benchmark pretrained language models on various datasets from HIPE-2020, HIPE-2022 and Europeana. The following table shows an overview of used datasets:
Language | Datasets |
---|---|
English | AjMC - TopRes19th |
German | AjMC - NewsEye - HIPE-2020 |
French | AjMC - ICDAR-Europeana - LeTemps - NewsEye - HIPE-2020 |
Finnish | NewsEye |
Swedish | NewsEye |
Dutch | ICDAR-Europeana |
The hmLeaderboard
space on the Hugging Face Model Hub shows all results and can be
accessed here.
A collection of best performing models can be found here (grouped by the used backbone LM):
- Fine-Tuned Historical NER Models (hmTEAMS)
- Fine-Tuned Historical NER Models (hmBERT)
- Fine-Tuned Historical NER Models (hmByT5)
We use Flair for fine-tuning NER models on HIPE-2022 datasets from HIPE-2022 Shared Task. Additionally, the ICDAR-Europeana is used for benchmarks on Dutch and French.
We use a tagged version of Flair to ensure a kind of reproducibility. The following commands need to be run to install all necessary dependencies:
$ pip3 install -r requirements.txt
In order to use the hmTEAMS models you need to authorize with your account on Hugging Face Model Hub. This can be done via cli:
# Use access token from https://huggingface.co/settings/tokens
$ huggingface-cli login
We use a config-driven hyper-parameter search. The script flair-fine-tuner.py
can be used to
fine-tune NER models from our Model Zoo.
Additionally, we provide a script that uses Hugging Face AutoTrain Advanced (Space Runner) to fine-tune models. The following snippet shows an example:
$ pip3 install git+https://github.com/huggingface/autotrain-advanced.git
$ export HF_TOKEN="" # Get token from: https://huggingface.co/settings/tokens
$ autotrain spacerunner --project-name "flair-hmbench-hmbyt5-ajmc-de" \
--script-path $(pwd) \
--username stefan-it \
--token $HF_TOKEN \
--backend spaces-t4s \
--env "CONFIG=configs/ajmc/de/hmbyt5.json;HF_TOKEN=$HF_TOKEN;HUB_ORG_NAME=stefan-it"
The concrete implementation can be found in script.py
.
Notice: the AutoTrain implementation is currently under development!
All configurations for fine-tuning are located in the ./configs
folder with the following naming convention:
./configs/<dataset-name>/<language>/<model-name>.json
.
- 17.10.2023: Over 1.200 models from hyper-parameter search are now available on the Model Hub.
- 05.10.2023: Initial version of this repository.
We thank Luisa März, Katharina Schmid and Erion Çano for their fruitful discussions about Historical Language Models.
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️