This repository contains the implementation used for training and evaluating language models for extremely low-resource Finno-Ugric languages.
Pre-trained:
Instruction-tuned:
- tartuNLP/Llama-SMUGRI-7B-Instruct-MTI (SupInst+TrAlpaca)
- Llama-SMUGRI-7B-Instruct-MTI-Tr (SupInst+TrAlpaca+TrInst)
- tartuNLP/Llama-SMUGRI-7B-Instruct-LLMTI (SupInst+LLMTrAlpaca)
- tartuNLP/Llama-SMUGRI-7B-Instruct-LLMTI-Tr (SupInst+LLMTrAlpaca+TrInst)
Belebele-SMUGRI:
SIB-SMUGRI:
Scripts for launching training are provided in:
LM-eval-harness configurations:
@misc{purason2024llmsextremelylowresourcefinnougric,
title={LLMs for Extremely Low-Resource Finno-Ugric Languages},
author={Taido Purason and Hele-Andra Kuulmets and Mark Fishel},
year={2024},
eprint={2410.18902},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.18902},
}
The implementation is built on github.com/TartuNLP/llammas.