- Tatoeba (>3600 sentence pairs). HuggingFace / Source
- Finugorbib (>30k sentence pairs). HuggingFace
- Soviet geography book (>2700 sentence pairs). HuggingFace
- FLORES-250, translation benchmark. HuggingFace / Other languages
- Udmurt news (udmddn.ru and oshmes.info, in total 36k sentences). HuggingFace
- Wikipedia dump (more than 43k sentences). Download
- MADLAD-400 (651k sentences, 9.5 million words) HuggingFace / All languages
- Glot500-c (121k sentences) GitHub
- Zerpal (1.4M sentences) HuggingFace
- Zerpal-udmdunne (8,154 rows, 5 labels) HuggingFace
- Zerpal-udmurtmedia (15,274 rows, 10 labels) HuggingFace
- Zerpal-pos-tagging (12,392 rows, 17 classes) HuggingFace
- WikiANN (the transcription is problematic: Latin and Cyrillic are used inconsistently, Wikipedia Markup is parsed incorrectly, but if you want to use it, see
wikiann
directory)
- MURI-IT (2,751 rows) HuggingFace