Skip to content

Latest commit

 

History

History
173 lines (134 loc) · 6.98 KB

matrix.md

File metadata and controls

173 lines (134 loc) · 6.98 KB
layout title
default
Resource matrix

Resource matrix

The special interest group for Uralic languages hosts an up-to-date list of resource for Uralic languages. This matrix tries to capture state of the Uralic languages computational resources using linkable, downloadable and usable resources as references (rather than, expert judgment, as some other similar matrices do). For a full list of resources available, users are advised to turn to services, such as meta-share.

The columns are following: ISO 639 is closest applicable standard language code, Language is the name of the language, in case of related / similarly named languages with separate language codes, (group) is used to add the differentiating part of language name, Orth column describes the status of standard orthography, Keyboard is freely available keyboard layouts for commonly used operation systems, Corpora is freely available language data, both spoken and written, annotated or not, carefully selected or not, Speech is speech technology resources such as synthesised speakers, Morph is for various text analysers; morpho-syntactic or otherwise, Treebank is for different treebanks and parsebanks with over word-level annotations and MT is for machine translators. The resources listed are ones that we have verified free to use, at least for research purposes but usually free for all, free as in no costs and free as in no restrictions for purposes of use, can be copyleft. Also, systems must be usable, ideally used by us or researchers we know.

ISO 639 Language (group) Orth Keyboard Corpora Speech Morph Treebank MT
fin Finnish ??? ++++ ??? ??? ++++ ++ +-
fkv Kven ? ? ? ? ? ? ?
fit Meänkieli ? ? ? ? ? ? ?
hun Hungarian ? ? ? ? + ? ?
est Estonian ? ? ? ? ? ? [+][fin-est-1]
ekk (Estonian)
vro Võro ? ? ? ? ? ? ?
sme Sámi North ??? [+][sme-keyboard-1] ??? ? ?? ? ?
Lule ? ? ? ? ? ? ?
sma South ? ? ? ? ? ? ?
Inari ? ? ? ? ? ? ?
sms Skolt ? ? ? ? ? ? ?
Kildin ? ? ? ? ? ? ?
kpv Komi Zyrian ? ? ? ? ? ?
Permyak ? ? ? ? ? ?
udm Udmurt ? ? ? ? ? ? ?
Mari Hill ? ? ? ? ? ?
Meadow ? ? ? ? ? ?
Mordvin Erzya ? ? ? ? ? ?
Moksha ? ? ? ? ?? ?
Mansi ? ? ? ? ? ? ?
kca Khanty ? ? ? ? ? ? ?
nio Nganasan ? ? ? ? ? ? ?
Enets Tundra ? ? ? ? ? ?
Forest ? ? ? ? ? ?
Nenets Tundra ? ? ? ? ? ?
Forest ? ? ? ? ? ?
krl Karelian Varsinais- ? ? ? ? ? ?
izh Ingrian ? ? ? ? ? ?
olo Olonets ? ? ? ? ? ? -
Selkup ? ? ? ? ? ? ?

I have used a plus sign + for most resources, an occasional hyphen-minus - is used to denote rather work-in-progress versions of data or software.

References by language

We have tried to link all resources while avoiding spamming the list with derivations and forks of the same resource.

Finnish

Finnish keyboard

  1. Kotoistus keyboard layout, for national (SFS 5966) and international standards

Comes with all common OSes and systems: Microsoft’s, Linux, Apple’s and Android-based.

Finnish Morphology

  1. Omorfi (see also: apertium-fin, giella-fin)
  2. Voikko (also: suomi-malaga, vfst morphology)
  3. GF Finnish
  4. Morfessor Finnish (Finnish models available?)

Finnish Treebanks

  1. Universal Depedencies Finnish (see also: Turku dependency treebank)
  2. Universal Dependencies Finnish FTB (see also: FinnTreeBanks)

Finnish Machine Translation

  1. Apertium Finnish-English (high coverage, low quality)
  2. GF Finnish to any

North Saami

North Saami keyboards

  1. [Divvun’s North Saami keyboard][sme-keyboard-1]

Hungarian

Hungarian morphologies

  1. hunmorph

Uncategorised references

We maintain a list of mainly freely available and open source resources for Uralic languages. Please help us keep the list updated. Some of these resources are already used directly or indirectly in the matrix.

Multi-language / general

Finnish

Estonian

Saami languages

  • Divvun Writers' tools for Saami languages, and lots of others