layout | title |
---|---|
default |
Resource matrix |
The special interest group for Uralic languages hosts an up-to-date list of resource for Uralic languages. This matrix tries to capture state of the Uralic languages computational resources using linkable, downloadable and usable resources as references (rather than, expert judgment, as some other similar matrices do). For a full list of resources available, users are advised to turn to services, such as meta-share.
The columns are following: ISO 639 is closest applicable standard language code, Language is the name of the language, in case of related / similarly named languages with separate language codes, (group) is used to add the differentiating part of language name, Orth column describes the status of standard orthography, Keyboard is freely available keyboard layouts for commonly used operation systems, Corpora is freely available language data, both spoken and written, annotated or not, carefully selected or not, Speech is speech technology resources such as synthesised speakers, Morph is for various text analysers; morpho-syntactic or otherwise, Treebank is for different treebanks and parsebanks with over word-level annotations and MT is for machine translators. The resources listed are ones that we have verified free to use, at least for research purposes but usually free for all, free as in no costs and free as in no restrictions for purposes of use, can be copyleft. Also, systems must be usable, ideally used by us or researchers we know.
ISO 639 | Language | (group) | Orth | Keyboard | Corpora | Speech | Morph | Treebank | MT |
---|---|---|---|---|---|---|---|---|---|
fin | Finnish | ??? | ++++ | ??? | ??? | ++++ | ++ | +- | |
fkv | Kven | ? | ? | ? | ? | ? | ? | ? | |
fit | Meänkieli | ? | ? | ? | ? | ? | ? | ? | |
hun | Hungarian | ? | ? | ? | ? | + | ? | ? | |
est | Estonian | ? | ? | ? | ? | ? | ? | [+][fin-est-1] | |
ekk | (Estonian) | ||||||||
vro | Võro | ? | ? | ? | ? | ? | ? | ? | |
sme | Sámi | North | ??? | [+][sme-keyboard-1] | ??? | ? | ?? | ? | ? |
Lule | ? | ? | ? | ? | ? | ? | ? | ||
sma | South | ? | ? | ? | ? | ? | ? | ? | |
Inari | ? | ? | ? | ? | ? | ? | ? | ||
sms | Skolt | ? | ? | ? | ? | ? | ? | ? | |
Kildin | ? | ? | ? | ? | ? | ? | ? | ||
kpv | Komi | Zyrian | ? | ? | ? | ? | ? | ? | |
Permyak | ? | ? | ? | ? | ? | ? | |||
udm | Udmurt | ? | ? | ? | ? | ? | ? | ? | |
Mari | Hill | ? | ? | ? | ? | ? | ? | ||
Meadow | ? | ? | ? | ? | ? | ? | |||
Mordvin | Erzya | ? | ? | ? | ? | ? | ? | ||
Moksha | ? | ? | ? | ? | ?? | ? | |||
Mansi | ? | ? | ? | ? | ? | ? | ? | ||
kca | Khanty | ? | ? | ? | ? | ? | ? | ? | |
nio | Nganasan | ? | ? | ? | ? | ? | ? | ? | |
Enets | Tundra | ? | ? | ? | ? | ? | ? | ||
Forest | ? | ? | ? | ? | ? | ? | |||
Nenets | Tundra | ? | ? | ? | ? | ? | ? | ||
Forest | ? | ? | ? | ? | ? | ? | |||
krl | Karelian | Varsinais- | ? | ? | ? | ? | ? | ? | |
izh | Ingrian | ? | ? | ? | ? | ? | ? | ||
olo | Olonets | ? | ? | ? | ? | ? | ? | - | |
Selkup | ? | ? | ? | ? | ? | ? | ? |
I have used a plus sign + for most resources, an occasional hyphen-minus - is used to denote rather work-in-progress versions of data or software.
We have tried to link all resources while avoiding spamming the list with derivations and forks of the same resource.
- Kotoistus keyboard layout, for national (SFS 5966) and international standards
Comes with all common OSes and systems: Microsoft’s, Linux, Apple’s and Android-based.
- Omorfi (see also: apertium-fin, giella-fin)
- Voikko (also: suomi-malaga, vfst morphology)
- GF Finnish
- Morfessor Finnish (Finnish models available?)
- Universal Depedencies Finnish (see also: Turku dependency treebank)
- Universal Dependencies Finnish FTB (see also: FinnTreeBanks)
- Apertium Finnish-English (high coverage, low quality)
- Apertium Finnish-German
- Apertium Finnish-Karelian Olonets
- [Apertium Finnish-Estonian][fin-est-1]
- GF Finnish to any
- [Divvun’s North Saami keyboard][sme-keyboard-1]
We maintain a list of mainly freely available and open source resources for Uralic languages. Please help us keep the list updated. Some of these resources are already used directly or indirectly in the matrix.
- Universal dependencies, treebanks, dependency syntax conventions for Finnish, Estonian and Hungarian plus other world languages (includes Uralic guidelines
- OPUS open source parallel corpora corpora for most of the world’s languages
- Giellatekno repository of uralic analysers and tools, most Uralic languages
- Apertium, machine translation dictionaries including some uralic languages
- Grammatical Framework Haskell descriptions of linguistic data, including a few uralic languages
- Korp at CSC, a corpus search interface for CSC.fi-managed corpora
- Wanca corpora from SUKI project on harvesting internet for Uralic texts
- Voikko spell-checking for many Uralic languages
- Language bank of Finland a Finland’s central repository of language resources
- Divvun Writers' tools for Saami languages, and lots of others