uniparser-grammar-udm

This is a formalized description of literary Udmurt morphology, which also includes a number of dialectal elements. The description is carried out in the UniParser format and involves a description of the inflection (paradigms.txt), a grammatical dictionary (udm_lexemes_XXX.txt files), a list of rules that annotate combinations of lexemes and grammatical values with additional Russian translations (lex_rules.txt), and a short list of analyses that should be avoided (bad_analyses.txt). The dictionary contains descriptions of individual lexemes, each of which is accompanied by information about its stem, its part-of-speech tag and some other grammatical/borrowing information, its inflectional type (paradigm), and Russian translation.

This description can be used for morphological analysis of Udmurt texts in the following ways:

The Wordlists directory contains a frequency list of tokens based on the Udmurt corpus (which contains 9 million words of contemporary texts) and the output of the analyzer for this list. The simplest solution is to use this analyzed wordlist for analyzing your texts. The recall of the analyzer on the corpus texts is about 96% and the corpus is sufficiently large, so if you use the wordlist, the recall on your texts will almost definitely exceed 90%.
The Analyzer directory contains the UniParser set of scripts together with all necessary language files. You can use it to analyze your own frequency word list. Your have to name your list "wordlist.csv" and put it to that directory. Each line should contain one token and its frequency, tab-delimited. When you run analyzer/UniParser/analyze.py, the analyzer will produce two files, one with analyzed tokens, the other with unanalyzed ones. (You can also use other file names and separators with command line options, see the code of analyze.py.) This way, you will notbe restricted by my word list, but the analyzer works pretty slowly (300-400 tokens per second).
Finally, you are free to convert/adapt the description to whatever kind of morphological analysis you prefer to use.

Apart from the analyzer, this repository contains a set of Constraint Grammar rules that can be used to partial disambiguation of analyzed Udmurt texts. They reduce the average number of different analyses per analyzed token from about 1.6 to about 1.3.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
udmparser		udmparser
wordlists		wordlists
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
README.md		README.md
bad_analyses.txt		bad_analyses.txt
paradigms.txt		paradigms.txt
requirements.txt		requirements.txt
setup.py		setup.py
udm_lexemes_ADJ.txt		udm_lexemes_ADJ.txt
udm_lexemes_IMIT.txt		udm_lexemes_IMIT.txt
udm_lexemes_N.txt		udm_lexemes_N.txt
udm_lexemes_NRel.txt		udm_lexemes_NRel.txt
udm_lexemes_N_persn.txt		udm_lexemes_N_persn.txt
udm_lexemes_PRO.txt		udm_lexemes_PRO.txt
udm_lexemes_V.txt		udm_lexemes_V.txt
udm_lexemes_unchangeable.txt		udm_lexemes_unchangeable.txt
udm_lexrules_N.txt		udm_lexrules_N.txt
udm_lexrules_V.txt		udm_lexrules_V.txt
udmurt_disambiguation.cg3		udmurt_disambiguation.cg3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

uniparser-grammar-udm

About

Releases

Packages

Languages

ankan2013/udmparser

Folders and files

Latest commit

History

Repository files navigation

uniparser-grammar-udm

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages