Welcome to the main page of my project! This repository stores examples of linguistics problems.
My name is Daria, I'm a software engineer with skills in natural language processing. My general scientific interests are knowledge bases and facts extraction. There are very important analysis tools that provides semantic analysis and text mining.
Project has next sections:
In the source code three languages is supported now: English, Russian and Finnish. I hope that very soon next publishing problems will implement NLP-algorithms for more languages.
Source code:
- Russian tokenizer
- Sentence boundary detection
- Transliteration Russian <=> Latin (with spell-checker)
- Word decomposition
- Camel case segmenter
- Distance to anagram
- Russian number2text converter
- Soundex Algorithm Implementation
- Syllable Module (word syllables count (russian/english/finnish) and word syllables list (russian/finnish))
- Russian patronymic generator
- Russian diminutive names generator
- Russian cases generator (dative)
- Russian cognate words checker
- English Adjective Comparisoner
- Common English question generator
- Finnish Predicative Sentences
- Finnish POS-tagger
- Finnish case tagger
- Russian POS-tagger
- N-gram dictionary (for spelling/for language modeling)
- Simple English word filler
- N-gram language model
- Collocations
- Russian diminutive names generator with RNN
- Russian character RNN (non-smoothing)
- Russian joking language model (PI Day)
- Simple spell-checker (based on n-grams and Damerau-Levenstein distance)
- Advanced spell-checker based on:
- dictionary of words from good texts with 2-3-gram index;
- train language model with 2-grams on good texts;
- retrieval
candidates
with Damerau-Levenstein distance; - find
candidate
with max probability of bigrammax{ P(prev_word, candidate), candidate in candidates}