Input Method Engine Using Neural Networks

Input Method Engine (IME) is a program that facilitates the input of non-english languages into digital devices. This work improves upon traditional n-gram based Chinese Pinyin IMEs by incorporating previous context and using a Seq2Seq neural network model with end-to-end training. Our model simplifies the NLP pipeline, while maintaining some tolerance for Pinyin abbreviations and typos. Our evaluation shows that it significantly outperforms the baseline bigram model in terms of prediction accuracies. We also built a Chrome extension frontend to help users type Chinese in any web pages.

See our full project report for more details.

Dev Notes

Dependencies

Python 3, tensorflow 1.0 are used.
To install all the dependecies (note, tensorflow is not included)
- run pip3 install -r requirement.txt

Data prep

Download larger corpus to /data:
- Lancaster Corpus of Mandarin Chinese (LCMC), SQlite version: Download lcmc.db3
- Weibo corpus: Download weibo.txt
Run data_extractor.py to generate datasets and samples.
- data/lcmc_clean.data Pickle-dumped byte file, contains [context, pinyins, chars] triples without any added noise. In the current version, with context_window=10, max_input_window=5, the resulting file will be around 300MB. Change the first_n parameter to generate a smaller file.
- data/sms_clean.data SMS corpus triples, same format as above. 60 MB.
- data/weibo_clean.data Weibo corpus triple, same format as above. Large.

Evaluation

To run evaluatoin
- run python3 eval.py --model [model] --k [k]

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
chrome_ext		chrome_ext
config_files		config_files
data		data
model		model
poster_and_report		poster_and_report
profile_graphs		profile_graphs
seq2seq		seq2seq
web		web
.gitignore		.gitignore
CorpusPreprocessor.java		CorpusPreprocessor.java
README.md		README.md
__init__.py		__init__.py
beam_search.py		beam_search.py
data_extractor.py		data_extractor.py
data_extractor_word.py		data_extractor_word.py
dump_attention.sh		dump_attention.sh
eval.py		eval.py
eval_inference_handler.py		eval_inference_handler.py
frequency_counter.py		frequency_counter.py
helpers.py		helpers.py
inference_handler.py		inference_handler.py
lcmc_queries.py		lcmc_queries.py
metric.py		metric.py
mix.sh		mix.sh
ngram.py		ngram.py
pinyin_segment_eval.py		pinyin_segment_eval.py
pinyin_util.py		pinyin_util.py
profile_helper.py		profile_helper.py
repl.py		repl.py
requirement.txt		requirement.txt
server.py		server.py
setup.sh		setup.sh
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Input Method Engine Using Neural Networks

Dev Notes

Dependencies

Data prep

Evaluation

About

Releases

Packages

Contributors 3

Languages

brucewen05/CSE_481_NLP

Folders and files

Latest commit

History

Repository files navigation

Input Method Engine Using Neural Networks

Dev Notes

Dependencies

Data prep

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages