New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

text parser Japanese tokenization issues #3

Open

tristcoil opened this issue Dec 15, 2024 · 1 comment

Labels

Owner

tristcoil commented Dec 15, 2024

We are using Mecab tokenizer to split Japanese sentences into individual words.

Issue:
word
食べてしまいます

gets split into

食べて
しまい
ます

this is rather difficult for readers to understand

Ideally we should use better parser that understands Japanese conjugation on higher level.

tristcoil added the enhancement label

Owner Author

tristcoil commented Jan 5, 2025

kuromoji can likely fix this issue, should be advanced tokenizer

https://github.com/atilika/kuromoji

Apache License

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment