Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text parser Japanese tokenization issues #3

Open
tristcoil opened this issue Dec 15, 2024 · 1 comment
Open

text parser Japanese tokenization issues #3

tristcoil opened this issue Dec 15, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@tristcoil
Copy link
Owner

We are using Mecab tokenizer to split Japanese sentences into individual words.

Issue:
word
食べてしまいます

gets split into

  • 食べて
  • しまい
  • ます

this is rather difficult for readers to understand

Ideally we should use better parser that understands Japanese conjugation on higher level.

@tristcoil tristcoil added the enhancement New feature or request label Dec 15, 2024
@tristcoil
Copy link
Owner Author

kuromoji can likely fix this issue, should be advanced tokenizer

https://github.com/atilika/kuromoji

Apache License

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant