Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some "numbers" having readings removed aren't actually numbers #25

Open
ahlec opened this issue Nov 25, 2022 · 1 comment
Open

Some "numbers" having readings removed aren't actually numbers #25

ahlec opened this issue Nov 25, 2022 · 1 comment

Comments

@ahlec
Copy link
Collaborator

ahlec commented Nov 25, 2022

We have the config option to remove readings from numbers, which we're currently doing by removing readings associated with 一二三四...

However, not all occurrences of those characters are outright numbers. Example: 一通り (ひととおり) uses 一, but shouldn't have its reading removed because it's part of a phrase.

A potential first step could be that we only remove readings from numbers where the character is 一 and the reading is いち, いっ, etc. But that might not work long-term. In order to fully fix this, we might need a separate tool that has a database/dictionary lookup to determine if a word is a number/number + counter (we want to remove the reading), or if it's a regular word (we want to keep the reading).

EDIT: Interestingly, 一先ず doesn't remove the reading from the 一. So clearly this problem isn't 100% universal even currently.

Actual: 一通りの一先ず一通[とおり]の一先[ひとま]ず
Expected: 一通りの一先ず一通[ひととおり]の一先[ひとま]ず

@ahlec
Copy link
Collaborator Author

ahlec commented Feb 23, 2023

Another test case to include in this would be 一切, which would also challenge the earlier proposal of "only remove it if the reading is a regular one (here, いっ)."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant