Skip to content

Fix MWE combining acute accent parser issue#118

Merged
arthurian merged 2 commits into
masterfrom
bugfix/mwe-combining-acute
Jul 31, 2025
Merged

Fix MWE combining acute accent parser issue#118
arthurian merged 2 commits into
masterfrom
bugfix/mwe-combining-acute

Conversation

@arthurian

Copy link
Copy Markdown
Member

Fixes tokenizer to properly handle Russian ordinal adverbs (во-первых, во-вторых, в-третьих) with combining acute accent marks as single tokens.

Changes

  • Updated the split_hyphenated function in tokenizer.py: Added canonical comparison to handle both accented and unaccented versions of hyphenated words
  • Case-insensitive handling: Updated "по-" prefix check to be case-insensitive
  • Unit tests: Added tests for both accented (во-первы́х, во-вторы́х, в-тре́тьих) and unaccented versions with proper capitalization

Notes

The tokenizer now uses canonical forms (stripped of diacritics and lowercased) when comparing against reserved hyphenated words, ensuring that variations like "Во-первых" and "во-первы́х" are both preserved as single tokens instead of being split at the hyphen.

@arthurian arthurian added the bug Something isn't working label Jul 31, 2025
@arthurian arthurian merged commit 09cc2dc into master Jul 31, 2025
2 checks passed
@arthurian arthurian deleted the bugfix/mwe-combining-acute branch July 31, 2025 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant