Fix MWE combining acute accent parser issue by arthurian · Pull Request #118 · Harvard-ATG/visualizing_russian_tools

arthurian · 2025-07-31T21:01:34Z

Fixes tokenizer to properly handle Russian ordinal adverbs (во-первых, во-вторых, в-третьих) with combining acute accent marks as single tokens.

Changes

Updated the split_hyphenated function in tokenizer.py: Added canonical comparison to handle both accented and unaccented versions of hyphenated words
Case-insensitive handling: Updated "по-" prefix check to be case-insensitive
Unit tests: Added tests for both accented (во-первы́х, во-вторы́х, в-тре́тьих) and unaccented versions with proper capitalization

Notes

The tokenizer now uses canonical forms (stripped of diacritics and lowercased) when comparing against reserved hyphenated words, ensuring that variations like "Во-первых" and "во-первы́х" are both preserved as single tokens instead of being split at the hyphen.

bugfix: fixed MWEs with accents not being tokenized correctly

5a8078e

arthurian added the bug Something isn't working label Jul 31, 2025

lint: reformattd with ruff

f730c1d

arthurian merged commit 09cc2dc into master Jul 31, 2025
2 checks passed

arthurian deleted the bugfix/mwe-combining-acute branch July 31, 2025 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix MWE combining acute accent parser issue#118

Fix MWE combining acute accent parser issue#118
arthurian merged 2 commits into
masterfrom
bugfix/mwe-combining-acute

arthurian commented Jul 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

arthurian commented Jul 31, 2025

Changes

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant