-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
str_split not splitting correctly on Unicode character #542
Comments
... and what would be the correct result? |
Correct return should be:
|
All I know about Burmese is what I've just read about on wikipedia, but it sounds like you're looking to break up into individual code points, not characters (which because Burmese is a abugida, not an alphabet, represent syllables, not individual vowels and consonants). I don't see an obvious way to do this with stringi, but @gagolews might. |
@hadley Thank you for raising the point. Burmese is indeed an abugida. I understand that all of pseudo-alphabet languages have their own structure and confusing, and there might even controversial breakdown system. Please let me explain in detail of breaking down the phrase "စမ်းသပ်မှု" (meaning "testing" or "test")
str_split() is trying to break the syllables into (grammatically) illegal groups. For example, it breaks "စမ်း" into ["စ", "မ်", "း"] that ["မ်", "း"] are grammatically illegal to standalone. I am a native Burmese NLP researcher and I believe I could help in this implementation. I recently developed bursyl, regex-based Burmese syllabification algorithm (with a very strict grammatical rule but can be adjusted), and potentially implement it into stringi for splitting Burmese langauge @gagolews ? |
On a side note, https://unicode-org.github.io/icu/userguide/boundaryanalysis/ says that:
|
I am trying to split Burmese Unicode characters in stringr::str_split() but not return the correct values.
str_split("စမ်းသပ်မှု", "")[[1]]
it returns:
If I use buildin strsplit:
strsplit("စမ်းသပ်မှု", "")[[1]]
it returns character level:I found that str_split treat "" empty string as regex but stringr::str_split() does not return neither character nor syllable:
So, I don't think it is actually a feature like Issue:88
For further study, if possible, could someone guide me where this splitting is coming from? I found that other services like Google also use this incorrect splitting format. TIA.
The text was updated successfully, but these errors were encountered: