Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str_split not splitting correctly on Unicode character #542

Closed
alexanderbeatson opened this issue Mar 29, 2024 · 5 comments
Closed

str_split not splitting correctly on Unicode character #542

alexanderbeatson opened this issue Mar 29, 2024 · 5 comments

Comments

@alexanderbeatson
Copy link

I am trying to split Burmese Unicode characters in stringr::str_split() but not return the correct values.

str_split("စမ်းသပ်မှု", "")[[1]]

it returns:

[1] "စ" "မ်" "း" "သ" "ပ်" "မှု"

If I use buildin strsplit: strsplit("စမ်းသပ်မှု", "")[[1]] it returns character level:

[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"

I found that str_split treat "" empty string as regex but stringr::str_split() does not return neither character nor syllable:

[1] "စမ်း" "သပ်" "မှု"

So, I don't think it is actually a feature like Issue:88

For further study, if possible, could someone guide me where this splitting is coming from? I found that other services like Google also use this incorrect splitting format. TIA.

@gagolews
Copy link
Contributor

gagolews commented Apr 2, 2024

... and what would be the correct result?

@alexanderbeatson
Copy link
Author

Correct return should be:

[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"

@hadley
Copy link
Member

hadley commented Jul 15, 2024

All I know about Burmese is what I've just read about on wikipedia, but it sounds like you're looking to break up into individual code points, not characters (which because Burmese is a abugida, not an alphabet, represent syllables, not individual vowels and consonants).

I don't see an obvious way to do this with stringi, but @gagolews might.

@hadley hadley closed this as completed Jul 15, 2024
@alexanderbeatson
Copy link
Author

@hadley Thank you for raising the point. Burmese is indeed an abugida.

I understand that all of pseudo-alphabet languages have their own structure and confusing, and there might even controversial breakdown system.

Please let me explain in detail of breaking down the phrase "စမ်းသပ်မှု" (meaning "testing" or "test")

  • "စမ်းသပ်မှု" is a single word
  • contains 3 distinct syllables ["စမ်း", "သပ်", "မှု"]

str_split() is trying to break the syllables into (grammatically) illegal groups. For example, it breaks "စမ်း" into ["စ", "မ်", "း"] that ["မ်", "း"] are grammatically illegal to standalone.

I am a native Burmese NLP researcher and I believe I could help in this implementation. I recently developed bursyl, regex-based Burmese syllabification algorithm (with a very strict grammatical rule but can be adjusted), and potentially implement it into stringi for splitting Burmese langauge @gagolews ?

@gagolews
Copy link
Contributor

gagolews commented Jul 16, 2024

On a side note, https://unicode-org.github.io/icu/userguide/boundaryanalysis/ says that:

*Dictionary-Based BreakIterator

Some languages are written without spaces, and word and line breaking requires more than rules over character sequences. ICU provides dictionary support for word boundaries in Chinese, Japanese, Thai, Lao, Khmer and Burmese.

Use of the dictionaries is automatic when text in one of the dictionary languages is encountered. There is no separate API, and no extra programming steps required by applications making use of the dictionaries.*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants