str_split not splitting correctly on Unicode character #542

alexanderbeatson · 2024-03-29T06:24:50Z

I am trying to split Burmese Unicode characters in stringr::str_split() but not return the correct values.

str_split("စမ်းသပ်မှု", "")[[1]]

it returns:

[1] "စ" "မ်" "း" "သ" "ပ်" "မှု"

If I use buildin strsplit: strsplit("စမ်းသပ်မှု", "")[[1]] it returns character level:

[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"

I found that str_split treat "" empty string as regex but stringr::str_split() does not return neither character nor syllable:

[1] "စမ်း" "သပ်" "မှု"

So, I don't think it is actually a feature like Issue:88

For further study, if possible, could someone guide me where this splitting is coming from? I found that other services like Google also use this incorrect splitting format. TIA.

The text was updated successfully, but these errors were encountered:

gagolews · 2024-04-02T13:08:01Z

... and what would be the correct result?

alexanderbeatson · 2024-04-04T06:09:55Z

Correct return should be:

[1] "စ" "မ" "်" "း" "သ" "ပ" "်" "မ" "ှ" "ု"

hadley · 2024-07-15T21:31:09Z

All I know about Burmese is what I've just read about on wikipedia, but it sounds like you're looking to break up into individual code points, not characters (which because Burmese is a abugida, not an alphabet, represent syllables, not individual vowels and consonants).

I don't see an obvious way to do this with stringi, but @gagolews might.

alexanderbeatson · 2024-07-16T06:51:54Z

@hadley Thank you for raising the point. Burmese is indeed an abugida.

I understand that all of pseudo-alphabet languages have their own structure and confusing, and there might even controversial breakdown system.

Please let me explain in detail of breaking down the phrase "စမ်းသပ်မှု" (meaning "testing" or "test")

"စမ်းသပ်မှု" is a single word
contains 3 distinct syllables ["စမ်း", "သပ်", "မှု"]

str_split() is trying to break the syllables into (grammatically) illegal groups. For example, it breaks "စမ်း" into ["စ", "မ်", "း"] that ["မ်", "း"] are grammatically illegal to standalone.

I am a native Burmese NLP researcher and I believe I could help in this implementation. I recently developed bursyl, regex-based Burmese syllabification algorithm (with a very strict grammatical rule but can be adjusted), and potentially implement it into stringi for splitting Burmese langauge @gagolews ?

gagolews · 2024-07-16T07:23:20Z

On a side note, https://unicode-org.github.io/icu/userguide/boundaryanalysis/ says that:

*Dictionary-Based BreakIterator

Some languages are written without spaces, and word and line breaking requires more than rules over character sequences. ICU provides dictionary support for word boundaries in Chinese, Japanese, Thai, Lao, Khmer and Burmese.

Use of the dictionaries is automatic when text in one of the dictionary languages is encountered. There is no separate API, and no extra programming steps required by applications making use of the dictionaries.*

hadley closed this as completed Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

str_split not splitting correctly on Unicode character #542

str_split not splitting correctly on Unicode character #542

alexanderbeatson commented Mar 29, 2024

gagolews commented Apr 2, 2024

alexanderbeatson commented Apr 4, 2024

hadley commented Jul 15, 2024

alexanderbeatson commented Jul 16, 2024

gagolews commented Jul 16, 2024 •

edited

Loading

str_split not splitting correctly on Unicode character #542

str_split not splitting correctly on Unicode character #542

Comments

alexanderbeatson commented Mar 29, 2024

gagolews commented Apr 2, 2024

alexanderbeatson commented Apr 4, 2024

hadley commented Jul 15, 2024

alexanderbeatson commented Jul 16, 2024

gagolews commented Jul 16, 2024 • edited Loading

gagolews commented Jul 16, 2024 •

edited

Loading