Trying to process text in Hmong yields 500 Internal Server Error from web-api #200

joanise · 2023-03-15T19:20:09Z

Context:
During our ICLDC workshop, a participant offered to help with Hmong g2p. Before contacting them, I wanted to see what und does with it, so I copy-pasted letters in the language from https://en.wikipedia.org/wiki/Hmong_language

Observations:
The Hmong letters are in the first higher plane in the unicode standard (Puachue Hmong starts at U+1E100, and Pahawh Hmong at U+16B00)
Refs: searching for Hmong in https://unicode.org/charts/ yields two charts:

Problem Inputs (in each case, put that input in the text box and click next step, result is from web-api):
� yields 422 Could not find any words to align in the text.
𞄤𞄦𞄣‎𖬊𖬋 yields 500 Internal Server Error
𞄤𞄦𞄣‎𖬊𖬋 asdf yields a Possible Text Processing issue, with two given strings being mapped to empty output

There are multiple issues at play here:

The fonts we use don't support these characters
The first example that gave me 422, it looks like I just failed to cut and paste it correctly, my input was literally the diamond with a question mark. I guess it illustrates it's not that easy to grab these characters in the first place.
The U+1Exxx chars (second example) has only higher plane chars, and we obviously don't accept that as input. 500 is not good.
The third example gets slightly better results: the response is valid, except the U+1Exxx chars disappear and don't get mapped to any sounds.

Desired behaviour, each of which could be its own issue:

handle the font for this script, or maybe let the user specify an additional custom font
Fix the 500 (we don't want that on any input the user could type)
Add support for higher plane characters to our und mapping.

The text was updated successfully, but these errors were encountered:

joanise · 2023-03-15T19:43:44Z

I just did a quick test with text_unidecode, and it only has values for plane 0 in Unicode:

$ python -c 'import text_unidecode as tu; print(len(tu._replaces));'
65535

So we would need a specific g2p for this language, and maybe we could submit an extension for text_unidecode but I'm really not sure we want to do that.

roedoejet · 2023-03-15T19:46:33Z

The font issue could be resolved with https://fonts.google.com/noto/specimen/Noto+Sans+Pahawh+Hmong but the problem is I'm not sure we want to start bundling all these fonts for every readalong - but being selective will require some more thinking

joanise · 2023-03-15T19:51:33Z

Yeah, I don't think we want to ship with all fonts all the time. Hmong would be a use case requiring a custom font for a given RA, I think.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to process text in Hmong yields 500 Internal Server Error from web-api #200

Trying to process text in Hmong yields 500 Internal Server Error from web-api #200

joanise commented Mar 15, 2023 •

edited

Loading

joanise commented Mar 15, 2023

roedoejet commented Mar 15, 2023

joanise commented Mar 15, 2023

Trying to process text in Hmong yields 500 Internal Server Error from web-api #200

Trying to process text in Hmong yields 500 Internal Server Error from web-api #200

Comments

joanise commented Mar 15, 2023 • edited Loading

joanise commented Mar 15, 2023

roedoejet commented Mar 15, 2023

joanise commented Mar 15, 2023

joanise commented Mar 15, 2023 •

edited

Loading