You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Context:
During our ICLDC workshop, a participant offered to help with Hmong g2p. Before contacting them, I wanted to see what und does with it, so I copy-pasted letters in the language from https://en.wikipedia.org/wiki/Hmong_language
Observations:
The Hmong letters are in the first higher plane in the unicode standard (Puachue Hmong starts at U+1E100, and Pahawh Hmong at U+16B00)
Refs: searching for Hmong in https://unicode.org/charts/ yields two charts:
Problem Inputs (in each case, put that input in the text box and click next step, result is from web-api): � yields 422 Could not find any words to align in the text. 𞄤𞄦𞄣𖬊𖬋 yields 500 Internal Server Error 𞄤𞄦𞄣𖬊𖬋 asdf yields a Possible Text Processing issue, with two given strings being mapped to empty output
There are multiple issues at play here:
The fonts we use don't support these characters
The first example that gave me 422, it looks like I just failed to cut and paste it correctly, my input was literally the diamond with a question mark. I guess it illustrates it's not that easy to grab these characters in the first place.
The U+1Exxx chars (second example) has only higher plane chars, and we obviously don't accept that as input. 500 is not good.
The third example gets slightly better results: the response is valid, except the U+1Exxx chars disappear and don't get mapped to any sounds.
Desired behaviour, each of which could be its own issue:
handle the font for this script, or maybe let the user specify an additional custom font
Fix the 500 (we don't want that on any input the user could type)
Add support for higher plane characters to our und mapping.
The text was updated successfully, but these errors were encountered:
I just did a quick test with text_unidecode, and it only has values for plane 0 in Unicode:
$ python -c 'import text_unidecode as tu; print(len(tu._replaces));'
65535
So we would need a specific g2p for this language, and maybe we could submit an extension for text_unidecode but I'm really not sure we want to do that.
The font issue could be resolved with https://fonts.google.com/noto/specimen/Noto+Sans+Pahawh+Hmong but the problem is I'm not sure we want to start bundling all these fonts for every readalong - but being selective will require some more thinking
Context:
During our ICLDC workshop, a participant offered to help with Hmong g2p. Before contacting them, I wanted to see what
und
does with it, so I copy-pasted letters in the language from https://en.wikipedia.org/wiki/Hmong_languageObservations:
The Hmong letters are in the first higher plane in the unicode standard (Puachue Hmong starts at U+1E100, and Pahawh Hmong at U+16B00)
Refs: searching for Hmong in https://unicode.org/charts/ yields two charts:
Problem Inputs (in each case, put that input in the text box and click next step, result is from web-api):
�
yields 422 Could not find any words to align in the text.𞄤𞄦𞄣𖬊𖬋
yields 500 Internal Server Error𞄤𞄦𞄣𖬊𖬋 asdf
yields a Possible Text Processing issue, with two given strings being mapped to empty outputThere are multiple issues at play here:
Desired behaviour, each of which could be its own issue:
und
mapping.The text was updated successfully, but these errors were encountered: