ASCII/Romanization for OuteTTS Multilingual Processing #10894

edwko · 2024-12-19T10:49:28Z

Integrated anyascii to handle multilingual text conversion by converting words to their romanized representation for example: "こんにちは" -> "konnichiha".

The current implementation handles romanization but does not implement proper word segmentation for languages written without word boundaries like Japanese.

Example of current vs desired output:
Input: 私は学生です
Current: watashihagakuseidesu
Expected: watashi ha gakusei desu

A morphological analyzer would be needed to achieve proper word segmentation. For now it is still usable but would require manually adding word boundaries on input for such languages.

ggerganov · 2024-12-19T13:00:45Z

Thanks. My understanding is that this pre-processing can be achieved easily with available command-line utilities and it is not very justified to introduce this partial solution to the codebase. The purpose of the example is mainly to demonstrate how to use libllama for TTS, and not to be a full-fledged TTS application. The latter would require many additional steps and features, which are not very suitable to implement at the current state of the project. Let's leave this PR as a demo for now and with time we can revisit the priorities.

edwko · 2024-12-19T13:46:01Z

I see, you're right users can handle text preprocessing separately using existing command-line tools or libraries. Implementing proper segmentation would probably add unnecessary complexity and dependencies. Let's keep this as a demo, could be useful reference for text processing if needed. :)

ngxson · 2024-12-19T16:08:25Z

I think in the future, the trend would be to have minimal pre-processing and have the model to simply understand all the input tokens, without any transformations. It would be cool if someone can fine-tune the model to support multiple languages / sound / music / etc.

edwko · 2024-12-19T16:35:00Z

I think in the future, the trend would be to have minimal pre-processing and have the model to simply understand all the input tokens, without any transformations. It would be cool if someone can fine-tune the model to support multiple languages / sound / music / etc.

It should be possible, it's just the current limitation with the alignment model used, it only supports Latin characters. Using something like whisper and getting timestamps on each word could be a better alternative, though a bit slower. Still would probably need some kind of segmentation on languages like Chinese, Japanese or explore some other methods that would be easier to implement.

edwko added 2 commits December 19, 2024 12:31

Create anyascii.h

7a21a92

ASCII/Romanization Support for OuteTTS

fa522bc

github-actions bot added the examples label Dec 19, 2024

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASCII/Romanization for OuteTTS Multilingual Processing #10894

ASCII/Romanization for OuteTTS Multilingual Processing #10894

edwko commented Dec 19, 2024

ggerganov commented Dec 19, 2024

edwko commented Dec 19, 2024

ngxson commented Dec 19, 2024

edwko commented Dec 19, 2024

ASCII/Romanization for OuteTTS Multilingual Processing #10894

Are you sure you want to change the base?

ASCII/Romanization for OuteTTS Multilingual Processing #10894

Conversation

edwko commented Dec 19, 2024

ggerganov commented Dec 19, 2024

edwko commented Dec 19, 2024

ngxson commented Dec 19, 2024

edwko commented Dec 19, 2024