Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASCII/Romanization for OuteTTS Multilingual Processing #10894

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

edwko
Copy link

@edwko edwko commented Dec 19, 2024

Integrated anyascii to handle multilingual text conversion by converting words to their romanized representation for example: "こんにちは" -> "konnichiha".

The current implementation handles romanization but does not implement proper word segmentation for languages written without word boundaries like Japanese.

Example of current vs desired output:
Input: 私は学生です
Current: watashihagakuseidesu
Expected: watashi ha gakusei desu

A morphological analyzer would be needed to achieve proper word segmentation. For now it is still usable but would require manually adding word boundaries on input for such languages.

@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Dec 19, 2024
@ggerganov
Copy link
Owner

Thanks. My understanding is that this pre-processing can be achieved easily with available command-line utilities and it is not very justified to introduce this partial solution to the codebase. The purpose of the example is mainly to demonstrate how to use libllama for TTS, and not to be a full-fledged TTS application. The latter would require many additional steps and features, which are not very suitable to implement at the current state of the project. Let's leave this PR as a demo for now and with time we can revisit the priorities.

@edwko
Copy link
Author

edwko commented Dec 19, 2024

I see, you're right users can handle text preprocessing separately using existing command-line tools or libraries. Implementing proper segmentation would probably add unnecessary complexity and dependencies. Let's keep this as a demo, could be useful reference for text processing if needed. :)

@ngxson
Copy link
Collaborator

ngxson commented Dec 19, 2024

I think in the future, the trend would be to have minimal pre-processing and have the model to simply understand all the input tokens, without any transformations. It would be cool if someone can fine-tune the model to support multiple languages / sound / music / etc.

@edwko
Copy link
Author

edwko commented Dec 19, 2024

I think in the future, the trend would be to have minimal pre-processing and have the model to simply understand all the input tokens, without any transformations. It would be cool if someone can fine-tune the model to support multiple languages / sound / music / etc.

It should be possible, it's just the current limitation with the alignment model used, it only supports Latin characters. Using something like whisper and getting timestamps on each word could be a better alternative, though a bit slower. Still would probably need some kind of segmentation on languages like Chinese, Japanese or explore some other methods that would be easier to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged examples
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants