Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The sudden change of speaker's voice in streaming inference #428

Open
JeneeeF opened this issue Sep 24, 2024 · 3 comments
Open

The sudden change of speaker's voice in streaming inference #428

JeneeeF opened this issue Sep 24, 2024 · 3 comments
Labels

Comments

@JeneeeF
Copy link

JeneeeF commented Sep 24, 2024

Great work, thanks.

I successfully finetuned the 300M model with my own data. However, while using the latest streaming inference methods, specifies one female speaker to do long text speech synthesis, I find some words or phrases suddenly change to a malse speaker voice , which scares me.

The reason behind the sudden change in the speaker's voice during streaming inference is something I'm curious about. Could it be because you're segmenting speech tokens as input? Is that possible to solve this problem by adjusting "token_min_hop_len", "token_max_hop_len", "token_overlap_len" or the other parameters?

Hope for your reply.

@aluminumbox
Copy link
Collaborator

set stream=False.
this is due to train/inference mismatching. first, we use fp16 inference, we will add amp train later. second, we train whole utterance, so there is performance degradation in streaming inference mode, but this is not easy to solve.

@JeneeeF
Copy link
Author

JeneeeF commented Sep 25, 2024

Thanks for your reply.

set stream=False. this is due to train/inference mismatching. first, we use fp16 inference, we will add amp train later. second, we train whole utterance, so there is performance degradation in streaming inference mode, but this is not easy to solve.

Thanks for your reply. Do you have any recommendations for mitigating the phenomenon of voice changing during streaming inference process? This phenomenon rarely appears in short sentences synthesis compared with long text speech synthesis.

Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants