You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I successfully finetuned the 300M model with my own data. However, while using the latest streaming inference methods, specifies one female speaker to do long text speech synthesis, I find some words or phrases suddenly change to a malse speaker voice , which scares me.
The reason behind the sudden change in the speaker's voice during streaming inference is something I'm curious about. Could it be because you're segmenting speech tokens as input? Is that possible to solve this problem by adjusting "token_min_hop_len", "token_max_hop_len", "token_overlap_len" or the other parameters?
Hope for your reply.
The text was updated successfully, but these errors were encountered:
set stream=False.
this is due to train/inference mismatching. first, we use fp16 inference, we will add amp train later. second, we train whole utterance, so there is performance degradation in streaming inference mode, but this is not easy to solve.
set stream=False. this is due to train/inference mismatching. first, we use fp16 inference, we will add amp train later. second, we train whole utterance, so there is performance degradation in streaming inference mode, but this is not easy to solve.
Thanks for your reply. Do you have any recommendations for mitigating the phenomenon of voice changing during streaming inference process? This phenomenon rarely appears in short sentences synthesis compared with long text speech synthesis.
Great work, thanks.
I successfully finetuned the 300M model with my own data. However, while using the latest streaming inference methods, specifies one female speaker to do long text speech synthesis, I find some words or phrases suddenly change to a malse speaker voice , which scares me.
The reason behind the sudden change in the speaker's voice during streaming inference is something I'm curious about. Could it be because you're segmenting speech tokens as input? Is that possible to solve this problem by adjusting "token_min_hop_len", "token_max_hop_len", "token_overlap_len" or the other parameters?
Hope for your reply.
The text was updated successfully, but these errors were encountered: