Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[debug] support flow cache, for sharper tts_mel output #412

Closed

Conversation

boji123
Copy link
Contributor

@boji123 boji123 commented Sep 20, 2024

c6695df4a89fd9984754a37bba6644f

我是柏基
#379 问题2的解决方案

flowmatching中的z和mu,跨chunk时对于每个index不是定值,是导致衔接处频谱模糊的因素之一(本质是flow的attention context问题,无解)

图中是flow的tts_mel输出,用于对比上下文及频谱模糊的问题
大图1列:不带cache;2列:带cache
小图左:前chunk最后34;中:(前+后)/2;右:后chunk开头34
可以发现带cache的,tts_mel频谱更清晰

*由于后续的mel fade、hifigan cache、speech fade的挽救,该项虽然更本质,但最终听感提升概率较小,多测测的确是有badcase得到改善的

@boji123
Copy link
Contributor Author

boji123 commented Sep 24, 2024

补充:可以缓解流式推理上下文音量突变问题(带cache音量参考)

@aluminumbox
Copy link
Collaborator

transformer的真流式需要做causal推理,虽然overlap出的结果保留了下来,但是flow matching的decoder在每个chunk的diffusion时的context已经发生了变化,导致最终生成的mel与上一次的overlap处的mel还是衔接不上。我们已经在做真流式的模型训练,这个pr先关闭了。

@boji123
Copy link
Contributor Author

boji123 commented Sep 29, 2024

flow matching 的输入 Z和MU使用cache结果是有意义的;同时你们做causal同样需要使用此cache(flowmatching 随机性)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants