Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[debug] support flow cache, for sharper tts_mel output (handle prompt bug) #455

Merged

Conversation

boji123
Copy link
Contributor

@boji123 boji123 commented Sep 30, 2024

#379 问题2的解决方案

flowmatching中的z和mu,跨chunk时对于每个index不是定值,是导致衔接处频谱模糊的因素之一(本质是flow的attention context问题,无解。需要重训模型+context cache,非常复杂)

有益效果:
1、可以改善衔接处频谱模糊、发音不清晰问题
2、可以改善衔接处断音问题
3、可以改善音量突变或音量异常问题

图中是flow的tts_mel输出,用于对比上下文及频谱模糊的问题
大图1列:不带cache;2列:带cache
小图左:前chunk最后34;中:(前+后)/2;右:后chunk开头34
可以发现带cache的,tts_mel频谱更清晰
需要注意的是,对于zeroshot,由于带了prompt输入,所以这部分的cache也需要额外考虑
c6695df4a89fd9984754a37bba6644f

备注:
1、由于后续的mel fade、hifigan cache、speech fade的挽救,该项虽然更本质,但最终听感提升概率较小
2、对于默认配置,断音出现周期为2s,若此时刻恰好没有发音,就难以发现区别,因此不保证每条case必然出现至少一处改善
3、flow matching 的输入Z和MU使用cache结果是有意义的。首先可以减少随机性、提升结果稳定性;其次未来做causal同样需要使用此cache,是可以复用的

已测有效的用例:
SFT
cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-SFT') cosyvoice.inference_sft('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '中文女', stream=True)

zeroshot
cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-25Hz') prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000) cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=True)
zeroshot 改进前
image
zeroshot 改进后
image

@boji123
Copy link
Contributor Author

boji123 commented Sep 30, 2024

其他有效的用例(只发现无改善&有提升的case,暂未发现变差的case):
1、
噢,唱歌吗?当然会啦!我虽然不是专业歌手,但是我喜欢用欢快的歌声来表达自己的情感。有时候我会和花朵们一起唱歌,它们也会和我一起合唱呢!
SFT中文女 4s处 断音改善
image
image

2、
你好,我是通义生成式语音大模型,请问有什么可以帮您的吗?
zeroshot zero_shot_prompt.wav 4.45s处 类鼠标按键音的异常杂音改善(得听,频谱看不出来)

3、
<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that's coming into the family is a reason why sometimes we don't buy the whole thing.
zeroshot zero_shot_prompt.wav 2.10s处 发音沙哑&频谱模糊改善
image
image

PasiKoodaa added a commit to PasiKoodaa/CosyVoice-optimized that referenced this pull request Oct 6, 2024
PasiKoodaa added a commit to PasiKoodaa/CosyVoice-optimized that referenced this pull request Oct 6, 2024
PasiKoodaa added a commit to PasiKoodaa/CosyVoice-optimized that referenced this pull request Oct 6, 2024
@aluminumbox aluminumbox changed the base branch from main to dev/lyuxiang.lx October 16, 2024 06:11
@aluminumbox aluminumbox merged commit ace734d into FunAudioLLM:dev/lyuxiang.lx Oct 16, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants