[debug] support flow cache, for sharper tts_mel output (handle prompt bug) #455

boji123 · 2024-09-30T02:23:22Z

#379 问题2的解决方案

flowmatching中的z和mu，跨chunk时对于每个index不是定值，是导致衔接处频谱模糊的因素之一（本质是flow的attention context问题，无解。需要重训模型+context cache，非常复杂）

有益效果：
1、可以改善衔接处频谱模糊、发音不清晰问题
2、可以改善衔接处断音问题
3、可以改善音量突变或音量异常问题

图中是flow的tts_mel输出，用于对比上下文及频谱模糊的问题
大图1列：不带cache；2列：带cache
小图左：前chunk最后34；中：（前+后）/2；右：后chunk开头34
可以发现带cache的，tts_mel频谱更清晰
需要注意的是，对于zeroshot，由于带了prompt输入，所以这部分的cache也需要额外考虑

备注：
1、由于后续的mel fade、hifigan cache、speech fade的挽救，该项虽然更本质，但最终听感提升概率较小
2、对于默认配置，断音出现周期为2s，若此时刻恰好没有发音，就难以发现区别，因此不保证每条case必然出现至少一处改善
3、flow matching 的输入Z和MU使用cache结果是有意义的。首先可以减少随机性、提升结果稳定性；其次未来做causal同样需要使用此cache，是可以复用的

已测有效的用例：
SFT
cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-SFT') cosyvoice.inference_sft('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', '中文女', stream=True)

zeroshot
cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-25Hz') prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000) cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=True)
zeroshot 改进前

zeroshot 改进后

Dev/lyuxiang.lx

boji123 · 2024-09-30T02:56:53Z

其他有效的用例（只发现无改善&有提升的case，暂未发现变差的case）：
1、
噢，唱歌吗？当然会啦！我虽然不是专业歌手，但是我喜欢用欢快的歌声来表达自己的情感。有时候我会和花朵们一起唱歌，它们也会和我一起合唱呢！
SFT中文女 4s处断音改善

2、
你好，我是通义生成式语音大模型，请问有什么可以帮您的吗？
zeroshot zero_shot_prompt.wav 4.45s处类鼠标按键音的异常杂音改善（得听，频谱看不出来）

3、
<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that's coming into the family is a reason why sometimes we don't buy the whole thing.
zeroshot zero_shot_prompt.wav 2.10s处发音沙哑&频谱模糊改善

FunAudioLLM#455

aluminumbox and others added 3 commits September 29, 2024 14:54

Merge pull request FunAudioLLM#453 from FunAudioLLM/dev/lyuxiang.lx

d492598

Dev/lyuxiang.lx

[debug] support flow cache, for sharper tts_mel output

c9acce1

[debug] handle cache with prompt

8130abb

mwbdcz approved these changes Sep 30, 2024

View reviewed changes

PasiKoodaa added a commit to PasiKoodaa/CosyVoice-optimized that referenced this pull request Oct 6, 2024

Update flow.py

69b2480

FunAudioLLM#455

PasiKoodaa added a commit to PasiKoodaa/CosyVoice-optimized that referenced this pull request Oct 6, 2024

Update flow_matching.py

b9a14f7

FunAudioLLM#455

PasiKoodaa added a commit to PasiKoodaa/CosyVoice-optimized that referenced this pull request Oct 6, 2024

Update model.py

b8adba7

FunAudioLLM#455

aluminumbox changed the base branch from main to dev/lyuxiang.lx October 16, 2024 06:11

aluminumbox merged commit ace734d into FunAudioLLM:dev/lyuxiang.lx Oct 16, 2024
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[debug] support flow cache, for sharper tts_mel output (handle prompt bug) #455

[debug] support flow cache, for sharper tts_mel output (handle prompt bug) #455

boji123 commented Sep 30, 2024 •

edited

Loading

boji123 commented Sep 30, 2024 •

edited

Loading

[debug] support flow cache, for sharper tts_mel output (handle prompt bug) #455

[debug] support flow cache, for sharper tts_mel output (handle prompt bug) #455

Conversation

boji123 commented Sep 30, 2024 • edited Loading

boji123 commented Sep 30, 2024 • edited Loading

boji123 commented Sep 30, 2024 •

edited

Loading

boji123 commented Sep 30, 2024 •

edited

Loading