[debug] support flow cache, for sharper tts_mel output (handle prompt bug) #455
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#379 问题2的解决方案
flowmatching中的z和mu,跨chunk时对于每个index不是定值,是导致衔接处频谱模糊的因素之一(本质是flow的attention context问题,无解。需要重训模型+context cache,非常复杂)
有益效果:
1、可以改善衔接处频谱模糊、发音不清晰问题
2、可以改善衔接处断音问题
3、可以改善音量突变或音量异常问题
图中是flow的tts_mel输出,用于对比上下文及频谱模糊的问题
大图1列:不带cache;2列:带cache
小图左:前chunk最后34;中:(前+后)/2;右:后chunk开头34
可以发现带cache的,tts_mel频谱更清晰
需要注意的是,对于zeroshot,由于带了prompt输入,所以这部分的cache也需要额外考虑
备注:
1、由于后续的mel fade、hifigan cache、speech fade的挽救,该项虽然更本质,但最终听感提升概率较小
2、对于默认配置,断音出现周期为2s,若此时刻恰好没有发音,就难以发现区别,因此不保证每条case必然出现至少一处改善
3、flow matching 的输入Z和MU使用cache结果是有意义的。首先可以减少随机性、提升结果稳定性;其次未来做causal同样需要使用此cache,是可以复用的
已测有效的用例:
SFT
cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-SFT') cosyvoice.inference_sft('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '中文女', stream=True)
zeroshot
cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-25Hz') prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000) cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=True)
zeroshot 改进前
zeroshot 改进后