Using joint training speaker encoder with consistency loss to achieve cross-lingual voice conversion and expressive voice conversion
Demo page: https://consistencyvc.github.io/ConsistencyVC-demo-page
The whisper medium model can be downloaded here: https://drive.google.com/file/d/1PZsfQg3PUZuu1k6nHvavd6OcOB_8m1Aa/view?usp=drive_link
The pre-trained models are available here:https://drive.google.com/drive/folders/1KvMN1V8BWCzJd-N8hfyP283rLQBKIbig?usp=sharing
Note: The audio needs to be 16KHz for train and inference.
Generate the WEO of the source speech in src by preprocess_ppg.py.
Copy the root of the reference speech to tgt
Use whisperconvert_exp.py to achieve voice conversion using WEO as content information.
For ConsistencyEVC, use ppgemoconvert_exp.py to achieve voice conversion using ppg as content information.
I uploaded a new py file for the inference of long audio. You don't need to run the whisper by another file, just change this part and run this py file.
Use ppg.py to generate the PPG.
Use preprocess_ppg.py to generate the WEO.
First you need to train the model without speaker consistency loss for 100k steps:
change this line to
loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl# + loss_emo
run the py file:
python train_whisper_emo.py -c configs/cvc-whispers-multi.json -m cvc-whispers-three
Then change this line back to finetune this model with speaker consistency loss
python train_whisper_emo.py -c configs/cvc-whispers-three-emo.json -m cvc-whispers-three
First you need to train the model without speaker consistency loss for 100k steps:
change this line to
loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl# + loss_emo
run the py file:
python train_eng_ppg_emo_loss.py -c configs/cvc-eng-ppgs-three-emo.json -m cvc-eng-ppgs-three-emo
Then change this line back to finetune this model with speaker consistency loss
python train_eng_ppg_emo_loss.py -c configs/cvc-eng-ppgs-three-emo-cycleloss.json -m cvc-eng-ppgs-three-emo
The code structure is based on FreeVC-s. Suggestion: please follow the instruction of FreeVC to install python requirements.
The WEO content feature is based on LoraSVC.
The PPG is from the phoneme recognition model.