MQTTS

Official implementation for the paper: A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech.
Audio samples (40 each system) can be accessed at here.
Quick demo can be accessed here (Some are still TODO).
Paper appendix is here.

Setup the environment

Setup conda environment:

conda create --name mqtts python=3.9
conda activate mqtts
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt

(Update) You may need to create an access token to use the speaker embedding of pyannote as they updated their policy. If that's the case follow the pyannote repo and change every Inference("pyannote/embedding", window="whole") accordingly.

Download the pretrained phonemizer checkpoint

wget https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_forward.pt

Preprocess the dataset

Get the GigaSpeech dataset from the official repo
Install FFmpeg, then

conda install ffmpeg=4.3=hf484d3e_0
conda update ffmpeg

Run python script

python preprocess.py --giga_speech_dir GIGASPEECH --outputdir datasets

Train the quantizer and inference

Train

cd quantizer/
python train.py --input_wavs_dir ../datasets/audios \
                --input_training_file ../datasets/training.txt \
                --input_validation_file ../datasets/validation.txt \
                --checkpoint_path ./checkpoints \
                --config config.json

Inference to get codes for training the second stage

python get_labels.py --input_json ../datasets/train.json \
                     --input_wav_dir ../datasets/audios \
                     --output_json ../datasets/train_q.json \
                     --checkpoint_file ./checkpoints/g_{training_steps}
python get_labels.py --input_json ../datasets/dev.json \
                     --input_wav_dir ../datasets/audios \
                     --output_json ../datasets/dev_q.json \
                     --checkpoint_file ./checkpoints/g_{training_steps}

Train the transformer (below an example for the 100M version)

cd ..
mkdir ckpt
python train.py \
     --distributed \
     --saving_path ckpt/ \
     --sampledir logs/ \
     --vocoder_config_path quantizer/checkpoints/config.json \
     --vocoder_ckpt_path quantizer/checkpoints/g_{training_steps} \
     --datadir datasets/audios \
     --metapath datasets/train_q.json \
     --val_metapath datasets/dev_q.json \
     --use_repetition_token \
     --ar_layer 4 \
     --ar_ffd_size 1024 \
     --ar_hidden_size 256 \
     --ar_nheads 4 \
     --speaker_embed_dropout 0.05 \
     --enc_nlayers 6 \
     --dec_nlayers 6 \
     --ffd_size 3072 \
     --hidden_size 768 \
     --nheads 12 \
     --batch_size 200 \
     --precision bf16 \
     --training_step 800000 \
     --layer_norm_eps 1e-05

You can view the progress using:

tensorboard --logdir logs/

Run batched inference

You'll have to change speaker_to_text.json, it's just an example.

mkdir infer_samples
CUDA_VISIBLE_DEVICES=0 python infer.py \
    --phonemizer_dict_path en_us_cmudict_forward.pt \
    --model_path ckpt/last.ckpt \
    --config_path ckpt/config.json \
    --input_path speaker_to_text.json \
    --outputdir infer_samples \
    --batch_size {batch_size} \
    --top_p 0.8 \
    --min_top_k 2 \
    --max_output_length {Maximum Output Frames to prevent infinite loop} \
    --phone_context_window 3 \
    --clean_speech_prior

Pretrained checkpoints

Quantizer (put it under quantizer/checkpoints/): here
Transformer (100M version) (put it under ckpt/): model, config

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
datasets		datasets
docs		docs
measures		measures
modules		modules
quantizer		quantizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
infer.py		infer.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
speaker_to_text.json		speaker_to_text.json
tester.py		tester.py
train.py		train.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MQTTS

Setup the environment

Preprocess the dataset

Train the quantizer and inference

Train the transformer (below an example for the 100M version)

Run batched inference

Pretrained checkpoints

About

Releases

Packages

Languages

License

Hyeokiki/MQTTS

Folders and files

Latest commit

History

Repository files navigation

MQTTS

Setup the environment

Preprocess the dataset

Train the quantizer and inference

Train the transformer (below an example for the 100M version)

Run batched inference

Pretrained checkpoints

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages