This repository is built on top of ESPNet. Paper has been (re)submitted to Signal Processing Letters. Audio samples are here. Various notebook examples: Aligner Text effects (long phonemes, questions) Prosody transfer Zero shot language transfer: English to German Evaluation on CSS10 with Whisper English to Hungarian Evaluation on CSS10 with Whisper English to Spanish Evaluation on CSS10 with Whisper English to Mandarin For info on how to run the baseline, refer to my clone of zm-text-tts Evaluation on AISHELL-3 with FunASR Paraformer