-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
I've been testing the latest version of VoiceBox (v0.3.0), by uploading a voice with a thick Australian male accent (that was initially created by Gemini TTS) and attempting to clone it, and found the following issues. I'm using a MSI gaming laptop with 16GB of RAM, 6GB Graphics Card, a Nvidia RTX 4050 and 13th intel core i7-13620H.
Qwen3-TTS (both 1.7B and 0.6B) - Best of the models, although they take about 6 times of the length of the file to generate (30 seconds of audio takes 3 minutes to generate). Although clear, in audio generate it sometimes misses natural pauses (a run-on sentence) or makes the last word sound likes it's going to continue on instead of observing a full stop. Fails to observe emotion tags.
ChatterBox Turbo - The best for emotion, without the emotion tags (it ingnores them - actually, it reads them, not observing them), BUT and this is a big BUT, it randomly turns to crap in the last third of each generation, with totally made up nonsense words, or repeating itself mid-sentence. Annoying, because it generates quickly and the first half of the file sounds great.
Chatterbox - Like Qwen3-TTS takes awhile to generate the audio. It lost all accent, creating an American voice. It also fails to observe and reads the emotion tags.
LuxTTS - Sound like the person is calling a horse race! For no reason what-so-ever, it speeds up the voice. Some generations also error, making a warped like sound part-way through it. It does observe emotion tags though, and is lightning fast in generation.