Audio Generation Quality Issues (Model by Model Breakdown)

I've been testing the latest version of VoiceBox (v0.3.0), by uploading a voice with a thick Australian male accent (that was initially created by Gemini TTS) and attempting to clone it, and found the following issues. I'm using a MSI gaming laptop with 16GB of RAM, 6GB Graphics Card, a Nvidia RTX 4050 and 13th intel core i7-13620H. 

Qwen3-TTS (both 1.7B and 0.6B) - Best of the models, although they take about 6 times of the length of the file to generate (30 seconds of audio takes 3 minutes to generate). Although clear, in audio generate it sometimes misses natural pauses (a run-on sentence) or makes the last word sound likes it's going to continue on instead of observing a full stop. Fails to observe emotion tags.

ChatterBox Turbo - The best for emotion, without the emotion tags (it ingnores them - actually, it reads them, not observing them), BUT and this is a big BUT, it randomly turns to crap in the last third of each generation, with totally made up nonsense words, or repeating itself mid-sentence. Annoying, because it generates quickly and the first half of the file sounds great.

Chatterbox - Like Qwen3-TTS takes awhile to generate the audio. It lost all accent, creating an American voice. It also fails to observe and reads the emotion tags.

LuxTTS - Sound like the person is calling a horse race! For no reason what-so-ever, it speeds up the voice. Some generations also error, making a warped like sound part-way through it. It does observe emotion tags though, and is lightning fast in generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio Generation Quality Issues (Model by Model Breakdown) #307

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Audio Generation Quality Issues (Model by Model Breakdown) #307

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions