Skip to content

Audio Generation Quality Issues (Model by Model Breakdown) #307

@YachtRockAM

Description

@YachtRockAM

I've been testing the latest version of VoiceBox (v0.3.0), by uploading a voice with a thick Australian male accent (that was initially created by Gemini TTS) and attempting to clone it, and found the following issues. I'm using a MSI gaming laptop with 16GB of RAM, 6GB Graphics Card, a Nvidia RTX 4050 and 13th intel core i7-13620H.

Qwen3-TTS (both 1.7B and 0.6B) - Best of the models, although they take about 6 times of the length of the file to generate (30 seconds of audio takes 3 minutes to generate). Although clear, in audio generate it sometimes misses natural pauses (a run-on sentence) or makes the last word sound likes it's going to continue on instead of observing a full stop. Fails to observe emotion tags.

ChatterBox Turbo - The best for emotion, without the emotion tags (it ingnores them - actually, it reads them, not observing them), BUT and this is a big BUT, it randomly turns to crap in the last third of each generation, with totally made up nonsense words, or repeating itself mid-sentence. Annoying, because it generates quickly and the first half of the file sounds great.

Chatterbox - Like Qwen3-TTS takes awhile to generate the audio. It lost all accent, creating an American voice. It also fails to observe and reads the emotion tags.

LuxTTS - Sound like the person is calling a horse race! For no reason what-so-ever, it speeds up the voice. Some generations also error, making a warped like sound part-way through it. It does observe emotion tags though, and is lightning fast in generation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions