Skip to content

jamiepine/voicebox

Repository files navigation

Voicebox

Voicebox

The open-source voice synthesis studio.
Clone voices. Generate speech. Apply effects. Build voice-powered apps.
All running locally on your machine.

Downloads Release Stars License

voicebox.shDocsDownloadFeaturesAPI


Voicebox App Screenshot

Click the image above to watch the demo video on voicebox.sh


Voicebox Screenshot 2

Voicebox Screenshot 3


What is Voicebox?

Voicebox is a local-first voice cloning studio — a free and open-source alternative to ElevenLabs. Clone voices from a few seconds of audio, generate speech in 23 languages across 5 TTS engines, apply post-processing effects, and compose multi-voice projects with a timeline editor.

  • Complete privacy — models and voice data stay on your machine
  • 5 TTS engines — Qwen3-TTS, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, and HumeAI TADA
  • 23 languages — from English to Arabic, Japanese, Hindi, Swahili, and more
  • Post-processing effects — pitch shift, reverb, delay, chorus, compression, and filters
  • Expressive speech — paralinguistic tags like [laugh], [sigh], [gasp] via Chatterbox Turbo
  • Unlimited length — auto-chunking with crossfade for scripts, articles, and chapters
  • Stories editor — multi-track timeline for conversations, podcasts, and narratives
  • API-first — REST API for integrating voice synthesis into your own projects
  • Native performance — built with Tauri (Rust), not Electron
  • Runs everywhere — macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, Docker

Download

Platform Download
macOS (Apple Silicon) Download DMG
macOS (Intel) Download DMG
Windows Download MSI
Docker docker compose up

View all binaries →

Linux — Pre-built binaries are not yet available. See voicebox.sh/linux-install for build-from-source instructions.


Features

Multi-Engine Voice Cloning

Five TTS engines with different strengths, switchable per-generation:

Engine Languages Strengths
Qwen3-TTS (0.6B / 1.7B) 10 High-quality multilingual cloning, delivery instructions ("speak slowly", "whisper")
LuxTTS English Lightweight (~1GB VRAM), 48kHz output, 150x realtime on CPU
Chatterbox Multilingual 23 Broadest language coverage — Arabic, Danish, Finnish, Greek, Hebrew, Hindi, Malay, Norwegian, Polish, Swahili, Swedish, Turkish and more
Chatterbox Turbo English Fast 350M model with paralinguistic emotion/sound tags
TADA (1B / 3B) 10 HumeAI speech-language model — 700s+ coherent audio, text-acoustic dual alignment

Emotions & Paralinguistic Tags

Type / in the text input to insert expressive tags that the model synthesizes inline with speech (Chatterbox Turbo):

[laugh] [chuckle] [gasp] [cough] [sigh] [groan] [sniff] [shush] [clear throat]

Post-Processing Effects

8 audio effects powered by Spotify's pedalboard library. Apply after generation, preview in real time, build reusable presets.

Effect Description
Pitch Shift Up or down by up to 12 semitones
Reverb Configurable room size, damping, wet/dry mix
Delay Echo with adjustable time, feedback, and mix
Chorus / Flanger Modulated delay for metallic or lush textures
Compressor Dynamic range compression
Gain Volume adjustment (-40 to +40 dB)
High-Pass Filter Remove low frequencies
Low-Pass Filter Remove high frequencies

Ships with 4 built-in presets (Robotic, Radio, Echo Chamber, Deep Voice) and supports custom presets. Effects can be assigned per-profile as defaults.

Unlimited Generation Length

Text is automatically split at sentence boundaries and each chunk is generated independently, then crossfaded together. Works with all engines.

  • Configurable auto-chunking limit (100–5,000 chars)
  • Crossfade slider (0–200ms) for smooth transitions
  • Max text length: 50,000 characters
  • Smart splitting respects abbreviations, CJK punctuation, and [tags]

Generation Versions

Every generation supports multiple versions with provenance tracking:

  • Original — clean TTS output, always preserved
  • Effects versions — apply different effects chains from any source version
  • Takes — regenerate with a new seed for variation
  • Source tracking — each version records its lineage
  • Favorites — star generations for quick access

Async Generation Queue

Generation is non-blocking. Submit and immediately start typing the next one.

  • Serial execution queue prevents GPU contention
  • Real-time SSE status streaming
  • Failed generations can be retried
  • Stale generations from crashes auto-recover on startup

Voice Profile Management

  • Create profiles from audio files or record directly in-app
  • Import/export profiles to share or back up
  • Multi-sample support for higher quality cloning
  • Per-profile default effects chains
  • Organize with descriptions and language tags

Stories Editor

Multi-voice timeline editor for conversations, podcasts, and narratives.

  • Multi-track composition with drag-and-drop
  • Inline audio trimming and splitting
  • Auto-playback with synchronized playhead
  • Version pinning per track clip

Recording & Transcription

  • In-app recording with waveform visualization
  • System audio capture (macOS and Windows)
  • Automatic transcription powered by Whisper (including Whisper Turbo)
  • Export recordings in multiple formats

Model Management

  • Per-model unload to free GPU memory without deleting downloads
  • Custom models directory via VOICEBOX_MODELS_DIR
  • Model folder migration with progress tracking
  • Download cancel/clear UI

GPU Support

Platform Backend Notes
macOS (Apple Silicon) MLX (Metal) 4-5x faster via Neural Engine
Windows / Linux (NVIDIA) PyTorch (CUDA) Auto-downloads CUDA binary from within the app
Linux (AMD) PyTorch (ROCm) Auto-configures HSA_OVERRIDE_GFX_VERSION
Windows (any GPU) DirectML Universal Windows GPU support
Intel Arc IPEX/XPU Intel discrete GPU acceleration
Any CPU Works everywhere, just slower

API

Voicebox exposes a full REST API for integrating voice synthesis into your own apps.

# Generate speech
curl -X POST http://localhost:17493/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "profile_id": "abc123", "language": "en"}'

# List voice profiles
curl http://localhost:17493/profiles

# Create a profile
curl -X POST http://localhost:17493/profiles \
  -H "Content-Type: application/json" \
  -d '{"name": "My Voice", "language": "en"}'

Use cases: game dialogue, podcast production, accessibility tools, voice assistants, content automation.

Full API documentation available at http://localhost:17493/docs.


Tech Stack

Layer Technology
Desktop App Tauri (Rust)
Frontend React, TypeScript, Tailwind CSS
State Zustand, React Query
Backend FastAPI (Python)
TTS Engines Qwen3-TTS, LuxTTS, Chatterbox, Chatterbox Turbo, TADA
Effects Pedalboard (Spotify)
Transcription Whisper / Whisper Turbo (PyTorch or MLX)
Inference MLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU)
Database SQLite
Audio WaveSurfer.js, librosa

Roadmap

Feature Description
Real-time Streaming Stream audio as it generates, word by word
Voice Design Create new voices from text descriptions
More Models XTTS, Bark, and other open-source voice models
Plugin Architecture Extend with custom models and effects
Mobile Companion Control Voicebox from your phone

Development

See CONTRIBUTING.md for detailed setup and contribution guidelines.

Quick Start

git clone https://github.com/jamiepine/voicebox.git
cd voicebox

just setup   # creates Python venv, installs all deps
just dev     # starts backend + desktop app

Install just: brew install just or cargo install just. Run just --list to see all commands.

Prerequisites: Bun, Rust, Python 3.11+, Tauri Prerequisites, and Xcode on macOS.

Building Locally

just build          # Build CPU server binary + Tauri app
just build-local    # (Windows) Build CPU + CUDA server binaries + Tauri app

Adding New Voice Models

The multi-engine architecture makes adding new TTS engines straightforward. A step-by-step guide covers the full process: dependency research, backend protocol implementation, frontend wiring, and PyInstaller bundling.

The guide is optimized for AI coding agents. An agent skill can pick up a model name and handle the entire integration autonomously — you just test the build locally.

Project Structure

voicebox/
├── app/              # Shared React frontend
├── tauri/            # Desktop app (Tauri + Rust)
├── web/              # Web deployment
├── backend/          # Python FastAPI server
├── landing/          # Marketing website
└── scripts/          # Build & release scripts

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

  1. Fork the repo
  2. Create a feature branch
  3. Make your changes
  4. Submit a PR

Security

Found a security vulnerability? Please report it responsibly. See SECURITY.md for details.


License

MIT License — see LICENSE for details.


voicebox.sh