A beautiful, modern web application for AI-powered voice synthesis using Microsoft's VibeVoice model. Generate natural-sounding speech from text with custom voice profiles.
- π€ Voice Training: Upload audio files or record your voice directly
- π Text-to-Speech: Convert text or text files to natural speech
- π Multiple Speakers: Support for up to 4 distinct speakers
- πΎ Voice Library: Save and manage custom voice profiles
- π¨ Beautiful UI: Modern, responsive design with dark/light themes
- β‘ Real-time Processing: Fast speech generation with streaming support
- π Audio Visualization: Live waveform display during recording
- πΎ Download & Save: Export generated audio files
vibevoice-demo.mp4
- Python 3.9 or higher
- CUDA-capable GPU (recommended)
- 8GB+ RAM
- Clone the repository
git clone https://github.com/shamspias/vibevoice-studio.git
cd vibevoice-studio
- Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install VibeVoice
git clone https://github.com/shamspias/VibeVoice
cd VibeVoice
pip install -e .
cd ..
- Install dependencies
pip install -r requirements.txt
- Configure environment
cp .env.example .env
# Edit .env with your settings
- Run the application
python -m app.main
- Open in browser
http://localhost:8000
- Upload or record voices
- Support for WAV, MP3, M4A, FLAC
- Organized voice library
- Manual input or upload
.txt
files - Multi-speaker support for conversations
- Voice strength (CFG scale 1.0β2.0)
- Up to 4 speakers
- Adjustable inference steps
- Play in browser
- Download WAV file
- Save to library
Edit .env
:
HOST=0.0.0.0
PORT=8000
DEBUG=False
MODEL_PATH=microsoft/VibeVoice-1.5B
DEVICE=cuda
CFG_SCALE=1.3
SAMPLE_RATE=24000
- Select/upload a voice
- Enter text
- Click "Generate Speech"
Speaker 1: Hello, welcome!
Speaker 2: Thanks, glad to be here.
- Record 10β30s of clear speech
- Save with name
- Use for TTS generation
GET /api/voices
β list voicesPOST /api/voices/upload
β upload voicePOST /api/voices/record
β record voicePOST /api/generate
β generate speechGET /api/audio/{filename}
β download audio
Minimum: Python 3.9+, 8GB RAM, CPU with AVX Recommended: Python 3.10+, 16GB RAM, NVIDIA GPU (8GB+ VRAM)
- OOM: Use smaller model, reduce batch size
- Low quality: Use better voice samples, adjust CFG scale
- Slow generation: Enable GPU, shorten text
- Use GPU for 10β20Γ speed
- Batch texts
- Cache voices
- Try quantized models
- Fork repo
- Create feature branch
- Commit & push
- Open PR
MIT License
- Microsoft VibeVoice team
- FastAPI community
- Contributors & users
- Issues: GitHub Issues