Skip to content

TTS toolkit built on Kokoro-82M with librosa audio enhancement, MCP server for Claude/Cursor, CLI & Python API. Free & open-source for YouTube creators.

License

Notifications You must be signed in to change notification settings

aparsoft/kokoro-mcp-server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

51 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Kokoro MCP SERVER: Text To Speech (TTS)

A comprehensive Text-to-Speech toolkit built on Kokoro-82M with audio enhancement, Model Context Protocol (MCP) server integration, CLI interface, and Docker deployment.

Python 3.10+ License: Apache 2.0 Code style: black


πŸ“Ί Demo Video

Watch the Demo Video


Features

  • Kokoro-82M TTS Engine: Open-weight model with 82M parameters (510 tokens per pass)
  • 🌐 Streamlit Web UI: Enterprise-grade management interface with real-time preview (OPTIONAL)
  • Audio Enhancement: Professional processing with librosa (normalization, noise reduction, fade in/out)
  • MCP Server: Model Context Protocol integration for Claude Desktop, Cursor, and other AI tools (OPTIONAL)
  • CLI Interface: Command-line tools for quick generation (OPTIONAL)
  • Batch Processing: Generate multiple audio files efficiently
  • Script Processing: Convert complete video scripts with automatic text chunking
  • Docker Support: Containerization with docker-compose
  • Enterprise Features: Structured logging, configuration management, comprehensive testing
  • CI/CD: GitHub Actions pipeline with automated testing

Streamlit Web Interface (Optional)

Beautiful web UI for managing all TTS functionality:

  • 🎯 Single Generation - Convert text with real-time preview
  • πŸ“¦ Batch Processing - Process multiple texts in one go
  • πŸ“„ Script Processing - Complete video script conversion
  • πŸ” Voice Explorer - Compare all 12 voices side-by-side
  • βš™οΈ Configuration - Manage settings visually
  • πŸ“Š Analytics - Track generations with charts and statistics

Install with: pip install -e ".[streamlit]" or pip install -e ".[complete]"

Quick Start:

python run_streamlit.py
# Opens at http://localhost:8501

πŸ“š See STREAMLIT_README.md for complete Streamlit documentation.


Building on Kokoro-82M

What Kokoro-82M Provides Out-of-the-Box: Kokoro-82M is an exceptional open-weight TTS model that delivers: core neural TTS inference with 82M parameters, a basic Python inference library (KPipeline), 10 professional voice packs (male/female, American/British), phonemization (G2P) system, and raw 24kHz audio output with a 510-token processing limit per pass.

What aparsoft-tts Adds: We integrate Kokoro-82M's excellent TTS inference with comprehensive development tooling and workflow enhancements. This toolkit adds:

  1. Audio post-processing - Normalization, noise reduction, silence trimming, and fade in/out using librosa
  2. Automated script workflows - Direct script-to-voiceover conversion with paragraph detection and gap management
  3. IDE-native generation - MCP server integration eliminates context switching for Claude Desktop and Cursor users
  4. Deployment infrastructure - Docker deployment, structured logging, configuration management, and comprehensive testing
  5. Batch processing - CLI and Python APIs for processing multiple segments efficiently

Technical Implementation

Audio Enhancement (librosa Integration):

This toolkit adds an audio processing pipeline on Kokoro generated TTS output:

# Without enhancement - raw Kokoro output
audio = kokoro_pipeline(text)

# With enhancement
audio = enhance_audio(
    kokoro_output,
    normalize=True,        # Consistent volume
    trim_silence=True,     # Remove dead air
    noise_reduction=True,  # Spectral gating
    add_fade=True         # Smooth transitions
)

Result: Voiceovers ready for YouTube, podcasts, or content creation without additional audio editing.

MCP Server Integration:

Traditional workflow:

# 1. Write script in Claude/Cursor
# 2. Copy text to terminal
# 3. Run Python script
# 4. Switch back to editor
# 5. Repeat for each segment

With MCP server:

# In Claude Desktop or Cursor:
"Generate voiceover for this section using am_michael voice"
# Done. Audio generated without leaving your workspace.

Workflow Enhancement:

  • Content creators: Write scripts in AI editors, generate voiceovers inline
  • Developers: Generate test audio during development without context switching
  • Teams: Standardized TTS across tools (Claude, Cursor, CLI, API)
  • Automation: AI agents can generate audio as part of content pipelines

Deployment Features:

The toolkit wraps Kokoro with common deployment and development needs:

  • Configuration management - Environment-based settings, no hardcoded values
  • Structured logging - JSON logs for aggregation, correlation IDs for tracing
  • Error handling - Custom exceptions, graceful failures, detailed error context
  • Testing - Comprehensive test suite, CI/CD integration
  • Docker deployment - Containerized with health checks, resource limits
  • CLI interface - Quick access without writing code

Use Cases

YouTube/Podcast Production:

# Process entire video script with proper gaps
engine.process_script("script.txt", "voiceover.wav", gap_duration=0.5)

AI-Assisted Content Creation:

# In Claude Desktop with MCP:
User: "Generate a 30-second intro for my coding tutorial"
Claude: *generates script and voiceover via MCP*

Batch Content Generation:

# Generate 100 audio segments for e-learning course
engine.batch_generate(lesson_texts, output_dir="lessons/")

Development/Testing:

# Quick CLI test during development
aparsoft-tts generate "Test message" -o test.wav

Quick Start

Installation

System Dependencies (Required):

# Ubuntu/Debian
sudo apt-get install espeak-ng ffmpeg libsndfile1

# macOS
brew install espeak ffmpeg

# Windows: Download from
# - espeak-ng: http://espeak.sourceforge.net/
# - ffmpeg: https://ffmpeg.org/download.html

Python Package - Choose Your Installation:

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# OPTION 1: Complete installation (RECOMMENDED)
# Includes: TTS Engine + MCP Server + CLI + Streamlit Web UI
pip install -e ".[complete]"

# OPTION 2: Without Streamlit (Developers)
# Includes: TTS Engine + MCP Server + CLI (no web UI)
pip install -e ".[mcp,cli]"

# OPTION 3: Streamlit Only
# Includes: TTS Engine + Streamlit Web UI (no MCP, no CLI)
pip install -e ".[streamlit]"

# OPTION 4: Core Only (Minimal)
# Includes: TTS Engine only (Python API)
pip install -e .

# OPTION 5: Everything (Contributors)
# Includes: All features + development tools
pip install -e ".[all]"

πŸ“š See INSTALLATION.md for detailed installation options and troubleshooting.

Quick Launch

Streamlit Web UI:

# Cross-platform launcher
python run_streamlit.py

# Or use platform-specific scripts
./run_streamlit.sh      # Linux/macOS
run_streamlit.bat        # Windows

# Or direct
streamlit run streamlit_app.py

MCP Server (for Claude Desktop/Cursor):

Basic Usage

from aparsoft_tts import TTSEngine

# Initialize engine
engine = TTSEngine()

# Generate speech
engine.generate(
    text="Welcome to Kokoro YouTube TTS",
    output_path="output.wav"
)

CLI Usage

# Generate audio
aparsoft-tts generate "Hello world" -o output.wav

# List available voices
aparsoft-tts voices

# Process video script
aparsoft-tts script video_script.txt -o voiceover.wav

# Batch generate
aparsoft-tts batch "Intro" "Body" "Outro" -d segments/

Available Voices

Male Voices:

  • am_adam - American male (natural inflection)
  • am_michael - American male (deeper tones, professional)
  • bm_george - British male (classic accent)
  • bm_lewis - British male (modern accent)

Female Voices:

  • af_bella - American female (warm tones)
  • af_nicole - American female (dynamic range)
  • af_sarah - American female (clear articulation)
  • af_sky - American female (youthful energy)
  • bf_emma - British female (professional)
  • bf_isabella - British female (soft tones)

Special Voices:

  • af - Default mix (50-50 blend of Bella and Sarah)

Advanced Usage

Custom Configuration

from aparsoft_tts import TTSEngine, TTSConfig

# Create custom configuration
config = TTSConfig(
    voice="bm_george",
    speed=1.2,
    enhance_audio=True,
    fade_duration=0.2
)

engine = TTSEngine(config=config)
engine.generate("Custom configuration", "output.wav")

Audio Enhancement

from aparsoft_tts.utils.audio import enhance_audio

# Generate raw audio
audio = engine.generate("Test audio")

# Apply custom enhancement
enhanced = enhance_audio(
    audio,
    sample_rate=24000,
    normalize=True,
    trim_silence=True,
    trim_db=25.0,
    noise_reduction=True,
    add_fade=True,
    fade_duration=0.15
)

Batch Processing

# Process multiple texts
texts = [
    "Welcome to the tutorial",
    "Let's explore the features",
    "Thanks for watching"
]

paths = engine.batch_generate(
    texts=texts,
    output_dir="segments/",
    voice="am_michael"
)

Script Processing

# Process complete video script with automatic text chunking
engine.process_script(
    script_path="video_script.txt",
    output_path="complete_voiceover.wav",
    gap_duration=0.5,  # Gap between paragraphs
    voice="am_michael",
    speed=1.0
)

# Note: Kokoro processes up to 510 tokens per pass.
# Long scripts are automatically chunked and combined seamlessly.

Podcast Generation (Multi-Voice)

Create podcast-style content with different voices and speeds per segment. Perfect for interviews, dialogues, or multi-speaker content.

Via MCP (Claude Desktop/Cursor):

"Create a podcast with these segments:
- Intro by am_michael: 'Welcome to Tech Talk'
- Guest by af_bella at 0.95 speed: 'Thanks for having me'
- Outro by am_michael: 'See you next week'"

Via Python API:

from aparsoft_tts.utils.audio import combine_audio_segments, save_audio

# Define podcast segments with different voices/speeds
segments = [
    {"text": "Welcome to the show", "voice": "am_michael", "speed": 1.0},
    {"text": "Great to be here", "voice": "af_bella", "speed": 0.95},
    {"text": "Thanks for listening", "voice": "am_michael", "speed": 1.0},
]

# Generate each segment
audio_segments = []
for seg in segments:
    audio = engine.generate(
        text=seg["text"],
        voice=seg["voice"],
        speed=seg["speed"]
    )
    audio_segments.append(audio)

# Combine with gaps
combined = combine_audio_segments(
    audio_segments,
    sample_rate=24000,
    gap_duration=0.6  # Pause between segments
)

# Save
save_audio(combined, "podcast.wav", sample_rate=24000)

Via Streamlit UI:

  1. Open Streamlit: python run_streamlit.py
  2. Navigate to "πŸŽ™οΈ Podcast Generation" tab
  3. Click "βž• Add Segment" for each speaker
  4. Configure voice, speed, and text per segment
  5. Adjust gap duration in settings panel
  6. Click "🎧 Generate Podcast"

Features:

  • Per-segment voice control (host/guest conversations)
  • Individual speed settings (emphasis/pacing)
  • Configurable gaps between segments
  • Audio enhancement (normalization, crossfades)
  • Segment reordering (move up/down)
  • Template support for quick start

Streaming Generation

# Generate audio in chunks
for chunk in engine.generate_stream(
    text="Long text for streaming...",
    voice="am_michael"
):
    # Process chunk as it's generated
    process_audio_chunk(chunk)

Model Context Protocol (MCP) Integration

Quick MCP Setup (5 Minutes)

What is MCP? Model Context Protocol lets Claude Desktop and Cursor generate speech directly from your conversations. No copy-pasting, no context switching.

For Developers: Quick Start

# 1. Find your Python path
which python  # Linux/Mac
where python  # Windows

# Example output: /home/ram/projects/youtube-creator/venv/bin/python

Claude Desktop:

# 1. Open config (creates if doesn't exist)
code ~/Library/Application\ Support/Claude/claude_desktop_config.json  # macOS
code ~/.config/Claude/claude_desktop_config.json  # Linux
notepad %APPDATA%\Claude\claude_desktop_config.json  # Windows

# 2. Add this (use YOUR absolute Python path):
{
  "mcpServers": {
    "aparsoft-tts": {
      "command": "/absolute/path/to/your/venv/bin/python",
      "args": ["-m", "aparsoft_tts.mcp_server"]
    }
  }
}

# 3. Restart Claude (Cmd/Ctrl + R)

Cursor:

# 1. Create/edit config
mkdir -p ~/.cursor && code ~/.cursor/mcp.json

# 2. Add this (use YOUR absolute Python path):
{
  "mcpServers": {
    "aparsoft-tts": {
      "command": "/absolute/path/to/your/venv/bin/python",
      "args": ["-m", "aparsoft_tts.mcp_server"]
    }
  }
}

# 3. Restart Cursor completely

Testing MCP Server

# Quick test - should print server info
python -m aparsoft_tts.mcp_server --help

# Interactive testing with MCP Inspector
npx @modelcontextprotocol/inspector \
  --command "/path/to/venv/bin/python" \
  --args "-m" "aparsoft_tts.mcp_server"
# Opens UI at http://localhost:6274

Usage Examples

In Claude Desktop or Cursor, just ask naturally:

# Basic generation
"Generate speech for 'Welcome to my channel' using am_michael voice"

# Voice discovery
"List all available TTS voices"

# Batch processing
"Create voiceovers for these three segments: 'Intro', 'Main', 'Outro'"

# Script processing
"Process video_script.txt and create a complete voiceover"

# Custom parameters
"Generate 'Test message' at 1.3x speed with British accent"

MCP Tools Available

  1. generate_speech: Single audio generation with full control

    • Text input (up to 10,000 characters)
    • Voice selection (6 voices)
    • Speed control (0.5x - 2.0x)
    • Audio enhancement toggle
  2. list_voices: Get voice catalog with descriptions

  3. batch_generate: Process multiple texts efficiently

  4. process_script: Complete video script conversion

    • Automatic paragraph detection
    • Configurable gap duration
    • Handles long texts via automatic chunking

Troubleshooting MCP

"Could not attach to MCP server"

  • Use absolute path: /full/path/to/venv/bin/python
  • Test server runs: python -m aparsoft_tts.mcp_server
  • Check Python version: python --version (needs 3.10+)

"Tool not found"

# Reinstall MCP dependencies
pip install -e ".[mcp]"

# Verify FastMCP
python -c "from fastmcp import FastMCP; print('βœ… OK')"

Detailed Documentation: See TUTORIAL.md for comprehensive MCP guide with advanced features, debugging, and production deployment.


Docker Deployment

Build and Run

# Build image
docker build -t aparsoft-tts:latest .

# Run MCP server
docker run -d \
  --name aparsoft-tts \
  -v $(pwd)/outputs:/app/outputs \
  -v $(pwd)/logs:/app/logs \
  aparsoft-tts:latest

# Run CLI commands
docker run --rm \
  -v $(pwd)/outputs:/app/outputs \
  aparsoft-tts:latest \
  aparsoft-tts generate "Docker test" -o /app/outputs/test.wav

Docker Compose

# Start services
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Environment Variables

# TTS Configuration
TTS_VOICE=am_michael
TTS_SPEED=1.0
TTS_ENHANCE_AUDIO=true

# MCP Server
MCP_SERVER_NAME=aparsoft-tts-server
MCP_ENABLE_RATE_LIMITING=true

# Logging
LOG_LEVEL=INFO
LOG_FORMAT=json

Project Structure

youtube-creator/
β”œβ”€β”€ aparsoft_tts/
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   └── engine.py          # TTS engine
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ audio.py           # Audio processing with librosa
β”‚   β”‚   β”œβ”€β”€ logging.py         # Structured logging
β”‚   β”‚   └── exceptions.py      # Custom exceptions
β”‚   β”œβ”€β”€ config.py              # Configuration management
β”‚   β”œβ”€β”€ cli.py                 # CLI interface
β”‚   └── mcp_server.py          # MCP server (FastMCP)
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ unit/                  # Unit tests
β”‚   └── integration/           # Integration tests
β”œβ”€β”€ examples/                  # Usage examples
β”œβ”€β”€ pyproject.toml             # Project metadata
β”œβ”€β”€ Dockerfile                 # Docker configuration
└── docker-compose.yml         # Docker Compose config

Audio Processing

The toolkit enhances Kokoro's output with professional audio processing:

Features:

  • Normalization: Consistent volume levels
  • Silence Trimming: Remove quiet sections (configurable threshold)
  • Noise Reduction: Spectral gating for cleaner audio
  • Fade In/Out: Smooth transitions, prevents clicks
  • Custom Processing: Extensible with librosa/scipy

Enhancement Pipeline:

from aparsoft_tts.utils.audio import enhance_audio, save_audio

# Generate raw audio
audio = engine.generate("Your text here")

# Apply enhancement pipeline
enhanced = enhance_audio(
    audio,
    sample_rate=24000,
    normalize=True,      # Normalize volume
    trim_silence=True,   # Trim silence
    trim_db=20.0,        # Threshold in dB
    noise_reduction=True,  # Apply noise gate
    add_fade=True,       # Add fade in/out
    fade_duration=0.1    # 100ms fade
)

# Save enhanced audio
save_audio(enhanced, "enhanced.wav", sample_rate=24000)

Configuration

Using Configuration Files

from aparsoft_tts import TTSConfig, MCPConfig, LoggingConfig, Config

# TTS settings
tts_config = TTSConfig(
    voice="am_michael",
    speed=1.0,
    enhance_audio=True,
    sample_rate=24000,
    output_format="wav"
)

# MCP server settings
mcp_config = MCPConfig(
    server_name="aparsoft-tts-production",
    enable_rate_limiting=True,
    rate_limit_calls=100
)

# Logging settings
logging_config = LoggingConfig(
    level="INFO",
    format="json",
    output="file"
)

# Combined configuration
config = Config(
    tts=tts_config,
    mcp=mcp_config,
    logging=logging_config
)

Environment Variables

Create .env file:

# TTS Settings
TTS_VOICE=am_michael
TTS_SPEED=1.0
TTS_ENHANCE_AUDIO=true
TTS_SAMPLE_RATE=24000

# Audio Processing
TTS_TRIM_SILENCE=true
TTS_TRIM_DB=20.0
TTS_FADE_DURATION=0.1

# Logging
LOG_LEVEL=INFO
LOG_FORMAT=console
LOG_OUTPUT=stdout

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=aparsoft_tts --cov-report=html

# Run specific test file
pytest tests/unit/test_engine.py

# Run only fast tests
pytest -m "not slow"

Development

Setup Development Environment

# Clone repository
git clone https://github.com/aparsoft/kokoro-youtube-tts.git
cd kokoro-youtube-tts

# Install with dev dependencies
pip install -e ".[dev,mcp,cli,all]"

# Install pre-commit hooks
pre-commit install

Running CI Locally

The project includes GitHub Actions workflow for CI/CD:

  • Code quality checks (Black, Ruff, mypy)
  • Tests on multiple Python versions (3.10, 3.11, 3.12)
  • Docker build verification
  • Security scanning with Trivy

API Reference

TTSEngine

Initialization:

TTSEngine(config: TTSConfig | None = None)

Methods:

  • generate(text, output_path, voice, speed, enhance) - Generate speech
  • generate_stream(text, voice, speed) - Stream audio chunks
  • batch_generate(texts, output_dir, voice, speed) - Batch processing
  • process_script(script_path, output_path, gap_duration, voice, speed) - Process scripts
  • list_voices() - Get available voices

Configuration Classes

  • TTSConfig - TTS engine settings
  • MCPConfig - MCP server configuration
  • LoggingConfig - Logging configuration
  • Config - Main application configuration

Audio Utilities

  • enhance_audio(audio, ...) - Apply audio enhancement
  • combine_audio_segments(segments, ...) - Combine audio files
  • save_audio(audio, path, ...) - Save audio to file
  • load_audio(path, ...) - Load audio from file
  • chunk_audio(audio, ...) - Split audio into chunks
  • get_audio_duration(audio, ...) - Get audio duration

Examples

See the examples/ directory for complete examples:

  • basic_usage.py - Simple generation examples
  • youtube_workflow.py - Complete YouTube video production workflow

Troubleshooting

espeak-ng not found

# Ubuntu/Debian
sudo apt-get install espeak-ng

# macOS
brew install espeak

# Windows: Download from http://espeak.sourceforge.net/

Audio quality issues

Enable audio enhancement:

engine.generate(text="Your text", enhance=True)

Import errors

Ensure virtual environment is activated:

source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

Docker issues

Check container logs:

docker logs aparsoft-tts

Performance

Benchmarks (on typical consumer hardware):

  • Model Loading: ~2-3 seconds (one-time)
  • Generation Speed: ~0.5s per second of audio
  • Memory Usage: ~2GB RAM (model loaded)
  • Token Processing: Up to 510 tokens per pass

Text Length Limits:

Kokoro-82M processes up to 510 tokens in a single pass. For longer texts:

  • Automatic chunking: Engine automatically splits long texts
  • Script processing: Handles unlimited length via intelligent segmentation
  • Batch processing: Each segment processed independently

Optimization Tips:

  1. Reuse engine instances (avoid reloading model)
  2. Disable enhancement for draft generations (enhance=False)
  3. Use streaming for long texts (automatic chunking)
  4. Batch process multiple files for efficiency
  5. Enable GPU acceleration on supported platforms
  6. For very long texts, use process_script() for optimal chunking

Credits & Acknowledgements

This project builds upon excellent open-source software:

Core Dependencies

  • Kokoro-82M by hexgrad - Apache License 2.0

    • Open-weight TTS model with 82M parameters
    • Processes up to 510 tokens per pass
    • Architectured by @yl4579 (StyleTTS 2)
    • 24kHz audio output, <100 hours training data
  • librosa - ISC License

    • Audio analysis and processing
  • FastMCP - MIT License

    • Model Context Protocol server framework

Additional Dependencies

  • soundfile - Audio I/O
  • pydantic - Configuration management
  • structlog - Structured logging
  • typer - CLI framework
  • pytest - Testing framework

Special Thanks

  • πŸ› οΈ @yl4579 for StyleTTS 2 architecture
  • πŸ† hexgrad team for Kokoro model and inference library
  • 🌐 Anthropic for Model Context Protocol
  • πŸ“Š All contributors to the open-source dependencies

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Third-Party Licenses:

  • Kokoro-82M: Apache License 2.0
  • librosa: ISC License
  • FastMCP: MIT License

Support


Citation

If you use this toolkit in your research or project, please cite:

@software{kokoro_youtube_tts,
  author = {Aparsoft},
  title = {Kokoro YouTube TTS: Comprehensive TTS Toolkit},
  year = {2025},
  url = {https://github.com/aparsoft/kokoro-youtube-tts}
}

For the Kokoro model:

@software{kokoro_tts,
  author = {hexgrad},
  title = {Kokoro-82M: Open-weight TTS Model},
  year = {2024},
  url = {https://huggingface.co/hexgrad/Kokoro-82M}
}

Built with ❀️ for the video creator community

About

TTS toolkit built on Kokoro-82M with librosa audio enhancement, MCP server for Claude/Cursor, CLI & Python API. Free & open-source for YouTube creators.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published