A comprehensive demonstration project showcasing speech recognition and translation capabilities using the Groq API. This project features speaker diarization, real-time processing, and both CLI and web interfaces to illustrate potential customer use cases.
⚠️ Important Disclaimer: This is a demonstrative solution created to showcase various speech processing use cases with the Groq API. It is not production-grade and should be used as a reference implementation for learning and prototyping purposes. For production deployments, additional testing, security hardening, error handling, and performance optimization are required.
Demonstrating.the.Groq.Speech.Demo.Solution.mp4
- Python 3.8+
- Node.js 22 (for web UI)
- Groq API key (Get one here)
- Hugging Face token (Required for diarization):
- Get token: https://huggingface.co/settings/tokens
⚠️ MUST accept model licenses first - See HuggingFace Models & License Requirements below
Run the automated setup script to install all dependencies and configure environments for this demonstration:
# Clone the repository
git clone https://github.com/build-with-groq/groq-speech
cd groq-speech
# Run the setup script (installs everything)
./setup.shThe setup script will:
- ✅ Create a Python virtual environment (
.venv/) - ✅ Install all Python dependencies (core library, API server, examples)
- ✅ Install Node.js dependencies for the web UI
- ✅ Create
.env.apiand.env.uiconfiguration files
After setup completes:
-
Edit
.env.apiand add your API keys:GROQ_API_KEY=your_actual_groq_api_key_here HF_TOKEN=your_huggingface_token_here # Get from: https://huggingface.co/settings/tokens (see HuggingFace section below for license requirements) -
Activate the virtual environment (required for Python commands):
source .venv/bin/activate
If you prefer manual setup:
- Clone the repository:
git clone https://github.com/build-with-groq/groq-speech
cd groq-speech- Create and activate Python virtual environment:
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install Python dependencies:
# Upgrade pip
pip install --upgrade pip
# Install the core package
pip install -e .
# Install all requirements
pip install -r requirements.txt
pip install -r groq_speech/requirements.txt- Install Node.js dependencies for the web UI:
cd examples/groq-speech-ui
npm install
cd ../..- Configure environment variables:
# Copy template files
cp .env.api.template .env.api
cp .env.ui.template .env.ui
# Edit .env.api with your API keys
# GROQ_API_KEY=your_actual_groq_api_key_here
# HF_TOKEN=your_huggingface_token_here # Get from: https://huggingface.co/settings/tokens (see HuggingFace section below for license requirements)
# .env.ui defaults should work for local developmentNote: Make sure to activate the Python virtual environment before running any Python commands:
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Activate virtual environment first
source .venv/bin/activate
# File transcription
python examples/speech_demo.py --file audio.wav
# File transcription with diarization
python examples/speech_demo.py --file audio.wav --diarize
# Microphone single mode
python examples/speech_demo.py --microphone-mode single
# Microphone continuous mode with diarization
python examples/speech_demo.py --microphone-mode continuous --diarize
# Translation mode
python examples/speech_demo.py --file audio.wav --operation translation --diarize# Terminal 1: Start API server
source .venv/bin/activate # Activate virtual environment
cd api && python server.py
# Terminal 2: Start frontend (in another terminal)
cd examples/groq-speech-ui
npm run dev
# Open http://localhost:3000 in your browser# One-command startup (starts both backend and frontend)
./scripts/dev/run-dev.sh
# With verbose logging
./scripts/dev/run-dev.sh --verbose
# Clean up existing processes
./scripts/dev/run-dev.sh --clean
# Access the application at:
# - Frontend: https://localhost:3443 (HTTPS for microphone access)
# - Backend API: http://localhost:8000
# - API Docs: http://localhost:8000/docsFeatures:
- ✅ Automatically starts both backend and frontend
- ✅ Checks and validates environment configuration
- ✅ Installs missing dependencies
- ✅ HTTPS support for microphone access (self-signed certificate)
- ✅ Verbose mode for detailed logging
- ✅ Auto-cleanup of existing processes
Note: Your browser will show a security warning for the self-signed certificate. Click "Advanced" and "Proceed to localhost" to continue.
# Quick deployment with helper script
./deployment/docker/deploy-local.sh
# Or manually with docker-compose
docker-compose -f deployment/docker/docker-compose.yml up --build
# Access the application at:
# - Frontend: https://localhost:3443 (HTTPS for microphone access)
# - Backend API: http://localhost:8000
# - API Docs: http://localhost:8000/docsDocker Management:
# View logs
docker-compose -f deployment/docker/docker-compose.yml logs -f
# Stop services
docker-compose -f deployment/docker/docker-compose.yml down
# Restart services
docker-compose -f deployment/docker/docker-compose.yml restart
# View specific service logs
docker-compose -f deployment/docker/docker-compose.yml logs -f groq-speech-api
docker-compose -f deployment/docker/docker-compose.yml logs -f groq-speech-uiRequirements for Docker:
- Create
deployment/docker/.env.apianddeployment/docker/.env.uifiles - The
deploy-local.shscript will copy templates if they don't exist - Make sure to set your actual
GROQ_API_KEYandHF_TOKENin.env.api
🚨 IMPORTANT: You MUST accept model licenses on HuggingFace BEFORE using diarization features, or you will encounter authentication errors.
This project uses the following HuggingFace models for speaker diarization:
- Purpose: Audio segmentation and speaker turn detection
- License: MIT License
pyannote/speaker-diarization-3.1
- Purpose: Complete speaker diarization pipeline
- License: MIT License
❌ What happens if you don't accept the licenses:
GatedRepoError: Access to model pyannote/speaker-diarization-3.1 is restricted.
You must be authenticated to access it and have accepted the model's terms and conditions.
Or you might see:
OSError: You are trying to access a gated repo.
Make sure to request access at https://huggingface.co/pyannote/speaker-diarization-3.1
and pass a token having permission to this repo either by logging in with
`huggingface-cli login` or by passing `use_auth_token=<your_token>`.
Follow these steps BEFORE running diarization:
- Go to https://huggingface.co/join
- Create a free account (if you don't have one)
You must accept the license for each model individually:
For pyannote/segmentation-3.0:
- Visit: https://huggingface.co/pyannote/segmentation-3.0
- Scroll down to the model card
- Click "Agree and access repository" button
- You may need to fill out a form with:
- Your name
- Organization (can be "Individual" or "Personal")
- Country
- Agree to terms checkbox
For pyannote/speaker-diarization-3.1:
- Visit: https://huggingface.co/pyannote/speaker-diarization-3.1
- Scroll down to the model card
- Click "Agree and access repository" button
- Fill out the same form as above
Example of what you'll see:
┌─────────────────────────────────────────────────────────┐
│ Access pyannote/speaker-diarization-3.1 │
│ │
│ By clicking below, you agree to share your contact │
│ information (username and email) with the model authors.│
│ │
│ Name: [Your Name] │
│ Email: [[email protected]] │
│ Org: [Individual/Company] │
│ Country: [Your Country] │
│ │
│ ☐ I have read the License and agree to its terms │
│ │
│ [Agree and Access Repository] │
└─────────────────────────────────────────────────────────┘
- Go to https://huggingface.co/settings/tokens
- Click "New token"
- Name it (e.g., "groq-speech-diarization")
- Select "Read" permission (minimum required)
- Click "Generate token"
- Copy the token (you won't be able to see it again!)
Add the token to your .env.api file:
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxNote: HuggingFace tokens start with hf_
To verify your licenses are accepted, run:
# Activate virtual environment
source .venv/bin/activate
# Test with a simple diarization
python examples/speech_demo.py --file examples/test_audio.wav --diarizeIf licenses are properly accepted, you should see:
🎭 Running CORRECT diarization pipeline...
1. Pyannote.audio → Speaker detection
✅ Pipeline loaded and moved to cuda
✅ Detected X speaker segments
If licenses are NOT accepted, you'll see authentication errors as shown above.
graph TB
subgraph "Layer 3: UI Client"
UI[groq-speech-ui/<br/>EnhancedSpeechDemo.tsx<br/>PerformanceMetrics.tsx]
end
subgraph "Layer 2b: API Client"
API[api/server.py<br/>FastAPI REST API only]
end
subgraph "Layer 2a: CLI Client"
CLI[speech_demo.py<br/>Command Line Interface]
end
subgraph "Layer 1: SDK"
SDK[groq_speech/<br/>speech_recognizer.py<br/>speaker_diarization.py<br/>vad_service.py<br/>audio_utils.py]
end
UI -->|HTTP REST| API
CLI -->|Direct Calls| SDK
API -->|Direct Calls| SDK
style UI fill:#1976D2,color:#ffffff
style API fill:#7B1FA2,color:#ffffff
style CLI fill:#7B1FA2,color:#ffffff
style SDK fill:#388E3C,color:#ffffff
speech_recognizer.py- Main orchestrator, handles all speech processingspeech_config.py- Configuration management with factory methodsspeaker_diarization.py- Speaker diarization using Pyannote.audiovad_service.py- Voice Activity Detection serviceaudio_utils.py- Audio format utilities and conversionexceptions.py- Custom exception classesresult_reason.py- Result status enums
server.py- FastAPI server with REST endpoints onlymodels/- Pydantic request/response models- REST API - HTTP endpoints for all operations
EnhancedSpeechDemo.tsx- Main UI component with all featuresaudio-recorder.ts- Unified audio recording (standard + optimized)continuous-audio-recorder.ts- VAD-based continuous recordingclient-vad-service.ts- Client-side Voice Activity Detectionaudio-converter.ts- Unified audio conversion (standard + optimized)groq-api.ts- REST API client
Audio Input → numpy array → SDK Processing → Console Output
Audio Input → Frontend Processing → HTTP REST → API Server → SDK Processing → JSON Response → UI Display
- File Processing: Base64-encoded WAV → HTTP REST → base64 decode → numpy array
- Microphone Processing: Float32Array → HTTP REST → array conversion → numpy array
- VAD Processing: Client-side for real-time performance
- ✅ File-based transcription
- ✅ Microphone single mode
- ✅ Microphone continuous mode with VAD
- ✅ Real-time audio level visualization
- ✅ Silence detection and chunking
- ✅ File-based translation
- ✅ Microphone translation
- ✅ Multi-language support
- ✅ Target language configuration
- ✅ Pyannote.audio integration
- ✅ GPU acceleration support
- ✅ Multi-speaker detection
- ✅ Speaker-specific segments
- ✅ Client-side real-time processing
- ✅ 15-second silence detection
- ✅ Audio level visualization
- ✅ Automatic chunk creation
- ✅ Unified audio recorders (standard + optimized)
- ✅ Unified audio converters (standard + optimized)
- ✅ Client-side VAD for real-time processing
- ✅ Chunked processing for large files
- ✅ Memory-efficient operations
POST /api/v1/recognize- File transcriptionPOST /api/v1/translate- File translationPOST /api/v1/recognize-microphone- Single microphone processingPOST /api/v1/recognize-microphone-continuous- Continuous microphone processing
GET /health- Health checkGET /api/v1/models- Available modelsGET /api/v1/languages- Supported languagesPOST /api/log- Frontend logging
POST /api/v1/vad/should-create-chunk- VAD chunk detectionPOST /api/v1/vad/audio-level- Audio level analysis
# Standard deployment
docker-compose -f deployment/docker/docker-compose.yml up
# GPU-enabled deployment
docker-compose -f deployment/docker/docker-compose.gpu.yml up
# Development with hot reload
docker-compose -f deployment/docker/docker-compose.dev.yml up# Deploy to Cloud Run with GPU support
cd deployment/gcp
./deploy.sh- Direct SDK access - No network overhead
- Real-time VAD - Local processing
- Memory efficient - Direct numpy array handling
- Client-side VAD - Real-time silence detection
- Unified components - Optimized for both short and long audio
- Chunked processing - Handles large files efficiently
- REST API - Scalable and maintainable
The project uses two separate environment files for better isolation:
Used by: SDK, speech_demo.py, API server
# Required: Groq API Key
GROQ_API_KEY=your_groq_api_key_here
# Required for speaker diarization (Get from: https://huggingface.co/settings/tokens - see HuggingFace Models section for license requirements)
HF_TOKEN=your_huggingface_token_here
# Optional: API Configuration
GROQ_API_BASE=https://api.groq.com/openai/v1
GROQ_MODEL_ID=whisper-large-v3
GROQ_TEMPERATURE=0.0
GROQ_RESPONSE_FORMAT=verbose_json
# Optional: Server Settings
API_HOST=0.0.0.0
API_PORT=8000
API_WORKERS=1
# Optional: GPU Configuration
CUDA_VISIBLE_DEVICES=0Used by: groq-speech-ui web application
# Required: API Connection
NEXT_PUBLIC_API_URL=http://localhost:8000
# Optional: UI Settings
NEXT_PUBLIC_FRONTEND_URL=http://localhost:3000
NEXT_PUBLIC_VERBOSE=false
NEXT_PUBLIC_DEBUG=falseGetting API Keys:
- GROQ_API_KEY: Get from Groq Console
- HF_TOKEN: Get from HuggingFace Tokens - See HuggingFace Models & License Requirements section above for complete setup instructions
- Sample Rate: 16kHz (standard)
- Channels: Mono (1 channel)
- Format: Float32Array (microphone), WAV (files)
- VAD Threshold: 0.003 RMS (conservative detection)
All documentation is now organized in docs/ for better maintainability.
- Quick Start Guide - Get up and running in minutes
- Environment Setup - Detailed environment configuration
- Scripts Reference - Complete guide to all scripts
- SDK Reference - Complete SDK API documentation
- Architecture Guide - System architecture and design
- Deployment Guide - Docker, Cloud Run, and GKE deployment
- Contributing Guide - Development guidelines
- Debugging Guide - Troubleshooting and debugging
- Changelog - Version history
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For issues and questions:
- Check the documentation
- Review existing issues
- Create a new issue with detailed information
Built with ❤️ by:
- Sreenivas Manyam Rajaram - LinkedIn
Technologies:
- Groq - Lightning-fast AI inference
- Pyannote.audio - Speaker diarization
- Modern web technologies (Next.js, React, TypeScript)
Built with ❤️ using Groq, Pyannote.audio, and modern web technologies.