📖 Documentation · 🚀 Quick Start · 🏗️ Architecture · 🎨 Community
TL;DR: Build real-time, multimodal and omnichannel agents on Azure in minutes, not months. Our approach is code-first, modular, ops-friendly & extensible.
You own the agentic design; this repo handles the end-to-end voice plumbing. We keep a clean separation of concerns—telephony (ACS), app middleware, AI inference loop (STT → LLM → TTS), and orchestration—so you can swap parts without starting from zero. Shipping voice agents is more than "voice-to-voice." You need predictable latency budgets, media handoffs, error paths, channel fan-out, barge-in, noise cancellation, and more. This framework gives you the e2e working spine so you can focus on what differentiates you—your tools, agentic design, and orchestration logic (multi-agent ready).
📺 Full Overview 🎬 Demo Walkthrough
💡 What you get
-
Omnichannel, including first-class telephony. Azure Communication Services (ACS) integration for PSTN, SIP transfer, IVR/DTMF routing, and number provisioning—extendable for contact centers and custom IVR trees.
-
Transport that scales. FastAPI + WebSockets for true bidirectional streaming; runs locally and scales out in Kubernetes. Leverages ACS bidirectional media streaming for low-latency ingest/playback (barge-in ready), with helper classes to wire your UI WebSocket client or loop back into ACS— the plumbing is done for you.
-
Model freedom. Use GPT-family or your provider of choice behind a slim adapter; swap models without touching the transport.
-
Clear seams for customization. Replace code, switch STT/TTS providers, add tool routers, or inject domain policies—without tearing down the whole app.
-
Build from scratch (maximum control). Use our AI inference layer and patterns to wire STT → LLM → TTS with your preferred Azure services and assessments. Own the event loop, intercept any step, and tailor latency/quality trade-offs for your use case. Ideal for on‑prem/hybrid, strict compliance, or deep customization.
-
Managed path (ship fast, enterprise‑ready). Leverage the latest addition to the Azure AI family—Azure Voice Live API (preview)—for voice-to-voice media, and connect to Azure AI Foundry Agents for built-in tool/function calling. Keep your hooks; let Azure AI Foundry handle the media layer, scaling, noise suppression, and barge-in.
-
Bring your own voice‑to‑voice model. Drop in your model behind(e.g., latest gpt‑realtime or equivalent). Transport/orchestration (including ACS telephony) stays the same—no app changes.
The question of the century: Is it production-ready?
“Production” means different things, but our intent is clear: this is an accelerator—it gets you ~80% of the way with battle-tested plumbing. You bring the last mile: hardening, infrastructure policies, security posture, SRE/DevOps, and your enterprise release process.
We ship the scaffolding to make that last mile fast: structured logging, metrics/tracing hooks, and a load-testing harness so you can profile end-to-end latency and concurrency, then tune or harden as needed to reach your target volume.
Two orchestration modes—same agent framework, different audio paths:
| Mode | Path | Latency | Best For |
|---|---|---|---|
| SpeechCascade | Azure Speech STT → LLM → TTS | ~400ms | Custom VAD, phrase lists, Azure voices |
| VoiceLive | Azure VoiceLive SDK (gpt-4o-realtime) | ~200ms | Fastest setup, lowest latency |
# Select mode via environment variable
export ACS_STREAMING_MODE=MEDIA # SpeechCascade (default)
export ACS_STREAMING_MODE=VOICE_LIVE # VoiceLive🔧 SpeechCascade — Full Control
You own each step: STT → LLM → TTS with granular hooks.
| Feature | Description |
|---|---|
| Custom VAD | Control silence detection, barge-in thresholds |
| Azure Speech Voices | Full neural TTS catalog, styles, prosody |
| Phrase Lists | Boost domain-specific recognition |
| Sentence Streaming | Natural pacing with per-sentence TTS |
Best for: On-prem/hybrid, compliance requirements, deep customization.
⚡ VoiceLive — Ship Fast
[!NOTE] Uses Azure VoiceLive SDK with gpt-realtime in the backend.
Managed voice-to-voice: Azure-hosted GPT-4o Realtime handles audio in one hop.
| Feature | Description |
|---|---|
| ~200ms latency | Direct audio streaming, no separate STT/TTS |
| Server-side VAD | Automatic turn detection, noise reduction |
| Native tools | Built-in function calling via Realtime API |
| Azure Neural Voices | HD voices like en-US-Ava:DragonHDLatestNeural |
Best for: Speed to production, lowest latency requirements.
| Requirement | Quick Check |
|---|---|
| Azure CLI | az --version |
| Azure Developer CLI | azd version |
| Docker | docker --version |
| Azure Subscription | az account show |
| Contributor Access | Required for resource creation |
# 1. Clone the repository
git clone https://github.com/Azure-Samples/art-voice-agent-accelerator.git
cd art-voice-agent-accelerator
# 2. Login to Azure
azd auth login
# 3. Deploy everything
azd up # ~15 min for complete infra and code deploymentNote
If you encounter any issues, please refer to TROUBLESHOOTING.md
Done! Your voice agent is running. Open the frontend URL shown in the output.
📁 apps/artagent/ # Main application
├── 🔧 backend/ # FastAPI + WebSockets voice pipeline
│ ├── registries/ # Agent & scenario definitions
│ │ ├── agentstore/ # YAML agent configs + Jinja2 prompts
│ │ ├── scenariostore/ # Multi-agent orchestration flows
│ │ └── toolstore/ # Pluggable business tools
│ └── voice/ # Orchestrators (SpeechCascade, VoiceLive)
└── 🌐 frontend/ # Vite + React demo client
📁 src/ # Core libraries (ACS, Speech, AOAI, Redis, Cosmos, VAD)
📁 samples/ # Tutorials (hello_world, voice_live_sdk, labs)
📁 infra/ # Infrastructure as Code (Terraform + Bicep)
📁 docs/ # Guides and references
📁 tests/ # Pytest suite and load testing
📁 utils/ # Logging/telemetry helpers
- Start here: Getting started
- Deploy in ~15 minutes: Quick start
- Run locally: Local development
- Setup: Prerequisites
- Try the UI: Demo guide
- Production guidance: Deployment guide
- Understand the system: Architecture
- IaC details (repo): infra/README.md
ARTist = Artist + ART (Azure Real-Time Voice Agent Framework)
Join the community of practitioners building real-time voice AI agents! The ARTist Certification Program recognizes builders at three levels:
- Level 1: Apprentice — Run the UI, demonstrate the framework, and understand the architecture
- Level 2: Creator — Build custom agents with YAML config and tool integrations
- Level 3: Maestro — Lead production deployments, optimize performance, and mentor others
Earn your badge, join the Hall of Fame, and connect with fellow ARTists!
👉 Learn about ARTist Certification →
PRs & issues welcome—see CONTRIBUTING.md before pushing.
Released under MIT. This sample is not an official Microsoft product—validate compliance (HIPAA, PCI, GDPR, etc.) before production use.
Important
This software is provided for demonstration purposes only. It is not intended to be relied upon for any production workload. The creators of this software make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability, or availability of the software or related content. Any reliance placed on such information is strictly at your own risk.





