-
Notifications
You must be signed in to change notification settings - Fork 336
Integrate Voice Agents into Agents SDK #542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
renandincer
wants to merge
10
commits into
cloudflare:main
Choose a base branch
from
itzmanish:RTK-6762
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
bf94316
feat: add examples for realtime agents
itzmanish ba441ba
feat: add realtime capability on agents class
itzmanish a399da3
chore: add example env
itzmanish fd6494a
chore: move example env
itzmanish daead29
fix: use single websocket for bi-di communication
itzmanish 0e41670
fix: lint
itzmanish c087df6
chore: better websocket connection handling for realtime agents
6903879
chore: move out realtime codes into seperate class
itzmanish d964ac9
Merge branch 'main' into pr/542
threepointone 2b99f3c
attempt to fix a bunch of errors
threepointone File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| CF_ACCOUNT_ID= | ||
| CF_API_TOKEN= | ||
| DEEPGRAM_API_KEY= | ||
| ELEVENLABS_API_KEY= | ||
| RTK_MEETING_ID= | ||
| RTK_AUTH_TOKEN= |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,198 @@ | ||
| # Realtime Voice Assistant Agent | ||
|
|
||
| This example demonstrates how to build a complete voice assistant using Cloudflare's AI Agent framework with realtime capabilities. The assistant can: | ||
|
|
||
| - Listen to audio input via RealtimeKit | ||
| - Convert speech to text using Deepgram STT | ||
| - Process conversations with intelligent responses | ||
| - Convert responses back to speech using ElevenLabs TTS | ||
| - Stream audio output back to the client | ||
|
|
||
| ## Architecture | ||
|
|
||
| The voice assistant uses a pipeline architecture: | ||
|
|
||
| ``` | ||
| Audio Input → RealtimeKit → Deepgram STT → Agent Logic → ElevenLabs TTS → Audio Output | ||
| ``` | ||
|
|
||
| ## Setup | ||
|
|
||
| 1. **Environment Variables**: Configure the following in your `wrangler.toml` or environment: | ||
|
|
||
| ```toml | ||
| [vars] | ||
| ACCOUNT_ID = "your-cloudflare-account-id" | ||
| API_TOKEN = "your-cloudflare-api-token" | ||
| DEEPGRAM_API_KEY = "your-deepgram-api-key" | ||
| ELEVENLABS_API_KEY = "your-elevenlabs-api-key" | ||
| RTK_MEETING_ID = "your-realtimekit-meeting-id" # Optional | ||
| RTK_AUTH_TOKEN = "your-realtimekit-auth-token" # Optional | ||
| ``` | ||
|
|
||
| 2. **API Keys**: | ||
| - Get a Deepgram API key from [https://deepgram.com](https://deepgram.com) | ||
| - Get an ElevenLabs API key from [https://elevenlabs.io](https://elevenlabs.io) | ||
| - Get your Cloudflare Account ID and API token from the Cloudflare dashboard | ||
|
|
||
| 3. **Deploy**: | ||
|
|
||
| ```bash | ||
| npm run dev # For local development | ||
| wrangler deploy # For production deployment | ||
| ``` | ||
|
|
||
| ## Usage | ||
|
|
||
| Once deployed, the agent creates WebSocket connections for real-time voice interaction. | ||
|
|
||
| ### Basic Flow: | ||
|
|
||
| 1. Client connects to the agent WebSocket endpoint | ||
| 2. Agent initializes the realtime pipeline | ||
| 3. Client streams audio → Agent processes → Agent streams audio back | ||
| 4. Agent handles conversation logic in `onRealtimeTranscript()` method | ||
|
|
||
| ### Customization: | ||
|
|
||
| - Modify `onRealtimeTranscript()` method to add your own conversational AI logic | ||
| - Integrate with OpenAI, Anthropic, or other language models | ||
| - Add knowledge base queries, tool calling, or context management | ||
| - Customize voice settings in ElevenLabsTTS configuration | ||
|
|
||
| ## Key Components | ||
|
|
||
| ### RealtimeVoiceAgent | ||
|
|
||
| - Extends `Agent` class with realtime pipeline components | ||
| - Implements `onRealtimeTranscript()` for conversation handling | ||
| - Manages pipeline initialization and cleanup via `realtimePipelineComponents` | ||
|
|
||
| ### MyAgent (Durable Object) | ||
|
|
||
| - Manages agent lifecycle and WebSocket connections | ||
| - Handles client connect/disconnect events | ||
| - Implements alarm handling for maintenance tasks | ||
|
|
||
| ### Pipeline Components: | ||
|
|
||
| - **RealtimeKitTransport**: Audio input/output via RealtimeKit | ||
| - **DeepgramSTT**: Speech-to-text conversion | ||
| - **ElevenLabsTTS**: Text-to-speech synthesis | ||
|
|
||
| ## Pipeline Configuration | ||
|
|
||
| The agent uses a pipeline component system defined in `realtimePipelineComponents` method: | ||
|
|
||
| ```typescript | ||
| createRealtimePipeline() { | ||
| const rtk = new RealtimeKitTransport( | ||
| this.env.RTK_MEETING_ID || "default-meeting", | ||
| this.env.RTK_AUTH_TOKEN || "default-token", | ||
| [{ | ||
| media_kind: "audio", | ||
| stream_kind: "microphone", | ||
| preset_name: "*" | ||
| }] | ||
| ); | ||
|
|
||
| const stt = new DeepgramSTT(this.env.DEEPGRAM_API_KEY); | ||
| const tts = new ElevenLabsTTS(this.env.ELEVENLABS_API_KEY); | ||
|
|
||
| // Pipeline: Audio Input → STT → Agent → TTS → Audio Output | ||
| return [rtk, stt, this, tts, rtk]; | ||
| } | ||
| ``` | ||
|
|
||
| ### Pipeline Flow | ||
|
|
||
| 1. **Audio Input**: RealtimeKit captures microphone audio | ||
| 2. **Speech Recognition**: Deepgram converts audio to text | ||
| 3. **Agent Processing**: Your agent receives transcribed text via `onRealtimeTranscript()` | ||
| 4. **Response Generation**: Agent generates text response | ||
| 5. **Speech Synthesis**: ElevenLabs converts response to audio | ||
| 6. **Audio Output**: RealtimeKit streams audio back to client | ||
|
|
||
| ### Customizing the Pipeline | ||
|
|
||
| You can modify the pipeline components in `createRealtimePipeline()`: | ||
|
|
||
| ```typescript | ||
| // Different STT provider | ||
| const stt = new CustomSTT(this.env.CUSTOM_API_KEY); | ||
|
|
||
| // Multiple TTS voices | ||
| const tts1 = new ElevenLabsTTS(this.env.ELEVENLABS_KEY, { voice_id: "voice1" }); | ||
| const tts2 = new ElevenLabsTTS(this.env.ELEVENLABS_KEY, { voice_id: "voice2" }); | ||
|
|
||
| // Audio preprocessing | ||
| const processor = new AudioProcessor(); | ||
|
|
||
| return [rtk, processor, stt, this, tts1, rtk]; | ||
| ``` | ||
|
|
||
| ## Implementation Details | ||
|
|
||
| The Agent class implements the `RealtimePipelineComponent` interface, allowing it to be used directly in realtime pipelines: | ||
|
|
||
| ```typescript | ||
| class RealtimeVoiceAgent extends Agent<Env> { | ||
| realtimePipelineComponents = this.createRealtimePipeline; | ||
|
|
||
| createRealtimePipeline() { | ||
| const rtk = new RealtimeKitTransport(...); | ||
| const stt = new DeepgramSTT(...); | ||
| const tts = new ElevenLabsTTS(...); | ||
|
|
||
| // Use 'this' to include the agent in the pipeline | ||
| return [rtk, stt, this, tts, rtk]; | ||
| } | ||
|
|
||
| // This method receives transcribed text | ||
| onRealtimeTranscript(text: string, reply: (response: string) => void) { | ||
| // Your conversation logic here | ||
| const response = processConversation(text); | ||
| reply(response); | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| **Key Features:** | ||
|
|
||
| - ✅ **Direct agent integration** - Use `this` to include your agent in the pipeline | ||
| - ✅ **Type safety** - Full TypeScript support for pipeline components | ||
| - ✅ **Flexible positioning** - Place the agent anywhere in the processing flow | ||
| - ✅ **Clean separation** - Clear distinction between pipeline setup and conversation logic | ||
|
|
||
| ## Examples | ||
|
|
||
| The current implementation includes basic conversational responses like: | ||
|
|
||
| - Greetings and farewells | ||
| - Time and date queries | ||
| - Simple jokes | ||
| - Help information | ||
|
|
||
| You can extend this by integrating with: | ||
|
|
||
| - OpenAI GPT models for advanced conversations | ||
| - Knowledge bases for domain-specific responses | ||
| - Weather APIs, calendars, or other external services | ||
| - Custom business logic and workflows | ||
|
|
||
| ## Development | ||
|
|
||
| Run locally: | ||
|
|
||
| ```bash | ||
| npm run dev | ||
| ``` | ||
|
|
||
| The agent will be available at the WebSocket endpoint provided by Wrangler. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| - Ensure all API keys are properly configured | ||
| - Check Cloudflare account ID and API token permissions | ||
| - Verify RealtimeKit meeting configuration | ||
| - Monitor logs for pipeline initialization errors | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| /* eslint-disable */ | ||
| // Generated by Wrangler by running `wrangler types env.d.ts --include-runtime false` (hash: 94d1687f592f0bb5cbcd056355820198) | ||
| declare namespace Cloudflare { | ||
| interface GlobalProps { | ||
| mainModule: typeof import("./src/index"); | ||
| durableNamespaces: "RealtimeVoiceAgent"; | ||
| } | ||
| interface Env { | ||
| ACCOUNT_ID: ""; | ||
| API_TOKEN: ""; | ||
| REALTIME_VOICE_AGENT: DurableObjectNamespace< | ||
| import("./src/index").RealtimeVoiceAgent | ||
| >; | ||
| } | ||
| } | ||
| interface Env extends Cloudflare.Env {} | ||
| type StringifyValues<EnvType extends Record<string, unknown>> = { | ||
| [Binding in keyof EnvType]: EnvType[Binding] extends string | ||
| ? EnvType[Binding] | ||
| : string; | ||
| }; | ||
| declare namespace NodeJS { | ||
| interface ProcessEnv | ||
| extends StringifyValues<Pick<Cloudflare.Env, "ACCOUNT_ID" | "API_TOKEN">> {} | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| { | ||
| "name": "@cloudflare/realtime-agents-example", | ||
| "author": "Manish", | ||
| "keywords": [], | ||
| "private": true, | ||
| "scripts": { | ||
| "dev": "wrangler dev", | ||
| "types": "wrangler types env.d.ts --include-runtime false" | ||
| }, | ||
| "type": "module" | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this use case?