Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a WebRTC Audio Streaming Example in Python #882

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions fern/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@ navigation:
path: docs/pages/cookbooks/legacy/text-to-speech/streaming.mdx
- page: WebSockets
path: docs/pages/cookbooks/legacy/text-to-speech/websockets.mdx
- page: WebRTC
path: docs/pages/cookbooks/legacy/text-to-speech/webrtc.mdx
- page: Request stitching
path: docs/pages/cookbooks/legacy/text-to-speech/request-stitching.mdx
- page: Pronunciation dictionaries
Expand Down
125 changes: 125 additions & 0 deletions fern/docs/pages/cookbooks/legacy/text-to-speech/webrtc.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
---
title: Real-time audio streaming with WebRTC
subtitle: Learn how to convert text to speech via a WebRTC connection.
---

## Introduction

WebRTC is a technology that enables real-time communication between web browsers and servers. It allows for low-latency, high-quality audio and video communication, and is supported by most modern browsers.

In this guide, we'll build a simple WebRTC application entirely in Python.
Our application will continuously listen for incoming audio from a microphone and then repeat the audio back to the user in a different voice.
We'll use the `fastrtc` library to handle the WebRTC connection and the `elevenlabs` library to handle the speech-to-text and text-to-speech conversion.

This is a preview of what we'll build:

<video
controls
className="w-full"
src="https://github.com/user-attachments/assets/149bcda7-7381-4a15-bc63-e7c244e61f75"
></video>

## Setup

Install the required packages to manage environmental variables and handle the WebRTC connection:

```bash
pip install python-dotenv
pip install fastrtc[vad]
pip install elevenlabs
```

Next, create a `.env` file in your project directory and add your API key:

```bash .env
ELEVENLABS_API_KEY=your_elevenlabs_api_key_here
```

Create a new file named `webrtc-streaming.py` for our code.

## Initialize the client

First, let's initialize the ElevenLabs client with the API key from the `.env` file:

```python
import os
from dotenv import load_dotenv
from elevenlabs import ElevenLabs

load_dotenv()

elevenlabs_client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))
```

## Define the Echo function

The `echo` function will take as input the user's audio and the desired voice ID.
It will then convert the incoming audio to text and then to speech with the ElevenLabs client.

```python
import numpy as np
from numpy.typing import NDArray
from fastrtc import audio_to_bytes

def echo(audio: tuple[int, NDArray[np.int16]], voice_id: str):
transcription = elevenlabs_client.speech_to_text.convert(
file=audio_to_bytes(audio),
model_id="scribe_v1",
tag_audio_events=True,
language_code="eng",
)
for chunk in elevenlabs_client.text_to_speech.convert_as_stream(
text=transcription.text, # type: ignore
voice_id=voice_id,
model_id="eleven_multilingual_v2",
output_format="pcm_24000",
):
audio_array = np.frombuffer(chunk, dtype=np.int16).reshape(1, -1)
yield (24000, audio_array)
```

## Define the FastRTC Application

Now we'll create a FastRTC `Stream` object to turn our echo function into a WebRTC stream. We'll wrap the `echo` function with `ReplyOnPause` to handle turn-taking and voice activity detection. We'll also add a dropdown to let users select different voices:

```python
import gradio as gr
from fastrtc import ReplyOnPause, Stream

stream = Stream(
ReplyOnPause(echo),
modality="audio",
mode="send-receive",
additional_inputs=[
gr.Dropdown(
value="Xb7hH8MSUJpSbSDYk0k2",
choices=[
("Alice", "Xb7hH8MSUJpSbSDYk0k2"),
("Aria", "9BWtsMINqrJLrRacOk9x"),
("Bill", "pqHfZKP75CvOlQylNhV4"),
("Brian", "nPczCjzI2devNBz1zQrb"),
]
)
],
ui_args={
"title": "Echo Audio with ElevenLabs",
"subtitle": "Choose a voice and speak naturally. The model will echo it back in a different voice.",
},
)

stream.ui.launch()
```

## Run the application

```bash
python webrtc-streaming.py
```

You can see the full code [here](https://gist.github.com/freddyaboulton/2a50928337b177205264112531d7552c)

## Conclusion

You've now implemented a WebRTC streaming application in just 50 lines of Python! This example demonstrates how to create a real-time audio processing pipeline that leverages ElevenLabs' speech-to-text and text-to-speech capabilities.

For more information on customizing your WebRTC application, check out the [fastrtc documentation](https://fastrtc.org).