diff --git a/fern/docs.yml b/fern/docs.yml index 196de1eb..30d4ed17 100644 --- a/fern/docs.yml +++ b/fern/docs.yml @@ -76,6 +76,8 @@ navigation: path: docs/pages/cookbooks/legacy/text-to-speech/streaming.mdx - page: WebSockets path: docs/pages/cookbooks/legacy/text-to-speech/websockets.mdx + - page: WebRTC + path: docs/pages/cookbooks/legacy/text-to-speech/webrtc.mdx - page: Request stitching path: docs/pages/cookbooks/legacy/text-to-speech/request-stitching.mdx - page: Pronunciation dictionaries diff --git a/fern/docs/pages/cookbooks/legacy/text-to-speech/webrtc.mdx b/fern/docs/pages/cookbooks/legacy/text-to-speech/webrtc.mdx new file mode 100644 index 00000000..df0507f1 --- /dev/null +++ b/fern/docs/pages/cookbooks/legacy/text-to-speech/webrtc.mdx @@ -0,0 +1,125 @@ +--- +title: Real-time audio streaming with WebRTC +subtitle: Learn how to convert text to speech via a WebRTC connection. +--- + +## Introduction + +WebRTC is a technology that enables real-time communication between web browsers and servers. It allows for low-latency, high-quality audio and video communication, and is supported by most modern browsers. + +In this guide, we'll build a simple WebRTC application entirely in Python. +Our application will continuously listen for incoming audio from a microphone and then repeat the audio back to the user in a different voice. +We'll use the `fastrtc` library to handle the WebRTC connection and the `elevenlabs` library to handle the speech-to-text and text-to-speech conversion. + +This is a preview of what we'll build: + + + +## Setup + +Install the required packages to manage environmental variables and handle the WebRTC connection: + +```bash +pip install python-dotenv +pip install fastrtc[vad] +pip install elevenlabs +``` + +Next, create a `.env` file in your project directory and add your API key: + +```bash .env +ELEVENLABS_API_KEY=your_elevenlabs_api_key_here +``` + +Create a new file named `webrtc-streaming.py` for our code. + +## Initialize the client + +First, let's initialize the ElevenLabs client with the API key from the `.env` file: + +```python +import os +from dotenv import load_dotenv +from elevenlabs import ElevenLabs + +load_dotenv() + +elevenlabs_client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY")) +``` + +## Define the Echo function + +The `echo` function will take as input the user's audio and the desired voice ID. +It will then convert the incoming audio to text and then to speech with the ElevenLabs client. + +```python +import numpy as np +from numpy.typing import NDArray +from fastrtc import audio_to_bytes + +def echo(audio: tuple[int, NDArray[np.int16]], voice_id: str): + transcription = elevenlabs_client.speech_to_text.convert( + file=audio_to_bytes(audio), + model_id="scribe_v1", + tag_audio_events=True, + language_code="eng", + ) + for chunk in elevenlabs_client.text_to_speech.convert_as_stream( + text=transcription.text, # type: ignore + voice_id=voice_id, + model_id="eleven_multilingual_v2", + output_format="pcm_24000", + ): + audio_array = np.frombuffer(chunk, dtype=np.int16).reshape(1, -1) + yield (24000, audio_array) +``` + +## Define the FastRTC Application + +Now we'll create a FastRTC `Stream` object to turn our echo function into a WebRTC stream. We'll wrap the `echo` function with `ReplyOnPause` to handle turn-taking and voice activity detection. We'll also add a dropdown to let users select different voices: + +```python +import gradio as gr +from fastrtc import ReplyOnPause, Stream + +stream = Stream( + ReplyOnPause(echo), + modality="audio", + mode="send-receive", + additional_inputs=[ + gr.Dropdown( + value="Xb7hH8MSUJpSbSDYk0k2", + choices=[ + ("Alice", "Xb7hH8MSUJpSbSDYk0k2"), + ("Aria", "9BWtsMINqrJLrRacOk9x"), + ("Bill", "pqHfZKP75CvOlQylNhV4"), + ("Brian", "nPczCjzI2devNBz1zQrb"), + ] + ) + ], + ui_args={ + "title": "Echo Audio with ElevenLabs", + "subtitle": "Choose a voice and speak naturally. The model will echo it back in a different voice.", + }, +) + +stream.ui.launch() +``` + +## Run the application + +```bash +python webrtc-streaming.py +``` + +You can see the full code [here](https://gist.github.com/freddyaboulton/2a50928337b177205264112531d7552c) + +## Conclusion + +You've now implemented a WebRTC streaming application in just 50 lines of Python! This example demonstrates how to create a real-time audio processing pipeline that leverages ElevenLabs' speech-to-text and text-to-speech capabilities. + +For more information on customizing your WebRTC application, check out the [fastrtc documentation](https://fastrtc.org).