Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when processing real time audio from a Twilio media stream #304

Open
Pablo-Merino opened this issue Mar 22, 2025 · 6 comments
Open
Assignees
Labels
question Question about using the SDK stale

Comments

@Pablo-Merino
Copy link

Hello! I'm trying to hook up a Twilio media stream to an Agent with the voice pipeline.

My process is more or less the following:

  1. I receive the Twilio call through the websockets and start processing the events
  2. I transcode the Twilio audio from 8kHz mu-Law to the expected 24kHz Mono PCM
  3. I add the audio to the instance of StreamedAudioInput
  4. I listen to the events from the pipeline run
  5. I transcode back the audio from OpenAI to Twilio

Now, what I have is a "working" system where I can accept a call, the audio gets processed properly and sent to OpenAI, and get a response back that I'm able to hear through the phone call.

The issue I have is that the audio transcript is just garbled (I don't mean the audio itself, which if I listen to it, it's clear enough, and I'm able to transcribe it just fine with a standard transcription call). It's the audio transcript which is just nothing at all to what is being said.

Here's two examples:

  • Here I'm just saying "test" but the transcription is "Hi."
Image

This is the audio file: https://filebin.net/3do528busqpnegro/span_239e5f3eb17349dfa9fc64ec-input.wav

  • Here I'm saying "probando" (in Spanish, "testing"), and it comes out as two Chinese characters (看看)
Image

This is the audio file: https://filebin.net/3do528busqpnegro/span_9577c4b6fd674cb794081ada-input.wav

I cannot figure out what's wrong, I have a feeling it's got to do with the audio processed for some reason.

This is how I'm setting up the pipeline:

pipeline = VoicePipeline(
    workflow=SingleAgentVoiceWorkflow(
        agent,
    ),
    config=VoicePipelineConfig(
        model_provider=OpenAIVoiceModelProvider(
            api_key=OPENAI_API_KEY,
        ),
        workflow_name="Agent",
        stt_settings=STTModelSettings(
            turn_detection={"type": "semantic_vad", "eagerness": "low"},
        ),
    ),
)

This is the processed code:

import numpy as np
import audioop
import soxr


def mulaw_to_openai_pcm(mulaw_bytes: bytes) -> np.ndarray:
    pcm = audioop.ulaw2lin(mulaw_bytes, 2)
    audio_np = np.frombuffer(pcm, dtype=np.int16)

    audio_24k = soxr.resample(audio_np, 8000, 24000)

    # Convert to float32 range [-1.0, 1.0] as expected by OpenAI
    return (audio_24k / 32768.0).astype(np.float32)


def openai_audio_to_twilio_mulaw(audio_data: np.ndarray) -> bytes:
    # Normalize dtype
    if audio_data.dtype == np.int16:
        audio_data = audio_data.astype(np.float32) / 32768.0
    elif audio_data.dtype != np.float32:
        raise ValueError(f"Unsupported dtype: {audio_data.dtype}")

    # Resample from 24kHz → 8kHz
    resampled = soxr.resample(audio_data, 24000, 8000)

    # Convert to int16
    resampled_int16 = np.clip(resampled * 32768.0, -32768, 32767).astype(np.int16)

    # μ-law encode
    return audioop.lin2ulaw(resampled_int16.tobytes(), 2)

PD: I'm adding the processed Twilio audio to the StreamedAudioInput as soon as I receive it, maybe it's got to do with that?

@Pablo-Merino Pablo-Merino added the question Question about using the SDK label Mar 22, 2025
@rm-openai
Copy link
Collaborator

@dkundel-openai mind taking a look when you have a sec?

@dkundel-openai
Copy link
Contributor

Hey @Pablo-Merino I shared the issue with the team to figure out what might be happening. I see you are converting the data to float32 from int16 while doing the conversion from ulaw. The issue might be there.

Did you try passing the audio in as int16 as opposed to converting it to float32? From looking at the code I think we might have a bug where we don't transform the audio from float32 to int16 before passing it to the Realtime API for transcription. For uploading the audio to the tracing API we do the conversion.

Let me know if changing it to int16 works.

@Pablo-Merino
Copy link
Author

@dkundel-openai amazing, now it's transcribing way better! I guess it's as good as it can get with ulaw 8kHz, which is already pretty lossy. Thanks a lot for the suggestion!

This is the code I ended up using:

import numpy as np
import audioop
import soxr
import webrtcvad


vad = webrtcvad.Vad(3)  # Set aggressiveness mode (0-3)
def is_speech(audio_data: bytes) -> bool:
    """Check if the audio data contains speech using VAD."""
    # Check if audio data is empty
    return vad.is_speech(audio_data, 8000)

def mulaw_to_openai_pcm(mulaw_bytes: bytes) -> np.ndarray:
    """
    Convert μ-law encoded audio to PCM format.
    This function converts μ-law encoded audio data to PCM format and resamples it from 8kHz to 24kHz.
    The input audio data is expected to be in μ-law format, and the output will be a NumPy array of
    int16 PCM samples.
    Args:
        mulaw_bytes (bytes): The μ-law encoded audio data.
    Returns:
        np.ndarray: The PCM audio data resampled to 24kHz.
    """
    pcm = audioop.ulaw2lin(mulaw_bytes, 2)
    audio_np = np.frombuffer(pcm, dtype=np.int16)

    # Resample from 8kHz to 24kHz, maintaining int16 format
    audio_24k = soxr.resample(audio_np, 8000, 24000)

    # Return as int16 instead of converting to float32
    return audio_24k.astype(np.int16)


def openai_audio_to_twilio_mulaw(audio_data: np.ndarray) -> bytes:
    """
    Convert OpenAI PCM audio to Twilio μ-law format.
    This function converts PCM audio data to μ-law format and resamples it from 24kHz to 8kHz.
    The input audio data is expected to be in PCM format, and the output will be μ-law encoded bytes.
    Args:
        audio_data (np.ndarray): The PCM audio data.
    Returns:
        bytes: The μ-law encoded audio data.
    """
    # Normalize dtype
    if audio_data.dtype == np.int16:
        audio_data = audio_data.astype(np.float32) / 32768.0
    elif audio_data.dtype != np.float32:
        raise ValueError(f"Unsupported dtype: {audio_data.dtype}")

    # Resample from 24kHz → 8kHz
    resampled = soxr.resample(audio_data, 24000, 8000)

    # Convert to int16
    resampled_int16 = np.clip(resampled * 32768.0, -32768, 32767).astype(np.int16)

    # μ-law encode
    return audioop.lin2ulaw(resampled_int16.tobytes(), 2)

I added a is_speech method to try to avoid sending voice that's not detected as speech, although that probably isn't working the way I want, any suggestions?

Also I've got an extra question, regarding turns. When I receive a turn_started event, this means that the person is speaking, and turn_ended is that the person stopped speaking, right?

I'm having some issues with turn detection, tried with semantic_vad and server_vad, and I'm trying to clear Twilio's stream when the user starts speaking. Maybe I need to roll my own?

Again, thanks a bunch!

@dkundel-openai
Copy link
Contributor

Glad it's working now!

Technically semantic_vad or server_vad should handle the silence detection for you. You could try enabling noise reduction to further improve the experience.

Right now we also don't really have a good way for you to do your own VAD & turn detection but we should add that as an additional option.

I'm planning to build a Twilio app myself later this week and will report with any additional tips I can find.

@Pablo-Merino
Copy link
Author

@dkundel-openai thanks a lot for the help! I'll be looking forward that Twilio app feedback. Meanwhile I'll try the noise reduction and see how it fares. Thanks again!

Copy link

github-actions bot commented Apr 2, 2025

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question about using the SDK stale
Projects
None yet
Development

No branches or pull requests

3 participants