-
Notifications
You must be signed in to change notification settings - Fork 962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue when processing real time audio from a Twilio media stream #304
Comments
@dkundel-openai mind taking a look when you have a sec? |
Hey @Pablo-Merino I shared the issue with the team to figure out what might be happening. I see you are converting the data to float32 from int16 while doing the conversion from ulaw. The issue might be there. Did you try passing the audio in as int16 as opposed to converting it to float32? From looking at the code I think we might have a bug where we don't transform the audio from float32 to int16 before passing it to the Realtime API for transcription. For uploading the audio to the tracing API we do the conversion. Let me know if changing it to int16 works. |
@dkundel-openai amazing, now it's transcribing way better! I guess it's as good as it can get with ulaw 8kHz, which is already pretty lossy. Thanks a lot for the suggestion! This is the code I ended up using: import numpy as np
import audioop
import soxr
import webrtcvad
vad = webrtcvad.Vad(3) # Set aggressiveness mode (0-3)
def is_speech(audio_data: bytes) -> bool:
"""Check if the audio data contains speech using VAD."""
# Check if audio data is empty
return vad.is_speech(audio_data, 8000)
def mulaw_to_openai_pcm(mulaw_bytes: bytes) -> np.ndarray:
"""
Convert μ-law encoded audio to PCM format.
This function converts μ-law encoded audio data to PCM format and resamples it from 8kHz to 24kHz.
The input audio data is expected to be in μ-law format, and the output will be a NumPy array of
int16 PCM samples.
Args:
mulaw_bytes (bytes): The μ-law encoded audio data.
Returns:
np.ndarray: The PCM audio data resampled to 24kHz.
"""
pcm = audioop.ulaw2lin(mulaw_bytes, 2)
audio_np = np.frombuffer(pcm, dtype=np.int16)
# Resample from 8kHz to 24kHz, maintaining int16 format
audio_24k = soxr.resample(audio_np, 8000, 24000)
# Return as int16 instead of converting to float32
return audio_24k.astype(np.int16)
def openai_audio_to_twilio_mulaw(audio_data: np.ndarray) -> bytes:
"""
Convert OpenAI PCM audio to Twilio μ-law format.
This function converts PCM audio data to μ-law format and resamples it from 24kHz to 8kHz.
The input audio data is expected to be in PCM format, and the output will be μ-law encoded bytes.
Args:
audio_data (np.ndarray): The PCM audio data.
Returns:
bytes: The μ-law encoded audio data.
"""
# Normalize dtype
if audio_data.dtype == np.int16:
audio_data = audio_data.astype(np.float32) / 32768.0
elif audio_data.dtype != np.float32:
raise ValueError(f"Unsupported dtype: {audio_data.dtype}")
# Resample from 24kHz → 8kHz
resampled = soxr.resample(audio_data, 24000, 8000)
# Convert to int16
resampled_int16 = np.clip(resampled * 32768.0, -32768, 32767).astype(np.int16)
# μ-law encode
return audioop.lin2ulaw(resampled_int16.tobytes(), 2) I added a Also I've got an extra question, regarding turns. When I receive a turn_started event, this means that the person is speaking, and turn_ended is that the person stopped speaking, right? I'm having some issues with turn detection, tried with Again, thanks a bunch! |
Glad it's working now! Technically semantic_vad or server_vad should handle the silence detection for you. You could try enabling noise reduction to further improve the experience. Right now we also don't really have a good way for you to do your own VAD & turn detection but we should add that as an additional option. I'm planning to build a Twilio app myself later this week and will report with any additional tips I can find. |
@dkundel-openai thanks a lot for the help! I'll be looking forward that Twilio app feedback. Meanwhile I'll try the noise reduction and see how it fares. Thanks again! |
This issue is stale because it has been open for 7 days with no activity. |
Hello! I'm trying to hook up a Twilio media stream to an Agent with the voice pipeline.
My process is more or less the following:
Now, what I have is a "working" system where I can accept a call, the audio gets processed properly and sent to OpenAI, and get a response back that I'm able to hear through the phone call.
The issue I have is that the audio transcript is just garbled (I don't mean the audio itself, which if I listen to it, it's clear enough, and I'm able to transcribe it just fine with a standard transcription call). It's the audio transcript which is just nothing at all to what is being said.
Here's two examples:
This is the audio file: https://filebin.net/3do528busqpnegro/span_239e5f3eb17349dfa9fc64ec-input.wav
This is the audio file: https://filebin.net/3do528busqpnegro/span_9577c4b6fd674cb794081ada-input.wav
I cannot figure out what's wrong, I have a feeling it's got to do with the audio processed for some reason.
This is how I'm setting up the pipeline:
This is the processed code:
PD: I'm adding the processed Twilio audio to the StreamedAudioInput as soon as I receive it, maybe it's got to do with that?
The text was updated successfully, but these errors were encountered: