-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Medium model output is nonsense for batched pipeline (for short 15s audio clips) #977
Comments
can you upload the audio to reproduce? |
So I'm using it to do live streaming with whisper, hence my wanting to use the medium model for better latency. This means I'm using a 15 second rolling window of my mic input. I'm using the following video for testing: https://www.youtube.com/watch?v=kYnNSORARFk. I've fixed the complete nonsense data by actually improving data quality (was converting my mic input suboptimally, large-v2 could still decipher it apparently but medium couldn't), but now what I'm running into is that the transcriptions seem to cycle between normal transcriptions for the 15 seconds audio to very shortened versions of it. I'm using the following code in combination with the above to get output as string:
Then for the first ~15s of the clip I linked I get alternatingly e.g. |
Batching will not be useful for live transcription unless you are doing it over multiple streams/files, also check this |
Alright, intuitively that makes sense but when I used it with large models it did perform much faster than the unbatched version and gave good results (very similar to the unbatched version). Is there any explanation for that? It feels like there is something there. And thanks for the link btw, I have tried Whisperlive but I couldn't get it to work as I'd like for my usecase (transcribing meetings). My approach is very similar but incorporates some elements from whisper_streaming. Planning to take a look at https://github.com/backspacetg/simul_whisper too. |
Somewhat related, sometimes even when using the unbatched version faster-whisper will sometimes take a very long time to transcribe an audio clip of <15s (think 8-40s, where it takes about 1s usually). I'm assuming this is caused by perhaps some hallucination issues or fallbacks, are there any settings I can adjust to correct for this behavior? I've noticed it sometimes in transcribing files too but it's of course more of a problem in streaming attempts |
you can disable fallback by setting |
Thanks that's super useful, almost completely eliminated the very long transcribes! |
@tjongsma I've the same issue and the same application, transcription in real-time with chuncks ~3s that some times takes a lot. How you fixed it? Thank you |
Setting beam_size=5, temperature=0 and max tokens=224 worked for me! Let me know if it does for you too. On Aug 31, 2024 11:56, asr-lord ***@***.***> wrote:
Somewhat related, sometimes even when using the unbatched version faster-whisper will sometimes take a very long time to transcribe an audio clip of <15s (think 8-40s, where it takes about 1s usually). I'm assuming this is caused by perhaps some hallucination issues or fallbacks, are there any settings I can adjust to correct for this behavior? I've noticed it sometimes in transcribing files too but it's of course more of a problem in streaming attempts
@tjongsma I've the same issue and the same applications, transcriptions in real time with chuncks ~3 that some times takes a lot. How you fixed it? Thank you
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Like the title implies, when using the batched commits and using the medium model the model output is nonsense (empty, repeats the inital prompt or says 'I'm sorry'). I'm using something along the following lines:
It works fine with both large-v2 and large-v3. Any idea as to why and/or a way to fix it? Thank you!
The text was updated successfully, but these errors were encountered: