Medium model output is nonsense for batched pipeline (for short 15s audio clips) #977

tjongsma · 2024-08-26T17:40:47Z

Like the title implies, when using the batched commits and using the medium model the model output is nonsense (empty, repeats the inital prompt or says 'I'm sorry'). I'm using something along the following lines:

from faster_whisper import WhisperModel, BatchedInferencePipeline
model_size = "medium"
model = WhisperModel(model_size, device="cuda", compute_type="float16")
batched_model = BatchedInferencePipeline(model=model
                                        ,use_vad_model=True
segments, info = batched_model.transcribe(audio_tensor,
                                        batch_size=24,
                                        word_timestamps=True)
  text = []
  # Iterate over the segments and store words with timestamps
  for segment in segments:
      for word in segment.words:
          text.append(WordWithTimestamp(word.word, word.start, word.end))

It works fine with both large-v2 and large-v3. Any idea as to why and/or a way to fix it? Thank you!

The text was updated successfully, but these errors were encountered:

MahmoudAshraf97 · 2024-08-27T14:13:05Z

can you upload the audio to reproduce?

tjongsma · 2024-08-28T07:45:57Z

So I'm using it to do live streaming with whisper, hence my wanting to use the medium model for better latency. This means I'm using a 15 second rolling window of my mic input. I'm using the following video for testing: https://www.youtube.com/watch?v=kYnNSORARFk. I've fixed the complete nonsense data by actually improving data quality (was converting my mic input suboptimally, large-v2 could still decipher it apparently but medium couldn't), but now what I'm running into is that the transcriptions seem to cycle between normal transcriptions for the 15 seconds audio to very shortened versions of it. I'm using the following code in combination with the above to get output as string:

    class WordWithTimestamp:
        def __init__(self, word, start, end):
            self.word = word
            self.start = start
            self.end = end

        def __str__(self):
            return self.word
"".join(str(word) for word in text)

Then for the first ~15s of the clip I linked I get alternatingly e.g.
"So give the president a chance.
Governor romney, i'm glad that you recognize that al qaeda is a threat.
Because a few months ago, when you were asked what's the biggest geopolitical threat facing america, you said russia, not al qaeda."
"So give the president a chance."
"So"
"The"
Any ideas on why this happens? I'm starting to think that maybe the batching output is different than the normal output, and the reason I'm getting this problem with medium but not large-v2 is that medium allows my GPU to take advantage of batching more (I'm running it on a laptop 3060 with 6gb of vram).

MahmoudAshraf97 · 2024-08-28T12:06:21Z

Batching will not be useful for live transcription unless you are doing it over multiple streams/files, also check this

tjongsma · 2024-08-28T12:31:48Z

Alright, intuitively that makes sense but when I used it with large models it did perform much faster than the unbatched version and gave good results (very similar to the unbatched version). Is there any explanation for that? It feels like there is something there.

And thanks for the link btw, I have tried Whisperlive but I couldn't get it to work as I'd like for my usecase (transcribing meetings). My approach is very similar but incorporates some elements from whisper_streaming. Planning to take a look at https://github.com/backspacetg/simul_whisper too.

tjongsma · 2024-08-28T12:55:25Z

Somewhat related, sometimes even when using the unbatched version faster-whisper will sometimes take a very long time to transcribe an audio clip of <15s (think 8-40s, where it takes about 1s usually). I'm assuming this is caused by perhaps some hallucination issues or fallbacks, are there any settings I can adjust to correct for this behavior? I've noticed it sometimes in transcribing files too but it's of course more of a problem in streaming attempts

MahmoudAshraf97 · 2024-08-28T13:19:08Z

you can disable fallback by setting temperature to be a single value instead of the default list

tjongsma · 2024-08-28T13:29:36Z

Thanks that's super useful, almost completely eliminated the very long transcribes!

asr-lord · 2024-08-31T09:56:10Z

Somewhat related, sometimes even when using the unbatched version faster-whisper will sometimes take a very long time to transcribe an audio clip of <15s (think 8-40s, where it takes about 1s usually). I'm assuming this is caused by perhaps some hallucination issues or fallbacks, are there any settings I can adjust to correct for this behavior? I've noticed it sometimes in transcribing files too but it's of course more of a problem in streaming attempts

@tjongsma I've the same issue and the same application, transcription in real-time with chuncks ~3s that some times takes a lot. How you fixed it? Thank you

tjongsma · 2024-08-31T13:11:56Z

Setting beam_size=5, temperature=0 and max tokens=224 worked for me! Let me know if it does for you too. On Aug 31, 2024 11:56, asr-lord ***@***.***> wrote: Somewhat related, sometimes even when using the unbatched version faster-whisper will sometimes take a very long time to transcribe an audio clip of <15s (think 8-40s, where it takes about 1s usually). I'm assuming this is caused by perhaps some hallucination issues or fallbacks, are there any settings I can adjust to correct for this behavior? I've noticed it sometimes in transcribing files too but it's of course more of a problem in streaming attempts @tjongsma I've the same issue and the same applications, transcriptions in real time with chuncks ~3 that some times takes a lot. How you fixed it? Thank you —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

tjongsma changed the title ~~Medium model output is nonsense for batched pipeline~~ Medium model output is nonsense for batched pipeline (for short 15s audio clips) Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Medium model output is nonsense for batched pipeline (for short 15s audio clips) #977

Medium model output is nonsense for batched pipeline (for short 15s audio clips) #977

tjongsma commented Aug 26, 2024

MahmoudAshraf97 commented Aug 27, 2024

tjongsma commented Aug 28, 2024

MahmoudAshraf97 commented Aug 28, 2024

tjongsma commented Aug 28, 2024 •

edited

Loading

tjongsma commented Aug 28, 2024

MahmoudAshraf97 commented Aug 28, 2024

tjongsma commented Aug 28, 2024

asr-lord commented Aug 31, 2024 •

edited

Loading

tjongsma commented Aug 31, 2024 via email

Medium model output is nonsense for batched pipeline (for short 15s audio clips) #977

Medium model output is nonsense for batched pipeline (for short 15s audio clips) #977

Comments

tjongsma commented Aug 26, 2024

MahmoudAshraf97 commented Aug 27, 2024

tjongsma commented Aug 28, 2024

MahmoudAshraf97 commented Aug 28, 2024

tjongsma commented Aug 28, 2024 • edited Loading

tjongsma commented Aug 28, 2024

MahmoudAshraf97 commented Aug 28, 2024

tjongsma commented Aug 28, 2024

asr-lord commented Aug 31, 2024 • edited Loading

tjongsma commented Aug 31, 2024 via email

tjongsma commented Aug 28, 2024 •

edited

Loading

asr-lord commented Aug 31, 2024 •

edited

Loading