Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Medium model output is nonsense for batched pipeline (for short 15s audio clips) #977

Open
tjongsma opened this issue Aug 26, 2024 · 9 comments

Comments

@tjongsma
Copy link

Like the title implies, when using the batched commits and using the medium model the model output is nonsense (empty, repeats the inital prompt or says 'I'm sorry'). I'm using something along the following lines:

from faster_whisper import WhisperModel, BatchedInferencePipeline
model_size = "medium"
model = WhisperModel(model_size, device="cuda", compute_type="float16")
batched_model = BatchedInferencePipeline(model=model
                                        ,use_vad_model=True
segments, info = batched_model.transcribe(audio_tensor,
                                        batch_size=24,
                                        word_timestamps=True)
  text = []
  # Iterate over the segments and store words with timestamps
  for segment in segments:
      for word in segment.words:
          text.append(WordWithTimestamp(word.word, word.start, word.end))

It works fine with both large-v2 and large-v3. Any idea as to why and/or a way to fix it? Thank you!

@MahmoudAshraf97
Copy link
Collaborator

can you upload the audio to reproduce?

@tjongsma
Copy link
Author

So I'm using it to do live streaming with whisper, hence my wanting to use the medium model for better latency. This means I'm using a 15 second rolling window of my mic input. I'm using the following video for testing: https://www.youtube.com/watch?v=kYnNSORARFk. I've fixed the complete nonsense data by actually improving data quality (was converting my mic input suboptimally, large-v2 could still decipher it apparently but medium couldn't), but now what I'm running into is that the transcriptions seem to cycle between normal transcriptions for the 15 seconds audio to very shortened versions of it. I'm using the following code in combination with the above to get output as string:

    class WordWithTimestamp:
        def __init__(self, word, start, end):
            self.word = word
            self.start = start
            self.end = end

        def __str__(self):
            return self.word
"".join(str(word) for word in text)

Then for the first ~15s of the clip I linked I get alternatingly e.g.
"So give the president a chance.
Governor romney, i'm glad that you recognize that al qaeda is a threat.
Because a few months ago, when you were asked what's the biggest geopolitical threat facing america, you said russia, not al qaeda."
"So give the president a chance."
"So"
"The"
Any ideas on why this happens? I'm starting to think that maybe the batching output is different than the normal output, and the reason I'm getting this problem with medium but not large-v2 is that medium allows my GPU to take advantage of batching more (I'm running it on a laptop 3060 with 6gb of vram).

@MahmoudAshraf97
Copy link
Collaborator

Batching will not be useful for live transcription unless you are doing it over multiple streams/files, also check this

@tjongsma tjongsma changed the title Medium model output is nonsense for batched pipeline Medium model output is nonsense for batched pipeline (for short 15s audio clips) Aug 28, 2024
@tjongsma
Copy link
Author

tjongsma commented Aug 28, 2024

Alright, intuitively that makes sense but when I used it with large models it did perform much faster than the unbatched version and gave good results (very similar to the unbatched version). Is there any explanation for that? It feels like there is something there.

And thanks for the link btw, I have tried Whisperlive but I couldn't get it to work as I'd like for my usecase (transcribing meetings). My approach is very similar but incorporates some elements from whisper_streaming. Planning to take a look at https://github.com/backspacetg/simul_whisper too.

@tjongsma
Copy link
Author

Somewhat related, sometimes even when using the unbatched version faster-whisper will sometimes take a very long time to transcribe an audio clip of <15s (think 8-40s, where it takes about 1s usually). I'm assuming this is caused by perhaps some hallucination issues or fallbacks, are there any settings I can adjust to correct for this behavior? I've noticed it sometimes in transcribing files too but it's of course more of a problem in streaming attempts

@MahmoudAshraf97
Copy link
Collaborator

you can disable fallback by setting temperature to be a single value instead of the default list

@tjongsma
Copy link
Author

Thanks that's super useful, almost completely eliminated the very long transcribes!

@asr-lord
Copy link

asr-lord commented Aug 31, 2024

Somewhat related, sometimes even when using the unbatched version faster-whisper will sometimes take a very long time to transcribe an audio clip of <15s (think 8-40s, where it takes about 1s usually). I'm assuming this is caused by perhaps some hallucination issues or fallbacks, are there any settings I can adjust to correct for this behavior? I've noticed it sometimes in transcribing files too but it's of course more of a problem in streaming attempts

@tjongsma I've the same issue and the same application, transcription in real-time with chuncks ~3s that some times takes a lot. How you fixed it? Thank you

@tjongsma
Copy link
Author

tjongsma commented Aug 31, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@MahmoudAshraf97 @tjongsma @asr-lord and others