why does word_timestamps=True change the transcription output? #2535
Replies: 1 comment
-
|
This is a very good question, and you're not imagining it. In principle, enabling: word_timestamps=Trueshould only add alignment information after decoding. However, in the current Whisper implementation, it can indirectly affect the final transcription. What happens when
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, this is more of a theoretical question - if I run whisper with the same parameters and the audio does not lead to temperature fallbacks introducing randomness, running whisper several times on the same long audio files will usually be quite deterministic for me. At most the differences will be incredibly tiny, probably due to float16 rounding errors (?).
However running it with word_timestamps on and off on the same audio gives quite different text results, even when disabling "condition_on_previous_text" so that small changes don't add up to bigger changes over time via the "previous prompt".
I am curious as to why. As far as I can tell, the word_timestamps option is not even passed to the model.decode() function and also does not appear in there. I also can not find any code that alters the slicing of the audio chunks / seek based on word_timestamps, as long as you keep hallucination_silence_threshold disabled.
Does retrieving the attention weights in find_alignment() alter some internal state of the model that causes it do give different results? However, at that point, the output segments have already been formed. I also see the text output differences if I manually slice a long audio file into 30 second files with ffmpeg and feed them with single python calls to load Whisper again in a fresh python instance for a single transcription with either "word_timestamps" on or off.
Am I overlooking something? I just want to understand how the outputs can be different. If I had to make a guess about the nature of the transcription differences, I would say transcripts with "word_timestamps" = True "overlook" more content overall, but it is hard to tell.
Cheers!
Beta Was this translation helpful? Give feedback.
All reactions