why does word_timestamps=True change the transcription output? #2535

whispy-woods · 2025-02-21T11:05:04Z

whispy-woods
Feb 21, 2025

Hi, this is more of a theoretical question - if I run whisper with the same parameters and the audio does not lead to temperature fallbacks introducing randomness, running whisper several times on the same long audio files will usually be quite deterministic for me. At most the differences will be incredibly tiny, probably due to float16 rounding errors (?).

However running it with word_timestamps on and off on the same audio gives quite different text results, even when disabling "condition_on_previous_text" so that small changes don't add up to bigger changes over time via the "previous prompt".

I am curious as to why. As far as I can tell, the word_timestamps option is not even passed to the model.decode() function and also does not appear in there. I also can not find any code that alters the slicing of the audio chunks / seek based on word_timestamps, as long as you keep hallucination_silence_threshold disabled.

Does retrieving the attention weights in find_alignment() alter some internal state of the model that causes it do give different results? However, at that point, the output segments have already been formed. I also see the text output differences if I manually slice a long audio file into 30 second files with ffmpeg and feed them with single python calls to load Whisper again in a fresh python instance for a single transcription with either "word_timestamps" on or off.

Am I overlooking something? I just want to understand how the outputs can be different. If I had to make a guess about the nature of the transcription differences, I would say transcripts with "word_timestamps" = True "overlook" more content overall, but it is hard to tell.

Cheers!

Advait251206 · 2026-06-24T18:32:02Z

Advait251206
Jun 24, 2026

This is a very good question, and you're not imagining it. In principle, enabling:

word_timestamps=True

should only add alignment information after decoding. However, in the current Whisper implementation, it can indirectly affect the final transcription.

What happens when `word_timestamps=False`?

The pipeline is roughly:

Audio
 ↓
Decode segment
 ↓
Generate text
 ↓
Return result

Segment boundaries are determined primarily by:

timestamp tokens
silence heuristics
seek progression

The decoded text is returned largely unchanged.

What happens when `word_timestamps=True`?

The pipeline becomes:

Audio
 ↓
Decode segment
 ↓
Find token-word alignment
 ↓
Adjust segment timing
 ↓
Apply word-level timestamp heuristics
 ↓
Potentially modify segment boundaries
 ↓
Return result

The key point is that Whisper does more than simply attach timestamps to existing words.

Segment boundary adjustments

When word timestamps are enabled, Whisper uses attention-based alignment to estimate where words occur.

During this process it may:

split segments differently
merge segments differently
trim words near boundaries
shift segment start/end times

These changes can affect what audio is included in subsequent decoding windows.

For long audio, even small timing changes can propagate forward.

Even with `condition_on_previous_text=False`

You're correct that disabling:

condition_on_previous_text=False

removes one major source of cascading differences.

However, another source remains:

segment timing
↓
seek position
↓
next audio chunk
↓
different decoding result

If the alignment logic causes Whisper to advance or rewind slightly differently, later chunks may no longer be identical.

Hallucination-related heuristics

Even when:

hallucination_silence_threshold=None

some word-timestamp-related heuristics are still active.

The alignment code computes information such as:

word durations
pauses
punctuation timing

These values can influence how segments are finalized.

As a result, the effective segmentation path may diverge from the non-word-timestamp run.

Does `find_alignment()` modify model weights or decoder state?

Generally:

No

The attention extraction performed by:

find_alignment()

uses forward hooks but does not intentionally modify model parameters.

So the differences are not typically caused by:

changed weights
changed KV cache
changed decoder logits

for an already-decoded segment.

The more likely explanation is downstream processing and segmentation.

Why differences can appear even on isolated chunks

You mentioned that you:

1. Cut audio into 30-second files
2. Start a fresh Python process
3. Run with word_timestamps=True vs False

and still see text differences.

That is more interesting.

In that case, there are two possibilities.

1. Different decoding path due to timestamp handling

Whisper's decoding logic is tightly coupled with timestamp tokens.

Even if word_timestamps isn't passed directly into:

model.decode()

the overall transcription pipeline can take slightly different paths regarding:

timestamp token interpretation
segment post-processing
token filtering

which can affect the final text emitted.

2. Numerical effects from attention extraction

This is less likely but possible.

The alignment code registers hooks and performs additional forward passes through parts of the model.

On GPU, especially with:

float16
Tensor Cores
non-deterministic kernels

very small numerical differences can occur.

Normally these should not matter, but if decoding is near a decision boundary:

token A: 0.501
token B: 0.499

a tiny perturbation can flip the selected token.

Over long sequences, those small differences may become noticeable.

Why transcripts with `word_timestamps=True` sometimes miss content

This observation matches reports from other users.

A common pattern is:

word_timestamps=False
→ slightly more complete transcript

word_timestamps=True
→ cleaner timing but occasional dropped words/phrases

This is usually caused by alignment-based boundary adjustments.

Words near:

segment start
segment end
long pauses

are the most vulnerable.

How to verify

A useful experiment is:

result = model.transcribe(
    audio,
    word_timestamps=False,
    verbose=True
)

and compare against:

result = model.transcribe(
    audio,
    word_timestamps=True,
    verbose=True
)

while inspecting:

result["segments"]

Specifically compare:

start
end
seek
tokens

You'll often find that segment boundaries differ even when the underlying audio is identical.

Bottom line

Your intuition is mostly correct: word_timestamps=True does not simply "decorate" an existing transcript with timestamps. It triggers Whisper's alignment pipeline, which can alter segment boundaries and timing decisions. Those changes can propagate into different decoding windows, leading to different text output. For isolated chunks, any remaining differences are likely due to timestamp-related post-processing and, in some cases, small numerical differences introduced by additional alignment passes.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

why does word_timestamps=True change the transcription output? #2535

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

why does word_timestamps=True change the transcription output? #2535

Uh oh!

Uh oh!

whispy-woods Feb 21, 2025

Replies: 1 comment

Uh oh!

Advait251206 Jun 24, 2026

What happens when word_timestamps=False?

What happens when word_timestamps=True?

Segment boundary adjustments

Even with condition_on_previous_text=False

Hallucination-related heuristics

Does find_alignment() modify model weights or decoder state?

Why differences can appear even on isolated chunks

1. Different decoding path due to timestamp handling

2. Numerical effects from attention extraction

Why transcripts with word_timestamps=True sometimes miss content

How to verify

Bottom line

whispy-woods
Feb 21, 2025

Advait251206
Jun 24, 2026

What happens when `word_timestamps=False`?

What happens when `word_timestamps=True`?

Even with `condition_on_previous_text=False`

Does `find_alignment()` modify model weights or decoder state?

Why transcripts with `word_timestamps=True` sometimes miss content