What special tokens to use for transcription, and why is <|startoflm|> so highly probable? #2578

belson17 · 2025-04-19T00:56:10Z

belson17
Apr 19, 2025

Hi all, I'm trying to determine the probability of a sequence of "target tokens" for a given audio recording of speech. To do so, I'm passing in this sequence of tokens and the encoded Mel spectrogram to the decoder. My problem is I'm unsure what special tokens to prepend to the sequence to indicate to the model to transcribe in English. In particular, for the "turbo" and "large-v3" models, I see the token "<|startoflm|>" has a very high probability output from the decoder, so it seems the model expects this token, but I don't know how to use it. I also don't see this token predicted with high probability when using other models, e.g., "base".

What prefix special tokens should I use for the turbo and large models? Below are snippets of code to show what I'm doing. If there is documentation, please point me there. Thanks!

model = whisper.load_model("turbo", device="cuda")

prefix_token_ids = [50258, 50259, 50359, 50363]
prefix_tokens = ['<|startoftranscript|>', '<|en|>', '<|transcribe|>', '<|notimestamps|>']
prefix_tensor = torch.tensor(prefix_token_ids).unsqueeze(0)

full_sequence_token_ids = torch.cat([prefix_tensor, target_token_ids], dim=1)
decoder_input_token_ids = full_sequence_token_ids[:, :-1]
target_token_ids_to_eval = full_sequence_token_ids[:, 1:]

logits = model.decoder(decoder_input_token_ids, self.model.encoder(mel))
log_probs = F.log_softmax(logits, dim=-1)

for i in range(len(logits[0]):
    pred_top = logits[0, i].topk(5).indices
    pred_top_probs = np.exp(log_probs[0, i, pred_top].detach().cpu().numpy())
    print(f"Time step {i}: top tokens = {[tokenizer.decode([t]) for t in pred_top]}, probs={pred_top_probs}")
    print(f"Target token: {self.tokenizer.decode([target_token_ids_to_eval[0, i]])}")

Outputs:

Time step 0: top tokens = ['<|en|>', '<|notimestamps|>', '<|la|>', '<|de|>', '<|fr|>'], probs=[9.9240476e-01 4.9215057e-03 4.9104204e-04 2.9613255e-04 2.5190943e-04]
Target token: <|en|>
Time step 1: top tokens = ['<|startoflm|>', '<|transcribe|>', '', '<|nospeech|>', '<|endoftext|>'], probs=[9.9999833e-01 1.6491989e-06 2.2530656e-08 1.3922099e-08 4.4615808e-10]
Target token: <|transcribe|>
Time step 2: top tokens = ['', '', '', '', ''], probs=[5.7725841e-01 4.1457579e-01 5.2851834e-04 5.1052548e-04 4.3363243e-04]
Target token: <|notimestamps|>
Time step 3: top tokens = ['<|endoftext|>', ',', ' and', ' -', ' ...'], probs=[1.0000000e+00 3.1554283e-11 2.2344248e-11 1.2954151e-11 1.1298531e-11]
Target token:  They
Time step 4: top tokens = [' regain', ' reg', ' began', ' re', ' begin'], probs=[0.6244507  0.309043   0.00685263 0.00403515 0.00258522]
Target token:  regain

Advait251206 · 2026-06-24T18:27:25Z

Advait251206
Jun 24, 2026

What you're seeing is expected, and it stems from an architectural difference between Large-v3/Turbo and earlier Whisper models.

Short answer

For transcription with Large-v3 and Turbo, you should generally use the same transcription prefix that Whisper itself uses:

<|startoftranscript|>
<|en|>
<|transcribe|>
<|notimestamps|>

However, if you're manually computing sequence probabilities by calling:

model.decoder(...)

you may be missing an additional token that the decoder expects internally, which is why you're seeing:

P(<|startoflm|>) ≈ 0.999998

at position 1.

Why is `<|startoflm|>` so probable?

The key observation is:

Input:
<|startoftranscript|>

Prediction:
<|en|>

which looks correct.

Then:

Input:
<|startoftranscript|> <|en|>

Prediction:
<|startoflm|>

instead of:

<|transcribe|>

This suggests that the decoder's training distribution for Large-v3/Turbo includes an intermediate language-modeling token that your manually constructed sequence is not providing.

In other words:

Your prefix:
<|startoftranscript|>
<|en|>
<|transcribe|>
<|notimestamps|>

may not exactly match the sequence used during training or generation.

Important distinction: decoder vs generate()

When Whisper runs normally:

model.decode(...)

or

model.transcribe(...)

it constructs the prompt using tokenizer logic and decoding rules.

When you directly call:

model.decoder(token_ids, audio_features)

you bypass that machinery and become responsible for reproducing the exact token sequence expected by the model.

This becomes particularly important for:

Large-v3
Turbo

because they introduced tokenizer and decoder changes compared to older checkpoints.

Check what Whisper actually uses

I would inspect:

tokenizer = whisper.tokenizer.get_tokenizer(
    model.is_multilingual,
    language="en",
    task="transcribe"
)

and then examine:

tokenizer.sot_sequence

and related attributes.

For example:

print(tokenizer.sot_sequence)

You may find that the actual decoding prompt differs from the four-token prefix you're manually constructing.

Another clue: your probabilities collapse

You observe:

Time step 3:
<|endoftext|> = 1.0

before the model has generated meaningful text.

That's a strong signal that the decoder believes the sequence is malformed.

If the prefix were correct, you'd expect:

They
The
I
We
...

to appear among the most likely next tokens.

Instead, the model appears to think:

sequence finished

or

invalid context

which often happens when a required control token is missing.

How I'd debug this

Compare the logits from:

model.decode(...)

with those from:

model.decoder(...)

for the same audio.

Specifically:

Run a normal decode.
Capture the exact token sequence being fed to the decoder.
Use that sequence in your probability computation.

That removes any ambiguity about special tokens.

About `<|startoflm|>`

From experimentation and community reports, Large-v3/Turbo appear to use some internal control-token behavior that differs from earlier models.

The fact that:

P(<|startoflm|>) ≈ 0.999998

implies the model strongly expects that token at that position under the context you've provided.

Rather than manually guessing where it belongs, I'd recommend extracting the actual prompt sequence generated by Whisper's tokenizer and decoding pipeline for those checkpoints.

Recommendation

Instead of hardcoding:

[50258, 50259, 50359, 50363]

use Whisper's tokenizer utilities to obtain the exact start-of-transcript sequence for the model you're evaluating.

If your goal is sequence likelihood evaluation, the safest approach is:

tokenizer = get_tokenizer(...)
prompt = tokenizer.sot_sequence

and then append your target text tokens exactly as Whisper would.

The unusually high probability of <|startoflm|> is a strong indication that the manually specified prefix does not fully match the decoding prompt expected by Large-v3/Turbo, even though it works reasonably for earlier models such as base.

2 replies

belson17 Jun 25, 2026
Author

This reply reads like AI slop to me.

Advait251206 Jun 25, 2026

A little bit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What special tokens to use for transcription, and why is <|startoflm|> so highly probable? #2578

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

What special tokens to use for transcription, and why is <|startoflm|> so highly probable? #2578

Uh oh!

belson17 Apr 19, 2025

Replies: 1 comment · 2 replies

Uh oh!

Advait251206 Jun 24, 2026

Short answer

Why is <|startoflm|> so probable?

Important distinction: decoder vs generate()

Check what Whisper actually uses

Another clue: your probabilities collapse

How I'd debug this

About <|startoflm|>

Recommendation

Uh oh!

belson17 Jun 25, 2026 Author

Uh oh!

Advait251206 Jun 25, 2026

belson17
Apr 19, 2025

Replies: 1 comment 2 replies

Advait251206
Jun 24, 2026

Why is `<|startoflm|>` so probable?

About `<|startoflm|>`

belson17 Jun 25, 2026
Author