Whisper V3 Large Turbo – Words/Sec Capped at ~284? Bottleneck or Parallelism Limit? #2594

AryanSakhala · 2025-05-14T02:49:14Z

AryanSakhala
May 14, 2025

During a benchmarking run I have been doing, I found openai/whisper-large-v3-turbo showing some strange behaviour.
Irrespective of the concurrency or Sampling Rate of the audio, Words/Sec were constant around ~284.

Am I missing something?
I am using Loadbalancer - nginx
I have deployed it using Vllm

The architecture only uses 4 decoder layers (compared to 32 in Whisper Large), so I expected higher parallelism, but it seems capped.
Is this:

A limitation of the model architecture?

A runtime/framework bottleneck?

Or am I missing an optimization?

Would love to hear from others who’ve tried pushing this model to its limits.

Advait251206 · 2026-06-24T18:25:32Z

Advait251206
Jun 24, 2026

A constant throughput of ~284 words/sec regardless of concurrency or audio sampling rate usually suggests you're hitting a system bottleneck rather than a fundamental limit of the Whisper Turbo architecture itself.

Why this is probably not a model limitation

Whisper Large-v3 Turbo reduces the decoder from:

Large-v3: 32 decoder layers
Turbo:     4 decoder layers

which significantly reduces autoregressive decoding cost.

If the model were the bottleneck, you would normally expect throughput to change when:

concurrency increases,
audio length changes,
batch size changes,
hardware utilization changes.

Seeing the same ~284 words/sec across different workloads often means some other stage is saturated.

Things to investigate

1. GPU utilization

First check:

nvidia-smi dmon

or similar monitoring.

Questions:

Is GPU utilization near 100%?
Is memory bandwidth saturated?
Are Tensor Cores active?
Is the GPU spending time idle between requests?

If utilization is low while throughput is capped, the bottleneck is likely elsewhere.

2. Decoder serialization

Although Turbo has only 4 decoder layers, it is still:

autoregressive

meaning tokens are generated sequentially:

token_1
   ↓
token_2
   ↓
token_3

You cannot fully parallelize token generation across time steps.

Reducing decoder depth lowers latency per token, but it does not remove the sequential nature of decoding.

So there is still a throughput ceiling compared to fully parallel encoder workloads.

3. vLLM effectiveness for Whisper

vLLM was originally designed for LLM serving.

Whisper has a different workload profile:

Audio encoder
+
Autoregressive decoder

Some optimizations that work extremely well for text generation may provide smaller gains for speech models.

Questions worth checking:

Are requests actually being batched together?
Is continuous batching active?
Are encoder outputs being reused efficiently?

If batching efficiency plateaus, throughput may also plateau.

4. Audio preprocessing bottlenecks

A surprising number of ASR pipelines spend time in:

audio decoding
resampling
feature extraction
mel spectrogram generation

rather than inference itself.

Profile:

Audio preprocessing
Encoder
Decoder
Post-processing

separately.

You may find the decoder is not the dominant cost anymore.

5. Nginx and request scheduling

Because you're using:

Client
 ↓
Nginx
 ↓
vLLM
 ↓
GPU

it's worth checking whether:

connection limits
request buffering
worker counts
queueing delays

are creating an artificial ceiling.

A throughput plateau that remains constant across concurrency levels sometimes indicates the serving layer has become the bottleneck.

6. Words/sec may be misleading

Throughput metrics can hide what's actually happening.

For example:

Words/sec ≈ constant

while:

Tokens/sec
Requests/sec
GPU utilization
Latency

change significantly.

I'd recommend measuring:

Encoder ms/audio-second
Decoder tokens/sec
Requests/sec
GPU utilization
Queue time

rather than only words/sec.

What I'd expect theoretically

Large-v3 Turbo should generally scale better than Large-v3 because:

Decoder depth is much smaller.
Decoder FLOPs are reduced.
Memory traffic is lower.

However, once decoder cost is no longer dominant, another component becomes the bottleneck:

Encoder
Feature extraction
Batching
Serving framework
Memory bandwidth
Request scheduling

At that point, adding concurrency won't increase throughput much.

My suspicion

A flat ceiling around ~284 words/sec independent of concurrency is more consistent with:

A serving/runtime bottleneck (vLLM scheduling, batching, or queueing),
A preprocessing bottleneck,
GPU saturation in a component other than the decoder,

than with a hard architectural limit of Whisper Large-v3 Turbo itself.

To narrow it down, I'd be interested in:

GPU model
Batch size
Average audio duration
vLLM version
GPU utilization (%)
Decoder tokens/sec
Requests/sec

Those metrics would quickly reveal whether you're hitting a runtime bottleneck or the actual limits of the hardware.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Whisper V3 Large Turbo – Words/Sec Capped at ~284? Bottleneck or Parallelism Limit? #2594

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Whisper V3 Large Turbo – Words/Sec Capped at ~284? Bottleneck or Parallelism Limit? #2594

Uh oh!

AryanSakhala May 14, 2025

Replies: 1 comment

Uh oh!

Advait251206 Jun 24, 2026

Why this is probably not a model limitation

Things to investigate

1. GPU utilization

2. Decoder serialization

3. vLLM effectiveness for Whisper

4. Audio preprocessing bottlenecks

5. Nginx and request scheduling

6. Words/sec may be misleading

What I'd expect theoretically

My suspicion

AryanSakhala
May 14, 2025

Advait251206
Jun 24, 2026