Whisper V3 Large Turbo – Words/Sec Capped at ~284? Bottleneck or Parallelism Limit? #2594
Replies: 1 comment
-
|
A constant throughput of ~284 words/sec regardless of concurrency or audio sampling rate usually suggests you're hitting a system bottleneck rather than a fundamental limit of the Whisper Turbo architecture itself. Why this is probably not a model limitationWhisper Large-v3 Turbo reduces the decoder from: which significantly reduces autoregressive decoding cost. If the model were the bottleneck, you would normally expect throughput to change when:
Seeing the same ~284 words/sec across different workloads often means some other stage is saturated. Things to investigate1. GPU utilizationFirst check: nvidia-smi dmonor similar monitoring. Questions:
If utilization is low while throughput is capped, the bottleneck is likely elsewhere. 2. Decoder serializationAlthough Turbo has only 4 decoder layers, it is still: meaning tokens are generated sequentially: You cannot fully parallelize token generation across time steps. Reducing decoder depth lowers latency per token, but it does not remove the sequential nature of decoding. So there is still a throughput ceiling compared to fully parallel encoder workloads. 3. vLLM effectiveness for WhispervLLM was originally designed for LLM serving. Whisper has a different workload profile: Some optimizations that work extremely well for text generation may provide smaller gains for speech models. Questions worth checking:
If batching efficiency plateaus, throughput may also plateau. 4. Audio preprocessing bottlenecksA surprising number of ASR pipelines spend time in:
rather than inference itself. Profile: separately. You may find the decoder is not the dominant cost anymore. 5. Nginx and request schedulingBecause you're using: it's worth checking whether:
are creating an artificial ceiling. A throughput plateau that remains constant across concurrency levels sometimes indicates the serving layer has become the bottleneck. 6. Words/sec may be misleadingThroughput metrics can hide what's actually happening. For example: while: change significantly. I'd recommend measuring: rather than only words/sec. What I'd expect theoreticallyLarge-v3 Turbo should generally scale better than Large-v3 because:
However, once decoder cost is no longer dominant, another component becomes the bottleneck: At that point, adding concurrency won't increase throughput much. My suspicionA flat ceiling around ~284 words/sec independent of concurrency is more consistent with:
than with a hard architectural limit of Whisper Large-v3 Turbo itself. To narrow it down, I'd be interested in: Those metrics would quickly reveal whether you're hitting a runtime bottleneck or the actual limits of the hardware. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
During a benchmarking run I have been doing, I found openai/whisper-large-v3-turbo showing some strange behaviour.
Irrespective of the concurrency or Sampling Rate of the audio, Words/Sec were constant around ~284.
Am I missing something?
I am using Loadbalancer - nginx
I have deployed it using Vllm
The architecture only uses 4 decoder layers (compared to 32 in Whisper Large), so I expected higher parallelism, but it seems capped.
Is this:
Would love to hear from others who’ve tried pushing this model to its limits.
Beta Was this translation helpful? Give feedback.
All reactions