Summary
Under concurrent serving, kvarn_k4v2_g128 aggregate throughput saturates at ~80 tok/s regardless of batch size, while fp16 KV and another KV-quant backend scale to ~2000 tok/s on the same machine. Single-stream decode is fine (~44 tok/s), so this is specific to batched/concurrent serving.
Scaling curve (Qwen3-4B, warm runs, aggregate output tok/s)
| concurrency N |
1 |
8 |
16 |
32 |
64 |
128 |
| kvarn_k4v2_g128 |
27 |
80 |
78 |
83 |
83 |
81 |
| fp16 KV |
66 |
362 |
663 |
1087 |
(noisy) |
2172 |
KVarN flatlines from N=8 onward; fp16 scales ~linearly. A second KV-quant backend on the same box/flags also scaled to ~1950 tok/s at N=128, so this is not a WSL / pin_memory artifact.
Root cause (your own jit_monitor flags it)
During the concurrent burst the engine logs:
WARNING [jit_monitor] Triton kernel JIT compilation during inference:
_sinkhorn_log_kernel. This causes a latency spike;
consider extending warmup to cover this shape/config.
The running-set shape changes every decode step under dynamic batching, so _sinkhorn_log_kernel recompiles continuously. The scheduler never runs more than ~22 concurrent requests with 60+ waiting and GPU KV-cache usage at ~3% (so KV capacity is not the bottleneck):
Avg generation throughput: 1.5 tokens/s, Running: 15 reqs, Waiting: 53 reqs, GPU KV cache usage: 2.6%
generation throughput oscillates between ~1.5 and ~277 tok/s as the kernel recompiles.
Warmup does not fix it
Running each concurrency level twice (a discarded warmup burst, then a measured burst) gives essentially the same throughput cold vs warm at every N (e.g. N=128: 76 cold vs 81 warm). So this is per-step recompilation driven by the changing running-set shape, not a one-time first-compile cost.
Env
RTX 4090 (WSL2, Ubuntu 24.04), Qwen3-4B fp16, vLLM dev build (v0.1.dev58+gd784952a9), flags: --enforce-eager --no-enable-prefix-caching --no-enable-chunked-prefill --gpu-memory-utilization 0.9 --max-model-len 4096 --max-num-seqs 256.
Suggested direction
Shape-specialize / bucket the Sinkhorn kernel over a fixed set of batch sizes (pad the running-set dimension to the next bucket) so the compiled kernel is reused across decode steps instead of recompiling each step. Happy to share the full probe script and per-N engine logs.
Summary
Under concurrent serving,
kvarn_k4v2_g128aggregate throughput saturates at ~80 tok/s regardless of batch size, while fp16 KV and another KV-quant backend scale to ~2000 tok/s on the same machine. Single-stream decode is fine (~44 tok/s), so this is specific to batched/concurrent serving.Scaling curve (Qwen3-4B, warm runs, aggregate output tok/s)
KVarN flatlines from N=8 onward; fp16 scales ~linearly. A second KV-quant backend on the same box/flags also scaled to ~1950 tok/s at N=128, so this is not a WSL /
pin_memoryartifact.Root cause (your own jit_monitor flags it)
During the concurrent burst the engine logs:
The running-set shape changes every decode step under dynamic batching, so
_sinkhorn_log_kernelrecompiles continuously. The scheduler never runs more than ~22 concurrent requests with 60+ waiting and GPU KV-cache usage at ~3% (so KV capacity is not the bottleneck):generation throughput oscillates between ~1.5 and ~277 tok/s as the kernel recompiles.
Warmup does not fix it
Running each concurrency level twice (a discarded warmup burst, then a measured burst) gives essentially the same throughput cold vs warm at every N (e.g. N=128: 76 cold vs 81 warm). So this is per-step recompilation driven by the changing running-set shape, not a one-time first-compile cost.
Env
RTX 4090 (WSL2, Ubuntu 24.04), Qwen3-4B fp16, vLLM dev build (
v0.1.dev58+gd784952a9), flags:--enforce-eager --no-enable-prefix-caching --no-enable-chunked-prefill --gpu-memory-utilization 0.9 --max-model-len 4096 --max-num-seqs 256.Suggested direction
Shape-specialize / bucket the Sinkhorn kernel over a fixed set of batch sizes (pad the running-set dimension to the next bucket) so the compiled kernel is reused across decode steps instead of recompiling each step. Happy to share the full probe script and per-N engine logs.