Skip to content

kvarn_k4v2_g128 throughput does not scale with batch size (Sinkhorn kernel JIT-recompiles per step under dynamic batching) #15

Description

@sztlink

Summary

Under concurrent serving, kvarn_k4v2_g128 aggregate throughput saturates at ~80 tok/s regardless of batch size, while fp16 KV and another KV-quant backend scale to ~2000 tok/s on the same machine. Single-stream decode is fine (~44 tok/s), so this is specific to batched/concurrent serving.

Scaling curve (Qwen3-4B, warm runs, aggregate output tok/s)

concurrency N 1 8 16 32 64 128
kvarn_k4v2_g128 27 80 78 83 83 81
fp16 KV 66 362 663 1087 (noisy) 2172

KVarN flatlines from N=8 onward; fp16 scales ~linearly. A second KV-quant backend on the same box/flags also scaled to ~1950 tok/s at N=128, so this is not a WSL / pin_memory artifact.

Root cause (your own jit_monitor flags it)

During the concurrent burst the engine logs:

WARNING [jit_monitor] Triton kernel JIT compilation during inference:
_sinkhorn_log_kernel. This causes a latency spike;
consider extending warmup to cover this shape/config.

The running-set shape changes every decode step under dynamic batching, so _sinkhorn_log_kernel recompiles continuously. The scheduler never runs more than ~22 concurrent requests with 60+ waiting and GPU KV-cache usage at ~3% (so KV capacity is not the bottleneck):

Avg generation throughput: 1.5 tokens/s, Running: 15 reqs, Waiting: 53 reqs, GPU KV cache usage: 2.6%

generation throughput oscillates between ~1.5 and ~277 tok/s as the kernel recompiles.

Warmup does not fix it

Running each concurrency level twice (a discarded warmup burst, then a measured burst) gives essentially the same throughput cold vs warm at every N (e.g. N=128: 76 cold vs 81 warm). So this is per-step recompilation driven by the changing running-set shape, not a one-time first-compile cost.

Env

RTX 4090 (WSL2, Ubuntu 24.04), Qwen3-4B fp16, vLLM dev build (v0.1.dev58+gd784952a9), flags: --enforce-eager --no-enable-prefix-caching --no-enable-chunked-prefill --gpu-memory-utilization 0.9 --max-model-len 4096 --max-num-seqs 256.

Suggested direction

Shape-specialize / bucket the Sinkhorn kernel over a fixed set of batch sizes (pad the running-set dimension to the next bucket) so the compiled kernel is reused across decode steps instead of recompiling each step. Happy to share the full probe script and per-N engine logs.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions