kvarn_k4v2_g128 throughput does not scale with batch size (Sinkhorn kernel JIT-recompiles per step under dynamic batching)

## Summary

Under concurrent serving, `kvarn_k4v2_g128` aggregate throughput saturates at ~80 tok/s regardless of batch size, while fp16 KV and another KV-quant backend scale to ~2000 tok/s on the same machine. Single-stream decode is fine (~44 tok/s), so this is specific to batched/concurrent serving.

## Scaling curve (Qwen3-4B, warm runs, aggregate output tok/s)

| concurrency N | 1 | 8 | 16 | 32 | 64 | 128 |
|---|---|---|---|---|---|---|
| **kvarn_k4v2_g128** | 27 | 80 | 78 | 83 | 83 | **81** |
| fp16 KV | 66 | 362 | 663 | 1087 | (noisy) | **2172** |

KVarN flatlines from N=8 onward; fp16 scales ~linearly. A second KV-quant backend on the same box/flags also scaled to ~1950 tok/s at N=128, so this is not a WSL / `pin_memory` artifact.

## Root cause (your own jit_monitor flags it)

During the concurrent burst the engine logs:

```
WARNING [jit_monitor] Triton kernel JIT compilation during inference:
_sinkhorn_log_kernel. This causes a latency spike;
consider extending warmup to cover this shape/config.
```

The running-set shape changes every decode step under dynamic batching, so `_sinkhorn_log_kernel` recompiles continuously. The scheduler never runs more than ~22 concurrent requests with 60+ waiting and GPU KV-cache usage at ~3% (so KV capacity is not the bottleneck):

```
Avg generation throughput: 1.5 tokens/s, Running: 15 reqs, Waiting: 53 reqs, GPU KV cache usage: 2.6%
```

generation throughput oscillates between ~1.5 and ~277 tok/s as the kernel recompiles.

## Warmup does not fix it

Running each concurrency level twice (a discarded warmup burst, then a measured burst) gives essentially the same throughput cold vs warm at every N (e.g. N=128: 76 cold vs 81 warm). So this is per-step recompilation driven by the changing running-set shape, not a one-time first-compile cost.

## Env

RTX 4090 (WSL2, Ubuntu 24.04), Qwen3-4B fp16, vLLM dev build (`v0.1.dev58+gd784952a9`), flags: `--enforce-eager --no-enable-prefix-caching --no-enable-chunked-prefill --gpu-memory-utilization 0.9 --max-model-len 4096 --max-num-seqs 256`.

## Suggested direction

Shape-specialize / bucket the Sinkhorn kernel over a fixed set of batch sizes (pad the running-set dimension to the next bucket) so the compiled kernel is reused across decode steps instead of recompiling each step. Happy to share the full probe script and per-N engine logs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kvarn_k4v2_g128 throughput does not scale with batch size (Sinkhorn kernel JIT-recompiles per step under dynamic batching) #15

Summary

Scaling curve (Qwen3-4B, warm runs, aggregate output tok/s)

Root cause (your own jit_monitor flags it)

Warmup does not fix it

Env

Suggested direction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

concurrency N	1	8	16	32	64	128
kvarn_k4v2_g128	27	80	78	83	83	81
fp16 KV	66	362	663	1087	(noisy)	2172

Uh oh!

kvarn_k4v2_g128 throughput does not scale with batch size (Sinkhorn kernel JIT-recompiles per step under dynamic batching) #15

Description

Summary

Scaling curve (Qwen3-4B, warm runs, aggregate output tok/s)

Root cause (your own jit_monitor flags it)

Warmup does not fix it

Env

Suggested direction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions