From f02dc9d097c524b40aa936d73cf8217594fa948b Mon Sep 17 00:00:00 2001 From: Felipe Sztutman Date: Thu, 11 Jun 2026 07:10:46 -0300 Subject: [PATCH 1/2] docs: add Running on WSL2 section (residency limit, spill diagnostic, UVA) --- README.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/README.md b/README.md index 08e70b42..80561aaf 100644 --- a/README.md +++ b/README.md @@ -162,6 +162,20 @@ once all of its tokens are accepted, so rejected draft tokens never corrupt hist --- +### Running on WSL2 + +KVarN works on WSL2, with two caveats. + +**1. Keep VRAM allocation under the residency limit.** On WSL2 (WDDM), allocating past the GPU's resident limit does not fail: Windows silently spills pages to system RAM, and identical CUDA kernels run at PCIe speed. Since KVarN adds an fp16 tail pool on top of weights and KV cache, the default `--gpu-memory-utilization 0.90` can cross that line on consumer cards. On a 24 GB RTX 4090 with Qwen2.5-7B-Instruct, `0.80` keeps everything resident and restores throughput (692 to 3523 tok/s at 128 concurrent decodes, quality unchanged; `0.85` still spilled). + +Diagnostic: if `nvidia-smi` shows 100% GPU utilization but power draw is near idle (about 100 W on a 4090), the cache has spilled. Power draw is the reliable signal; utilization is not. + +**2. UVA must be enabled.** vLLM disables UVA on WSL2 by policy, which prevents the engine from starting. Workaround: make `is_uva_available()` in `vllm/utils/platform_utils.py` return `True` on WSL2 (UVA itself works fine there). Re-apply after pulling. + +Background and measurements: [issue #15](https://github.com/huawei-csl/KVarN/issues/15). + +*Validated on WSL 2.7.3 (kernel 6.6.114), Windows 11 build 26100, NVIDIA driver 595.79. The residency limit and spill behavior are properties of the WDDM paravirtualization stack and may shift across driver versions.* + ## How does KVarN work?

From 7a5722212b28ff8fd3878514aca2bad3f33acaf1 Mon Sep 17 00:00:00 2001 From: Felipe Sztutman Date: Fri, 12 Jun 2026 08:30:30 -0300 Subject: [PATCH 2/2] docs(wsl2): generalize section, drop vendor/card references per review Move the agency to the WDDM paravirtualization layer (the actual source of the residency limit and spill), express the spill diagnostic as the utilization/power divergence rather than a card-specific wattage, and keep the 24 GB GPU + driver only as a validation data point. Add a live-monitor command. --- README.md | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 80561aaf..38416326 100644 --- a/README.md +++ b/README.md @@ -164,17 +164,23 @@ once all of its tokens are accepted, so rejected draft tokens never corrupt hist ### Running on WSL2 -KVarN works on WSL2, with two caveats. +KVarN works on WSL2, with two caveats. Both come from system layers (the Windows graphics virtualization model and a vLLM policy), not from any particular hardware. -**1. Keep VRAM allocation under the residency limit.** On WSL2 (WDDM), allocating past the GPU's resident limit does not fail: Windows silently spills pages to system RAM, and identical CUDA kernels run at PCIe speed. Since KVarN adds an fp16 tail pool on top of weights and KV cache, the default `--gpu-memory-utilization 0.90` can cross that line on consumer cards. On a 24 GB RTX 4090 with Qwen2.5-7B-Instruct, `0.80` keeps everything resident and restores throughput (692 to 3523 tok/s at 128 concurrent decodes, quality unchanged; `0.85` still spilled). +**1. Keep VRAM allocation under the resident limit.** Under WSL2 the GPU runs through the Windows display driver model (WDDM), which paravirtualizes graphics memory. Allocating past the resident limit does not fail: the system silently spills pages to host RAM, and identical CUDA kernels then run at PCIe speed instead of VRAM speed. KVarN adds an fp16 tail pool on top of weights and KV cache, so a `--gpu-memory-utilization` that fits a plain model can cross the resident limit once KVarN is active. Lower it until the whole working set stays resident. -Diagnostic: if `nvidia-smi` shows 100% GPU utilization but power draw is near idle (about 100 W on a 4090), the cache has spilled. Power draw is the reliable signal; utilization is not. +How far to lower it depends on total VRAM and the model footprint, so locate the edge by measurement. On a 24 GB GPU running Qwen2.5-7B-Instruct, `0.85` still spilled while `0.80` kept everything resident, recovering throughput from 692 to 3523 tok/s at 128 concurrent decodes with quality unchanged. -**2. UVA must be enabled.** vLLM disables UVA on WSL2 by policy, which prevents the engine from starting. Workaround: make `is_uva_available()` in `vllm/utils/platform_utils.py` return `True` on WSL2 (UVA itself works fine there). Re-apply after pulling. +Diagnostic, the behavioral signature of a spill: `nvidia-smi` shows GPU utilization pinned near 100% while power draw sits near idle. The two signals diverge because the kernels are stalled on the PCIe path rather than doing resident work. Power draw is the reliable signal here; utilization is not. Read the divergence between the two, not a fixed wattage, since the idle baseline varies by device. To watch both signals live: + +``` +watch -n 1 'nvidia-smi --query-gpu=utilization.gpu,power.draw --format=csv,noheader' +``` + +**2. UVA must be enabled.** vLLM disables UVA on WSL2 by policy, which prevents the engine from starting. UVA itself works fine under WSL2, so the fix is to make `is_uva_available()` in `vllm/utils/platform_utils.py` return `True` on WSL2. Re-apply after pulling. Background and measurements: [issue #15](https://github.com/huawei-csl/KVarN/issues/15). -*Validated on WSL 2.7.3 (kernel 6.6.114), Windows 11 build 26100, NVIDIA driver 595.79. The residency limit and spill behavior are properties of the WDDM paravirtualization stack and may shift across driver versions.* +*Validated on WSL 2.7.3 (kernel 6.6.114), Windows 11 build 26100. The resident limit and silent spill are properties of the WDDM paravirtualization stack, not of any GPU model, and may shift across driver and Windows versions.* ## How does KVarN work?