feat(qwen3): vLLM-prefill P/D — cross-engine KV compat, verified end-to-end by xiaguan · Pull Request #540 · openinfer-project/openinfer

xiaguan · 2026-07-03T18:56:54Z

What

Qwen3 decode nodes can now consume KV prefilled by a vLLM + PegaKVConnector peer over the pegaflow P2P mesh: vLLM prefills and seals blocks under its own prefix-cache hashes; openinfer-D derives byte-identical keys, discovers the blocks via the MetaServer, RDMA-fetches and restores them, and decodes — zero vLLM/connector changes.

openinfer-kv-offload/src/vllm_hash.rs — VllmBlockHasher: replicates vLLM's xxhash_cbor block-hash chain (hand-rolled canonical CBOR + xxh3_128, NONE_HASH from PYTHONHASHSEED). Golden vectors captured from a real vLLM process.
D-side wait window — a cold request whose query misses re-queries (5ms/req throttle) until the producer's registration lands (--kv-pd-miss-wait-ms, default 5000), bridging the P→D save/publish race. remote_fetch.rs decision cascade extended with wait_on_miss; 12 unit tests cover the branch matrix.
Unhappy path is loud and bounded (toxic-review hardening): startup fingerprint log (seed/namespace/block_size/NONE_HASH/KV geometry — diff it against the P side), warn on every exhausted miss window, and a miss breaker — after 3 consecutive exhausted windows (or 15s hard timeouts) new requests skip the wait until a remote hit re-arms it. Flags validate at parse time (seed must be decimal u32, namespace 8-hex); miss-wait >= 15s deadline is rejected at startup.
Server flags — --kv-pd-vllm-seed, --kv-pd-vllm-namespace, --kv-pd-miss-wait-ms (all gated on the P2P mesh flags).

Compatibility contract (verified on hardware)

P's blocks are loadable by D iff: same namespace (an 8-hex digest that does not include block size), same block granularity + hash scheme (P must run --block-size <D's page size> and --prefix-caching-hash-algo xxhash_cbor with a pinned PYTHONHASHSEED), equal per-layer slot counts (36 for Qwen3-8B on both sides), and compatible per-slot K/V segment layout — the last is absorbed by pegaflow novitalabs/pegaflow#382 (contiguous device layouts loading split-KV-sealed blocks). Full derivation in docs/models/glm52/pd-vllm-prefill.md §3.1.

Evidence (jz node 34, H200 ×8: vLLM-P on GPU0, openinfer-D on GPU1)

Correctness — 3-prompt greedy smoke (short / cross-block / block-aligned) via the P/D router vs direct-D baseline: all outputs byte-identical. Content-addressed delta reuse confirmed: the aligned prompt (sharing a 480-token prefix with the previous one) fetched exactly 1 incremental block.

TTFT (unique-prefix cold runs, median of 3; overhead = PD via router − P direct):

prompt tokens	P direct	P/D	handoff overhead
~470	34.1 ms	47.7 ms	+14 ms
~1790	54.9 ms	105.9 ms	+51 ms
~7070	203.9 ms	351.0 ms	+147 ms

The 7k overhead is dominated by physically moving ~1 GiB of KV (RDMA fetch alone: 992 MiB in 41.8 ms, 23.7 GiB/s); pipelining P-side save / discovery / H2D is follow-up territory.

Failure injection — P killed, 4 sequential cold requests: first 3 wait the 5s window (one warn each), breaker opens, 4th fails over in 38 ms; P restored, one routed hit re-arms waiting.

Not in this PR (tracked in the design doc)

Tail-block connector extension + router t1-forwarding (D currently computes the ≤ block_size partial tail locally).
Strict no-prefill mode (429/500 on miss instead of scratch fallback).
Namespace↔model-identity canary handshake (the namespace digest carries no model identity; the startup fingerprint is the current guard).

Dependencies

Runtime-correct only with pegaflow ≥ #382; builds fine against the currently pinned rev (the fix is load-path-only). I'll bump the pegaflow-core rev here once #382 merges.

Test plan

cargo test --release -p openinfer-kv-offload -p openinfer-qwen3 --lib — golden vectors + remote-fetch branch matrix, all green.
e2e smoke + TTFT + failure injection on node 34 as above (/data/pd-stack/{stack,smoke,failtest}.sh, ttft.py).

🤖 Generated with Claude Code

GLM prefill rides vLLM (openinfer glm52 kernel surface is decode-only); P->D readiness = D-side bounded fast-poll over pegaflow CPU P2P; cross- engine keys via a vLLM-hash-compat provider (xxhash_cbor + pinned PYTHONHASHSEED); tail-block connector extension + router t1 forwarding planned. Roadmap: qwen3 vLLM-P + openinfer-D smoke test first. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…t window Decode-node support for a vLLM prefill peer over pegaflow CPU P2P: - openinfer-kv-offload::VllmBlockHasher replicates vLLM's xxhash_cbor prefix-cache hashing (canonical CBOR + xxh3_128, NONE_HASH from PYTHONHASHSEED, parent-chained full blocks + derivable tail keys); golden vectors captured from a real vLLM process. - Qwen3VllmCompatOptions switches cold-request offload queries to the vLLM key chain over the same [gpu_hit..cacheable) window (local GPU naming stays kvbm), joins the P side's connector namespace, skips LoRA requests (unreplicated extra_keys salting), and disables self-saves (kvbm keys are unfindable in a vLLM-keyed domain). - remote_fetch_action grows wait_on_miss: during the P/D handoff race a zero hit means the producer's registration hasn't landed yet, so the request stays parked until a bounded miss deadline instead of degrading to a local prefill. - Server flags: --kv-pd-vllm-seed / --kv-pd-vllm-namespace / --kv-pd-miss-wait-ms. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…bility contract qwen3 vLLM-P + openinfer-D smoke passed on node 34: byte-identical outputs, delta reuse, TTFT overhead +14ms/+51ms/+147ms at 0.5k/1.8k/7k tokens. Distill the verified compatibility equation (namespace factors, block-size == D page size, per-layer slot counts, K/V segment layout absorbed by pegaflow#382) into section 3.1 and mark the remaining milestone-1 gaps (tail-block extension, strict no-prefill mode). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… path Toxic-review findings on the vLLM-compat branch: every failure mode of the P/D handoff (wrong seed, wrong namespace, block-size mismatch, peer down) degraded into silent scratch prefills that read as "network is a bit slow". Address the unhappy path: - Warn when a request exhausts the zero-hit wait window (the single symptom all misconfigurations share), with blocks queried and time waited. - Log a startup fingerprint (seed, namespace, block size, NONE_HASH, KV geometry) an operator can diff against the vLLM peer's config. - Miss breaker: after 3 consecutive exhausted windows, new requests skip the wait — a dead or misconfigured peer no longer taxes every cold request the full window. Any remote hit re-arms waiting. - Throttle parked re-queries to one MetaServer RPC per 5ms per request so a miss storm cannot turn scheduler ticks into serial RPC pumps (the throttle the design doc promised but the code never had). - Reject miss-wait >= the 15s remote-fetch deadline at startup instead of silently capping it; validate seed (decimal u32) and namespace (8-hex) at flag parse time — an empty seed is a well-formed key space that can never match. - Drop the speculative include_tail parameter from key_chain (the tail golden vector stays covered via hash_block) and take the query window from the probe without materializing kvbm hashes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Failure injection (P down, 4 sequential cold requests) showed the 4th request still waiting the full miss window after the breaker opened: a transient Loading answer on the first query parks a request past the breaker gate, and the poll path recomputed wait-on-miss from the deadline alone. Make the poll consult the breaker too, and only emit the degradation warning (and count a miss window) when the request actually waited the window out, so breaker-shortened requests do not spam the log the breaker warning already covers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…tone table Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A Loading-stuck peer rides requests to the hard deadline instead of the miss window; without counting those, the breaker never opens and every cold request pays the full 15s. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

xiaguan and others added 8 commits July 3, 2026 12:21

chore(kv-offload): commit lockfile entry for xxhash-rust

410fbbe

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

docs(glm52): reflect miss-path observability and breaker in the miles…

9c43d66

…tone table Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

xiaguan mentioned this pull request Jul 4, 2026

Make the qwen3 KV page size configurable (currently hardcoded to 16) #545

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(qwen3): vLLM-prefill P/D — cross-engine KV compat, verified end-to-end#540

feat(qwen3): vLLM-prefill P/D — cross-engine KV compat, verified end-to-end#540
xiaguan wants to merge 8 commits into
mainfrom
feat/pd-vllm-hash-compat

xiaguan commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

xiaguan commented Jul 3, 2026

What

Compatibility contract (verified on hardware)

Evidence (jz node 34, H200 ×8: vLLM-P on GPU0, openinfer-D on GPU1)

Not in this PR (tracked in the design doc)

Dependencies

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant