Skip to content

feat(qwen3): vLLM-prefill P/D — cross-engine KV compat, verified end-to-end#540

Open
xiaguan wants to merge 8 commits into
mainfrom
feat/pd-vllm-hash-compat
Open

feat(qwen3): vLLM-prefill P/D — cross-engine KV compat, verified end-to-end#540
xiaguan wants to merge 8 commits into
mainfrom
feat/pd-vllm-hash-compat

Conversation

@xiaguan

@xiaguan xiaguan commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

What

Qwen3 decode nodes can now consume KV prefilled by a vLLM + PegaKVConnector peer over the pegaflow P2P mesh: vLLM prefills and seals blocks under its own prefix-cache hashes; openinfer-D derives byte-identical keys, discovers the blocks via the MetaServer, RDMA-fetches and restores them, and decodes — zero vLLM/connector changes.

  • openinfer-kv-offload/src/vllm_hash.rsVllmBlockHasher: replicates vLLM's xxhash_cbor block-hash chain (hand-rolled canonical CBOR + xxh3_128, NONE_HASH from PYTHONHASHSEED). Golden vectors captured from a real vLLM process.
  • D-side wait window — a cold request whose query misses re-queries (5ms/req throttle) until the producer's registration lands (--kv-pd-miss-wait-ms, default 5000), bridging the P→D save/publish race. remote_fetch.rs decision cascade extended with wait_on_miss; 12 unit tests cover the branch matrix.
  • Unhappy path is loud and bounded (toxic-review hardening): startup fingerprint log (seed/namespace/block_size/NONE_HASH/KV geometry — diff it against the P side), warn on every exhausted miss window, and a miss breaker — after 3 consecutive exhausted windows (or 15s hard timeouts) new requests skip the wait until a remote hit re-arms it. Flags validate at parse time (seed must be decimal u32, namespace 8-hex); miss-wait >= 15s deadline is rejected at startup.
  • Server flags--kv-pd-vllm-seed, --kv-pd-vllm-namespace, --kv-pd-miss-wait-ms (all gated on the P2P mesh flags).

Compatibility contract (verified on hardware)

P's blocks are loadable by D iff: same namespace (an 8-hex digest that does not include block size), same block granularity + hash scheme (P must run --block-size <D's page size> and --prefix-caching-hash-algo xxhash_cbor with a pinned PYTHONHASHSEED), equal per-layer slot counts (36 for Qwen3-8B on both sides), and compatible per-slot K/V segment layout — the last is absorbed by pegaflow novitalabs/pegaflow#382 (contiguous device layouts loading split-KV-sealed blocks). Full derivation in docs/models/glm52/pd-vllm-prefill.md §3.1.

Evidence (jz node 34, H200 ×8: vLLM-P on GPU0, openinfer-D on GPU1)

Correctness — 3-prompt greedy smoke (short / cross-block / block-aligned) via the P/D router vs direct-D baseline: all outputs byte-identical. Content-addressed delta reuse confirmed: the aligned prompt (sharing a 480-token prefix with the previous one) fetched exactly 1 incremental block.

TTFT (unique-prefix cold runs, median of 3; overhead = PD via router − P direct):

prompt tokens P direct P/D handoff overhead
~470 34.1 ms 47.7 ms +14 ms
~1790 54.9 ms 105.9 ms +51 ms
~7070 203.9 ms 351.0 ms +147 ms

The 7k overhead is dominated by physically moving ~1 GiB of KV (RDMA fetch alone: 992 MiB in 41.8 ms, 23.7 GiB/s); pipelining P-side save / discovery / H2D is follow-up territory.

Failure injection — P killed, 4 sequential cold requests: first 3 wait the 5s window (one warn each), breaker opens, 4th fails over in 38 ms; P restored, one routed hit re-arms waiting.

Not in this PR (tracked in the design doc)

  • Tail-block connector extension + router t1-forwarding (D currently computes the ≤ block_size partial tail locally).
  • Strict no-prefill mode (429/500 on miss instead of scratch fallback).
  • Namespace↔model-identity canary handshake (the namespace digest carries no model identity; the startup fingerprint is the current guard).

Dependencies

Runtime-correct only with pegaflow ≥ #382; builds fine against the currently pinned rev (the fix is load-path-only). I'll bump the pegaflow-core rev here once #382 merges.

Test plan

  • cargo test --release -p openinfer-kv-offload -p openinfer-qwen3 --lib — golden vectors + remote-fetch branch matrix, all green.
  • e2e smoke + TTFT + failure injection on node 34 as above (/data/pd-stack/{stack,smoke,failtest}.sh, ttft.py).

🤖 Generated with Claude Code

xiaguan and others added 8 commits July 3, 2026 12:21
GLM prefill rides vLLM (openinfer glm52 kernel surface is decode-only);
P->D readiness = D-side bounded fast-poll over pegaflow CPU P2P; cross-
engine keys via a vLLM-hash-compat provider (xxhash_cbor + pinned
PYTHONHASHSEED); tail-block connector extension + router t1 forwarding
planned. Roadmap: qwen3 vLLM-P + openinfer-D smoke test first.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t window

Decode-node support for a vLLM prefill peer over pegaflow CPU P2P:

- openinfer-kv-offload::VllmBlockHasher replicates vLLM's xxhash_cbor
  prefix-cache hashing (canonical CBOR + xxh3_128, NONE_HASH from
  PYTHONHASHSEED, parent-chained full blocks + derivable tail keys);
  golden vectors captured from a real vLLM process.
- Qwen3VllmCompatOptions switches cold-request offload queries to the
  vLLM key chain over the same [gpu_hit..cacheable) window (local GPU
  naming stays kvbm), joins the P side's connector namespace, skips
  LoRA requests (unreplicated extra_keys salting), and disables
  self-saves (kvbm keys are unfindable in a vLLM-keyed domain).
- remote_fetch_action grows wait_on_miss: during the P/D handoff race
  a zero hit means the producer's registration hasn't landed yet, so
  the request stays parked until a bounded miss deadline instead of
  degrading to a local prefill.
- Server flags: --kv-pd-vllm-seed / --kv-pd-vllm-namespace /
  --kv-pd-miss-wait-ms.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…bility contract

qwen3 vLLM-P + openinfer-D smoke passed on node 34: byte-identical
outputs, delta reuse, TTFT overhead +14ms/+51ms/+147ms at
0.5k/1.8k/7k tokens. Distill the verified compatibility equation
(namespace factors, block-size == D page size, per-layer slot counts,
K/V segment layout absorbed by pegaflow#382) into section 3.1 and mark
the remaining milestone-1 gaps (tail-block extension, strict
no-prefill mode).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… path

Toxic-review findings on the vLLM-compat branch: every failure mode of
the P/D handoff (wrong seed, wrong namespace, block-size mismatch, peer
down) degraded into silent scratch prefills that read as "network is a
bit slow". Address the unhappy path:

- Warn when a request exhausts the zero-hit wait window (the single
  symptom all misconfigurations share), with blocks queried and time
  waited.
- Log a startup fingerprint (seed, namespace, block size, NONE_HASH,
  KV geometry) an operator can diff against the vLLM peer's config.
- Miss breaker: after 3 consecutive exhausted windows, new requests
  skip the wait — a dead or misconfigured peer no longer taxes every
  cold request the full window. Any remote hit re-arms waiting.
- Throttle parked re-queries to one MetaServer RPC per 5ms per request
  so a miss storm cannot turn scheduler ticks into serial RPC pumps
  (the throttle the design doc promised but the code never had).
- Reject miss-wait >= the 15s remote-fetch deadline at startup instead
  of silently capping it; validate seed (decimal u32) and namespace
  (8-hex) at flag parse time — an empty seed is a well-formed key space
  that can never match.
- Drop the speculative include_tail parameter from key_chain (the tail
  golden vector stays covered via hash_block) and take the query window
  from the probe without materializing kvbm hashes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Failure injection (P down, 4 sequential cold requests) showed the 4th
request still waiting the full miss window after the breaker opened: a
transient Loading answer on the first query parks a request past the
breaker gate, and the poll path recomputed wait-on-miss from the
deadline alone. Make the poll consult the breaker too, and only emit
the degradation warning (and count a miss window) when the request
actually waited the window out, so breaker-shortened requests do not
spam the log the breaker warning already covers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tone table

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A Loading-stuck peer rides requests to the hard deadline instead of the
miss window; without counting those, the breaker never opens and every
cold request pays the full 15s.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant