feat(qwen3): vLLM-prefill P/D — cross-engine KV compat, verified end-to-end#540
Open
xiaguan wants to merge 8 commits into
Open
feat(qwen3): vLLM-prefill P/D — cross-engine KV compat, verified end-to-end#540xiaguan wants to merge 8 commits into
xiaguan wants to merge 8 commits into
Conversation
GLM prefill rides vLLM (openinfer glm52 kernel surface is decode-only); P->D readiness = D-side bounded fast-poll over pegaflow CPU P2P; cross- engine keys via a vLLM-hash-compat provider (xxhash_cbor + pinned PYTHONHASHSEED); tail-block connector extension + router t1 forwarding planned. Roadmap: qwen3 vLLM-P + openinfer-D smoke test first. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t window Decode-node support for a vLLM prefill peer over pegaflow CPU P2P: - openinfer-kv-offload::VllmBlockHasher replicates vLLM's xxhash_cbor prefix-cache hashing (canonical CBOR + xxh3_128, NONE_HASH from PYTHONHASHSEED, parent-chained full blocks + derivable tail keys); golden vectors captured from a real vLLM process. - Qwen3VllmCompatOptions switches cold-request offload queries to the vLLM key chain over the same [gpu_hit..cacheable) window (local GPU naming stays kvbm), joins the P side's connector namespace, skips LoRA requests (unreplicated extra_keys salting), and disables self-saves (kvbm keys are unfindable in a vLLM-keyed domain). - remote_fetch_action grows wait_on_miss: during the P/D handoff race a zero hit means the producer's registration hasn't landed yet, so the request stays parked until a bounded miss deadline instead of degrading to a local prefill. - Server flags: --kv-pd-vllm-seed / --kv-pd-vllm-namespace / --kv-pd-miss-wait-ms. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…bility contract qwen3 vLLM-P + openinfer-D smoke passed on node 34: byte-identical outputs, delta reuse, TTFT overhead +14ms/+51ms/+147ms at 0.5k/1.8k/7k tokens. Distill the verified compatibility equation (namespace factors, block-size == D page size, per-layer slot counts, K/V segment layout absorbed by pegaflow#382) into section 3.1 and mark the remaining milestone-1 gaps (tail-block extension, strict no-prefill mode). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… path Toxic-review findings on the vLLM-compat branch: every failure mode of the P/D handoff (wrong seed, wrong namespace, block-size mismatch, peer down) degraded into silent scratch prefills that read as "network is a bit slow". Address the unhappy path: - Warn when a request exhausts the zero-hit wait window (the single symptom all misconfigurations share), with blocks queried and time waited. - Log a startup fingerprint (seed, namespace, block size, NONE_HASH, KV geometry) an operator can diff against the vLLM peer's config. - Miss breaker: after 3 consecutive exhausted windows, new requests skip the wait — a dead or misconfigured peer no longer taxes every cold request the full window. Any remote hit re-arms waiting. - Throttle parked re-queries to one MetaServer RPC per 5ms per request so a miss storm cannot turn scheduler ticks into serial RPC pumps (the throttle the design doc promised but the code never had). - Reject miss-wait >= the 15s remote-fetch deadline at startup instead of silently capping it; validate seed (decimal u32) and namespace (8-hex) at flag parse time — an empty seed is a well-formed key space that can never match. - Drop the speculative include_tail parameter from key_chain (the tail golden vector stays covered via hash_block) and take the query window from the probe without materializing kvbm hashes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Failure injection (P down, 4 sequential cold requests) showed the 4th request still waiting the full miss window after the breaker opened: a transient Loading answer on the first query parks a request past the breaker gate, and the poll path recomputed wait-on-miss from the deadline alone. Make the poll consult the breaker too, and only emit the degradation warning (and count a miss window) when the request actually waited the window out, so breaker-shortened requests do not spam the log the breaker warning already covers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tone table Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A Loading-stuck peer rides requests to the hard deadline instead of the miss window; without counting those, the breaker never opens and every cold request pays the full 15s. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Qwen3 decode nodes can now consume KV prefilled by a vLLM + PegaKVConnector peer over the pegaflow P2P mesh: vLLM prefills and seals blocks under its own prefix-cache hashes; openinfer-D derives byte-identical keys, discovers the blocks via the MetaServer, RDMA-fetches and restores them, and decodes — zero vLLM/connector changes.
openinfer-kv-offload/src/vllm_hash.rs—VllmBlockHasher: replicates vLLM'sxxhash_cborblock-hash chain (hand-rolled canonical CBOR + xxh3_128,NONE_HASHfromPYTHONHASHSEED). Golden vectors captured from a real vLLM process.--kv-pd-miss-wait-ms, default 5000), bridging the P→D save/publish race.remote_fetch.rsdecision cascade extended withwait_on_miss; 12 unit tests cover the branch matrix.NONE_HASH/KV geometry — diff it against the P side), warn on every exhausted miss window, and a miss breaker — after 3 consecutive exhausted windows (or 15s hard timeouts) new requests skip the wait until a remote hit re-arms it. Flags validate at parse time (seed must be decimal u32, namespace 8-hex);miss-wait >= 15sdeadline is rejected at startup.--kv-pd-vllm-seed,--kv-pd-vllm-namespace,--kv-pd-miss-wait-ms(all gated on the P2P mesh flags).Compatibility contract (verified on hardware)
P's blocks are loadable by D iff: same namespace (an 8-hex digest that does not include block size), same block granularity + hash scheme (P must run
--block-size <D's page size>and--prefix-caching-hash-algo xxhash_cborwith a pinnedPYTHONHASHSEED), equal per-layer slot counts (36 for Qwen3-8B on both sides), and compatible per-slot K/V segment layout — the last is absorbed by pegaflow novitalabs/pegaflow#382 (contiguous device layouts loading split-KV-sealed blocks). Full derivation indocs/models/glm52/pd-vllm-prefill.md§3.1.Evidence (jz node 34, H200 ×8: vLLM-P on GPU0, openinfer-D on GPU1)
Correctness — 3-prompt greedy smoke (short / cross-block / block-aligned) via the P/D router vs direct-D baseline: all outputs byte-identical. Content-addressed delta reuse confirmed: the aligned prompt (sharing a 480-token prefix with the previous one) fetched exactly 1 incremental block.
TTFT (unique-prefix cold runs, median of 3; overhead = PD via router − P direct):
The 7k overhead is dominated by physically moving ~1 GiB of KV (RDMA fetch alone: 992 MiB in 41.8 ms, 23.7 GiB/s); pipelining P-side save / discovery / H2D is follow-up territory.
Failure injection — P killed, 4 sequential cold requests: first 3 wait the 5s window (one warn each), breaker opens, 4th fails over in 38 ms; P restored, one routed hit re-arms waiting.
Not in this PR (tracked in the design doc)
Dependencies
Runtime-correct only with pegaflow ≥ #382; builds fine against the currently pinned rev (the fix is load-path-only). I'll bump the
pegaflow-corerev here once #382 merges.Test plan
cargo test --release -p openinfer-kv-offload -p openinfer-qwen3 --lib— golden vectors + remote-fetch branch matrix, all green./data/pd-stack/{stack,smoke,failtest}.sh,ttft.py).🤖 Generated with Claude Code