feat(qwen35): add batched DFlash speculative decoding#507
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an opt-in DFlash speculative decoding path for the Qwen3.5-4B model line, activated via --dflash-draft-model-path, while keeping the default Qwen3.5 decode behavior unchanged when no draft model is provided. It also introduces supporting infrastructure for batched target verification, hybrid-state commit/rollback, additional benchmarking metrics, and documentation/tests to gate correctness and performance.
Changes:
- Wire
--dflash-draft-model-paththroughopeninfer-serverinto the Qwen3.5 engine, with explicit incompatibility checks for unsupported modes. - Implement Qwen3.5 DFlash draft loading, fixed-buffer batched verification, and hybrid-state (KV + recurrent + conv) commit/rollback in the model scheduler/executor paths.
- Extend
bench_servingreporting witheffective_tpot_msand optional full token traces, plus add new Qwen3.5 speculative verification and losslessness gate tests and documentation.
Reviewed changes
Copilot reviewed 43 out of 43 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| openinfer-server/src/main.rs | Adds Qwen3.5 DFlash flag gating and forwards draft path into the Qwen3.5 launcher with compatibility checks. |
| openinfer-server/src/config.rs | Updates CLI arg help/validation to allow DFlash for Qwen3.5 and adjusts decode-overlap enum derivations. |
| openinfer-server/src/bin/bench_serving/runners.rs | Adds effective_tpot_ms aggregation and rendering. |
| openinfer-server/src/bin/bench_serving/report.rs | Extends bench report schema with effective_tpot_ms and optional full token trace field. |
| openinfer-server/src/bin/bench_serving/metrics.rs | Adds env-gated “full token trace” capture to bench output. |
| openinfer-server/src/bin/bench_serving/main.rs | Wires DFlash draft path into bench engine start for Qwen3 and Qwen3.5 and updates metric tests. |
| openinfer-server/src/bin/bench_serving/exec.rs | Refactors scheduler bench execution to drain TokenStreamReceiver directly and collect timing components for new metrics. |
| openinfer-server/src/bin/bench_serving/cli.rs | Adds --dflash-draft-model-path to the bench CLI. |
| openinfer-qwen35-4b/tests/speculative_verify.rs | Adds targeted executor-level speculative verify state correctness tests. |
| openinfer-qwen35-4b/tests/dflash_speculative_gate.rs | Adds scheduler-level losslessness gate for Qwen3.5 DFlash under greedy decode and concurrency. |
| openinfer-qwen35-4b/src/weights.rs | Adds DFlash-aware memory reservation to KV sizing and exposes tied projection/embedding helpers for drafter integration. |
| openinfer-qwen35-4b/src/verify_buffers.rs | Introduces fixed scratch buffers for batched Qwen3.5 verification. |
| openinfer-qwen35-4b/src/unified_forward.rs | Adds prefill hidden-state capture plumbing for DFlash feature extraction. |
| openinfer-qwen35-4b/src/speculative.rs | Adds executor-level speculative verify path with accept logic and commit/replay handling. |
| openinfer-qwen35-4b/src/scheduler.rs | Implements scheduler-level DFlash speculative decode, capture, verify, commit, and fallback logic. |
| openinfer-qwen35-4b/src/recurrent.rs | Relaxes scratch size assertions to permit reuse of larger scratch allocations. |
| openinfer-qwen35-4b/src/recurrent_state.rs | Adds D2D copy helper for recurrent+conv state to support commit/rollback. |
| openinfer-qwen35-4b/src/prefill.rs | Refactors prefill to support capture and adds batched verification forward (prefill_verify_into). |
| openinfer-qwen35-4b/src/prefill_buffers.rs | Adds row-capacity tracking and helpers for resizing views over preallocated scratch. |
| openinfer-qwen35-4b/src/ops.rs | Re-exports additional copy helpers used by verification/capture paths. |
| openinfer-qwen35-4b/src/lib.rs | Wires DFlash draft path into engine startup and re-exports speculative/diagnostic types. |
| openinfer-qwen35-4b/src/kernel_plan.rs | Updates kernel plan docstrings to reflect refactored prefill entrypoint naming. |
| openinfer-qwen35-4b/src/executor.rs | Exposes state needed for speculative verification and adds debug state summaries. |
| openinfer-qwen35-4b/src/dflash/reservation.rs | Adds draft-memory reservation accounting to inform target KV sizing before loading draft weights. |
| openinfer-qwen35-4b/src/dflash/loading.rs | Implements safetensors loading for the DFlash draft model and rope precompute. |
| openinfer-qwen35-4b/src/dflash/config.rs | Adds draft config parsing/validation against the target model config. |
| openinfer-qwen35-4b/src/dflash.rs | Implements DFlash draft forward (batched dense ops + per-request varlen ops) and request state management. |
| openinfer-qwen35-4b/src/decode_buffers.rs | Adds captured hidden buffer storage to decode buffers for DFlash context capture. |
| openinfer-qwen35-4b/src/batch_decode.rs | Adds decode hidden capture plumbing and separates capture vs non-capture CUDA graph state. |
| openinfer-qwen35-4b/src/batch_decode_graph.rs | Adds capture-specific CUDA graphs and recurrent-state copy helpers. |
| openinfer-qwen35-4b/Cargo.toml | Registers the new DFlash losslessness gate test under feature gating. |
| openinfer-kernels/src/ops/attention.rs | Adds causal-window single-prefill attention wrapper. |
| openinfer-kernels/src/ops.rs | Re-exports the new causal-window attention op. |
| openinfer-kernels/src/ffi/shared.rs | Declares FFI for the new causal-window prefill kernel. |
| openinfer-kernels/src/ffi/qwen35.rs | Declares FFI for batched Qwen3.5 prefill attention prep kernel. |
| openinfer-kernels/csrc/shared/paged_attention.cu | Implements the new causal-window single-prefill kernel wrapper. |
| openinfer-kernels/csrc/qwen35/prefill_attention_hd256.cu | Implements batched Qwen3.5 QK norm + RoPE + KV write kernels for verification. |
| openinfer-core/src/page_pool.rs | Adds OwnedPagePermit::truncate to return tail pages back to the pool. |
| openinfer-core/src/ops.rs | Re-exports the new causal-window attention op at the core layer. |
| openinfer-core/src/kv_pool.rs | Adds KvState::truncate_to for rolling back logical KV length and returning pages. |
| docs/models/qwen35/roadmap.md | Updates Qwen3.5 roadmap TL;DR and adds DFlash entry. |
| docs/models/qwen35/dflash-speculative-decoding.md | Adds new doc describing enabling, contract, validation, benchmarks, and boundaries for Qwen3.5 DFlash. |
| docs/index.md | Adds routing entry for the new DFlash Qwen3.5 doc and updates roadmap summary row. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a74cd58ae6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
a74cd58 to
e385429
Compare
|
Heads-up before we dig into a proper review — sharing how we measure DFlash / speculative-decode performance on Qwen3-4B, since the benchmark table here shows a notably different shape and it may help to have our methodology as reference. Our measurement harness (Qwen3-4B DFlash)1. Single-stream latency A/B — bs=1, fixed 256-token budget, 2. Losslessness gate — Greedy spec must be lossless, but an exact The shape contrast worth notingOn Qwen3-4B (full attention), single-stream was the direct win and concurrent throughput initially inverted it — the per-request serial draft loop was launch-bound (24.8 ms/step at batch 16 was almost all kernel-launch overhead; a skip-attention A/B showed attention compute <2%). The fix was (a) batching the dense draft forward into one N×block pass, then (b) a piecewise verify CUDA Graph (dense ops captured, attention eager). Only after both landed did c1/c8/c16 all reach or beat vLLM. This PR's table shows the opposite: single-concurrency flat-to-negative (−0.5% to −0.9%), wins concentrated at c4/c8/c16. Qwen3.5 is hybrid (24 linear + 8 full attention), so the baseline decode is less memory-bound than Qwen3-4B's full-attention path — which would explain why the single-stream amortization buys less here, and why the win shows up under concurrent batch pressure instead. It would help the doc to call out explicitly whether the single-stream regression is expected for the architecture, or whether there is a launch-overhead component (like the one we hit) that a graph pass could recover — since on Qwen3-4B the concurrent win only appeared after we killed the per-request serial draft loop and captured the verify graph, and the c1 gap was ~84% dense-op kernel-launch overhead. Two concrete suggestions, neither a review blocker:
Not a review — just flagging the methodology and the contrast so it is on the radar before we look at the code. |
|
Follow-up on my earlier comment — I missed the serving-level bench scripts, which are the other half of how we measure DFlash on Qwen3-4B (besides the two Serving-level A/B scripts (
|
| Layer | Harness | What it measures |
|---|---|---|
| single-stream latency | tests/dflash_speculative_perf.rs |
bs=1 spec OFF vs ON tok/s + speedup ratio (the direct spec-decode win) |
| correctness | tests/dflash_speculative_gate.rs |
greedy losslessness via regret check (not exact match — prefill vs decode kernel ~1 bf16 ULP) |
| concurrent throughput | tools/bench/run_serving_bench.sh (vllm-bench) |
QPS + concurrency sweeps, effective TPOT vs raw ITL p99, against a real serving endpoint |
The reason both layers matter: on Qwen3-4B the single-stream harness showed the win (1.56–1.82×) while the concurrent sweep showed the inversion that forced the batched-draft + piecewise-verify-graph fix — the .rs perf test would have stayed green while the serving sweep went red. If Qwen3.5 DFlash has an equivalent serving-level sweep script (or could reuse these), it would close the same gap.
e385429 to
9b5f921
Compare
Signed-off-by: CAICAIIs <3360776475@qq.com>
9b5f921 to
20f6c7f
Compare
Signed-off-by: CAICAIIs <3360776475@qq.com>
|
Thanks for the Qwen3 DFlash measurement notes. I updated the Qwen3.5 docs/tests to make the same boundaries explicit: regret-oracle handling for bf16 near-ties, visible TTFT/effective TPOT/raw ITL/output tok/s, and a clear claim that Qwen3.5 DFlash is a c4/c8/c16 throughput win while c1 is flat/slightly negative. I kept the benchmark claim scoped to in-process serving evidence; a full HTTP pressure sweep with the serving scripts is still a useful follow-up before claiming broader serving-level performance. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 87213d6207
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: CAICAIIs <3360776475@qq.com>
Summary
Addresses #434.
This PR adds opt-in Qwen3.5 DFlash speculative decoding behind
--dflash-draft-model-path. The default Qwen3.5 decode path remains unchanged unless a draft model is explicitly provided.Key changes:
Validation
Validated on a real RTX 5090 GPU host.
The updated docs were checked for private paths, SSH hosts, credentials, and token-like strings.
Benchmark
Same host, same source snapshot (
8cd46cb), in-processbench_serving request, greedy synthetic distinct prompts,output_len=256, warmup3, iters8.effective_tpot_msis the amortized per-request decode time. Raw token-event ITL can spike under speculative decode because accepted spans emit multiple tokens in one scheduler step; keep both metrics visible when reviewing tails. Single-concurrency runs are flat to slightly negative, so the intended performance claim is multi-active throughput.Profile
Collected
nsys profile --trace=cuda,nvtx --cuda-graph-trace=nodefor baseline and DFlash.Representative stats:
prompt=1,c=8:gated_delta_rule_decode_kerneltotal dropped from2.04sto1.53sprompt=1024,c=8: batch decode attention dropped from550.2msto71.8ms; batched prefill verify cost was49.6msprompt=4096,c=16: batch decode attention dropped from2.44sto1.68s; batched prefill verify cost was537.9msThe previous per-request verifier overhead is removed by the batched verifier path, and commit, replay, and copy work is not the main bottleneck in the final profile.
Boundaries
--dflash-draft-model-path.