Skip to content

feat(qwen35): add batched DFlash speculative decoding#507

Open
CAICAIIs wants to merge 3 commits into
openinfer-project:mainfrom
CAICAIIs:feat/qwen35-dflash
Open

feat(qwen35): add batched DFlash speculative decoding#507
CAICAIIs wants to merge 3 commits into
openinfer-project:mainfrom
CAICAIIs:feat/qwen35-dflash

Conversation

@CAICAIIs

@CAICAIIs CAICAIIs commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Addresses #434.

This PR adds opt-in Qwen3.5 DFlash speculative decoding behind --dflash-draft-model-path. The default Qwen3.5 decode path remains unchanged unless a draft model is explicitly provided.

Key changes:

  • add Qwen3.5 DFlash draft loading, memory reservation, admission headroom, and acceptance tracing
  • add fixed-buffer batched target verification for Qwen3.5 DFlash
  • add hybrid speculative state handling for full-attention KV, recurrent state, conv state, and sequence length
  • commit full-span accepts by copying verified state; commit partial accepts by rolling back and replaying the accepted span
  • fail closed or fall back for unsupported modes such as non-greedy sampling, logprobs, LoRA, multi-device, KV offload, and decode overlap

Validation

Validated on a real RTX 5090 GPU host.

cargo fmt --all --check
git diff --check
cargo build --release -p openinfer-server --features qwen35-4b --bin bench_serving
cargo test --release -p openinfer-qwen35-4b --features qwen35-4b --test dflash_speculative_gate -- --nocapture --test-threads=1
cargo test --release -p openinfer-qwen35-4b --features qwen35-4b --test speculative_verify -- --nocapture --test-threads=1
cargo test --release -p openinfer-qwen35-4b --features qwen35-4b --test hf_golden_gate -- --nocapture
cargo test --release -p openinfer-qwen35-4b --features qwen35-4b --test e2e_scheduler -- --nocapture

The updated docs were checked for private paths, SSH hosts, credentials, and token-like strings.

Benchmark

Same host, same source snapshot (8cd46cb), in-process bench_serving request, greedy synthetic distinct prompts, output_len=256, warmup 3, iters 8.

Prompt Concurrency Baseline tok/s DFlash tok/s Delta Baseline effective TPOT p50 DFlash effective TPOT p50 Baseline raw ITL p99 DFlash raw ITL p99 Baseline TTFT p50 DFlash TTFT p50
1 1 152.225 150.792 -0.94% 6.569 ms 6.632 ms 6.642 ms 6.663 ms 9.126 ms 9.223 ms
1 4 112.028 129.303 +15.42% 8.905 ms 8.523 ms 8.986 ms 23.210 ms 39.734 ms 33.406 ms
1 8 92.073 110.319 +19.82% 10.836 ms 8.996 ms 10.908 ms 34.054 ms 71.238 ms 64.958 ms
1 16 66.781 75.236 +12.66% 14.936 ms 14.351 ms 15.117 ms 62.012 ms 135.142 ms 128.442 ms
1024 1 138.754 138.032 -0.52% 7.207 ms 7.244 ms 7.281 ms 7.275 ms 46.991 ms 46.906 ms
1024 4 102.211 344.014 +236.57% 9.875 ms 3.015 ms 9.693 ms 19.854 ms 154.986 ms 138.428 ms
1024 8 82.904 242.829 +192.90% 12.165 ms 4.125 ms 54.483 ms 28.213 ms 266.918 ms 231.595 ms
1024 16 59.390 86.722 +46.02% 16.978 ms 11.717 ms 60.756 ms 55.041 ms 497.441 ms 417.980 ms
4096 1 110.688 109.782 -0.82% 9.035 ms 9.101 ms 9.110 ms 9.113 ms 191.241 ms 191.210 ms
4096 4 80.516 231.923 +188.05% 12.792 ms 4.676 ms 57.662 ms 23.121 ms 644.821 ms 573.805 ms
4096 8 63.302 89.473 +41.34% 16.204 ms 11.663 ms 60.615 ms 37.011 ms 1113.506 ms 951.100 ms
4096 16 44.292 55.315 +24.89% 23.181 ms 17.808 ms 65.180 ms 66.158 ms 2078.919 ms 1708.289 ms

effective_tpot_ms is the amortized per-request decode time. Raw token-event ITL can spike under speculative decode because accepted spans emit multiple tokens in one scheduler step; keep both metrics visible when reviewing tails. Single-concurrency runs are flat to slightly negative, so the intended performance claim is multi-active throughput.

Profile

Collected nsys profile --trace=cuda,nvtx --cuda-graph-trace=node for baseline and DFlash.

Representative stats:

  • prompt=1,c=8: gated_delta_rule_decode_kernel total dropped from 2.04s to 1.53s
  • prompt=1024,c=8: batch decode attention dropped from 550.2ms to 71.8ms; batched prefill verify cost was 49.6ms
  • prompt=4096,c=16: batch decode attention dropped from 2.44s to 1.68s; batched prefill verify cost was 537.9ms

The previous per-request verifier overhead is removed by the batched verifier path, and commit, replay, and copy work is not the main bottleneck in the final profile.

Boundaries

  • DFlash is default-off and requires --dflash-draft-model-path.
  • The main performance win is for multi-active decode, especially c4/c8/c16.
  • Raw streaming ITL p99 can spike because accepted speculative tokens are emitted in bursts; effective TPOT and throughput are the primary performance metrics for this path.
  • DFlash uses extra memory for draft weights, draft KV, verify buffers, scratch, and request headroom.
  • Unsupported modes use normal decode or fail closed instead of partially enabling speculation.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in DFlash speculative decoding path for the Qwen3.5-4B model line, activated via --dflash-draft-model-path, while keeping the default Qwen3.5 decode behavior unchanged when no draft model is provided. It also introduces supporting infrastructure for batched target verification, hybrid-state commit/rollback, additional benchmarking metrics, and documentation/tests to gate correctness and performance.

Changes:

  • Wire --dflash-draft-model-path through openinfer-server into the Qwen3.5 engine, with explicit incompatibility checks for unsupported modes.
  • Implement Qwen3.5 DFlash draft loading, fixed-buffer batched verification, and hybrid-state (KV + recurrent + conv) commit/rollback in the model scheduler/executor paths.
  • Extend bench_serving reporting with effective_tpot_ms and optional full token traces, plus add new Qwen3.5 speculative verification and losslessness gate tests and documentation.

Reviewed changes

Copilot reviewed 43 out of 43 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
openinfer-server/src/main.rs Adds Qwen3.5 DFlash flag gating and forwards draft path into the Qwen3.5 launcher with compatibility checks.
openinfer-server/src/config.rs Updates CLI arg help/validation to allow DFlash for Qwen3.5 and adjusts decode-overlap enum derivations.
openinfer-server/src/bin/bench_serving/runners.rs Adds effective_tpot_ms aggregation and rendering.
openinfer-server/src/bin/bench_serving/report.rs Extends bench report schema with effective_tpot_ms and optional full token trace field.
openinfer-server/src/bin/bench_serving/metrics.rs Adds env-gated “full token trace” capture to bench output.
openinfer-server/src/bin/bench_serving/main.rs Wires DFlash draft path into bench engine start for Qwen3 and Qwen3.5 and updates metric tests.
openinfer-server/src/bin/bench_serving/exec.rs Refactors scheduler bench execution to drain TokenStreamReceiver directly and collect timing components for new metrics.
openinfer-server/src/bin/bench_serving/cli.rs Adds --dflash-draft-model-path to the bench CLI.
openinfer-qwen35-4b/tests/speculative_verify.rs Adds targeted executor-level speculative verify state correctness tests.
openinfer-qwen35-4b/tests/dflash_speculative_gate.rs Adds scheduler-level losslessness gate for Qwen3.5 DFlash under greedy decode and concurrency.
openinfer-qwen35-4b/src/weights.rs Adds DFlash-aware memory reservation to KV sizing and exposes tied projection/embedding helpers for drafter integration.
openinfer-qwen35-4b/src/verify_buffers.rs Introduces fixed scratch buffers for batched Qwen3.5 verification.
openinfer-qwen35-4b/src/unified_forward.rs Adds prefill hidden-state capture plumbing for DFlash feature extraction.
openinfer-qwen35-4b/src/speculative.rs Adds executor-level speculative verify path with accept logic and commit/replay handling.
openinfer-qwen35-4b/src/scheduler.rs Implements scheduler-level DFlash speculative decode, capture, verify, commit, and fallback logic.
openinfer-qwen35-4b/src/recurrent.rs Relaxes scratch size assertions to permit reuse of larger scratch allocations.
openinfer-qwen35-4b/src/recurrent_state.rs Adds D2D copy helper for recurrent+conv state to support commit/rollback.
openinfer-qwen35-4b/src/prefill.rs Refactors prefill to support capture and adds batched verification forward (prefill_verify_into).
openinfer-qwen35-4b/src/prefill_buffers.rs Adds row-capacity tracking and helpers for resizing views over preallocated scratch.
openinfer-qwen35-4b/src/ops.rs Re-exports additional copy helpers used by verification/capture paths.
openinfer-qwen35-4b/src/lib.rs Wires DFlash draft path into engine startup and re-exports speculative/diagnostic types.
openinfer-qwen35-4b/src/kernel_plan.rs Updates kernel plan docstrings to reflect refactored prefill entrypoint naming.
openinfer-qwen35-4b/src/executor.rs Exposes state needed for speculative verification and adds debug state summaries.
openinfer-qwen35-4b/src/dflash/reservation.rs Adds draft-memory reservation accounting to inform target KV sizing before loading draft weights.
openinfer-qwen35-4b/src/dflash/loading.rs Implements safetensors loading for the DFlash draft model and rope precompute.
openinfer-qwen35-4b/src/dflash/config.rs Adds draft config parsing/validation against the target model config.
openinfer-qwen35-4b/src/dflash.rs Implements DFlash draft forward (batched dense ops + per-request varlen ops) and request state management.
openinfer-qwen35-4b/src/decode_buffers.rs Adds captured hidden buffer storage to decode buffers for DFlash context capture.
openinfer-qwen35-4b/src/batch_decode.rs Adds decode hidden capture plumbing and separates capture vs non-capture CUDA graph state.
openinfer-qwen35-4b/src/batch_decode_graph.rs Adds capture-specific CUDA graphs and recurrent-state copy helpers.
openinfer-qwen35-4b/Cargo.toml Registers the new DFlash losslessness gate test under feature gating.
openinfer-kernels/src/ops/attention.rs Adds causal-window single-prefill attention wrapper.
openinfer-kernels/src/ops.rs Re-exports the new causal-window attention op.
openinfer-kernels/src/ffi/shared.rs Declares FFI for the new causal-window prefill kernel.
openinfer-kernels/src/ffi/qwen35.rs Declares FFI for batched Qwen3.5 prefill attention prep kernel.
openinfer-kernels/csrc/shared/paged_attention.cu Implements the new causal-window single-prefill kernel wrapper.
openinfer-kernels/csrc/qwen35/prefill_attention_hd256.cu Implements batched Qwen3.5 QK norm + RoPE + KV write kernels for verification.
openinfer-core/src/page_pool.rs Adds OwnedPagePermit::truncate to return tail pages back to the pool.
openinfer-core/src/ops.rs Re-exports the new causal-window attention op at the core layer.
openinfer-core/src/kv_pool.rs Adds KvState::truncate_to for rolling back logical KV length and returning pages.
docs/models/qwen35/roadmap.md Updates Qwen3.5 roadmap TL;DR and adds DFlash entry.
docs/models/qwen35/dflash-speculative-decoding.md Adds new doc describing enabling, contract, validation, benchmarks, and boundaries for Qwen3.5 DFlash.
docs/index.md Adds routing entry for the new DFlash Qwen3.5 doc and updates roadmap summary row.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread openinfer-qwen35-4b/src/verify_buffers.rs
Comment thread openinfer-server/src/config.rs
Comment thread openinfer-kernels/src/ops/attention.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a74cd58ae6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread openinfer-qwen35-4b/src/scheduler.rs Outdated
Comment thread openinfer-server/src/bin/bench_serving/exec.rs Outdated
@CAICAIIs CAICAIIs force-pushed the feat/qwen35-dflash branch from a74cd58 to e385429 Compare July 2, 2026 15:07
@xiaguan

xiaguan commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Heads-up before we dig into a proper review — sharing how we measure DFlash / speculative-decode performance on Qwen3-4B, since the benchmark table here shows a notably different shape and it may help to have our methodology as reference.

Our measurement harness (Qwen3-4B DFlash)

1. Single-stream latency A/Bopeninfer-qwen3/tests/dflash_speculative_perf.rs

bs=1, fixed 256-token budget, ignore_eos, one warm-up discarded, spec OFF vs ON, reports tok/s and the speedup ratio. This is the canonical "does spec decode actually help" measurement: plain decode is memory-bound (one target forward per token), spec amortizes that forward over the accepted run. Qwen3-4B measures 1.82× on 5070 Ti, 1.56× on 5090. The harness asserts only that spec is not catastrophically slower (a guard against the draft mispredicting everything) — the real signal is the printed speedup number (--nocapture).

2. Losslessness gateopeninfer-qwen3/tests/dflash_speculative_gate.rs

Greedy spec must be lossless, but an exact spec == baseline token match is the wrong gate: the verify path runs the prefill attention kernel over the K+1 span while plain decode runs the decode kernel, and the two differ by ~1 bf16 ULP — enough to flip an argmax on a near-tie, after which two greedy runs fan out completely. We use a regret test (same idea as hf_golden_gate): at the first position the two sequences disagree (where they still share an identical context, so the comparison is valid), how far below the argmax does the speculative pick sit, measured in the prefill kernel's own distribution (obtained by a re-prefill of the shared context). Within MARGIN_TOL = 0.20 ⇒ benign numerical tie; clearly worse (or outside the prefill top-K) ⇒ a real verify/accept/capture bug. A systematic bug corrupts the non-tie positions too, so it cannot hide behind the tie band.

The shape contrast worth noting

On Qwen3-4B (full attention), single-stream was the direct win and concurrent throughput initially inverted it — the per-request serial draft loop was launch-bound (24.8 ms/step at batch 16 was almost all kernel-launch overhead; a skip-attention A/B showed attention compute <2%). The fix was (a) batching the dense draft forward into one N×block pass, then (b) a piecewise verify CUDA Graph (dense ops captured, attention eager). Only after both landed did c1/c8/c16 all reach or beat vLLM.

This PR's table shows the opposite: single-concurrency flat-to-negative (−0.5% to −0.9%), wins concentrated at c4/c8/c16. Qwen3.5 is hybrid (24 linear + 8 full attention), so the baseline decode is less memory-bound than Qwen3-4B's full-attention path — which would explain why the single-stream amortization buys less here, and why the win shows up under concurrent batch pressure instead. It would help the doc to call out explicitly whether the single-stream regression is expected for the architecture, or whether there is a launch-overhead component (like the one we hit) that a graph pass could recover — since on Qwen3-4B the concurrent win only appeared after we killed the per-request serial draft loop and captured the verify graph, and the c1 gap was ~84% dense-op kernel-launch overhead.

Two concrete suggestions, neither a review blocker:

  • Consider adding a single-stream A/B in the same style as dflash_speculative_perf.rs (bs=1, fixed budget, spec OFF vs ON speedup ratio) so the "single-stream is not the win" claim is a measured number rather than a row in the concurrent sweep.
  • The effective_tpot_ms vs raw ITL p99 distinction you call out is exactly right — on Qwen3-4B we hit the same burst-emission effect. Worth keeping both visible in any tail-latency gate.

Not a review — just flagging the methodology and the contrast so it is on the radar before we look at the code.

@xiaguan

xiaguan commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Follow-up on my earlier comment — I missed the serving-level bench scripts, which are the other half of how we measure DFlash on Qwen3-4B (besides the two tests/*.rs harnesses).

Serving-level A/B scripts (tools/bench/)

run_serving_bench.sh — one-shot end-to-end: builds + launches an openinfer (or vLLM) server, waits for readiness, runs a QPS sweep and a DSpark/DFlash concurrency sweep, then summarizes. The key flags for spec decode:

# openinfer Qwen3-4B + DSpark, concurrency sweep only (single-stream A/B is the .rs test)
MODEL=/data/Qwen3-4B DRAFT_MODEL=/data/dspark_qwen3_4b_block7 GPU=7 \
  QPS_LIST="" CONCURRENCY_LIST="1 4 8" tools/bench/run_serving_bench.sh

DRAFT_MODEL wires --dflash-draft-model-path into the server; omit it for the plain-decode baseline. --temperature 0 is enforced by the script (non-greedy silently disables spec decoding — see dspark-integration.md Bug 1), and --percentile-metrics ttft,tpot,itl,e2el keeps the effective-TPOT vs raw-ITL split visible. Results land as one JSON per sweep point; summarize_qps_sweep.py prints the table.

qps_sweep.sh — the same vllm-bench driver, but against an already-running server (useful when you bring your own server lifecycle). Same --temperature 0 / --ignore-eos / --percentile-metrics defaults.

What the two layers cover

Layer Harness What it measures
single-stream latency tests/dflash_speculative_perf.rs bs=1 spec OFF vs ON tok/s + speedup ratio (the direct spec-decode win)
correctness tests/dflash_speculative_gate.rs greedy losslessness via regret check (not exact match — prefill vs decode kernel ~1 bf16 ULP)
concurrent throughput tools/bench/run_serving_bench.sh (vllm-bench) QPS + concurrency sweeps, effective TPOT vs raw ITL p99, against a real serving endpoint

The reason both layers matter: on Qwen3-4B the single-stream harness showed the win (1.56–1.82×) while the concurrent sweep showed the inversion that forced the batched-draft + piecewise-verify-graph fix — the .rs perf test would have stayed green while the serving sweep went red. If Qwen3.5 DFlash has an equivalent serving-level sweep script (or could reuse these), it would close the same gap.

@CAICAIIs CAICAIIs marked this pull request as draft July 2, 2026 15:10
@CAICAIIs CAICAIIs force-pushed the feat/qwen35-dflash branch from e385429 to 9b5f921 Compare July 2, 2026 15:12
Signed-off-by: CAICAIIs <3360776475@qq.com>
@CAICAIIs CAICAIIs force-pushed the feat/qwen35-dflash branch from 9b5f921 to 20f6c7f Compare July 2, 2026 15:19
Signed-off-by: CAICAIIs <3360776475@qq.com>
@CAICAIIs CAICAIIs marked this pull request as ready for review July 2, 2026 17:47
@CAICAIIs

CAICAIIs commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for the Qwen3 DFlash measurement notes. I updated the Qwen3.5 docs/tests to make the same boundaries explicit: regret-oracle handling for bf16 near-ties, visible TTFT/effective TPOT/raw ITL/output tok/s, and a clear claim that Qwen3.5 DFlash is a c4/c8/c16 throughput win while c1 is flat/slightly negative.

I kept the benchmark claim scoped to in-process serving evidence; a full HTTP pressure sweep with the serving scripts is still a useful follow-up before claiming broader serving-level performance.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 87213d6207

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread openinfer-qwen35-4b/src/scheduler.rs
Comment thread openinfer-qwen35-4b/src/scheduler.rs Outdated
Signed-off-by: CAICAIIs <3360776475@qq.com>
@CAICAIIs

CAICAIIs commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

I also opened follow-up issues (#513, #514, #515) for the remaining single-stream, HTTP serving, and c1 profiling work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants