feat(qwen35): add batched DFlash speculative decoding by CAICAIIs · Pull Request #507 · openinfer-project/openinfer

CAICAIIs · 2026-07-02T14:55:09Z

Summary

Addresses #434.

This PR adds opt-in Qwen3.5 DFlash speculative decoding behind --dflash-draft-model-path. The default Qwen3.5 decode path remains unchanged unless a draft model is explicitly provided.

Key changes:

add Qwen3.5 DFlash draft loading, memory reservation, admission headroom, and acceptance tracing
add fixed-buffer batched target verification for Qwen3.5 DFlash
add hybrid speculative state handling for full-attention KV, recurrent state, conv state, and sequence length
commit full-span accepts by copying verified state; commit partial accepts by rolling back and replaying the accepted span
fail closed or fall back for unsupported modes such as non-greedy sampling, logprobs, LoRA, multi-device, KV offload, and decode overlap

Validation

Validated on a real RTX 5090 GPU host.

cargo fmt --all --check
git diff --check
cargo build --release -p openinfer-server --features qwen35-4b --bin bench_serving
cargo test --release -p openinfer-qwen35-4b --features qwen35-4b --test dflash_speculative_gate -- --nocapture --test-threads=1
cargo test --release -p openinfer-qwen35-4b --features qwen35-4b --test speculative_verify -- --nocapture --test-threads=1
cargo test --release -p openinfer-qwen35-4b --features qwen35-4b --test hf_golden_gate -- --nocapture
cargo test --release -p openinfer-qwen35-4b --features qwen35-4b --test e2e_scheduler -- --nocapture

The updated docs were checked for private paths, SSH hosts, credentials, and token-like strings.

Benchmark

Same host, same source snapshot (8cd46cb), in-process bench_serving request, greedy synthetic distinct prompts, output_len=256, warmup 3, iters 8.

Prompt	Concurrency	Baseline tok/s	DFlash tok/s	Delta	Baseline effective TPOT p50	DFlash effective TPOT p50	Baseline raw ITL p99	DFlash raw ITL p99	Baseline TTFT p50	DFlash TTFT p50
1	1	152.225	150.792	-0.94%	6.569 ms	6.632 ms	6.642 ms	6.663 ms	9.126 ms	9.223 ms
1	4	112.028	129.303	+15.42%	8.905 ms	8.523 ms	8.986 ms	23.210 ms	39.734 ms	33.406 ms
1	8	92.073	110.319	+19.82%	10.836 ms	8.996 ms	10.908 ms	34.054 ms	71.238 ms	64.958 ms
1	16	66.781	75.236	+12.66%	14.936 ms	14.351 ms	15.117 ms	62.012 ms	135.142 ms	128.442 ms
1024	1	138.754	138.032	-0.52%	7.207 ms	7.244 ms	7.281 ms	7.275 ms	46.991 ms	46.906 ms
1024	4	102.211	344.014	+236.57%	9.875 ms	3.015 ms	9.693 ms	19.854 ms	154.986 ms	138.428 ms
1024	8	82.904	242.829	+192.90%	12.165 ms	4.125 ms	54.483 ms	28.213 ms	266.918 ms	231.595 ms
1024	16	59.390	86.722	+46.02%	16.978 ms	11.717 ms	60.756 ms	55.041 ms	497.441 ms	417.980 ms
4096	1	110.688	109.782	-0.82%	9.035 ms	9.101 ms	9.110 ms	9.113 ms	191.241 ms	191.210 ms
4096	4	80.516	231.923	+188.05%	12.792 ms	4.676 ms	57.662 ms	23.121 ms	644.821 ms	573.805 ms
4096	8	63.302	89.473	+41.34%	16.204 ms	11.663 ms	60.615 ms	37.011 ms	1113.506 ms	951.100 ms
4096	16	44.292	55.315	+24.89%	23.181 ms	17.808 ms	65.180 ms	66.158 ms	2078.919 ms	1708.289 ms

effective_tpot_ms is the amortized per-request decode time. Raw token-event ITL can spike under speculative decode because accepted spans emit multiple tokens in one scheduler step; keep both metrics visible when reviewing tails. Single-concurrency runs are flat to slightly negative, so the intended performance claim is multi-active throughput.

Profile

Collected nsys profile --trace=cuda,nvtx --cuda-graph-trace=node for baseline and DFlash.

Representative stats:

prompt=1,c=8: gated_delta_rule_decode_kernel total dropped from 2.04s to 1.53s
prompt=1024,c=8: batch decode attention dropped from 550.2ms to 71.8ms; batched prefill verify cost was 49.6ms
prompt=4096,c=16: batch decode attention dropped from 2.44s to 1.68s; batched prefill verify cost was 537.9ms

The previous per-request verifier overhead is removed by the batched verifier path, and commit, replay, and copy work is not the main bottleneck in the final profile.

Boundaries

DFlash is default-off and requires --dflash-draft-model-path.
The main performance win is for multi-active decode, especially c4/c8/c16.
Raw streaming ITL p99 can spike because accepted speculative tokens are emitted in bursts; effective TPOT and throughput are the primary performance metrics for this path.
DFlash uses extra memory for draft weights, draft KV, verify buffers, scratch, and request headroom.
Unsupported modes use normal decode or fail closed instead of partially enabling speculation.

Copilot

Pull request overview

This PR adds an opt-in DFlash speculative decoding path for the Qwen3.5-4B model line, activated via --dflash-draft-model-path, while keeping the default Qwen3.5 decode behavior unchanged when no draft model is provided. It also introduces supporting infrastructure for batched target verification, hybrid-state commit/rollback, additional benchmarking metrics, and documentation/tests to gate correctness and performance.

Changes:

Wire --dflash-draft-model-path through openinfer-server into the Qwen3.5 engine, with explicit incompatibility checks for unsupported modes.
Implement Qwen3.5 DFlash draft loading, fixed-buffer batched verification, and hybrid-state (KV + recurrent + conv) commit/rollback in the model scheduler/executor paths.
Extend bench_serving reporting with effective_tpot_ms and optional full token traces, plus add new Qwen3.5 speculative verification and losslessness gate tests and documentation.

Reviewed changes

Copilot reviewed 43 out of 43 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
openinfer-server/src/main.rs	Adds Qwen3.5 DFlash flag gating and forwards draft path into the Qwen3.5 launcher with compatibility checks.
openinfer-server/src/config.rs	Updates CLI arg help/validation to allow DFlash for Qwen3.5 and adjusts decode-overlap enum derivations.
openinfer-server/src/bin/bench_serving/runners.rs	Adds `effective_tpot_ms` aggregation and rendering.
openinfer-server/src/bin/bench_serving/report.rs	Extends bench report schema with `effective_tpot_ms` and optional full token trace field.
openinfer-server/src/bin/bench_serving/metrics.rs	Adds env-gated “full token trace” capture to bench output.
openinfer-server/src/bin/bench_serving/main.rs	Wires DFlash draft path into bench engine start for Qwen3 and Qwen3.5 and updates metric tests.
openinfer-server/src/bin/bench_serving/exec.rs	Refactors scheduler bench execution to drain `TokenStreamReceiver` directly and collect timing components for new metrics.
openinfer-server/src/bin/bench_serving/cli.rs	Adds `--dflash-draft-model-path` to the bench CLI.
openinfer-qwen35-4b/tests/speculative_verify.rs	Adds targeted executor-level speculative verify state correctness tests.
openinfer-qwen35-4b/tests/dflash_speculative_gate.rs	Adds scheduler-level losslessness gate for Qwen3.5 DFlash under greedy decode and concurrency.
openinfer-qwen35-4b/src/weights.rs	Adds DFlash-aware memory reservation to KV sizing and exposes tied projection/embedding helpers for drafter integration.
openinfer-qwen35-4b/src/verify_buffers.rs	Introduces fixed scratch buffers for batched Qwen3.5 verification.
openinfer-qwen35-4b/src/unified_forward.rs	Adds prefill hidden-state capture plumbing for DFlash feature extraction.
openinfer-qwen35-4b/src/speculative.rs	Adds executor-level speculative verify path with accept logic and commit/replay handling.
openinfer-qwen35-4b/src/scheduler.rs	Implements scheduler-level DFlash speculative decode, capture, verify, commit, and fallback logic.
openinfer-qwen35-4b/src/recurrent.rs	Relaxes scratch size assertions to permit reuse of larger scratch allocations.
openinfer-qwen35-4b/src/recurrent_state.rs	Adds D2D copy helper for recurrent+conv state to support commit/rollback.
openinfer-qwen35-4b/src/prefill.rs	Refactors prefill to support capture and adds batched verification forward (`prefill_verify_into`).
openinfer-qwen35-4b/src/prefill_buffers.rs	Adds row-capacity tracking and helpers for resizing views over preallocated scratch.
openinfer-qwen35-4b/src/ops.rs	Re-exports additional copy helpers used by verification/capture paths.
openinfer-qwen35-4b/src/lib.rs	Wires DFlash draft path into engine startup and re-exports speculative/diagnostic types.
openinfer-qwen35-4b/src/kernel_plan.rs	Updates kernel plan docstrings to reflect refactored prefill entrypoint naming.
openinfer-qwen35-4b/src/executor.rs	Exposes state needed for speculative verification and adds debug state summaries.
openinfer-qwen35-4b/src/dflash/reservation.rs	Adds draft-memory reservation accounting to inform target KV sizing before loading draft weights.
openinfer-qwen35-4b/src/dflash/loading.rs	Implements safetensors loading for the DFlash draft model and rope precompute.
openinfer-qwen35-4b/src/dflash/config.rs	Adds draft config parsing/validation against the target model config.
openinfer-qwen35-4b/src/dflash.rs	Implements DFlash draft forward (batched dense ops + per-request varlen ops) and request state management.
openinfer-qwen35-4b/src/decode_buffers.rs	Adds captured hidden buffer storage to decode buffers for DFlash context capture.
openinfer-qwen35-4b/src/batch_decode.rs	Adds decode hidden capture plumbing and separates capture vs non-capture CUDA graph state.
openinfer-qwen35-4b/src/batch_decode_graph.rs	Adds capture-specific CUDA graphs and recurrent-state copy helpers.
openinfer-qwen35-4b/Cargo.toml	Registers the new DFlash losslessness gate test under feature gating.
openinfer-kernels/src/ops/attention.rs	Adds causal-window single-prefill attention wrapper.
openinfer-kernels/src/ops.rs	Re-exports the new causal-window attention op.
openinfer-kernels/src/ffi/shared.rs	Declares FFI for the new causal-window prefill kernel.
openinfer-kernels/src/ffi/qwen35.rs	Declares FFI for batched Qwen3.5 prefill attention prep kernel.
openinfer-kernels/csrc/shared/paged_attention.cu	Implements the new causal-window single-prefill kernel wrapper.
openinfer-kernels/csrc/qwen35/prefill_attention_hd256.cu	Implements batched Qwen3.5 QK norm + RoPE + KV write kernels for verification.
openinfer-core/src/page_pool.rs	Adds `OwnedPagePermit::truncate` to return tail pages back to the pool.
openinfer-core/src/ops.rs	Re-exports the new causal-window attention op at the core layer.
openinfer-core/src/kv_pool.rs	Adds `KvState::truncate_to` for rolling back logical KV length and returning pages.
docs/models/qwen35/roadmap.md	Updates Qwen3.5 roadmap TL;DR and adds DFlash entry.
docs/models/qwen35/dflash-speculative-decoding.md	Adds new doc describing enabling, contract, validation, benchmarks, and boundaries for Qwen3.5 DFlash.
docs/index.md	Adds routing entry for the new DFlash Qwen3.5 doc and updates roadmap summary row.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a74cd58ae6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

xiaguan · 2026-07-02T15:09:06Z

Heads-up before we dig into a proper review — sharing how we measure DFlash / speculative-decode performance on Qwen3-4B, since the benchmark table here shows a notably different shape and it may help to have our methodology as reference.

Our measurement harness (Qwen3-4B DFlash)

1. Single-stream latency A/B — openinfer-qwen3/tests/dflash_speculative_perf.rs

bs=1, fixed 256-token budget, ignore_eos, one warm-up discarded, spec OFF vs ON, reports tok/s and the speedup ratio. This is the canonical "does spec decode actually help" measurement: plain decode is memory-bound (one target forward per token), spec amortizes that forward over the accepted run. Qwen3-4B measures 1.82× on 5070 Ti, 1.56× on 5090. The harness asserts only that spec is not catastrophically slower (a guard against the draft mispredicting everything) — the real signal is the printed speedup number (--nocapture).

2. Losslessness gate — openinfer-qwen3/tests/dflash_speculative_gate.rs

Greedy spec must be lossless, but an exact spec == baseline token match is the wrong gate: the verify path runs the prefill attention kernel over the K+1 span while plain decode runs the decode kernel, and the two differ by ~1 bf16 ULP — enough to flip an argmax on a near-tie, after which two greedy runs fan out completely. We use a regret test (same idea as hf_golden_gate): at the first position the two sequences disagree (where they still share an identical context, so the comparison is valid), how far below the argmax does the speculative pick sit, measured in the prefill kernel's own distribution (obtained by a re-prefill of the shared context). Within MARGIN_TOL = 0.20 ⇒ benign numerical tie; clearly worse (or outside the prefill top-K) ⇒ a real verify/accept/capture bug. A systematic bug corrupts the non-tie positions too, so it cannot hide behind the tie band.

The shape contrast worth noting

On Qwen3-4B (full attention), single-stream was the direct win and concurrent throughput initially inverted it — the per-request serial draft loop was launch-bound (24.8 ms/step at batch 16 was almost all kernel-launch overhead; a skip-attention A/B showed attention compute <2%). The fix was (a) batching the dense draft forward into one N×block pass, then (b) a piecewise verify CUDA Graph (dense ops captured, attention eager). Only after both landed did c1/c8/c16 all reach or beat vLLM.

This PR's table shows the opposite: single-concurrency flat-to-negative (−0.5% to −0.9%), wins concentrated at c4/c8/c16. Qwen3.5 is hybrid (24 linear + 8 full attention), so the baseline decode is less memory-bound than Qwen3-4B's full-attention path — which would explain why the single-stream amortization buys less here, and why the win shows up under concurrent batch pressure instead. It would help the doc to call out explicitly whether the single-stream regression is expected for the architecture, or whether there is a launch-overhead component (like the one we hit) that a graph pass could recover — since on Qwen3-4B the concurrent win only appeared after we killed the per-request serial draft loop and captured the verify graph, and the c1 gap was ~84% dense-op kernel-launch overhead.

Two concrete suggestions, neither a review blocker:

Consider adding a single-stream A/B in the same style as dflash_speculative_perf.rs (bs=1, fixed budget, spec OFF vs ON speedup ratio) so the "single-stream is not the win" claim is a measured number rather than a row in the concurrent sweep.
The effective_tpot_ms vs raw ITL p99 distinction you call out is exactly right — on Qwen3-4B we hit the same burst-emission effect. Worth keeping both visible in any tail-latency gate.

Not a review — just flagging the methodology and the contrast so it is on the radar before we look at the code.

xiaguan · 2026-07-02T15:09:43Z

Follow-up on my earlier comment — I missed the serving-level bench scripts, which are the other half of how we measure DFlash on Qwen3-4B (besides the two tests/*.rs harnesses).

Serving-level A/B scripts (`tools/bench/`)

run_serving_bench.sh — one-shot end-to-end: builds + launches an openinfer (or vLLM) server, waits for readiness, runs a QPS sweep and a DSpark/DFlash concurrency sweep, then summarizes. The key flags for spec decode:

# openinfer Qwen3-4B + DSpark, concurrency sweep only (single-stream A/B is the .rs test)
MODEL=/data/Qwen3-4B DRAFT_MODEL=/data/dspark_qwen3_4b_block7 GPU=7 \
  QPS_LIST="" CONCURRENCY_LIST="1 4 8" tools/bench/run_serving_bench.sh

DRAFT_MODEL wires --dflash-draft-model-path into the server; omit it for the plain-decode baseline. --temperature 0 is enforced by the script (non-greedy silently disables spec decoding — see dspark-integration.md Bug 1), and --percentile-metrics ttft,tpot,itl,e2el keeps the effective-TPOT vs raw-ITL split visible. Results land as one JSON per sweep point; summarize_qps_sweep.py prints the table.

qps_sweep.sh — the same vllm-bench driver, but against an already-running server (useful when you bring your own server lifecycle). Same --temperature 0 / --ignore-eos / --percentile-metrics defaults.

What the two layers cover

Layer	Harness	What it measures
single-stream latency	`tests/dflash_speculative_perf.rs`	bs=1 spec OFF vs ON tok/s + speedup ratio (the direct spec-decode win)
correctness	`tests/dflash_speculative_gate.rs`	greedy losslessness via regret check (not exact match — prefill vs decode kernel ~1 bf16 ULP)
concurrent throughput	`tools/bench/run_serving_bench.sh` (vllm-bench)	QPS + concurrency sweeps, effective TPOT vs raw ITL p99, against a real serving endpoint

The reason both layers matter: on Qwen3-4B the single-stream harness showed the win (1.56–1.82×) while the concurrent sweep showed the inversion that forced the batched-draft + piecewise-verify-graph fix — the .rs perf test would have stayed green while the serving sweep went red. If Qwen3.5 DFlash has an equivalent serving-level sweep script (or could reuse these), it would close the same gap.

Signed-off-by: CAICAIIs <3360776475@qq.com>

CAICAIIs · 2026-07-02T17:48:03Z

Thanks for the Qwen3 DFlash measurement notes. I updated the Qwen3.5 docs/tests to make the same boundaries explicit: regret-oracle handling for bf16 near-ties, visible TTFT/effective TPOT/raw ITL/output tok/s, and a clear claim that Qwen3.5 DFlash is a c4/c8/c16 throughput win while c1 is flat/slightly negative.

I kept the benchmark claim scoped to in-process serving evidence; a full HTTP pressure sweep with the serving scripts is still a useful follow-up before claiming broader serving-level performance.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 87213d6207

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: CAICAIIs <3360776475@qq.com>

CAICAIIs · 2026-07-03T02:38:35Z

I also opened follow-up issues (#513, #514, #515) for the remaining single-stream, HTTP serving, and c1 profiling work.

CAICAIIs requested a review from Copilot July 2, 2026 14:58

Copilot started reviewing on behalf of CAICAIIs July 2, 2026 14:58 View session

Copilot AI reviewed Jul 2, 2026

View reviewed changes

Comment thread openinfer-qwen35-4b/src/verify_buffers.rs

Comment thread openinfer-server/src/config.rs

Comment thread openinfer-kernels/src/ops/attention.rs

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread openinfer-qwen35-4b/src/scheduler.rs Outdated

Comment thread openinfer-server/src/bin/bench_serving/exec.rs Outdated

CAICAIIs force-pushed the feat/qwen35-dflash branch from a74cd58 to e385429 Compare July 2, 2026 15:07

CAICAIIs marked this pull request as draft July 2, 2026 15:10

CAICAIIs force-pushed the feat/qwen35-dflash branch from e385429 to 9b5f921 Compare July 2, 2026 15:12

feat(qwen35): add batched dflash speculative decoding

20f6c7f

Signed-off-by: CAICAIIs <3360776475@qq.com>

CAICAIIs force-pushed the feat/qwen35-dflash branch from 9b5f921 to 20f6c7f Compare July 2, 2026 15:19

test(qwen35): tighten dflash evidence gates

87213d6

Signed-off-by: CAICAIIs <3360776475@qq.com>

CAICAIIs marked this pull request as ready for review July 2, 2026 17:47

This was referenced Jul 2, 2026

qwen35: add single-stream DFlash latency A/B gate #513

Open

qwen35: run DFlash HTTP serving pressure sweep #514

Open

qwen35: investigate DFlash c1 launch overhead and graph opportunities #515

Open

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread openinfer-qwen35-4b/src/scheduler.rs

Comment thread openinfer-qwen35-4b/src/scheduler.rs Outdated

fix(qwen35): guard dflash scheduler edge cases

3ab00a4

Signed-off-by: CAICAIIs <3360776475@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(qwen35): add batched DFlash speculative decoding#507

feat(qwen35): add batched DFlash speculative decoding#507
CAICAIIs wants to merge 3 commits into
openinfer-project:mainfrom
CAICAIIs:feat/qwen35-dflash

CAICAIIs commented Jul 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

xiaguan commented Jul 2, 2026

Uh oh!

xiaguan commented Jul 2, 2026

Uh oh!

CAICAIIs commented Jul 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

CAICAIIs commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

CAICAIIs commented Jul 2, 2026

Summary

Validation

Benchmark

Profile

Boundaries

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

xiaguan commented Jul 2, 2026

Our measurement harness (Qwen3-4B DFlash)

The shape contrast worth noting

Uh oh!

xiaguan commented Jul 2, 2026

Serving-level A/B scripts (tools/bench/)

What the two layers cover

Uh oh!

CAICAIIs commented Jul 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

CAICAIIs commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Serving-level A/B scripts (`tools/bench/`)

CAICAIIs commented Jul 3, 2026 •

edited

Loading