Background
PR #507 adds opt-in Qwen3.5 DFlash speculative decoding and documents that the measured win is concentrated in multi-active batches (c4/c8/c16), while single-concurrency c1 is flat to slightly slower.
The Qwen3 DFlash path already has a dedicated single-stream latency A/B harness (dflash_speculative_perf.rs): fixed 256-token budget, greedy / ignore-eos, one warmup discarded, spec OFF vs ON, and printed tok/s speedup. Qwen3.5 should have an equivalent harness so the c1 behavior is measured directly and remains reproducible.
Goal
Add a Qwen3.5 DFlash single-stream performance gate that reports baseline vs DFlash speed under the same fixed-token setup.
Suggested scope
- Add a Qwen3.5-specific
dflash_speculative_perf test or bench.
- Use fixed output budget, greedy decoding, ignore-eos behavior, and one discarded warmup.
- Print baseline tok/s, DFlash tok/s, speedup ratio, acceptance length/rate, and token sanity.
- Keep the assertion conservative: fail only on catastrophic slowdown or invalid output, while leaving the actual speedup visible in
--nocapture.
- Document the current Qwen3.5 expectation:
c1 may be flat/slightly negative unless a later optimization recovers launch overhead.
Validation
cargo test --release -p openinfer-qwen35-4b --features qwen35-4b --test dflash_speculative_perf -- --nocapture --test-threads=1
- Same-host A/B output included in the issue or follow-up PR.
Related: #434, #507.
Background
PR #507 adds opt-in Qwen3.5 DFlash speculative decoding and documents that the measured win is concentrated in multi-active batches (
c4/c8/c16), while single-concurrencyc1is flat to slightly slower.The Qwen3 DFlash path already has a dedicated single-stream latency A/B harness (
dflash_speculative_perf.rs): fixed 256-token budget, greedy / ignore-eos, one warmup discarded, spec OFF vs ON, and printed tok/s speedup. Qwen3.5 should have an equivalent harness so thec1behavior is measured directly and remains reproducible.Goal
Add a Qwen3.5 DFlash single-stream performance gate that reports baseline vs DFlash speed under the same fixed-token setup.
Suggested scope
dflash_speculative_perftest or bench.--nocapture.c1may be flat/slightly negative unless a later optimization recovers launch overhead.Validation
cargo test --release -p openinfer-qwen35-4b --features qwen35-4b --test dflash_speculative_perf -- --nocapture --test-threads=1Related: #434, #507.