Skip to content

qwen35: add single-stream DFlash latency A/B gate #513

Description

@CAICAIIs

Background

PR #507 adds opt-in Qwen3.5 DFlash speculative decoding and documents that the measured win is concentrated in multi-active batches (c4/c8/c16), while single-concurrency c1 is flat to slightly slower.

The Qwen3 DFlash path already has a dedicated single-stream latency A/B harness (dflash_speculative_perf.rs): fixed 256-token budget, greedy / ignore-eos, one warmup discarded, spec OFF vs ON, and printed tok/s speedup. Qwen3.5 should have an equivalent harness so the c1 behavior is measured directly and remains reproducible.

Goal

Add a Qwen3.5 DFlash single-stream performance gate that reports baseline vs DFlash speed under the same fixed-token setup.

Suggested scope

  • Add a Qwen3.5-specific dflash_speculative_perf test or bench.
  • Use fixed output budget, greedy decoding, ignore-eos behavior, and one discarded warmup.
  • Print baseline tok/s, DFlash tok/s, speedup ratio, acceptance length/rate, and token sanity.
  • Keep the assertion conservative: fail only on catastrophic slowdown or invalid output, while leaving the actual speedup visible in --nocapture.
  • Document the current Qwen3.5 expectation: c1 may be flat/slightly negative unless a later optimization recovers launch overhead.

Validation

  • cargo test --release -p openinfer-qwen35-4b --features qwen35-4b --test dflash_speculative_perf -- --nocapture --test-threads=1
  • Same-host A/B output included in the issue or follow-up PR.

Related: #434, #507.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestqwen35Qwen3.5-4B model crate (pegainfer-qwen35-4b)

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions