PR: feat(qwen3): n-gram speculative decoding by wjinxu · Pull Request #349 · openinfer-project/openinfer

wjinxu · 2026-06-11T10:54:08Z

Adds lossless n-gram (prompt-lookup) speculative decoding for Qwen3, built on the method-agnostic speculative core that landed in #436 / #442 (merged into this branch). Off by default; enable with --ngram-speculative.

This started as a parallel speculative stack (its own speculative.rs, verify, KV lifecycle). It has been re-based onto #436's shared core: n-gram is now just a second propose implementation behind the same verify / accept / KV transaction — no duplicated controller.

The shape

Speculative decoding is one optimistic transaction; only propose differs by method:

Propose — produce up to K candidate tokens.
Verify — one target forward over the K + 1 span [current, draft_1..draft_K], argmax at every position.
Accept — accept_greedy keeps the longest prefix matching the target argmax, then appends the target's own token at the first mismatch (always commits 1..=K + 1 tokens).
Commit / roll back — accepted KV committed; the unused draft tail is LIFO-released.

The draft↔verify boundary is a pure token span. So accept_greedy / build_verify_results, execute_speculative_verify_impl, and the schedule/apply/revert_speculative KV transaction are reused unchanged. Only two things are method-specific: where the drafts come from, and whether verify captures hidden states.

What n-gram adds

Proposer (ngram.rs): NgramProposer — longest-suffix prompt-lookup over the request's own token history, no draft model. Tries the longest configured suffix first, falls back to shorter. + NgramConfig (from_env) + 8 unit tests.
Propose is host-side: execute_speculative_draft runs the lookup on the executor thread — no worker forward, no draft model, no readiness handshake. DFlash instead dispatches StepCommand::SpeculativeDraft to the lane.
No-capture verify: execute_ngram_verify reuses batch_prefill_into with zero capture layers (a token-only proposer keeps no model state to seed), then argmax + build_verify_results. The worker routes verify by lane.dflash.is_some() — the methods are mutually exclusive.
Context tracking (ngram_ctx): per-request running token context, seeded at prefill (prompt + first_token), appended at every commit (verify / plain decode / fused unified decode), dropped on retire. Always ends with the request's dangling token.
Method state (SpeculativeMethod enum): the two mutually-exclusive executor fields are collapsed into one Option<SpeculativeMethod>{ Dflash | Ngram } — single source of truth, no "assert not both set" smell. This is the pre-trait consolidation feat(qwen3): DFlash speculative decoding #436 deferred. A full SpeculativeProposer trait object is not introduced yet: DFlash's propose is a worker RPC (needs &mut executor + lane), so it can't be a self-contained proposer without borrow gymnastics. Enum dispatch is the honest shape until a third method (EAGLE, feat(qwen3): explore speculative decoding #325) makes a shared trait pay off.

Config

--ngram-speculative — enable. Single-GPU greedy only; mutually exclusive with --dflash-draft-model-path, --enable-lora, --kv-offload, --decode-overlap; requires --tp-size=1. Forces the prefix cache off (verify writes each request's own speculative KV span).
OPENINFER_QWEN3_NGRAM_TOKENS — draft count K (default 4).
OPENINFER_QWEN3_NGRAM_MAX_NGRAM — longest suffix to match (default 3).

Losslessness

Greedy speculation is lossless (the accepted tokens are the target's own greedy continuation), which is a built-in oracle: spec-on must equal spec-off token-for-token (modulo the benign prefill-vs-decode bf16 tie flip).

GPU-validated on RTX 4090 (Qwen3-4B), tests/ngram_speculative_gate.rs, 3/3 pass, every prompt 100% token-identical to plain greedy:

ngram_speculative_greedy_matches_plain_greedy — bs=1, 5 prompts.
ngram_concurrent_heterogeneous_is_lossless — 4 concurrent, staggered budgets (bs>1 verify-span path).
ngram_mixed_greedy_and_sampling_is_lossless — a greedy request stays identical while a sampling request holds the batch off the speculative path.

A length-1 verify span (n-gram "no match") already degrades to a plain single-token decode in build_verify_results, so a miss needs no special fallback.

Files

File	Change
`openinfer-qwen3-4b/src/ngram.rs`	New: `NgramProposer` + `NgramConfig` + 8 unit tests
`openinfer-qwen3-4b/src/executor.rs`	`SpeculativeMethod` enum; `load_ngram_drafter`; host-side `execute_speculative_draft`; `execute_ngram_verify`; `ngram_ctx` lifecycle (prefill / decode / unified / drop)
`openinfer-qwen3-4b/src/executor/spec.rs`	Generalize the verify transaction to be method-agnostic (relax DFlash-only asserts; append accepted tokens to `ngram_ctx`)
`openinfer-qwen3-4b/src/verify_graph.rs`	`span()` accessor for the no-capture verify-buffer sizing
`openinfer-qwen3-4b/src/scheduler.rs`, `src/lib.rs`, `openinfer-server/src/{config,main}.rs`	`--ngram-speculative` wiring + mutual-exclusion validation
`openinfer-qwen3-4b/tests/ngram_speculative_gate.rs`	New: engine-level regret-based losslessness gate (3 tests)

Sits on top of main after merging #436 / #442, so the diff is the n-gram addition, not a re-derivation of the shared core.

Stateless prompt-lookup proposer: finds the most recent earlier occurrence of the current token suffix in the running context and proposes the tokens that followed it, longest-suffix-first with shorter-suffix fallback. No draft model. This is the proposer half of n-gram speculative decoding; verification and KV rollback in the decode path are a follow-up. Pure-CPU and unit-tested (recent-match continuation, longest-suffix preference, fallback, no-match, K cap, short context, config clamping). Co-authored-by: Cursor <cursoragent@cursor.com>

Adds the verification-side brain of n-gram speculative decoding: `accept_greedy` commits the longest proposed prefix the target model agrees with plus one model token (correction or bonus), so a verify step always makes >=1 token of progress; `num_accepted` reports how many speculative KV positions to keep on rollback. Also adds `SpeculativeConfig` (disabled by default). Pure-CPU and unit-tested. The GPU verify forward (multi-token decode returning per-position logits) and the executor / kvbm KV-rollback wiring are a follow-up. Co-authored-by: Cursor <cursoragent@cursor.com>

kvbm's SchedulableSequence already implements speculative decode (schedule_speculative + apply_speculative with automatic LIFO release of rejected drafts' blocks), but RequestKv only wrapped the single-token decode path. Add thin pass-throughs so the Qwen3 decode path can schedule a speculative step and commit the accepted prefix; KV rollback for rejected drafts is handled by kvbm, not the caller. The verify forward reuses prefill_view(1 + num_draft_tokens). Unit-tested: accepted-prefix commit advances kv_position by accepted.len(); over-commit past the scheduled budget is rejected. Co-authored-by: Cursor <cursoragent@cursor.com>

Captures the n-gram (prompt-lookup) speculative-decode design for Qwen3-4B: per-step data flow, the lossless greedy-acceptance correctness oracle, the token/KV accounting against kvbm's verified speculative contract, the layering mirrored from vLLM V1 (proposer in the scheduler layer, GPU-side argmax, scheduler reserves KV), and the remaining GPU verify-forward + executor wiring. Records the building blocks already landed (proposer, acceptance, RequestKv speculative pass-throughs) and the open interface questions. Co-authored-by: Cursor <cursoragent@cursor.com>

Add the GPU verify step for n-gram speculative decode, as an additive path that leaves prefill/decode/unified untouched: - LocalQwen3Lane::execute_verify runs the model on [dangling, drafts..] over the request's existing paged KV by reusing batch_prefill(echo=true) for the per-position logits, then takes a GPU argmax (argmax_batch_bf16_into) and copies back only the per-position token ids. - New StepCommand::SpeculativeVerify / WorkerStepOutcome::Speculative carry it through the rank-worker channel (TP-safe: workers run the forward, primary returns the argmax). - Qwen3Executor::execute_speculative orchestrates one request's step: schedule_speculative(K+1) -> verify -> accept_greedy -> apply_speculative (kvbm releases rejected drafts' blocks). Returns the committed tokens. - SpeculativeStepItem carries the dangling token + drafts; exported via runtime. Scheduler wiring (proposer invocation over token history) and GPU validation of the lossless spec-on == spec-off property are the remaining steps. Co-authored-by: Cursor <cursoragent@cursor.com>

Move the verify forward + executor step from "remaining" to "landed"; the remaining work is scheduler wiring and GPU validation of the lossless property. Co-authored-by: Cursor <cursoragent@cursor.com>

End-to-end test on real Qwen3-4B weights: greedy n-gram speculative decode (proposer -> schedule_speculative -> verify forward -> accept_greedy -> apply_speculative) produces a token-identical sequence to plain greedy decode. Prefix cache disabled for cold, identically-shaped runs; repetitive prompt keeps the greedy argmax well-separated despite the verify (prefill kernel) vs reference (decode kernel) path difference. Skips cleanly without GPU/weights. Passes on RTX (sm_89), Qwen3-4B: 24-token greedy continuation matches exactly. Co-authored-by: Cursor <cursoragent@cursor.com>

…code Co-authored-by: Cursor <cursoragent@cursor.com>

Add execute_speculative to ModelExecutor (default: unsupported) so a generic scheduler can drive speculative decode; Qwen3Executor's trait method delegates to the existing inherent body. No behaviour change — nothing calls it yet. Co-authored-by: Cursor <cursoragent@cursor.com>

Add the serving-loop integration behind a (default-disabled) SpeculativeConfig: - ActiveRequestState gains token_history (prompt + generated), maintained on promote (resolve) and each committed decode token (effects). - New scheduler/speculative.rs: speculative_decode_step proposes drafts per active request, runs execute_speculative (or falls back to a single decode when there is no draft), then streams the committed tokens applying stop / max-token handling — kept isolated from the one-token-per-step plan/resolve/ effects pipeline. - scheduler_loop routes pure-decode ticks through it when enabled. Mock-tested (FakeExecutor.execute_speculative): a step streams every committed token and advances state; commits past max_tokens truncate, finish, and retire. Enabling it from the public API (a config knob) is a follow-up. Co-authored-by: Cursor <cursoragent@cursor.com>

…+ perf Co-authored-by: Cursor <cursoragent@cursor.com>

Add SpeculativeConfig::from_env (OPENINFER_QWEN3_NGRAM_SPEC[_TOKENS|_MAX_NGRAM]) and have scheduler_loop build the config from it instead of hardcoding off, so the speculative decode path can be turned on end-to-end. Log once on enable. Co-authored-by: Cursor <cursoragent@cursor.com>

Add an ignored GPU benchmark that times greedy vs. speculative decode and reports forward-pass count, accepted-per-verify, and wall-clock speedup. On the periodic synthetic prompt (best case) it shows 3.96x / 5.00 tokens per verify; document the numbers and the best-case caveat. Co-authored-by: Cursor <cursoragent@cursor.com>

Introduce SpeculativeProposer (propose(&[u32]) -> Vec<u32>) as the single axis of variation between speculative methods, with NgramProposer implementing it. SpeculativeConfig now carries a SpeculativeMethod enum + build_proposer factory; scheduler_loop builds one boxed proposer and speculative_decode_step takes &dyn SpeculativeProposer. Verify forward, KV rollback, greedy acceptance, and the streaming step are reused unchanged. Acceptance (greedy-only) and the token-only proposer context are left as documented future seams. Co-authored-by: Cursor <cursoragent@cursor.com>

…action The module-level doc still described the file as ngram-specific glue, which contradicts the SpeculativeProposer trait it now defines. Rewrite it to list the three method-agnostic pieces (proposer trait, config/factory, greedy acceptance). Co-authored-by: Cursor <cursoragent@cursor.com>

SpeculativeConfig::from_env was hardcoding the n-gram-specific OPENINFER_QWEN3_ NGRAM_SPEC_* variables, so the generic config knew the proposer's knobs. Move those to NgramConfig::from_env (the proposer owns its own parameters) and leave only the generic switch on SpeculativeConfig. Rename the switch to the generic OPENINFER_QWEN3_SPEC and the n-gram knobs to OPENINFER_QWEN3_NGRAM_{TOKENS, MAX_NGRAM}; a OPENINFER_QWEN3_SPEC_METHOD dispatch slots in when a second proposer lands. Co-authored-by: Cursor <cursoragent@cursor.com>

Address review of the speculative abstraction: - Stop over-promising. The module/trait/docs claimed a method-agnostic core reused unchanged; in reality the verify forward returns argmax (greedy- specific), accept_greedy is greedy-only, and the scheduler step assumes a stateless proposer. Rewrite to say this is a greedy, n-gram-sized seam and spell out what a draft-model/EAGLE proposer would additionally need. - Down-level describe(): drop SpeculativeConfig::describe and add Display on SpeculativeMethod + NgramConfig, so a method owns formatting of its own fields. scheduler logs spec_config.method. - num_accepted: now an internal helper for accept_greedy (dedup the accept loop), demoted from pub since the executor uses the committed length. Co-authored-by: Cursor <cursoragent@cursor.com>

Two robustness fixes for the speculative decode step: - Transactional schedule: execute_speculative now reverts the schedule_speculative KV reservation (new RequestKv::revert_schedule) if the verify/commit body fails, so a mid-step error leaves the sequence idle for the next step instead of stranded half-scheduled. - Budget cap: the scheduler caps the draft count to the request's remaining token budget, so a verify commits at most up to max_tokens. Otherwise schedule_speculative clamps to the budget and the larger accepted run is rejected. Test renamed/updated to assert the cap (drafts capped, commit lands exactly on the limit) and the docs drop the now-private num_accepted. Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2f6d3e45b6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Addresses Codex review on openinfer-project#349. With OPENINFER_QWEN3_SPEC on, the scheduler routed every pure-decode tick through the speculative path regardless of each request's SamplingParams: a temperature>0 request was forced to argmax and a logprobs request silently lost its decode logprobs. speculative_decode_step now decides per request: only greedy (SamplingParams::is_greedy()) requests that ask for no decode logprobs take the speculative step; everyone else takes a normal sampled single-token decode with their own params, logprobs, and a fresh random_val. Factored emit_one out of apply_committed so both paths share the stop/max-token handling. Adds two tests (sampled and logprobs requests bypass speculation) and documents the gate. Co-authored-by: Cursor <cursoragent@cursor.com>

…infer-project#442) into n-gram branch Adopt openinfer-project#436's method-agnostic speculative core (speculative.rs draft/verify seam, executor/spec.rs, KV schedule/apply/revert in pool.rs) wholesale, discarding openinfer-project#349's now-superseded parallel scaffolding (its own speculative.rs, scheduler/speculative.rs, executor verify path). The reusable n-gram IP (ngram.rs prompt-lookup + tests, design doc) is kept unreferenced, to be re-based onto openinfer-project#436's proposer seam in follow-up commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ect#436 core Plug host-side n-gram (prompt-lookup) speculation into the method-agnostic speculative core landed by openinfer-project#436, instead of openinfer-project#349's parallel stack. The shared verify/accept/KV transaction (speculative.rs, executor/spec.rs, pool.rs) is reused as-is; only the propose step and a no-capture verify forward are added. - ngram.rs: keep the suffix-lookup proposer + unit tests; drop the obsolete SpeculativeProposer impl (the trait is gone — openinfer-project#436 kept the core concrete). - executor: NgramProposer + per-request token context (ngram_ctx), maintained at prefill (prompt + first token), verify-commit, and decode; dropped on retire. execute_speculative_draft proposes host-side (no worker forward); execute_speculative_verify reuses the shared KV transaction with the DFlash readiness/capture asserts relaxed. - executor lane: execute_ngram_verify runs batch_prefill_into with zero capture layers (no hidden-state capture), argmax, build_verify_results. Verify is routed by lane.dflash.is_some() (the methods are mutually exclusive). - load_ngram_drafter + speculative_enabled/request_ready cover the n-gram method. - CLI: --ngram-speculative (single-GPU greedy; mutually exclusive with DFlash, LoRA, KV offload, decode overlap), threaded through launch -> start_qwen3. - Remove tests/ngram_speculative.rs (drove openinfer-project#349's deleted low-level execute_speculative API); the proposer is unit-tested in ngram.rs. GPU losslessness to be validated engine-level on the 5090. Compiles green: cargo check -p openinfer-qwen3-4b {--lib,--tests} and -p openinfer-server; ngram unit tests pass. GPU losslessness validation pending. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Greedy n-gram speculation must be lossless: spec-on tokens equal plain greedy decode (modulo the benign prefill-vs-decode bf16 tie flip). Mirrors the DFlash gate's regret check (re-prefill the shared context at a divergence, tolerate only picks within MARGIN_TOL of the prefill-kernel argmax), driven through the engine with --ngram-speculative (no draft model). Repetitive/structured prompts make the proposer fire so the verify path is actually exercised. - ngram_speculative_greedy_matches_plain_greedy: bs=1 over 5 prompts. - ngram_concurrent_heterogeneous_is_lossless: 4 concurrent staggered budgets, exercising the bs>1 verify-span path and per-request context isolation. GPU + Qwen3-4B weights required; skips cleanly without OPENINFER_TEST_MODEL_PATH. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Collapse the two mutually-exclusive executor fields (speculative: Option<DFlashMeta> and ngram: Option<NgramProposer>) into a single speculative_method: Option<SpeculativeMethod> { Dflash(..) | Ngram(..) }. One source of truth for 'is speculation on and which kind', removing the 'assert not both set' smell. dflash_meta()/ngram_proposer() accessors keep every call site a one-liner and DFlash behaviour bit-identical. This is the natural pre-trait consolidation now that n-gram is the second method openinfer-project#436 was waiting for. A full SpeculativeProposer trait object is deliberately not introduced yet: DFlash's propose is a worker RPC (needs &mut executor + lane), so it can't be a self-contained proposer object without borrow gymnastics — the enum-dispatch is the honest shape until a third method (EAGLE, openinfer-project#325) makes a shared trait pay off. cargo check -p openinfer-qwen3-4b {--lib,--tests} green; ngram units pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cargo fmt --all --check clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…fer-project#349 review Review of openinfer-project#349 surfaced a real gap plus polish: - 🔴 The fused unified decode path (execute_unified) committed tokens via apply_decode but never appended them to ngram_ctx — only execute_decode and the verify path did. Reachable whenever a non-greedy request shares the batch (n-gram does not reject temperature>0): should_speculative_decode flips false, and the co-resident greedy request decodes via Unified. Not a correctness bug (the verify seed is the request's real last_token, not ngram_ctx), but the context tracker drifts, silently degrading n-gram acceptance once the batch returns to pure greedy and violating the 'always ends with the dangling token' invariant. Fixed by mirroring execute_decode's sync in the Unified branch, and defensively in the SplitDecodeReady branch (unreachable today — overlap is off under speculation — but guards a future relaxation). - Add ngram_mixed_greedy_and_sampling_is_lossless: a greedy request must stay token-identical to its solo baseline while a sampling request holds the batch non-greedy off the speculative path. - 🟡 Fix the misplaced doc comment: the load_dflash_draft_model header had been split from its body by the inserted dflash_meta()/ngram_proposer() accessors. - 🟢 max_context_tokens: document why n-gram needs no DFlash-style block_size reduction (verify span is budget-clamped, so it never writes past the window). - 🟢 Soften the verify-buffer 'allocates at most once' comment (it grows 1->K+1) and note propose()'s O(context_len) per-step scan + the vLLM look-back-window option for long contexts. cargo check -p openinfer-qwen3-4b {--lib,--tests} green; fmt clean; ngram units pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

NgramProposer/new/propose are only used within the crate (mod ngram is private; the gate test drives through the engine, not the proposer directly), so their pub visibility tripped the workspace unreachable_pub lint. NgramConfig stays pub — it appears in load_ngram_drafter's public signature. GPU-validated: ngram_speculative_gate passes 3/3 on RTX 4090 (Qwen3-4B), every prompt 100% token-identical to plain greedy across bs=1, concurrent, and mixed greedy+sampling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wjinxu · 2026-06-24T15:02:45Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3f27ad7ada

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Addresses openinfer-project#349 review (P2): when n-gram proposes nothing (non-repetitive prompt, or OPENINFER_QWEN3_NGRAM_TOKENS=0) the verify span is just [current_token], yet the scheduler had already picked SpeculativeDecode — so the no-hit token went through the prefill-style verify forward instead of the optimized decode kernel. That cost a full verify for one token and could flip a near-tie greedy pick by the prefill-vs-decode bf16 gap, with zero speculative upside. Now execute_plan partitions the post-propose batch: requests with drafts (verify span > 1) take the verify forward; requests with no draft fall back to execute_decode — same kernel, same cost, and bit-identical to baseline. All-miss steps (common on non-repetitive text) degrade exactly to plain decode, so enabling n-gram never regresses below baseline. DFlash is unaffected: its draft always emits block_size tokens, so its spans are never length 1. GPU-validated: ngram_speculative_gate still 3/3 on RTX 4090 (Qwen3-4B), all prompts 100% token-identical (the 17x23 reasoning prompt exercises the no-draft fallback path). cargo check --lib/--tests green; fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wjinxu · 2026-06-24T15:23:58Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8c5b5adc4a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…list the design note Address openinfer-project#349 review. The gate's success check `matched == base.len().min(spec.len())` returned Ok(()) whenever every overlapping token matched, even when the two runs had different lengths — so a speculative path that emits EOS/Length early (or runs on past the baseline) passed as "100% lossless". It also risked an out-of-bounds panic on the divergence diagnostic when `spec` was the shorter run. Reject length mismatches explicitly: greedy speculation is lossless only if the streams are token-for-token AND length identical. Also list `models/qwen3/ngram-speculative.md` in docs/index.md per the docs workflow (every doc must appear in the routing table). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…oser # Conflicts: # openinfer-server/src/config.rs

wjinxu · 2026-06-25T08:25:24Z

@codex review

chatgpt-codex-connector · 2026-06-25T08:34:24Z

Codex Review: Didn't find any major issues. Breezy!

Reviewed commit: a47adbea29

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

wjinxu · 2026-06-25T08:35:35Z

Codex Review: Didn't find any major issues. Breezy!

Reviewed commit: a47adbea29

ℹ️ About Codex in GitHub

Thank you

…oser # Conflicts: # openinfer-qwen3-4b/tests/dflash_speculative_gate.rs # openinfer-qwen3-4b/tests/dflash_speculative_perf.rs

wjinxu · 2026-06-26T12:53:46Z

@xiaguan Rebased onto latest main — the only conflict was the enable_kv_events field added in #457 colliding with this branch's ngram_speculative field in the DFlash test launch_options(); kept both. Builds clean (cargo test --no-run, exit 0), n-gram unit tests + the GPU losslessness gate all pass. Ready for review whenever you have a chance 🙏

xiaguan · 2026-06-28T10:27:22Z

Thanks for putting this together. The correctness side looks solid from the losslessness gate; the remaining thing I’d like to see before we position this as a performance win is a serving-level A/B with vllm bench serve datasets.

I ran a quick local pass on an RTX 5070 Ti, Qwen3-4B, temperature=0, ignore_eos, no prefix cache, 12 prompts × 128 output tokens:

dataset / mode	plain	n-gram	DFlash
`sonnet` c1	90.2 tok/s	95.9 tok/s (+6.3%)	140.3 tok/s (+55.6%)
`sonnet` c4	307.0 tok/s	198.2 tok/s (-35.4%)	353.8 tok/s (+15.2%)
`random` c1	90.1 tok/s	112.8 tok/s (+25.2%)	not run
`random` c4	306.6 tok/s	220.1 tok/s (-28.2%)	not run

So the current shape looks promising for some single-stream cases, but the concurrent serving path needs more investigation before we call it a general speedup. Could you add vllm bench serve A/B results to the PR and either address the c4 regression or document the feature as experimental / single-stream-oriented for now?

Here are the exact commands I used. Paths may need adjusting for your machine:

set -euo pipefail

MODEL=/data/models/Qwen3-4B
DRAFT=/data/models/Qwen3-4B-DFlash-b16
PORT=18080
RESULT_DIR=/tmp/openinfer-ngram-pr349
SONNET=/tmp/openinfer-sonnet.txt
mkdir -p "$RESULT_DIR"

curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm/main/benchmarks/sonnet.txt -o "$SONNET"
PATH=/usr/local/cuda-13.3/bin:$PATH cargo build --release

Start one server mode at a time:

# plain
RUST_LOG=warn target/release/openinfer \
  --model-path "$MODEL" \
  --served-model-name qwen3-4b \
  --port "$PORT" \
  --no-prefix-cache \
  --gpu-memory-utilization 0.85

# n-gram
RUST_LOG=warn target/release/openinfer \
  --model-path "$MODEL" \
  --served-model-name qwen3-4b \
  --port "$PORT" \
  --no-prefix-cache \
  --gpu-memory-utilization 0.85 \
  --ngram-speculative

# DFlash
RUST_LOG=warn target/release/openinfer \
  --model-path "$MODEL" \
  --served-model-name qwen3-4b \
  --port "$PORT" \
  --no-prefix-cache \
  --gpu-memory-utilization 0.85 \
  --dflash-draft-model-path "$DRAFT"

Run sonnet for c1/c4:

MODE=plain   # plain | ngram | dflash
CONCURRENCY=1  # then repeat with 4

uv run --with vllm vllm bench serve \
  --backend openai \
  --base-url "http://127.0.0.1:${PORT}" \
  --model qwen3-4b \
  --tokenizer "$MODEL" \
  --dataset-name sonnet \
  --dataset-path "$SONNET" \
  --sonnet-input-len 550 \
  --sonnet-output-len 128 \
  --sonnet-prefix-len 200 \
  --num-prompts 12 \
  --num-warmups 1 \
  --max-concurrency "$CONCURRENCY" \
  --request-rate inf \
  --ignore-eos \
  --temperature 0 \
  --ready-check-timeout-sec 60 \
  --disable-tqdm \
  --metric-percentiles 50,90,99 \
  --save-result \
  --save-detailed \
  --result-dir "$RESULT_DIR" \
  --result-filename "sonnet-${MODE}-c${CONCURRENCY}.json"

Run random for a control workload:

MODE=plain   # plain | ngram
CONCURRENCY=1  # then repeat with 4

uv run --with vllm vllm bench serve \
  --backend openai \
  --base-url "http://127.0.0.1:${PORT}" \
  --model qwen3-4b \
  --tokenizer "$MODEL" \
  --dataset-name random \
  --random-input-len 550 \
  --random-output-len 128 \
  --random-range-ratio 0 \
  --num-prompts 12 \
  --num-warmups 1 \
  --max-concurrency "$CONCURRENCY" \
  --request-rate inf \
  --ignore-eos \
  --temperature 0 \
  --ready-check-timeout-sec 60 \
  --disable-tqdm \
  --metric-percentiles 50,90,99 \
  --save-result \
  --save-detailed \
  --result-dir "$RESULT_DIR" \
  --result-filename "random-${MODE}-c${CONCURRENCY}.json"

xiaguan · 2026-06-28T10:27:30Z

One more note: I’m fine with keeping the optimization work as a follow-up PR if we want to land the correctness / feature scaffolding first, but then I think we should open an issue to explicitly track the concurrent serving regression.

The next step should be profiling why c4 regresses before changing the implementation. My guess is around the current speculative batch/verify path, but that is only a hypothesis until we have an nsys or per-step timing breakdown for plain vs n-gram at c4.

On non-repetitive text (prose) prompt-lookup drafts are almost never the target's greedy continuation (~5-7% acceptance), so every speculative step pays a K+1-wide verify forward — plus, when only some requests in the batch matched, a second plain-decode pass — for ~no accepted tokens. At higher concurrency the GPU has no spare compute to hide that, so n-gram regressed serving throughput badly (sonnet -26% c1 / -40% c4 on Qwen3-4B / RTX 4090). Add an engine-wide acceptance gate (`NgramGate`): an EWMA of accepted draft tokens per drafted step. When it falls below `accept_threshold` the proposer returns nothing for the step, so the existing draft/undraft partition routes the whole batch to plain decode — no verify forward, no regression. A periodic probe re-opens the gate so a shift into repetitive text is picked back up. Gating only changes whether we speculate, never the committed tokens, so the losslessness gate still passes 3/3. Serving A/B (vllm bench serve, Qwen3-4B, RTX 4090), n-gram vs plain: sonnet c1 -26.2% -> +0.6% | c4 -40.0% -> -2.2% random c1 -10.3% -> +21.8%| c4 -33.8% -> -20.0% Prose regression is recovered (now better than vLLM's -6%, which keeps speculating); random c1 matches vLLM (+21.8% vs +19.3%). random c4 stays negative at high (~37%) acceptance — that gap is the verify path itself (two forwards + prefill-kernel verify vs vLLM's single fused forward), tracked as a follow-up in docs/models/qwen3/ngram-speculative.md. Tunable via OPENINFER_QWEN3_NGRAM_ACCEPT_THRESHOLD (default 0.3, 0 disables). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

wjinxu · 2026-06-29T03:21:33Z

Added serving-level A/B with vllm bench serve (RTX 4090, Qwen3-4B, temperature=0, --ignore-eos, --no-prefix-cache, 12×128), sonnet + random, vs vLLM0.21/0.23.

The regression was from paying the verify forward at near-zero acceptance. Added an acceptance gate (NgramGate): an EWMA of accepted draft tokens/step; below threshold the proposer emits nothing and the batch falls back to plain decode (no verify forward), with a periodic probe to re-open on repetitive text. Lossless — the gate only changes whether we speculate; the losslessness gate still passes 3/3.

n-gram vs plain, after the gate:

dataset / conc	acceptance	vLLM 0.23	openinfer
sonnet c1	~5%	−5.8%	+0.6%
sonnet c4	~7%	−5.8%	−2.2%
random c1	~33%	+19.3%	+21.8%
random c4	~37%	+17.0%	−20.0%

Prose regression is gone (better than vLLM, which keeps speculating); random c1 matches vLLM. random c4 is the remaining gap (same ~37% acceptance,but vLLM's single fused forward beats our two-forward + prefill-kernel verify) — root-caused with nsys, tracked as a follow-up PR

xiaguan · 2026-06-29T13:59:28Z

Thanks for the update and for adding the serving A/B data.

Two code-quality things I’d like to see cleaned up before merge:

The new docs/models/qwen3/ngram-speculative.md still describes an older version of this branch in a few places. It mentions things like tests/ngram_speculative.rs, scheduler/speculative.rs, SpeculativeProposer, and OPENINFER_QWEN3_SPEC, but the current implementation is based on --ngram-speculative, SpeculativeMethod::{Dflash, Ngram}, executor-side n-gram context, and the shared verify/accept/KV transaction. Could you refresh the doc to match the actual code path?
The n-gram lifecycle is spread across executor.rs: context seeding, append-on-decode, append-on-unified, append-on-verify, drop cleanup, and gate bookkeeping all live in separate branches. Since executor.rs is already very large, I’d prefer to keep this feature’s state in a small dedicated helper, e.g. an NgramState/NgramRuntime that owns the context map, proposer, and gate. Then the executor can call intent-level methods instead of manually maintaining n-gram state in every execution path.

The feature shape looks reasonable; I mainly want the documentation and ownership boundary to stay maintainable.

The n-gram lifecycle was spread across executor.rs: a context HashMap field, an acceptance-gate field, and manual maintenance of both in every execution path (prefill seed, decode/unified/overlap append, verify append + gate record, retire drop, and the draft step's gate decision). Move all of it behind crate::ngram::NgramRuntime, which owns the proposer, the gate, and the per-request context map, and exposes intent-level methods: seed / append_committed / drop_request / is_ready / propose_step / record_verified. SpeculativeMethod::Ngram now holds the runtime, so the two executor fields (ngram_ctx, ngram_gate) and the per-path bookkeeping collapse to single calls through ngram_runtime()/ngram_runtime_mut(). No behaviour change: unit tests 12/12, losslessness gate 3/3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The doc still described an earlier version of the branch: a SpeculativeProposer trait, scheduler/speculative.rs::speculative_decode_step, OPENINFER_QWEN3_SPEC, tests/ngram_speculative.rs, and a standalone verify_forward. Rewrite it to the actual path: --ngram-speculative, the SpeculativeMethod{ Dflash, Ngram } enum, NgramRuntime owning proposer+gate+context, host-side propose, and the shared verify/accept/KV transaction. Keeps the serving A/B, acceptance-gate, and single-fused-forward follow-up sections. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

wjinxu · 2026-06-30T02:46:01Z

Ownership — all n-gram state now lives in a dedicated NgramRuntime (proposer + gate + per-request context). SpeculativeMethod::Ngram holds it; the two executor fields and the per-path bookkeeping collapse to intent-level calls (seed / append_committed / drop_request / is_ready / propose_step / record_verified). No behaviour change.
Docs — rewrote ngram-speculative.md to the shipped path: dropped the stale SpeculativeProposer trait, scheduler/speculative.rs, OPENINFER_QWEN3_SPEC, tests/ngram_speculative.rs, verify_forward; now reflects --ngram-speculative, the SpeculativeMethod { Dflash, Ngram } enum, NgramRuntime, and the shared verify/accept/KV transaction.

Green locally: build + fmt + clippy clean, unit tests 12/12, GPU losslessness gate 3/3.

xiaguan · 2026-06-30T14:14:46Z

Review: n-gram speculative decoding — thanks!

Great to see this land on top of the #436 shared core rather than as a parallel stack. Reusing verify / accept / the KV transaction and only swapping propose is the right call, and the enum dispatch instead of a premature trait is honest. The GPU losslessness gate (3/3, token-identical) is solid evidence. A few things I'd like to see addressed before this is ready — none of them block the design, mostly scope/defaults and some polish.

Substantive

random c4 is −20% in the PR's own A/B, and the fix (fused verify forward) is a follow-up. The doc is transparent about it, which I appreciate — but it does mean --ngram-speculative, in what should be spec decode's sweet spot (high concurrency + moderate acceptance), regresses below plain decode (vLLM is +17% there). Two options I'd be happy with:
- Land the fused verify forward in this PR, or
- Mark the flag experimental and document that concurrency > 1 can regress until fused verify lands (so nobody enables it in prod expecting a flat win).
NgramGate is engine-wide; the mixed-load case isn't tested. record_verified feeds mean(accepted_draft_tokens) across the drafted requests in a step, so a repetitive request (accept=4) co-resident with prose requests (accept=0) keeps the EWMA healthy and the gate open — the prose requests then keep paying the prefill-kernel verify for ~0 accepted tokens. The A/B tables are all homogeneous datasets. Could you add a 50/50 mixed A/B (repetitive + prose) to show the gate doesn't degrade the prose side? If it does, a per-request gate (or per-request EWMA) is probably the real shape.
accept_threshold semantics drift with K. It's compared against "mean accepted draft tokens per step" (absolute, 0..K), so 0.3 means 7.5% acceptance at K=4 but 3.75% at K=8. The doc quotes acceptance as percentages (~5% / ~37%), but the knob is in absolute tokens — users tuning K have to re-tune the threshold. Making it accepted / drafted (ratio, 0..1) would make it K-portable and match the doc's mental model directly. Happy if this is a follow-up, but worth noting now since it touches the gate's public surface.

Minor

execute_ngram_verify reallocs verify scratch as the span grows 1 → K+1. K is known at load_ngram_drafter time; pre-allocating at K+1 there avoids the startup alloc stalls on the lane. DFlash already does the equivalent with its fixed block_size.
NgramProposer::propose is O(context_len · max_ngram) per step, full scan, no look-back window. Fine for bounded contexts, but it'll erode the win on long generations — vLLM bounds the look-back. The doc mentions this as a future concern; it'll become one, so worth a tracked issue at least.
ngram_mixed_greedy_and_sampling_is_lossless can't deterministically hit the unified ngram_ctx sync branch (the test's own comment says so), and commit 1d7b6c98 fixed a bug exactly there. The fact that the IT can't pin it suggests the unified sync depends on scheduler arrival timing — a mild design smell. Not blocking, but worth a comment in the test or a deterministic harness later.
propose_step's anyhow::ensure!(item.params.is_greedy()) is unreachable — should_speculative_decode already guarantees the whole batch is greedy. Either drop it or demote to debug_assert! to mark it as an invariant guard.
NgramConfig fields are all pub. The type needs to be pub (it's in load_ngram_drafter's signature), but the fields are only constructed by from_env(); pub(crate) would be enough and keeps the contract tight.

Comment density

This is subjective, but I found the commenting heavy. A lot of the prose explains what the code does line-by-line rather than why — e.g. the execute_plan::SpeculativeDecode block, the NgramRuntime method docs, the append_committed sync blocks duplicated across decode/unified/split paths. Where the code already reads as a self-explanatory sequence, letting it speak cuts the surface area reviewers have to hold and makes the why comments (the ones that actually earn their keep, like the bf16 tie-flip rationale in the gate test) stand out more. Not asking for a mass strip — just biasing toward "delete unless it explains intent the code can't."

Performance

The acceptance gate is a nice save for prose, and the nsys root-cause for random-c4 (two forwards + prefill-kernel verify vs vLLM's single fused forward) is exactly the right diagnosis. The fused unified verify forward is the obvious next prize — looking forward to it closing the c4 gap. Happy to review that PR.

Conclusion

Solid foundation and the losslessness story is convincing. I'd just like to see the c4 framing tightened up and a mixed-load A/B before this merges — the rest can be follow-ups.

wjinxu and others added 18 commits June 11, 2026 13:25

docs(qwen3): mark speculative verify forward landed in design note

6464578

Move the verify forward + executor step from "remaining" to "landed"; the remaining work is scheduler wiring and GPU validation of the lossless property. Co-authored-by: Cursor <cursoragent@cursor.com>

docs(qwen3): record GPU-validated lossless property for ngram spec de…

4fd4c7a

…code Co-authored-by: Cursor <cursoragent@cursor.com>

docs(qwen3): record scheduler wiring landed; remaining = config knob …

2a6e2aa

…+ perf Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread openinfer-qwen3-4b/src/scheduler.rs Outdated

Comment thread openinfer-qwen3-4b/src/scheduler/speculative.rs Outdated

wjinxu marked this pull request as draft June 12, 2026 03:05

wjinxu mentioned this pull request Jun 15, 2026

feat(qwen3): add DFlash speculative decoding #380

Closed

xiaguan mentioned this pull request Jun 23, 2026

[Roadmap] Speculative decoding (qwen3 first, shared primitives) #443

Open

wjinxu and others added 7 commits June 24, 2026 15:55

style(qwen3): rustfmt the n-gram losslessness gate

c17975a

cargo fmt --all --check clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wjinxu marked this pull request as ready for review June 24, 2026 15:01

chatgpt-codex-connector Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread openinfer-qwen3-4b/src/executor.rs Outdated

chatgpt-codex-connector Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread openinfer-qwen3-4b/tests/ngram_speculative_gate.rs Outdated

Comment thread docs/models/qwen3/ngram-speculative.md

wjinxu and others added 2 commits June 25, 2026 16:10

Merge remote-tracking branch 'origin/main' into feat/qwen3-ngram-prop…

a47adbe

…oser # Conflicts: # openinfer-server/src/config.rs

Merge remote-tracking branch 'origin/main' into feat/qwen3-ngram-prop…

9a74be1

…oser # Conflicts: # openinfer-qwen3-4b/tests/dflash_speculative_gate.rs # openinfer-qwen3-4b/tests/dflash_speculative_perf.rs

xiaguan self-assigned this Jun 28, 2026

wjinxu and others added 2 commits June 29, 2026 11:10

style(qwen3): rustfmt the n-gram acceptance gate

9a955c1

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

wjinxu and others added 2 commits June 29, 2026 22:45

Uh oh!

Conversation

wjinxu commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The shape

What n-gram adds

Config

Losslessness

Files

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

wjinxu commented Jun 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

wjinxu commented Jun 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

wjinxu commented Jun 25, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 25, 2026

Uh oh!

wjinxu commented Jun 25, 2026

Uh oh!

wjinxu commented Jun 26, 2026

Uh oh!

xiaguan commented Jun 28, 2026

Uh oh!

xiaguan commented Jun 28, 2026

Uh oh!

wjinxu commented Jun 29, 2026

Uh oh!

xiaguan commented Jun 29, 2026

Uh oh!

wjinxu commented Jun 30, 2026

Uh oh!

xiaguan commented Jun 30, 2026

Review: n-gram speculative decoding — thanks!

Substantive

Minor

Comment density

Performance

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wjinxu commented Jun 11, 2026 •

edited

Loading