PR: feat(qwen3): n-gram speculative decoding#349
Conversation
Stateless prompt-lookup proposer: finds the most recent earlier occurrence of the current token suffix in the running context and proposes the tokens that followed it, longest-suffix-first with shorter-suffix fallback. No draft model. This is the proposer half of n-gram speculative decoding; verification and KV rollback in the decode path are a follow-up. Pure-CPU and unit-tested (recent-match continuation, longest-suffix preference, fallback, no-match, K cap, short context, config clamping). Co-authored-by: Cursor <cursoragent@cursor.com>
Adds the verification-side brain of n-gram speculative decoding: `accept_greedy` commits the longest proposed prefix the target model agrees with plus one model token (correction or bonus), so a verify step always makes >=1 token of progress; `num_accepted` reports how many speculative KV positions to keep on rollback. Also adds `SpeculativeConfig` (disabled by default). Pure-CPU and unit-tested. The GPU verify forward (multi-token decode returning per-position logits) and the executor / kvbm KV-rollback wiring are a follow-up. Co-authored-by: Cursor <cursoragent@cursor.com>
kvbm's SchedulableSequence already implements speculative decode (schedule_speculative + apply_speculative with automatic LIFO release of rejected drafts' blocks), but RequestKv only wrapped the single-token decode path. Add thin pass-throughs so the Qwen3 decode path can schedule a speculative step and commit the accepted prefix; KV rollback for rejected drafts is handled by kvbm, not the caller. The verify forward reuses prefill_view(1 + num_draft_tokens). Unit-tested: accepted-prefix commit advances kv_position by accepted.len(); over-commit past the scheduled budget is rejected. Co-authored-by: Cursor <cursoragent@cursor.com>
Captures the n-gram (prompt-lookup) speculative-decode design for Qwen3-4B: per-step data flow, the lossless greedy-acceptance correctness oracle, the token/KV accounting against kvbm's verified speculative contract, the layering mirrored from vLLM V1 (proposer in the scheduler layer, GPU-side argmax, scheduler reserves KV), and the remaining GPU verify-forward + executor wiring. Records the building blocks already landed (proposer, acceptance, RequestKv speculative pass-throughs) and the open interface questions. Co-authored-by: Cursor <cursoragent@cursor.com>
Add the GPU verify step for n-gram speculative decode, as an additive path that leaves prefill/decode/unified untouched: - LocalQwen3Lane::execute_verify runs the model on [dangling, drafts..] over the request's existing paged KV by reusing batch_prefill(echo=true) for the per-position logits, then takes a GPU argmax (argmax_batch_bf16_into) and copies back only the per-position token ids. - New StepCommand::SpeculativeVerify / WorkerStepOutcome::Speculative carry it through the rank-worker channel (TP-safe: workers run the forward, primary returns the argmax). - Qwen3Executor::execute_speculative orchestrates one request's step: schedule_speculative(K+1) -> verify -> accept_greedy -> apply_speculative (kvbm releases rejected drafts' blocks). Returns the committed tokens. - SpeculativeStepItem carries the dangling token + drafts; exported via runtime. Scheduler wiring (proposer invocation over token history) and GPU validation of the lossless spec-on == spec-off property are the remaining steps. Co-authored-by: Cursor <cursoragent@cursor.com>
Move the verify forward + executor step from "remaining" to "landed"; the remaining work is scheduler wiring and GPU validation of the lossless property. Co-authored-by: Cursor <cursoragent@cursor.com>
End-to-end test on real Qwen3-4B weights: greedy n-gram speculative decode (proposer -> schedule_speculative -> verify forward -> accept_greedy -> apply_speculative) produces a token-identical sequence to plain greedy decode. Prefix cache disabled for cold, identically-shaped runs; repetitive prompt keeps the greedy argmax well-separated despite the verify (prefill kernel) vs reference (decode kernel) path difference. Skips cleanly without GPU/weights. Passes on RTX (sm_89), Qwen3-4B: 24-token greedy continuation matches exactly. Co-authored-by: Cursor <cursoragent@cursor.com>
…code Co-authored-by: Cursor <cursoragent@cursor.com>
Add execute_speculative to ModelExecutor (default: unsupported) so a generic scheduler can drive speculative decode; Qwen3Executor's trait method delegates to the existing inherent body. No behaviour change — nothing calls it yet. Co-authored-by: Cursor <cursoragent@cursor.com>
Add the serving-loop integration behind a (default-disabled) SpeculativeConfig: - ActiveRequestState gains token_history (prompt + generated), maintained on promote (resolve) and each committed decode token (effects). - New scheduler/speculative.rs: speculative_decode_step proposes drafts per active request, runs execute_speculative (or falls back to a single decode when there is no draft), then streams the committed tokens applying stop / max-token handling — kept isolated from the one-token-per-step plan/resolve/ effects pipeline. - scheduler_loop routes pure-decode ticks through it when enabled. Mock-tested (FakeExecutor.execute_speculative): a step streams every committed token and advances state; commits past max_tokens truncate, finish, and retire. Enabling it from the public API (a config knob) is a follow-up. Co-authored-by: Cursor <cursoragent@cursor.com>
…+ perf Co-authored-by: Cursor <cursoragent@cursor.com>
Add SpeculativeConfig::from_env (OPENINFER_QWEN3_NGRAM_SPEC[_TOKENS|_MAX_NGRAM]) and have scheduler_loop build the config from it instead of hardcoding off, so the speculative decode path can be turned on end-to-end. Log once on enable. Co-authored-by: Cursor <cursoragent@cursor.com>
Add an ignored GPU benchmark that times greedy vs. speculative decode and reports forward-pass count, accepted-per-verify, and wall-clock speedup. On the periodic synthetic prompt (best case) it shows 3.96x / 5.00 tokens per verify; document the numbers and the best-case caveat. Co-authored-by: Cursor <cursoragent@cursor.com>
Introduce SpeculativeProposer (propose(&[u32]) -> Vec<u32>) as the single axis of variation between speculative methods, with NgramProposer implementing it. SpeculativeConfig now carries a SpeculativeMethod enum + build_proposer factory; scheduler_loop builds one boxed proposer and speculative_decode_step takes &dyn SpeculativeProposer. Verify forward, KV rollback, greedy acceptance, and the streaming step are reused unchanged. Acceptance (greedy-only) and the token-only proposer context are left as documented future seams. Co-authored-by: Cursor <cursoragent@cursor.com>
…action The module-level doc still described the file as ngram-specific glue, which contradicts the SpeculativeProposer trait it now defines. Rewrite it to list the three method-agnostic pieces (proposer trait, config/factory, greedy acceptance). Co-authored-by: Cursor <cursoragent@cursor.com>
SpeculativeConfig::from_env was hardcoding the n-gram-specific OPENINFER_QWEN3_
NGRAM_SPEC_* variables, so the generic config knew the proposer's knobs. Move
those to NgramConfig::from_env (the proposer owns its own parameters) and leave
only the generic switch on SpeculativeConfig. Rename the switch to the generic
OPENINFER_QWEN3_SPEC and the n-gram knobs to OPENINFER_QWEN3_NGRAM_{TOKENS,
MAX_NGRAM}; a OPENINFER_QWEN3_SPEC_METHOD dispatch slots in when a second
proposer lands.
Co-authored-by: Cursor <cursoragent@cursor.com>
Address review of the speculative abstraction: - Stop over-promising. The module/trait/docs claimed a method-agnostic core reused unchanged; in reality the verify forward returns argmax (greedy- specific), accept_greedy is greedy-only, and the scheduler step assumes a stateless proposer. Rewrite to say this is a greedy, n-gram-sized seam and spell out what a draft-model/EAGLE proposer would additionally need. - Down-level describe(): drop SpeculativeConfig::describe and add Display on SpeculativeMethod + NgramConfig, so a method owns formatting of its own fields. scheduler logs spec_config.method. - num_accepted: now an internal helper for accept_greedy (dedup the accept loop), demoted from pub since the executor uses the committed length. Co-authored-by: Cursor <cursoragent@cursor.com>
Two robustness fixes for the speculative decode step: - Transactional schedule: execute_speculative now reverts the schedule_speculative KV reservation (new RequestKv::revert_schedule) if the verify/commit body fails, so a mid-step error leaves the sequence idle for the next step instead of stranded half-scheduled. - Budget cap: the scheduler caps the draft count to the request's remaining token budget, so a verify commits at most up to max_tokens. Otherwise schedule_speculative clamps to the budget and the larger accepted run is rejected. Test renamed/updated to assert the cap (drafts capped, commit lands exactly on the limit) and the docs drop the now-private num_accepted. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2f6d3e45b6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Addresses Codex review on openinfer-project#349. With OPENINFER_QWEN3_SPEC on, the scheduler routed every pure-decode tick through the speculative path regardless of each request's SamplingParams: a temperature>0 request was forced to argmax and a logprobs request silently lost its decode logprobs. speculative_decode_step now decides per request: only greedy (SamplingParams::is_greedy()) requests that ask for no decode logprobs take the speculative step; everyone else takes a normal sampled single-token decode with their own params, logprobs, and a fresh random_val. Factored emit_one out of apply_committed so both paths share the stop/max-token handling. Adds two tests (sampled and logprobs requests bypass speculation) and documents the gate. Co-authored-by: Cursor <cursoragent@cursor.com>
…infer-project#442) into n-gram branch Adopt openinfer-project#436's method-agnostic speculative core (speculative.rs draft/verify seam, executor/spec.rs, KV schedule/apply/revert in pool.rs) wholesale, discarding openinfer-project#349's now-superseded parallel scaffolding (its own speculative.rs, scheduler/speculative.rs, executor verify path). The reusable n-gram IP (ngram.rs prompt-lookup + tests, design doc) is kept unreferenced, to be re-based onto openinfer-project#436's proposer seam in follow-up commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ect#436 core Plug host-side n-gram (prompt-lookup) speculation into the method-agnostic speculative core landed by openinfer-project#436, instead of openinfer-project#349's parallel stack. The shared verify/accept/KV transaction (speculative.rs, executor/spec.rs, pool.rs) is reused as-is; only the propose step and a no-capture verify forward are added. - ngram.rs: keep the suffix-lookup proposer + unit tests; drop the obsolete SpeculativeProposer impl (the trait is gone — openinfer-project#436 kept the core concrete). - executor: NgramProposer + per-request token context (ngram_ctx), maintained at prefill (prompt + first token), verify-commit, and decode; dropped on retire. execute_speculative_draft proposes host-side (no worker forward); execute_speculative_verify reuses the shared KV transaction with the DFlash readiness/capture asserts relaxed. - executor lane: execute_ngram_verify runs batch_prefill_into with zero capture layers (no hidden-state capture), argmax, build_verify_results. Verify is routed by lane.dflash.is_some() (the methods are mutually exclusive). - load_ngram_drafter + speculative_enabled/request_ready cover the n-gram method. - CLI: --ngram-speculative (single-GPU greedy; mutually exclusive with DFlash, LoRA, KV offload, decode overlap), threaded through launch -> start_qwen3. - Remove tests/ngram_speculative.rs (drove openinfer-project#349's deleted low-level execute_speculative API); the proposer is unit-tested in ngram.rs. GPU losslessness to be validated engine-level on the 5090. Compiles green: cargo check -p openinfer-qwen3-4b {--lib,--tests} and -p openinfer-server; ngram unit tests pass. GPU losslessness validation pending. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Greedy n-gram speculation must be lossless: spec-on tokens equal plain greedy decode (modulo the benign prefill-vs-decode bf16 tie flip). Mirrors the DFlash gate's regret check (re-prefill the shared context at a divergence, tolerate only picks within MARGIN_TOL of the prefill-kernel argmax), driven through the engine with --ngram-speculative (no draft model). Repetitive/structured prompts make the proposer fire so the verify path is actually exercised. - ngram_speculative_greedy_matches_plain_greedy: bs=1 over 5 prompts. - ngram_concurrent_heterogeneous_is_lossless: 4 concurrent staggered budgets, exercising the bs>1 verify-span path and per-request context isolation. GPU + Qwen3-4B weights required; skips cleanly without OPENINFER_TEST_MODEL_PATH. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collapse the two mutually-exclusive executor fields (speculative: Option<DFlashMeta>
and ngram: Option<NgramProposer>) into a single speculative_method:
Option<SpeculativeMethod> { Dflash(..) | Ngram(..) }. One source of truth for
'is speculation on and which kind', removing the 'assert not both set' smell.
dflash_meta()/ngram_proposer() accessors keep every call site a one-liner and
DFlash behaviour bit-identical.
This is the natural pre-trait consolidation now that n-gram is the second method
openinfer-project#436 was waiting for. A full SpeculativeProposer trait object is deliberately
not introduced yet: DFlash's propose is a worker RPC (needs &mut executor +
lane), so it can't be a self-contained proposer object without borrow gymnastics
— the enum-dispatch is the honest shape until a third method (EAGLE, openinfer-project#325) makes
a shared trait pay off.
cargo check -p openinfer-qwen3-4b {--lib,--tests} green; ngram units pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cargo fmt --all --check clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fer-project#349 review Review of openinfer-project#349 surfaced a real gap plus polish: - 🔴 The fused unified decode path (execute_unified) committed tokens via apply_decode but never appended them to ngram_ctx — only execute_decode and the verify path did. Reachable whenever a non-greedy request shares the batch (n-gram does not reject temperature>0): should_speculative_decode flips false, and the co-resident greedy request decodes via Unified. Not a correctness bug (the verify seed is the request's real last_token, not ngram_ctx), but the context tracker drifts, silently degrading n-gram acceptance once the batch returns to pure greedy and violating the 'always ends with the dangling token' invariant. Fixed by mirroring execute_decode's sync in the Unified branch, and defensively in the SplitDecodeReady branch (unreachable today — overlap is off under speculation — but guards a future relaxation). - Add ngram_mixed_greedy_and_sampling_is_lossless: a greedy request must stay token-identical to its solo baseline while a sampling request holds the batch non-greedy off the speculative path. - 🟡 Fix the misplaced doc comment: the load_dflash_draft_model header had been split from its body by the inserted dflash_meta()/ngram_proposer() accessors. - 🟢 max_context_tokens: document why n-gram needs no DFlash-style block_size reduction (verify span is budget-clamped, so it never writes past the window). - 🟢 Soften the verify-buffer 'allocates at most once' comment (it grows 1->K+1) and note propose()'s O(context_len) per-step scan + the vLLM look-back-window option for long contexts. cargo check -p openinfer-qwen3-4b {--lib,--tests} green; fmt clean; ngram units pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NgramProposer/new/propose are only used within the crate (mod ngram is private; the gate test drives through the engine, not the proposer directly), so their pub visibility tripped the workspace unreachable_pub lint. NgramConfig stays pub — it appears in load_ngram_drafter's public signature. GPU-validated: ngram_speculative_gate passes 3/3 on RTX 4090 (Qwen3-4B), every prompt 100% token-identical to plain greedy across bs=1, concurrent, and mixed greedy+sampling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3f27ad7ada
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Addresses openinfer-project#349 review (P2): when n-gram proposes nothing (non-repetitive prompt, or OPENINFER_QWEN3_NGRAM_TOKENS=0) the verify span is just [current_token], yet the scheduler had already picked SpeculativeDecode — so the no-hit token went through the prefill-style verify forward instead of the optimized decode kernel. That cost a full verify for one token and could flip a near-tie greedy pick by the prefill-vs-decode bf16 gap, with zero speculative upside. Now execute_plan partitions the post-propose batch: requests with drafts (verify span > 1) take the verify forward; requests with no draft fall back to execute_decode — same kernel, same cost, and bit-identical to baseline. All-miss steps (common on non-repetitive text) degrade exactly to plain decode, so enabling n-gram never regresses below baseline. DFlash is unaffected: its draft always emits block_size tokens, so its spans are never length 1. GPU-validated: ngram_speculative_gate still 3/3 on RTX 4090 (Qwen3-4B), all prompts 100% token-identical (the 17x23 reasoning prompt exercises the no-draft fallback path). cargo check --lib/--tests green; fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8c5b5adc4a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…list the design note Address openinfer-project#349 review. The gate's success check `matched == base.len().min(spec.len())` returned Ok(()) whenever every overlapping token matched, even when the two runs had different lengths — so a speculative path that emits EOS/Length early (or runs on past the baseline) passed as "100% lossless". It also risked an out-of-bounds panic on the divergence diagnostic when `spec` was the shorter run. Reject length mismatches explicitly: greedy speculation is lossless only if the streams are token-for-token AND length identical. Also list `models/qwen3/ngram-speculative.md` in docs/index.md per the docs workflow (every doc must appear in the routing table). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oser # Conflicts: # openinfer-server/src/config.rs
|
@codex review |
|
Codex Review: Didn't find any major issues. Breezy! Reviewed commit: ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
Thank you |
…oser # Conflicts: # openinfer-qwen3-4b/tests/dflash_speculative_gate.rs # openinfer-qwen3-4b/tests/dflash_speculative_perf.rs
|
@xiaguan Rebased onto latest main — the only conflict was the enable_kv_events field added in #457 colliding with this branch's ngram_speculative field in the DFlash test launch_options(); kept both. Builds clean (cargo test --no-run, exit 0), n-gram unit tests + the GPU losslessness gate all pass. Ready for review whenever you have a chance 🙏 |
|
Thanks for putting this together. The correctness side looks solid from the losslessness gate; the remaining thing I’d like to see before we position this as a performance win is a serving-level A/B with I ran a quick local pass on an RTX 5070 Ti, Qwen3-4B,
So the current shape looks promising for some single-stream cases, but the concurrent serving path needs more investigation before we call it a general speedup. Could you add Here are the exact commands I used. Paths may need adjusting for your machine: set -euo pipefail
MODEL=/data/models/Qwen3-4B
DRAFT=/data/models/Qwen3-4B-DFlash-b16
PORT=18080
RESULT_DIR=/tmp/openinfer-ngram-pr349
SONNET=/tmp/openinfer-sonnet.txt
mkdir -p "$RESULT_DIR"
curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm/main/benchmarks/sonnet.txt -o "$SONNET"
PATH=/usr/local/cuda-13.3/bin:$PATH cargo build --releaseStart one server mode at a time: # plain
RUST_LOG=warn target/release/openinfer \
--model-path "$MODEL" \
--served-model-name qwen3-4b \
--port "$PORT" \
--no-prefix-cache \
--gpu-memory-utilization 0.85
# n-gram
RUST_LOG=warn target/release/openinfer \
--model-path "$MODEL" \
--served-model-name qwen3-4b \
--port "$PORT" \
--no-prefix-cache \
--gpu-memory-utilization 0.85 \
--ngram-speculative
# DFlash
RUST_LOG=warn target/release/openinfer \
--model-path "$MODEL" \
--served-model-name qwen3-4b \
--port "$PORT" \
--no-prefix-cache \
--gpu-memory-utilization 0.85 \
--dflash-draft-model-path "$DRAFT"Run MODE=plain # plain | ngram | dflash
CONCURRENCY=1 # then repeat with 4
uv run --with vllm vllm bench serve \
--backend openai \
--base-url "http://127.0.0.1:${PORT}" \
--model qwen3-4b \
--tokenizer "$MODEL" \
--dataset-name sonnet \
--dataset-path "$SONNET" \
--sonnet-input-len 550 \
--sonnet-output-len 128 \
--sonnet-prefix-len 200 \
--num-prompts 12 \
--num-warmups 1 \
--max-concurrency "$CONCURRENCY" \
--request-rate inf \
--ignore-eos \
--temperature 0 \
--ready-check-timeout-sec 60 \
--disable-tqdm \
--metric-percentiles 50,90,99 \
--save-result \
--save-detailed \
--result-dir "$RESULT_DIR" \
--result-filename "sonnet-${MODE}-c${CONCURRENCY}.json"Run MODE=plain # plain | ngram
CONCURRENCY=1 # then repeat with 4
uv run --with vllm vllm bench serve \
--backend openai \
--base-url "http://127.0.0.1:${PORT}" \
--model qwen3-4b \
--tokenizer "$MODEL" \
--dataset-name random \
--random-input-len 550 \
--random-output-len 128 \
--random-range-ratio 0 \
--num-prompts 12 \
--num-warmups 1 \
--max-concurrency "$CONCURRENCY" \
--request-rate inf \
--ignore-eos \
--temperature 0 \
--ready-check-timeout-sec 60 \
--disable-tqdm \
--metric-percentiles 50,90,99 \
--save-result \
--save-detailed \
--result-dir "$RESULT_DIR" \
--result-filename "random-${MODE}-c${CONCURRENCY}.json" |
|
One more note: I’m fine with keeping the optimization work as a follow-up PR if we want to land the correctness / feature scaffolding first, but then I think we should open an issue to explicitly track the concurrent serving regression. The next step should be profiling why c4 regresses before changing the implementation. My guess is around the current speculative batch/verify path, but that is only a hypothesis until we have an nsys or per-step timing breakdown for plain vs n-gram at c4. |
On non-repetitive text (prose) prompt-lookup drafts are almost never the target's greedy continuation (~5-7% acceptance), so every speculative step pays a K+1-wide verify forward — plus, when only some requests in the batch matched, a second plain-decode pass — for ~no accepted tokens. At higher concurrency the GPU has no spare compute to hide that, so n-gram regressed serving throughput badly (sonnet -26% c1 / -40% c4 on Qwen3-4B / RTX 4090). Add an engine-wide acceptance gate (`NgramGate`): an EWMA of accepted draft tokens per drafted step. When it falls below `accept_threshold` the proposer returns nothing for the step, so the existing draft/undraft partition routes the whole batch to plain decode — no verify forward, no regression. A periodic probe re-opens the gate so a shift into repetitive text is picked back up. Gating only changes whether we speculate, never the committed tokens, so the losslessness gate still passes 3/3. Serving A/B (vllm bench serve, Qwen3-4B, RTX 4090), n-gram vs plain: sonnet c1 -26.2% -> +0.6% | c4 -40.0% -> -2.2% random c1 -10.3% -> +21.8%| c4 -33.8% -> -20.0% Prose regression is recovered (now better than vLLM's -6%, which keeps speculating); random c1 matches vLLM (+21.8% vs +19.3%). random c4 stays negative at high (~37%) acceptance — that gap is the verify path itself (two forwards + prefill-kernel verify vs vLLM's single fused forward), tracked as a follow-up in docs/models/qwen3/ngram-speculative.md. Tunable via OPENINFER_QWEN3_NGRAM_ACCEPT_THRESHOLD (default 0.3, 0 disables). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Added serving-level A/B with vllm bench serve (RTX 4090, Qwen3-4B, temperature=0, --ignore-eos, --no-prefix-cache, 12×128), sonnet + random, vs vLLM0.21/0.23. The regression was from paying the verify forward at near-zero acceptance. Added an acceptance gate (NgramGate): an EWMA of accepted draft tokens/step; below threshold the proposer emits nothing and the batch falls back to plain decode (no verify forward), with a periodic probe to re-open on repetitive text. Lossless — the gate only changes whether we speculate; the losslessness gate still passes 3/3. n-gram vs plain, after the gate:
Prose regression is gone (better than vLLM, which keeps speculating); random c1 matches vLLM. random c4 is the remaining gap (same ~37% acceptance,but vLLM's single fused forward beats our two-forward + prefill-kernel verify) — root-caused with nsys, tracked as a follow-up PR |
|
Thanks for the update and for adding the serving A/B data. Two code-quality things I’d like to see cleaned up before merge:
The feature shape looks reasonable; I mainly want the documentation and ownership boundary to stay maintainable. |
The n-gram lifecycle was spread across executor.rs: a context HashMap field, an acceptance-gate field, and manual maintenance of both in every execution path (prefill seed, decode/unified/overlap append, verify append + gate record, retire drop, and the draft step's gate decision). Move all of it behind crate::ngram::NgramRuntime, which owns the proposer, the gate, and the per-request context map, and exposes intent-level methods: seed / append_committed / drop_request / is_ready / propose_step / record_verified. SpeculativeMethod::Ngram now holds the runtime, so the two executor fields (ngram_ctx, ngram_gate) and the per-path bookkeeping collapse to single calls through ngram_runtime()/ngram_runtime_mut(). No behaviour change: unit tests 12/12, losslessness gate 3/3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The doc still described an earlier version of the branch: a SpeculativeProposer
trait, scheduler/speculative.rs::speculative_decode_step, OPENINFER_QWEN3_SPEC,
tests/ngram_speculative.rs, and a standalone verify_forward. Rewrite it to the
actual path: --ngram-speculative, the SpeculativeMethod{ Dflash, Ngram } enum,
NgramRuntime owning proposer+gate+context, host-side propose, and the shared
verify/accept/KV transaction. Keeps the serving A/B, acceptance-gate, and
single-fused-forward follow-up sections.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Green locally: build + fmt + clippy clean, unit tests 12/12, GPU losslessness gate 3/3. |
Review: n-gram speculative decoding — thanks!Great to see this land on top of the #436 shared core rather than as a parallel stack. Reusing verify / accept / the KV transaction and only swapping Substantive
Minor
Comment densityThis is subjective, but I found the commenting heavy. A lot of the prose explains what the code does line-by-line rather than why — e.g. the PerformanceThe acceptance gate is a nice save for prose, and the nsys root-cause for random-c4 (two forwards + prefill-kernel verify vs vLLM's single fused forward) is exactly the right diagnosis. The fused unified verify forward is the obvious next prize — looking forward to it closing the c4 gap. Happy to review that PR. ConclusionSolid foundation and the losslessness story is convincing. I'd just like to see the c4 framing tightened up and a mixed-load A/B before this merges — the rest can be follow-ups. |
Adds lossless n-gram (prompt-lookup) speculative decoding for Qwen3, built on the method-agnostic speculative core that landed in #436 / #442 (merged into this branch). Off by default; enable with
--ngram-speculative.This started as a parallel speculative stack (its own
speculative.rs, verify, KV lifecycle). It has been re-based onto #436's shared core: n-gram is now just a second propose implementation behind the same verify / accept / KV transaction — no duplicated controller.The shape
Speculative decoding is one optimistic transaction; only propose differs by method:
Kcandidate tokens.K + 1span[current, draft_1..draft_K], argmax at every position.accept_greedykeeps the longest prefix matching the target argmax, then appends the target's own token at the first mismatch (always commits1..=K + 1tokens).The draft↔verify boundary is a pure token span. So
accept_greedy/build_verify_results,execute_speculative_verify_impl, and theschedule/apply/revert_speculativeKV transaction are reused unchanged. Only two things are method-specific: where the drafts come from, and whether verify captures hidden states.What n-gram adds
ngram.rs):NgramProposer— longest-suffix prompt-lookup over the request's own token history, no draft model. Tries the longest configured suffix first, falls back to shorter. +NgramConfig(from_env) + 8 unit tests.execute_speculative_draftruns the lookup on the executor thread — no worker forward, no draft model, no readiness handshake. DFlash instead dispatchesStepCommand::SpeculativeDraftto the lane.execute_ngram_verifyreusesbatch_prefill_intowith zero capture layers (a token-only proposer keeps no model state to seed), then argmax +build_verify_results. The worker routes verify bylane.dflash.is_some()— the methods are mutually exclusive.ngram_ctx): per-request running token context, seeded at prefill (prompt + first_token), appended at every commit (verify / plain decode / fused unified decode), dropped on retire. Always ends with the request's dangling token.SpeculativeMethodenum): the two mutually-exclusive executor fields are collapsed into oneOption<SpeculativeMethod>{ Dflash | Ngram }— single source of truth, no "assert not both set" smell. This is the pre-trait consolidation feat(qwen3): DFlash speculative decoding #436 deferred. A fullSpeculativeProposertrait object is not introduced yet: DFlash's propose is a worker RPC (needs&mut executor+ lane), so it can't be a self-contained proposer without borrow gymnastics. Enum dispatch is the honest shape until a third method (EAGLE, feat(qwen3): explore speculative decoding #325) makes a shared trait pay off.Config
--ngram-speculative— enable. Single-GPU greedy only; mutually exclusive with--dflash-draft-model-path,--enable-lora,--kv-offload,--decode-overlap; requires--tp-size=1. Forces the prefix cache off (verify writes each request's own speculative KV span).OPENINFER_QWEN3_NGRAM_TOKENS— draft countK(default 4).OPENINFER_QWEN3_NGRAM_MAX_NGRAM— longest suffix to match (default 3).Losslessness
Greedy speculation is lossless (the accepted tokens are the target's own greedy continuation), which is a built-in oracle: spec-on must equal spec-off token-for-token (modulo the benign prefill-vs-decode bf16 tie flip).
GPU-validated on RTX 4090 (Qwen3-4B),
tests/ngram_speculative_gate.rs, 3/3 pass, every prompt 100% token-identical to plain greedy:ngram_speculative_greedy_matches_plain_greedy— bs=1, 5 prompts.ngram_concurrent_heterogeneous_is_lossless— 4 concurrent, staggered budgets (bs>1 verify-span path).ngram_mixed_greedy_and_sampling_is_lossless— a greedy request stays identical while a sampling request holds the batch off the speculative path.A length-1 verify span (n-gram "no match") already degrades to a plain single-token decode in
build_verify_results, so a miss needs no special fallback.Files
openinfer-qwen3-4b/src/ngram.rsNgramProposer+NgramConfig+ 8 unit testsopeninfer-qwen3-4b/src/executor.rsSpeculativeMethodenum;load_ngram_drafter; host-sideexecute_speculative_draft;execute_ngram_verify;ngram_ctxlifecycle (prefill / decode / unified / drop)openinfer-qwen3-4b/src/executor/spec.rsngram_ctx)openinfer-qwen3-4b/src/verify_graph.rsspan()accessor for the no-capture verify-buffer sizingopeninfer-qwen3-4b/src/scheduler.rs,src/lib.rs,openinfer-server/src/{config,main}.rs--ngram-speculativewiring + mutual-exclusion validationopeninfer-qwen3-4b/tests/ngram_speculative_gate.rsSits on top of
mainafter merging #436 / #442, so the diff is the n-gram addition, not a re-derivation of the shared core.