Skip to content

PR: feat(qwen3): n-gram speculative decoding#349

Open
wjinxu wants to merge 34 commits into
openinfer-project:mainfrom
wjinxu:feat/qwen3-ngram-proposer
Open

PR: feat(qwen3): n-gram speculative decoding#349
wjinxu wants to merge 34 commits into
openinfer-project:mainfrom
wjinxu:feat/qwen3-ngram-proposer

Conversation

@wjinxu

@wjinxu wjinxu commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Adds lossless n-gram (prompt-lookup) speculative decoding for Qwen3, built on the method-agnostic speculative core that landed in #436 / #442 (merged into this branch). Off by default; enable with --ngram-speculative.

This started as a parallel speculative stack (its own speculative.rs, verify, KV lifecycle). It has been re-based onto #436's shared core: n-gram is now just a second propose implementation behind the same verify / accept / KV transaction — no duplicated controller.

The shape

Speculative decoding is one optimistic transaction; only propose differs by method:

  1. Propose — produce up to K candidate tokens.
  2. Verify — one target forward over the K + 1 span [current, draft_1..draft_K], argmax at every position.
  3. Acceptaccept_greedy keeps the longest prefix matching the target argmax, then appends the target's own token at the first mismatch (always commits 1..=K + 1 tokens).
  4. Commit / roll back — accepted KV committed; the unused draft tail is LIFO-released.

The draft↔verify boundary is a pure token span. So accept_greedy / build_verify_results, execute_speculative_verify_impl, and the schedule/apply/revert_speculative KV transaction are reused unchanged. Only two things are method-specific: where the drafts come from, and whether verify captures hidden states.

What n-gram adds

  • Proposer (ngram.rs): NgramProposer — longest-suffix prompt-lookup over the request's own token history, no draft model. Tries the longest configured suffix first, falls back to shorter. + NgramConfig (from_env) + 8 unit tests.
  • Propose is host-side: execute_speculative_draft runs the lookup on the executor thread — no worker forward, no draft model, no readiness handshake. DFlash instead dispatches StepCommand::SpeculativeDraft to the lane.
  • No-capture verify: execute_ngram_verify reuses batch_prefill_into with zero capture layers (a token-only proposer keeps no model state to seed), then argmax + build_verify_results. The worker routes verify by lane.dflash.is_some() — the methods are mutually exclusive.
  • Context tracking (ngram_ctx): per-request running token context, seeded at prefill (prompt + first_token), appended at every commit (verify / plain decode / fused unified decode), dropped on retire. Always ends with the request's dangling token.
  • Method state (SpeculativeMethod enum): the two mutually-exclusive executor fields are collapsed into one Option<SpeculativeMethod>{ Dflash | Ngram } — single source of truth, no "assert not both set" smell. This is the pre-trait consolidation feat(qwen3): DFlash speculative decoding #436 deferred. A full SpeculativeProposer trait object is not introduced yet: DFlash's propose is a worker RPC (needs &mut executor + lane), so it can't be a self-contained proposer without borrow gymnastics. Enum dispatch is the honest shape until a third method (EAGLE, feat(qwen3): explore speculative decoding #325) makes a shared trait pay off.

Config

  • --ngram-speculative — enable. Single-GPU greedy only; mutually exclusive with --dflash-draft-model-path, --enable-lora, --kv-offload, --decode-overlap; requires --tp-size=1. Forces the prefix cache off (verify writes each request's own speculative KV span).
  • OPENINFER_QWEN3_NGRAM_TOKENS — draft count K (default 4).
  • OPENINFER_QWEN3_NGRAM_MAX_NGRAM — longest suffix to match (default 3).

Losslessness

Greedy speculation is lossless (the accepted tokens are the target's own greedy continuation), which is a built-in oracle: spec-on must equal spec-off token-for-token (modulo the benign prefill-vs-decode bf16 tie flip).

GPU-validated on RTX 4090 (Qwen3-4B), tests/ngram_speculative_gate.rs, 3/3 pass, every prompt 100% token-identical to plain greedy:

  • ngram_speculative_greedy_matches_plain_greedy — bs=1, 5 prompts.
  • ngram_concurrent_heterogeneous_is_lossless — 4 concurrent, staggered budgets (bs>1 verify-span path).
  • ngram_mixed_greedy_and_sampling_is_lossless — a greedy request stays identical while a sampling request holds the batch off the speculative path.

A length-1 verify span (n-gram "no match") already degrades to a plain single-token decode in build_verify_results, so a miss needs no special fallback.

Files

File Change
openinfer-qwen3-4b/src/ngram.rs New: NgramProposer + NgramConfig + 8 unit tests
openinfer-qwen3-4b/src/executor.rs SpeculativeMethod enum; load_ngram_drafter; host-side execute_speculative_draft; execute_ngram_verify; ngram_ctx lifecycle (prefill / decode / unified / drop)
openinfer-qwen3-4b/src/executor/spec.rs Generalize the verify transaction to be method-agnostic (relax DFlash-only asserts; append accepted tokens to ngram_ctx)
openinfer-qwen3-4b/src/verify_graph.rs span() accessor for the no-capture verify-buffer sizing
openinfer-qwen3-4b/src/scheduler.rs, src/lib.rs, openinfer-server/src/{config,main}.rs --ngram-speculative wiring + mutual-exclusion validation
openinfer-qwen3-4b/tests/ngram_speculative_gate.rs New: engine-level regret-based losslessness gate (3 tests)

Sits on top of main after merging #436 / #442, so the diff is the n-gram addition, not a re-derivation of the shared core.

wjinxu and others added 18 commits June 11, 2026 13:25
Stateless prompt-lookup proposer: finds the most recent earlier occurrence
of the current token suffix in the running context and proposes the tokens
that followed it, longest-suffix-first with shorter-suffix fallback. No draft
model. This is the proposer half of n-gram speculative decoding; verification
and KV rollback in the decode path are a follow-up.

Pure-CPU and unit-tested (recent-match continuation, longest-suffix
preference, fallback, no-match, K cap, short context, config clamping).

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds the verification-side brain of n-gram speculative decoding:
`accept_greedy` commits the longest proposed prefix the target model agrees
with plus one model token (correction or bonus), so a verify step always
makes >=1 token of progress; `num_accepted` reports how many speculative KV
positions to keep on rollback. Also adds `SpeculativeConfig` (disabled by
default). Pure-CPU and unit-tested.

The GPU verify forward (multi-token decode returning per-position logits) and
the executor / kvbm KV-rollback wiring are a follow-up.

Co-authored-by: Cursor <cursoragent@cursor.com>
kvbm's SchedulableSequence already implements speculative decode
(schedule_speculative + apply_speculative with automatic LIFO release of
rejected drafts' blocks), but RequestKv only wrapped the single-token decode
path. Add thin pass-throughs so the Qwen3 decode path can schedule a
speculative step and commit the accepted prefix; KV rollback for rejected
drafts is handled by kvbm, not the caller.

The verify forward reuses prefill_view(1 + num_draft_tokens). Unit-tested:
accepted-prefix commit advances kv_position by accepted.len(); over-commit
past the scheduled budget is rejected.

Co-authored-by: Cursor <cursoragent@cursor.com>
Captures the n-gram (prompt-lookup) speculative-decode design for Qwen3-4B:
per-step data flow, the lossless greedy-acceptance correctness oracle, the
token/KV accounting against kvbm's verified speculative contract, the layering
mirrored from vLLM V1 (proposer in the scheduler layer, GPU-side argmax,
scheduler reserves KV), and the remaining GPU verify-forward + executor wiring.
Records the building blocks already landed (proposer, acceptance, RequestKv
speculative pass-throughs) and the open interface questions.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add the GPU verify step for n-gram speculative decode, as an additive path
that leaves prefill/decode/unified untouched:

- LocalQwen3Lane::execute_verify runs the model on [dangling, drafts..] over
  the request's existing paged KV by reusing batch_prefill(echo=true) for the
  per-position logits, then takes a GPU argmax (argmax_batch_bf16_into) and
  copies back only the per-position token ids.
- New StepCommand::SpeculativeVerify / WorkerStepOutcome::Speculative carry it
  through the rank-worker channel (TP-safe: workers run the forward, primary
  returns the argmax).
- Qwen3Executor::execute_speculative orchestrates one request's step:
  schedule_speculative(K+1) -> verify -> accept_greedy -> apply_speculative
  (kvbm releases rejected drafts' blocks). Returns the committed tokens.
- SpeculativeStepItem carries the dangling token + drafts; exported via runtime.

Scheduler wiring (proposer invocation over token history) and GPU validation
of the lossless spec-on == spec-off property are the remaining steps.

Co-authored-by: Cursor <cursoragent@cursor.com>
Move the verify forward + executor step from "remaining" to "landed"; the
remaining work is scheduler wiring and GPU validation of the lossless property.

Co-authored-by: Cursor <cursoragent@cursor.com>
End-to-end test on real Qwen3-4B weights: greedy n-gram speculative decode
(proposer -> schedule_speculative -> verify forward -> accept_greedy ->
apply_speculative) produces a token-identical sequence to plain greedy decode.
Prefix cache disabled for cold, identically-shaped runs; repetitive prompt
keeps the greedy argmax well-separated despite the verify (prefill kernel) vs
reference (decode kernel) path difference. Skips cleanly without GPU/weights.

Passes on RTX (sm_89), Qwen3-4B: 24-token greedy continuation matches exactly.

Co-authored-by: Cursor <cursoragent@cursor.com>
…code

Co-authored-by: Cursor <cursoragent@cursor.com>
Add execute_speculative to ModelExecutor (default: unsupported) so a generic
scheduler can drive speculative decode; Qwen3Executor's trait method delegates
to the existing inherent body. No behaviour change — nothing calls it yet.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add the serving-loop integration behind a (default-disabled) SpeculativeConfig:

- ActiveRequestState gains token_history (prompt + generated), maintained on
  promote (resolve) and each committed decode token (effects).
- New scheduler/speculative.rs: speculative_decode_step proposes drafts per
  active request, runs execute_speculative (or falls back to a single decode
  when there is no draft), then streams the committed tokens applying stop /
  max-token handling — kept isolated from the one-token-per-step plan/resolve/
  effects pipeline.
- scheduler_loop routes pure-decode ticks through it when enabled.

Mock-tested (FakeExecutor.execute_speculative): a step streams every committed
token and advances state; commits past max_tokens truncate, finish, and retire.
Enabling it from the public API (a config knob) is a follow-up.

Co-authored-by: Cursor <cursoragent@cursor.com>
…+ perf

Co-authored-by: Cursor <cursoragent@cursor.com>
Add SpeculativeConfig::from_env (OPENINFER_QWEN3_NGRAM_SPEC[_TOKENS|_MAX_NGRAM])
and have scheduler_loop build the config from it instead of hardcoding off, so
the speculative decode path can be turned on end-to-end. Log once on enable.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add an ignored GPU benchmark that times greedy vs. speculative decode and
reports forward-pass count, accepted-per-verify, and wall-clock speedup. On the
periodic synthetic prompt (best case) it shows 3.96x / 5.00 tokens per verify;
document the numbers and the best-case caveat.

Co-authored-by: Cursor <cursoragent@cursor.com>
Introduce SpeculativeProposer (propose(&[u32]) -> Vec<u32>) as the single axis
of variation between speculative methods, with NgramProposer implementing it.
SpeculativeConfig now carries a SpeculativeMethod enum + build_proposer factory;
scheduler_loop builds one boxed proposer and speculative_decode_step takes
&dyn SpeculativeProposer. Verify forward, KV rollback, greedy acceptance, and
the streaming step are reused unchanged. Acceptance (greedy-only) and the
token-only proposer context are left as documented future seams.

Co-authored-by: Cursor <cursoragent@cursor.com>
…action

The module-level doc still described the file as ngram-specific glue, which
contradicts the SpeculativeProposer trait it now defines. Rewrite it to list the
three method-agnostic pieces (proposer trait, config/factory, greedy acceptance).

Co-authored-by: Cursor <cursoragent@cursor.com>
SpeculativeConfig::from_env was hardcoding the n-gram-specific OPENINFER_QWEN3_
NGRAM_SPEC_* variables, so the generic config knew the proposer's knobs. Move
those to NgramConfig::from_env (the proposer owns its own parameters) and leave
only the generic switch on SpeculativeConfig. Rename the switch to the generic
OPENINFER_QWEN3_SPEC and the n-gram knobs to OPENINFER_QWEN3_NGRAM_{TOKENS,
MAX_NGRAM}; a OPENINFER_QWEN3_SPEC_METHOD dispatch slots in when a second
proposer lands.

Co-authored-by: Cursor <cursoragent@cursor.com>
Address review of the speculative abstraction:
- Stop over-promising. The module/trait/docs claimed a method-agnostic core
  reused unchanged; in reality the verify forward returns argmax (greedy-
  specific), accept_greedy is greedy-only, and the scheduler step assumes a
  stateless proposer. Rewrite to say this is a greedy, n-gram-sized seam and
  spell out what a draft-model/EAGLE proposer would additionally need.
- Down-level describe(): drop SpeculativeConfig::describe and add Display on
  SpeculativeMethod + NgramConfig, so a method owns formatting of its own
  fields. scheduler logs spec_config.method.
- num_accepted: now an internal helper for accept_greedy (dedup the accept
  loop), demoted from pub since the executor uses the committed length.

Co-authored-by: Cursor <cursoragent@cursor.com>
Two robustness fixes for the speculative decode step:
- Transactional schedule: execute_speculative now reverts the
  schedule_speculative KV reservation (new RequestKv::revert_schedule) if the
  verify/commit body fails, so a mid-step error leaves the sequence idle for the
  next step instead of stranded half-scheduled.
- Budget cap: the scheduler caps the draft count to the request's remaining
  token budget, so a verify commits at most up to max_tokens. Otherwise
  schedule_speculative clamps to the budget and the larger accepted run is
  rejected. Test renamed/updated to assert the cap (drafts capped, commit lands
  exactly on the limit) and the docs drop the now-private num_accepted.

Co-authored-by: Cursor <cursoragent@cursor.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2f6d3e45b6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread openinfer-qwen3-4b/src/scheduler.rs Outdated
Comment thread openinfer-qwen3-4b/src/scheduler/speculative.rs Outdated
Addresses Codex review on openinfer-project#349. With OPENINFER_QWEN3_SPEC on, the scheduler
routed every pure-decode tick through the speculative path regardless of each
request's SamplingParams: a temperature>0 request was forced to argmax and a
logprobs request silently lost its decode logprobs.

speculative_decode_step now decides per request: only greedy
(SamplingParams::is_greedy()) requests that ask for no decode logprobs take the
speculative step; everyone else takes a normal sampled single-token decode with
their own params, logprobs, and a fresh random_val. Factored emit_one out of
apply_committed so both paths share the stop/max-token handling. Adds two tests
(sampled and logprobs requests bypass speculation) and documents the gate.

Co-authored-by: Cursor <cursoragent@cursor.com>
wjinxu and others added 7 commits June 24, 2026 15:55
…infer-project#442) into n-gram branch

Adopt openinfer-project#436's method-agnostic speculative core (speculative.rs draft/verify
seam, executor/spec.rs, KV schedule/apply/revert in pool.rs) wholesale,
discarding openinfer-project#349's now-superseded parallel scaffolding (its own speculative.rs,
scheduler/speculative.rs, executor verify path). The reusable n-gram IP
(ngram.rs prompt-lookup + tests, design doc) is kept unreferenced, to be
re-based onto openinfer-project#436's proposer seam in follow-up commits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ect#436 core

Plug host-side n-gram (prompt-lookup) speculation into the method-agnostic
speculative core landed by openinfer-project#436, instead of openinfer-project#349's parallel stack. The shared
verify/accept/KV transaction (speculative.rs, executor/spec.rs, pool.rs) is
reused as-is; only the propose step and a no-capture verify forward are added.

- ngram.rs: keep the suffix-lookup proposer + unit tests; drop the obsolete
  SpeculativeProposer impl (the trait is gone — openinfer-project#436 kept the core concrete).
- executor: NgramProposer + per-request token context (ngram_ctx), maintained
  at prefill (prompt + first token), verify-commit, and decode; dropped on
  retire. execute_speculative_draft proposes host-side (no worker forward);
  execute_speculative_verify reuses the shared KV transaction with the DFlash
  readiness/capture asserts relaxed.
- executor lane: execute_ngram_verify runs batch_prefill_into with zero capture
  layers (no hidden-state capture), argmax, build_verify_results. Verify is
  routed by lane.dflash.is_some() (the methods are mutually exclusive).
- load_ngram_drafter + speculative_enabled/request_ready cover the n-gram method.
- CLI: --ngram-speculative (single-GPU greedy; mutually exclusive with DFlash,
  LoRA, KV offload, decode overlap), threaded through launch -> start_qwen3.
- Remove tests/ngram_speculative.rs (drove openinfer-project#349's deleted low-level
  execute_speculative API); the proposer is unit-tested in ngram.rs. GPU
  losslessness to be validated engine-level on the 5090.

Compiles green: cargo check -p openinfer-qwen3-4b {--lib,--tests} and
-p openinfer-server; ngram unit tests pass. GPU losslessness validation pending.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Greedy n-gram speculation must be lossless: spec-on tokens equal plain greedy
decode (modulo the benign prefill-vs-decode bf16 tie flip). Mirrors the DFlash
gate's regret check (re-prefill the shared context at a divergence, tolerate
only picks within MARGIN_TOL of the prefill-kernel argmax), driven through the
engine with --ngram-speculative (no draft model). Repetitive/structured prompts
make the proposer fire so the verify path is actually exercised.

- ngram_speculative_greedy_matches_plain_greedy: bs=1 over 5 prompts.
- ngram_concurrent_heterogeneous_is_lossless: 4 concurrent staggered budgets,
  exercising the bs>1 verify-span path and per-request context isolation.

GPU + Qwen3-4B weights required; skips cleanly without OPENINFER_TEST_MODEL_PATH.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collapse the two mutually-exclusive executor fields (speculative: Option<DFlashMeta>
and ngram: Option<NgramProposer>) into a single speculative_method:
Option<SpeculativeMethod> { Dflash(..) | Ngram(..) }. One source of truth for
'is speculation on and which kind', removing the 'assert not both set' smell.
dflash_meta()/ngram_proposer() accessors keep every call site a one-liner and
DFlash behaviour bit-identical.

This is the natural pre-trait consolidation now that n-gram is the second method
openinfer-project#436 was waiting for. A full SpeculativeProposer trait object is deliberately
not introduced yet: DFlash's propose is a worker RPC (needs &mut executor +
lane), so it can't be a self-contained proposer object without borrow gymnastics
— the enum-dispatch is the honest shape until a third method (EAGLE, openinfer-project#325) makes
a shared trait pay off.

cargo check -p openinfer-qwen3-4b {--lib,--tests} green; ngram units pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cargo fmt --all --check clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fer-project#349 review

Review of openinfer-project#349 surfaced a real gap plus polish:

- 🔴 The fused unified decode path (execute_unified) committed tokens via
  apply_decode but never appended them to ngram_ctx — only execute_decode and
  the verify path did. Reachable whenever a non-greedy request shares the batch
  (n-gram does not reject temperature>0): should_speculative_decode flips false,
  and the co-resident greedy request decodes via Unified. Not a correctness bug
  (the verify seed is the request's real last_token, not ngram_ctx), but the
  context tracker drifts, silently degrading n-gram acceptance once the batch
  returns to pure greedy and violating the 'always ends with the dangling token'
  invariant. Fixed by mirroring execute_decode's sync in the Unified branch, and
  defensively in the SplitDecodeReady branch (unreachable today — overlap is off
  under speculation — but guards a future relaxation).
- Add ngram_mixed_greedy_and_sampling_is_lossless: a greedy request must stay
  token-identical to its solo baseline while a sampling request holds the batch
  non-greedy off the speculative path.
- 🟡 Fix the misplaced doc comment: the load_dflash_draft_model header had been
  split from its body by the inserted dflash_meta()/ngram_proposer() accessors.
- 🟢 max_context_tokens: document why n-gram needs no DFlash-style block_size
  reduction (verify span is budget-clamped, so it never writes past the window).
- 🟢 Soften the verify-buffer 'allocates at most once' comment (it grows 1->K+1)
  and note propose()'s O(context_len) per-step scan + the vLLM look-back-window
  option for long contexts.

cargo check -p openinfer-qwen3-4b {--lib,--tests} green; fmt clean; ngram units pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NgramProposer/new/propose are only used within the crate (mod ngram is private;
the gate test drives through the engine, not the proposer directly), so their
pub visibility tripped the workspace unreachable_pub lint. NgramConfig stays pub
— it appears in load_ngram_drafter's public signature.

GPU-validated: ngram_speculative_gate passes 3/3 on RTX 4090 (Qwen3-4B), every
prompt 100% token-identical to plain greedy across bs=1, concurrent, and
mixed greedy+sampling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wjinxu wjinxu marked this pull request as ready for review June 24, 2026 15:01
@wjinxu

wjinxu commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3f27ad7ada

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread openinfer-qwen3-4b/src/executor.rs Outdated
Addresses openinfer-project#349 review (P2): when n-gram proposes nothing (non-repetitive
prompt, or OPENINFER_QWEN3_NGRAM_TOKENS=0) the verify span is just
[current_token], yet the scheduler had already picked SpeculativeDecode — so the
no-hit token went through the prefill-style verify forward instead of the
optimized decode kernel. That cost a full verify for one token and could flip a
near-tie greedy pick by the prefill-vs-decode bf16 gap, with zero speculative
upside.

Now execute_plan partitions the post-propose batch: requests with drafts (verify
span > 1) take the verify forward; requests with no draft fall back to
execute_decode — same kernel, same cost, and bit-identical to baseline. All-miss
steps (common on non-repetitive text) degrade exactly to plain decode, so
enabling n-gram never regresses below baseline. DFlash is unaffected: its draft
always emits block_size tokens, so its spans are never length 1.

GPU-validated: ngram_speculative_gate still 3/3 on RTX 4090 (Qwen3-4B), all
prompts 100% token-identical (the 17x23 reasoning prompt exercises the no-draft
fallback path). cargo check --lib/--tests green; fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wjinxu

wjinxu commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8c5b5adc4a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread openinfer-qwen3-4b/tests/ngram_speculative_gate.rs Outdated
Comment thread docs/models/qwen3/ngram-speculative.md
wjinxu and others added 2 commits June 25, 2026 16:10
…list the design note

Address openinfer-project#349 review.

The gate's success check `matched == base.len().min(spec.len())` returned
Ok(()) whenever every overlapping token matched, even when the two runs had
different lengths — so a speculative path that emits EOS/Length early (or runs
on past the baseline) passed as "100% lossless". It also risked an
out-of-bounds panic on the divergence diagnostic when `spec` was the shorter
run. Reject length mismatches explicitly: greedy speculation is lossless only
if the streams are token-for-token AND length identical.

Also list `models/qwen3/ngram-speculative.md` in docs/index.md per the docs
workflow (every doc must appear in the routing table).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oser

# Conflicts:
#	openinfer-server/src/config.rs
@wjinxu

wjinxu commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Breezy!

Reviewed commit: a47adbea29

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@wjinxu

wjinxu commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Codex Review: Didn't find any major issues. Breezy!

Reviewed commit: a47adbea29

ℹ️ About Codex in GitHub

Thank you

…oser

# Conflicts:
#	openinfer-qwen3-4b/tests/dflash_speculative_gate.rs
#	openinfer-qwen3-4b/tests/dflash_speculative_perf.rs
@wjinxu

wjinxu commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

@xiaguan Rebased onto latest main — the only conflict was the enable_kv_events field added in #457 colliding with this branch's ngram_speculative field in the DFlash test launch_options(); kept both. Builds clean (cargo test --no-run, exit 0), n-gram unit tests + the GPU losslessness gate all pass. Ready for review whenever you have a chance 🙏

xiaguan commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Thanks for putting this together. The correctness side looks solid from the losslessness gate; the remaining thing I’d like to see before we position this as a performance win is a serving-level A/B with vllm bench serve datasets.

I ran a quick local pass on an RTX 5070 Ti, Qwen3-4B, temperature=0, ignore_eos, no prefix cache, 12 prompts × 128 output tokens:

dataset / mode plain n-gram DFlash
sonnet c1 90.2 tok/s 95.9 tok/s (+6.3%) 140.3 tok/s (+55.6%)
sonnet c4 307.0 tok/s 198.2 tok/s (-35.4%) 353.8 tok/s (+15.2%)
random c1 90.1 tok/s 112.8 tok/s (+25.2%) not run
random c4 306.6 tok/s 220.1 tok/s (-28.2%) not run

So the current shape looks promising for some single-stream cases, but the concurrent serving path needs more investigation before we call it a general speedup. Could you add vllm bench serve A/B results to the PR and either address the c4 regression or document the feature as experimental / single-stream-oriented for now?

Here are the exact commands I used. Paths may need adjusting for your machine:

set -euo pipefail

MODEL=/data/models/Qwen3-4B
DRAFT=/data/models/Qwen3-4B-DFlash-b16
PORT=18080
RESULT_DIR=/tmp/openinfer-ngram-pr349
SONNET=/tmp/openinfer-sonnet.txt
mkdir -p "$RESULT_DIR"

curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm/main/benchmarks/sonnet.txt -o "$SONNET"
PATH=/usr/local/cuda-13.3/bin:$PATH cargo build --release

Start one server mode at a time:

# plain
RUST_LOG=warn target/release/openinfer \
  --model-path "$MODEL" \
  --served-model-name qwen3-4b \
  --port "$PORT" \
  --no-prefix-cache \
  --gpu-memory-utilization 0.85

# n-gram
RUST_LOG=warn target/release/openinfer \
  --model-path "$MODEL" \
  --served-model-name qwen3-4b \
  --port "$PORT" \
  --no-prefix-cache \
  --gpu-memory-utilization 0.85 \
  --ngram-speculative

# DFlash
RUST_LOG=warn target/release/openinfer \
  --model-path "$MODEL" \
  --served-model-name qwen3-4b \
  --port "$PORT" \
  --no-prefix-cache \
  --gpu-memory-utilization 0.85 \
  --dflash-draft-model-path "$DRAFT"

Run sonnet for c1/c4:

MODE=plain   # plain | ngram | dflash
CONCURRENCY=1  # then repeat with 4

uv run --with vllm vllm bench serve \
  --backend openai \
  --base-url "http://127.0.0.1:${PORT}" \
  --model qwen3-4b \
  --tokenizer "$MODEL" \
  --dataset-name sonnet \
  --dataset-path "$SONNET" \
  --sonnet-input-len 550 \
  --sonnet-output-len 128 \
  --sonnet-prefix-len 200 \
  --num-prompts 12 \
  --num-warmups 1 \
  --max-concurrency "$CONCURRENCY" \
  --request-rate inf \
  --ignore-eos \
  --temperature 0 \
  --ready-check-timeout-sec 60 \
  --disable-tqdm \
  --metric-percentiles 50,90,99 \
  --save-result \
  --save-detailed \
  --result-dir "$RESULT_DIR" \
  --result-filename "sonnet-${MODE}-c${CONCURRENCY}.json"

Run random for a control workload:

MODE=plain   # plain | ngram
CONCURRENCY=1  # then repeat with 4

uv run --with vllm vllm bench serve \
  --backend openai \
  --base-url "http://127.0.0.1:${PORT}" \
  --model qwen3-4b \
  --tokenizer "$MODEL" \
  --dataset-name random \
  --random-input-len 550 \
  --random-output-len 128 \
  --random-range-ratio 0 \
  --num-prompts 12 \
  --num-warmups 1 \
  --max-concurrency "$CONCURRENCY" \
  --request-rate inf \
  --ignore-eos \
  --temperature 0 \
  --ready-check-timeout-sec 60 \
  --disable-tqdm \
  --metric-percentiles 50,90,99 \
  --save-result \
  --save-detailed \
  --result-dir "$RESULT_DIR" \
  --result-filename "random-${MODE}-c${CONCURRENCY}.json"

xiaguan commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

One more note: I’m fine with keeping the optimization work as a follow-up PR if we want to land the correctness / feature scaffolding first, but then I think we should open an issue to explicitly track the concurrent serving regression.

The next step should be profiling why c4 regresses before changing the implementation. My guess is around the current speculative batch/verify path, but that is only a hypothesis until we have an nsys or per-step timing breakdown for plain vs n-gram at c4.

@xiaguan xiaguan self-assigned this Jun 28, 2026
wjinxu and others added 2 commits June 29, 2026 11:10
On non-repetitive text (prose) prompt-lookup drafts are almost never the
target's greedy continuation (~5-7% acceptance), so every speculative step
pays a K+1-wide verify forward — plus, when only some requests in the batch
matched, a second plain-decode pass — for ~no accepted tokens. At higher
concurrency the GPU has no spare compute to hide that, so n-gram regressed
serving throughput badly (sonnet -26% c1 / -40% c4 on Qwen3-4B / RTX 4090).

Add an engine-wide acceptance gate (`NgramGate`): an EWMA of accepted draft
tokens per drafted step. When it falls below `accept_threshold` the proposer
returns nothing for the step, so the existing draft/undraft partition routes
the whole batch to plain decode — no verify forward, no regression. A periodic
probe re-opens the gate so a shift into repetitive text is picked back up.
Gating only changes whether we speculate, never the committed tokens, so the
losslessness gate still passes 3/3.

Serving A/B (vllm bench serve, Qwen3-4B, RTX 4090), n-gram vs plain:
  sonnet c1  -26.2% -> +0.6% | c4  -40.0% -> -2.2%
  random c1  -10.3% -> +21.8%| c4  -33.8% -> -20.0%
Prose regression is recovered (now better than vLLM's -6%, which keeps
speculating); random c1 matches vLLM (+21.8% vs +19.3%). random c4 stays
negative at high (~37%) acceptance — that gap is the verify path itself
(two forwards + prefill-kernel verify vs vLLM's single fused forward), tracked
as a follow-up in docs/models/qwen3/ngram-speculative.md.

Tunable via OPENINFER_QWEN3_NGRAM_ACCEPT_THRESHOLD (default 0.3, 0 disables).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wjinxu

wjinxu commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Added serving-level A/B with vllm bench serve (RTX 4090, Qwen3-4B, temperature=0, --ignore-eos, --no-prefix-cache, 12×128), sonnet + random, vs vLLM0.21/0.23.

The regression was from paying the verify forward at near-zero acceptance. Added an acceptance gate (NgramGate): an EWMA of accepted draft tokens/step; below threshold the proposer emits nothing and the batch falls back to plain decode (no verify forward), with a periodic probe to re-open on repetitive text. Lossless — the gate only changes whether we speculate; the losslessness gate still passes 3/3.

n-gram vs plain, after the gate:

dataset / conc acceptance vLLM 0.23 openinfer
sonnet c1 ~5% −5.8% +0.6%
sonnet c4 ~7% −5.8% −2.2%
random c1 ~33% +19.3% +21.8%
random c4 ~37% +17.0% −20.0%

Prose regression is gone (better than vLLM, which keeps speculating); random c1 matches vLLM. random c4 is the remaining gap (same ~37% acceptance,but vLLM's single fused forward beats our two-forward + prefill-kernel verify) — root-caused with nsys, tracked as a follow-up PR

xiaguan commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Thanks for the update and for adding the serving A/B data.

Two code-quality things I’d like to see cleaned up before merge:

  1. The new docs/models/qwen3/ngram-speculative.md still describes an older version of this branch in a few places. It mentions things like tests/ngram_speculative.rs, scheduler/speculative.rs, SpeculativeProposer, and OPENINFER_QWEN3_SPEC, but the current implementation is based on --ngram-speculative, SpeculativeMethod::{Dflash, Ngram}, executor-side n-gram context, and the shared verify/accept/KV transaction. Could you refresh the doc to match the actual code path?

  2. The n-gram lifecycle is spread across executor.rs: context seeding, append-on-decode, append-on-unified, append-on-verify, drop cleanup, and gate bookkeeping all live in separate branches. Since executor.rs is already very large, I’d prefer to keep this feature’s state in a small dedicated helper, e.g. an NgramState/NgramRuntime that owns the context map, proposer, and gate. Then the executor can call intent-level methods instead of manually maintaining n-gram state in every execution path.

The feature shape looks reasonable; I mainly want the documentation and ownership boundary to stay maintainable.

wjinxu and others added 2 commits June 29, 2026 22:45
The n-gram lifecycle was spread across executor.rs: a context HashMap field,
an acceptance-gate field, and manual maintenance of both in every execution
path (prefill seed, decode/unified/overlap append, verify append + gate
record, retire drop, and the draft step's gate decision).

Move all of it behind crate::ngram::NgramRuntime, which owns the proposer, the
gate, and the per-request context map, and exposes intent-level methods:
seed / append_committed / drop_request / is_ready / propose_step /
record_verified. SpeculativeMethod::Ngram now holds the runtime, so the two
executor fields (ngram_ctx, ngram_gate) and the per-path bookkeeping collapse
to single calls through ngram_runtime()/ngram_runtime_mut(). No behaviour
change: unit tests 12/12, losslessness gate 3/3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The doc still described an earlier version of the branch: a SpeculativeProposer
trait, scheduler/speculative.rs::speculative_decode_step, OPENINFER_QWEN3_SPEC,
tests/ngram_speculative.rs, and a standalone verify_forward. Rewrite it to the
actual path: --ngram-speculative, the SpeculativeMethod{ Dflash, Ngram } enum,
NgramRuntime owning proposer+gate+context, host-side propose, and the shared
verify/accept/KV transaction. Keeps the serving A/B, acceptance-gate, and
single-fused-forward follow-up sections.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wjinxu

wjinxu commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author
  1. Ownership — all n-gram state now lives in a dedicated NgramRuntime (proposer + gate + per-request context). SpeculativeMethod::Ngram holds it; the two executor fields and the per-path bookkeeping collapse to intent-level calls (seed / append_committed / drop_request / is_ready / propose_step / record_verified). No behaviour change.

  2. Docs — rewrote ngram-speculative.md to the shipped path: dropped the stale SpeculativeProposer trait, scheduler/speculative.rs, OPENINFER_QWEN3_SPEC, tests/ngram_speculative.rs, verify_forward; now reflects --ngram-speculative, the SpeculativeMethod { Dflash, Ngram } enum, NgramRuntime, and the shared verify/accept/KV transaction.

Green locally: build + fmt + clippy clean, unit tests 12/12, GPU losslessness gate 3/3.

@xiaguan

xiaguan commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Review: n-gram speculative decoding — thanks!

Great to see this land on top of the #436 shared core rather than as a parallel stack. Reusing verify / accept / the KV transaction and only swapping propose is the right call, and the enum dispatch instead of a premature trait is honest. The GPU losslessness gate (3/3, token-identical) is solid evidence. A few things I'd like to see addressed before this is ready — none of them block the design, mostly scope/defaults and some polish.

Substantive

  1. random c4 is −20% in the PR's own A/B, and the fix (fused verify forward) is a follow-up. The doc is transparent about it, which I appreciate — but it does mean --ngram-speculative, in what should be spec decode's sweet spot (high concurrency + moderate acceptance), regresses below plain decode (vLLM is +17% there). Two options I'd be happy with:

    • Land the fused verify forward in this PR, or
    • Mark the flag experimental and document that concurrency > 1 can regress until fused verify lands (so nobody enables it in prod expecting a flat win).
  2. NgramGate is engine-wide; the mixed-load case isn't tested. record_verified feeds mean(accepted_draft_tokens) across the drafted requests in a step, so a repetitive request (accept=4) co-resident with prose requests (accept=0) keeps the EWMA healthy and the gate open — the prose requests then keep paying the prefill-kernel verify for ~0 accepted tokens. The A/B tables are all homogeneous datasets. Could you add a 50/50 mixed A/B (repetitive + prose) to show the gate doesn't degrade the prose side? If it does, a per-request gate (or per-request EWMA) is probably the real shape.

  3. accept_threshold semantics drift with K. It's compared against "mean accepted draft tokens per step" (absolute, 0..K), so 0.3 means 7.5% acceptance at K=4 but 3.75% at K=8. The doc quotes acceptance as percentages (~5% / ~37%), but the knob is in absolute tokens — users tuning K have to re-tune the threshold. Making it accepted / drafted (ratio, 0..1) would make it K-portable and match the doc's mental model directly. Happy if this is a follow-up, but worth noting now since it touches the gate's public surface.

Minor

  1. execute_ngram_verify reallocs verify scratch as the span grows 1 → K+1. K is known at load_ngram_drafter time; pre-allocating at K+1 there avoids the startup alloc stalls on the lane. DFlash already does the equivalent with its fixed block_size.

  2. NgramProposer::propose is O(context_len · max_ngram) per step, full scan, no look-back window. Fine for bounded contexts, but it'll erode the win on long generations — vLLM bounds the look-back. The doc mentions this as a future concern; it'll become one, so worth a tracked issue at least.

  3. ngram_mixed_greedy_and_sampling_is_lossless can't deterministically hit the unified ngram_ctx sync branch (the test's own comment says so), and commit 1d7b6c98 fixed a bug exactly there. The fact that the IT can't pin it suggests the unified sync depends on scheduler arrival timing — a mild design smell. Not blocking, but worth a comment in the test or a deterministic harness later.

  4. propose_step's anyhow::ensure!(item.params.is_greedy()) is unreachableshould_speculative_decode already guarantees the whole batch is greedy. Either drop it or demote to debug_assert! to mark it as an invariant guard.

  5. NgramConfig fields are all pub. The type needs to be pub (it's in load_ngram_drafter's signature), but the fields are only constructed by from_env(); pub(crate) would be enough and keeps the contract tight.

Comment density

This is subjective, but I found the commenting heavy. A lot of the prose explains what the code does line-by-line rather than why — e.g. the execute_plan::SpeculativeDecode block, the NgramRuntime method docs, the append_committed sync blocks duplicated across decode/unified/split paths. Where the code already reads as a self-explanatory sequence, letting it speak cuts the surface area reviewers have to hold and makes the why comments (the ones that actually earn their keep, like the bf16 tie-flip rationale in the gate test) stand out more. Not asking for a mass strip — just biasing toward "delete unless it explains intent the code can't."

Performance

The acceptance gate is a nice save for prose, and the nsys root-cause for random-c4 (two forwards + prefill-kernel verify vs vLLM's single fused forward) is exactly the right diagnosis. The fused unified verify forward is the obvious next prize — looking forward to it closing the c4 gap. Happy to review that PR.

Conclusion

Solid foundation and the losslessness story is convincing. I'd just like to see the c4 framing tightened up and a mixed-load A/B before this merges — the rest can be follow-ups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants