-
Notifications
You must be signed in to change notification settings - Fork 75
PR: feat(qwen3): n-gram speculative decoding #349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
wjinxu
wants to merge
34
commits into
openinfer-project:main
Choose a base branch
from
wjinxu:feat/qwen3-ngram-proposer
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 27 commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
b387352
feat(qwen3): add n-gram (prompt-lookup) speculative proposer
wjinxu 2a85569
feat(qwen3): add greedy speculative acceptance + spec config
wjinxu 1e8ff27
feat(kv-cache): expose speculative schedule/apply on RequestKv
wjinxu 28f9295
docs(qwen3): design note for n-gram speculative decoding
wjinxu 05d90b7
feat(qwen3): wire speculative verify forward into the executor
wjinxu 6464578
docs(qwen3): mark speculative verify forward landed in design note
wjinxu 2181646
test(qwen3): GPU validation that n-gram speculative decode is lossless
wjinxu 4fd4c7a
docs(qwen3): record GPU-validated lossless property for ngram spec de…
wjinxu 75bfd35
feat(qwen3): expose execute_speculative on the ModelExecutor trait
wjinxu b811467
feat(qwen3): wire n-gram speculative decode into the scheduler
wjinxu 2a6e2aa
docs(qwen3): record scheduler wiring landed; remaining = config knob …
wjinxu 9cb3309
feat(qwen3): env switch for n-gram speculative decode
wjinxu 83f7e14
test(qwen3): n-gram speculative decode speedup benchmark
wjinxu ca5324a
refactor(qwen3): abstract speculative decode behind a proposer trait
wjinxu 1adef43
docs(qwen3): refresh speculative module header for the proposer abstr…
wjinxu 31e0aec
refactor(qwen3): push n-gram env parsing down to NgramConfig::from_env
wjinxu 85581b9
refactor(qwen3): honest speculative docs + tighten cohesion
wjinxu 2f6d3e4
fix(qwen3): make speculative step transactional and budget-capped
wjinxu 8969a2f
fix(qwen3): gate speculation to greedy, no-logprobs requests
wjinxu 09bc511
Merge origin/main (DFlash speculative core #436/#442) into n-gram branch
wjinxu 2842d15
feat(qwen3): n-gram speculative decoding on the shared #436 core
wjinxu 5fb92a2
test(qwen3): engine-level n-gram speculative losslessness gate
wjinxu 8cf77c2
refactor(qwen3): unify speculative method state into one enum
wjinxu c17975a
style(qwen3): rustfmt the n-gram losslessness gate
wjinxu 1d7b6c9
fix(qwen3): sync ngram_ctx on the unified decode path; address #349 r…
wjinxu 3f27ad7
style(qwen3): pub(crate) the n-gram proposer to clear unreachable_pub
wjinxu 8c5b5ad
perf(qwen3): route no-draft n-gram steps through plain decode
wjinxu 569af20
test(qwen3): fail the n-gram lossless gate on strict-prefix outputs; …
wjinxu a47adbe
Merge remote-tracking branch 'origin/main' into feat/qwen3-ngram-prop…
wjinxu 9a74be1
Merge remote-tracking branch 'origin/main' into feat/qwen3-ngram-prop…
wjinxu 8ae1b33
perf(qwen3): gate n-gram speculation on draft acceptance
wjinxu 9a955c1
style(qwen3): rustfmt the n-gram acceptance gate
wjinxu 3ac292c
refactor(qwen3): consolidate n-gram state into NgramRuntime
wjinxu ebfac0a
docs(qwen3): refresh n-gram doc to match the shipped code path
wjinxu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,205 @@ | ||
| # Qwen3-4B n-gram speculative decoding (design) | ||
|
|
||
| **TL;DR**: Draft-model-free speculative decoding for Qwen3-4B. An n-gram | ||
| (prompt-lookup) proposer suggests `K` continuation tokens from the running | ||
| context; the target model verifies them in one forward pass and commits the | ||
| longest greedy-agreed prefix plus one model token. Greedy verification is | ||
| **lossless** — output is bit-for-bit identical to plain greedy decode, just | ||
| fewer forward passes on repetitive / structured text (code, quoting, JSON). | ||
| The KV layer (kvbm) already supports speculative scheduling natively, so | ||
| rejected drafts need no manual rollback. | ||
|
|
||
| Last touched: 2026-06 · Status: **end-to-end implemented and gated behind a | ||
| (default-off) `SpeculativeConfig`: proposer, greedy acceptance, KV speculative | ||
| pass-throughs, the executor verify forward (`execute_speculative`, | ||
| GPU-validated lossless), and the scheduler serving-loop wiring | ||
| (`speculative_decode_step`, mock-tested). Remaining: a public config knob to | ||
| turn it on, and a speedup measurement.** | ||
|
|
||
| ## Why this is cheap to add here | ||
|
|
||
| Three of the four pieces already exist or are trivial: | ||
|
|
||
| - **Proposer** — `openinfer-qwen3-4b/src/ngram.rs` (`NgramProposer`). Done, unit-tested. | ||
| - **Acceptance** — `openinfer-qwen3-4b/src/speculative.rs` (`accept_greedy`, | ||
| `SpeculativeConfig`). Done, unit-tested. | ||
| - **KV scheduling / rollback** — kvbm's `SchedulableSequence` already implements | ||
| `schedule_speculative` / `apply_speculative` with **automatic LIFO release of | ||
| the blocks pre-allocated for rejected drafts**. Exposed on `RequestKv` | ||
| (`openinfer-kv-cache/src/pool.rs`). Done, unit-tested. | ||
| - **GPU verify forward + orchestration** — the only remaining work (below). | ||
|
|
||
| ## Per-step data flow (single request, greedy) | ||
|
|
||
| Layering mirrors vLLM V1 (proposer in the runner/scheduler layer that owns | ||
| token history; the executor reserves KV, runs the target forward, and accepts): | ||
|
|
||
| ``` | ||
| scheduler owns the request's token history (prompt + generated) | ||
| 1. drafts = NgramProposer.propose(history) # K candidates; empty -> plain decode | ||
| 2. verify_inputs = [d0, c0, .., c_{K-1}] # d0 = last committed token (dangling) | ||
| 3. RequestKv.schedule_speculative(K + 1) # room for drafts + bonus token | ||
| 4. argmax[0..K] = verify_forward(verify_inputs, prefill_view(1 + K)) | ||
| 5. committed = accept_greedy(drafts, argmax) # m accepted drafts + 1 model token | ||
| 6. RequestKv.apply_speculative(committed) # kvbm releases rejected blocks | ||
| 7. scheduler appends committed to history; applies stop / max_tokens; | ||
| committed.last() becomes the next step's d0 | ||
| ``` | ||
|
|
||
| ### Token / KV accounting (matches kvbm's verified contract) | ||
|
|
||
| - The verify forward computes KV for `1 + K` positions (`d0` + `K` drafts) at | ||
| `base_pos = kv_position`. Structurally this is a prefill of `1 + K` tokens at | ||
| the current position, so the `KvView` is `RequestKv::prefill_view(1 + K)` and | ||
| the forward reuses the existing paged-prefill attention path. | ||
| - `argmax[i]` is the model's greedy token *after* consuming `verify_inputs[i]`: | ||
| `argmax[0]` follows `d0` (the true next token), `argmax[i]` follows `c_{i-1}`, | ||
| `argmax[K]` is the bonus continuation. This index convention is exactly what | ||
| `accept_greedy(proposed = drafts, target_argmax = argmax)` expects. | ||
| - `apply_speculative(committed)` advances `kv_position` by `committed.len()` | ||
| (`m + 1`); kvbm LIFO-drops the over-allocated blocks. `committed.last()` (the | ||
| model token) becomes the new dangling token. `schedule_speculative(K + 1)` | ||
| guarantees `m + 1 <= K + 1`. | ||
|
|
||
| ### Why it is lossless | ||
|
|
||
| `accept_greedy` only keeps the prefix where `draft[i] == argmax[i]` and then | ||
| appends one of the model's own tokens, so the committed sequence is identical | ||
| to what plain greedy decode would have produced one token at a time. This gives | ||
| a free correctness oracle: **speculative-on must equal speculative-off** for the | ||
| same prompt under greedy params. | ||
|
|
||
| ## Landed: GPU verify forward + executor step | ||
|
|
||
| - **`LocalQwen3Lane::execute_verify(verify_tokens, kv_view, lora)`** — reuses | ||
| `batch_prefill(echo = true)` for per-position logits over the existing paged | ||
| KV, then a GPU argmax (`argmax_batch_bf16_into`) returns one token per verify | ||
| position. Only the position ids cross to host, never the `[vocab, n]` logits | ||
| (vLLM-style). Kept additive: `batch_prefill` / `batch_decode` are untouched. | ||
| - **`StepCommand::SpeculativeVerify` + `Qwen3Executor::execute_speculative`** — | ||
| `schedule_speculative(K + 1)` → `prefill_view(1 + K)` → verify step → argmax → | ||
| `accept_greedy` → `apply_speculative`, returning the committed tokens. The | ||
| rank-worker channel carries it (TP-safe). | ||
|
|
||
| ## Validated | ||
|
|
||
| - **Lossless on GPU** — `tests/ngram_speculative.rs` runs real Qwen3-4B and | ||
| asserts greedy n-gram speculative decode is token-identical to plain greedy | ||
| decode (prefix cache off; repetitive prompt). Confirms the full pipeline | ||
| (proposer → schedule_speculative → verify forward → accept_greedy → | ||
| apply_speculative) end-to-end. | ||
|
|
||
| ## Landed: scheduler serving-loop wiring | ||
|
|
||
| - `ActiveRequestState` carries `token_history` (prompt + generated), maintained | ||
| on promote and each committed decode token. | ||
| - `scheduler/speculative.rs::speculative_decode_step` proposes per active | ||
| request, runs `execute_speculative` (or a single decode when no draft), and | ||
| streams the committed tokens with stop / max-token handling — isolated from | ||
| the one-token-per-step plan/resolve/effects pipeline. `scheduler_loop` routes | ||
| pure-decode ticks through it when `SpeculativeConfig.enabled`. | ||
| - Mock-tested (`FakeExecutor::execute_speculative`): streams every committed | ||
| token + advances state; commits past `max_tokens` truncate, finish, retire. | ||
|
|
||
| ## Proposer seam (closed-set, n-gram-sized) | ||
|
|
||
| The proposer is factored out as the one piece meant to vary between methods: | ||
|
|
||
| - `speculative::SpeculativeProposer` — `fn propose(&self, context: &[u32]) -> Vec<u32>`. | ||
| - `SpeculativeConfig.method: SpeculativeMethod` (a *closed* enum, one variant per | ||
| method) + `build_proposer()` factory. `scheduler_loop` builds one boxed | ||
| `dyn SpeculativeProposer` at startup; `speculative_decode_step` takes | ||
| `&dyn SpeculativeProposer`. This is closed-set enum dispatch, not an open | ||
| plugin system — the idiomatic Rust choice for a small known set. | ||
|
|
||
| This is a good **n-gram** seam, not yet a general proposer abstraction. The | ||
| trait fits stateless, token-emitting proposers; a draft-model / EAGLE proposer | ||
| would need a wider trait (`&mut self` + per-request create/drop lifecycle, the | ||
| request id, and returning draft probabilities for rejection sampling) **and** | ||
| changes to the scheduler step and verify path. Concretely, the parts below the | ||
| proposer are **greedy-specific**, not method-agnostic: | ||
|
|
||
| - the verify forward returns argmax (part of the greedy acceptance rule; sampling | ||
| acceptance needs distributions), | ||
| - `accept_greedy` is greedy-only, | ||
| - `speculative_decode_step` assumes a stateless proposer (no per-request | ||
| create/drop). | ||
|
|
||
| Widening these is deferred until a second proposer actually lands, so the shapes | ||
| are validated against a real implementation rather than guessed at now. | ||
|
|
||
| ## Enabling it | ||
|
|
||
| `scheduler_loop` builds the config via `SpeculativeConfig::from_env()` | ||
| (default-off). The generic switch lives on `SpeculativeConfig`; each method | ||
| parses its own knobs (`NgramConfig::from_env`): | ||
|
|
||
| - `OPENINFER_QWEN3_SPEC=1` — turn speculation on (generic). | ||
| - `OPENINFER_QWEN3_NGRAM_TOKENS=K` — draft count (n-gram, default 4). | ||
| - `OPENINFER_QWEN3_NGRAM_MAX_NGRAM=N` — longest matched suffix (n-gram, default 3). | ||
|
|
||
| Only the non-LoRA `scheduler_loop` reads it; the unified prefill+decode tick | ||
| still uses plain decode. | ||
|
|
||
| **Per-request eligibility.** Even with the switch on, only requests that are | ||
| greedy (`SamplingParams::is_greedy()`) **and** ask for no decode logprobs | ||
| (`logprobs == 0`) take the speculative path. Speculation verifies with argmax | ||
| and emits no per-token logprobs, so a sampled request would otherwise be forced | ||
| to argmax and a logprobs request would silently lose them. Any ineligible | ||
| request takes a normal sampled single-token decode (its own params, logprobs, | ||
| and a fresh `random_val`) on that tick, so enabling speculation never changes a | ||
| sampled request's output or strips requested logprobs. | ||
|
|
||
| ## Remaining work | ||
|
|
||
| 1. **First-class config knob**: env var is the current switch; thread a typed | ||
| knob through `start_qwen3*` / `start_engine*` (and a server flag) so it shows | ||
| up in the engine config rather than the environment. | ||
| 2. **Speedup measurement** (initial numbers below — generalize to realistic | ||
| prompts and the scheduler loop). | ||
| 3. **Batched / vLLM-style verify** (perf): fold the verify tokens into the | ||
| unified batched forward (FlashInfer varlen) with a GPU rejection step, | ||
| instead of the current per-request `batch_prefill`-based verify. | ||
|
|
||
| ## Measured speedup (best case) | ||
|
|
||
| `tests/ngram_speculative.rs::ngram_speculative_speedup` (ignored; needs GPU + | ||
| weights) times greedy vs. speculative on Qwen3-4B (eager, single request, 192 | ||
| tokens). On the perfectly periodic synthetic prompt: | ||
|
|
||
| | metric | greedy | speculative | | ||
| | ----------------- | -------- | ----------- | | ||
| | forward passes | 191 | 39 | | ||
| | ms / token | 9.99 | 2.52 | | ||
| | accepted / verify | — | 5.00 (max with K=4) | | ||
| | wall-clock | 1908 ms | 481 ms (**3.96x**) | | ||
|
|
||
| This is the ceiling: the prompt is exactly periodic so every draft is accepted. | ||
| Real prompts accept a fraction of drafts, so expect smaller wins; the benchmark | ||
| exists to track acceptance-rate / TPOT as the proposer and verify path evolve. | ||
| Run: `cargo test -p openinfer-qwen3-4b --release --test ngram_speculative \ | ||
| ngram_speculative_speedup -- --ignored --nocapture` (`OPENINFER_BENCH_TOKENS` | ||
| overrides the 192-token default). | ||
|
|
||
| ## Scope / deferred | ||
|
|
||
| First cut is **single-request, greedy, non-CUDA-graph**. Deferred: batched | ||
| speculation (ragged verify across requests), sampling (non-greedy) acceptance, | ||
| CUDA-graph-captured verify, interaction with the unified prefill+decode step, | ||
| and pipelined ahead-of-time proposal. | ||
|
|
||
| ## Open questions for review | ||
|
|
||
| 1. `verify_forward` as a standalone model method (preferred) vs. reusing | ||
| `batch_prefill(echo = true)`. | ||
| 2. Confirm the GPU-side argmax via `select_batch_tokens_into` over the `K + 1` | ||
| verify positions is acceptable (vs. a dedicated argmax kernel later). | ||
| 3. Proposer placement in the scheduler layer (owns token history), matching | ||
| vLLM's runner / `request.spec_token_ids` split. | ||
|
|
||
| ## Prior art | ||
|
|
||
| - vLLM V1 n-gram / prompt-lookup spec decode (`NgramProposer` in the runner, | ||
| GPU rejection sampler, scheduler reserves KV for `k` draft tokens). | ||
| - kvbm `SchedulableSequence` speculative lifecycle (`schedule_speculative` / | ||
| `apply_speculative`, LIFO block release). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.