feat(qwen3): EAGLE-3 Chain Speculative decoding#454
Conversation
1ddef8d to
9fc79df
Compare
Squash of the 17-commit eagle3 series (loader → KV budget → lane load → 3-layer capture → chain draft/verify → lossless gate → shift fix → γ tuning → GSM8K harness). Granular history preserved in branch eagle3-backup-prerebase. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Draft chain step now uses a dedicated FlashInfer single-query decode path (single_decode_nhd) instead of borrowing DFlash's noncausal prefill kernel, so the draft's per-step attention is structurally single-query and no longer name-collides with the prefill path. Add an EAGLE-3 drafter golden gate validating prefill_batched against the official SafeAILab/EAGLE cnets.Model reference (regret=0, max rel logit delta 1.25% bf16). Fixture + generator under tools/accuracy. Drops the old kernel-vs-kernel check, which used DFlash's kernel as oracle and so could pass while both kernels shared an EAGLE-wrong bug. Perf A/B harness: tracing-subscriber dev-dep to surface the acceptance-rate debug trace, plus an OPENINFER_TEST_MEM_UTIL override to fit target+draft on a 16GB card. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1a4d6b1 to
c5d3502
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c5d350206e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| dflash_draft_model_path, | ||
| eagle3_draft_model_path, |
There was a problem hiding this comment.
Reject EAGLE when batch-invariant is enabled
When --batch-invariant is combined with --eagle3-draft-model-path, this path still reaches scheduler::start_qwen3 because the guard above only passes dflash_draft_model_path.is_some() into ensure_batch_invariant_supported. Pin mode's GEMM path bails on unwarmed drafter shapes, and EAGLE adds fc/qkv/lm_head GEMMs with different (M,K) shapes, so the first captured EAGLE request will error at runtime instead of being rejected at startup. Please include EAGLE in the same batch-invariant rejection as DFlash.
Useful? React with 👍 / 👎.
| let max_cache_len = (req.prompt_tokens.len() | ||
| + req.max_output_tokens | ||
| + crate::eagle3::EAGLE3_CHAIN_LENGTH) | ||
| .min(max_pos) |
There was a problem hiding this comment.
Reserve EAGLE's transient chain slots before admitting
For an EAGLE request where prompt_tokens.len() + max_output_tokens == max_pos, this clamp allocates no room for the 3 transient draft positions even though the request is still admitted under the target context limit. Near the end of that request, the scheduler still asks chain_round for EAGLE3_CHAIN_LENGTH drafts, and draft_step rejects once the transient position reaches state.max_cache_len, turning a valid length-capped generation into an error. Please either cap EAGLE admission by the chain length or clamp the drafted span before rolling the chain.
Useful? React with 👍 / 👎.
The batch-invariant guard only forwarded dflash_draft_model_path.is_some() into ensure_batch_invariant_supported, so --batch-invariant + --eagle3-draft-model-path slipped through to the scheduler. Pin mode warms/self-checks only the base projections; EAGLE-3's fc/qkv/lm_head GEMMs are new (M,K) shapes the pin never covers, so the first captured EAGLE request would bail at the GEMM boundary at runtime instead of being rejected at startup. Fold EAGLE-3 into the same rejection as DFlash (OR the two draft-path flags; generalize the message) + regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…osition limit The draft-KV reservation clamps max_cache_len to the drafter's max_position_embeddings (max_pos), so a request with prompt + max_output within EAGLE3_CHAIN_LENGTH of max_pos got no room for the transient draft positions the chain writes each round. Near the end of such a request the scheduler still asks chain_round for CHAIN_LENGTH drafts and draft_step bails (position >= max_cache_len), turning a valid length-capped generation into a runtime error. Gate EAGLE admission on prompt + max_output + CHAIN_LENGTH <= max_pos: near-limit requests fall back to plain decode (still correct, just not accelerated) instead of erroring. Guarantees the reservation's .min(max_pos) clamp never eats the chain headroom. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This is an PR for issue #443
Roadmap
step1: Chain draft
Change for step 1
Kernel:
eagle3_rope_kernelsingle_prefill_nhd_noncausal_cudaModel
openinfer-qwen3-4b/src/eagle3/loading.rsopeninfer-qwen3-4b/src/eagle3/reservation.rsopeninfer-qwen3-4b/src/eagle3/forward.rscorrectness check: EAGLE3 python golden gate
eagle3_drafter_golden_gatefromtools/accuracy/dump_qwen3_4b_eagle3_golden.pySD Loop
capture_layer_idsinprocess_all_layers_batch_multiopeninfer-qwen3-4b/src/executor/eagle3_lane.rs, maintain draft weight,each request's buffer, state, tmp captured_hiddenexecute_eagle3_verifyinopeninfer-qwen3-4b/src/executor.rsSpeculativeVerifyandSpeculativeDraftinexecute_step_on_laneCorrectness check:
openinfer-qwen3-4b/tests/eagle3_speculative_gate.rs``
Perf:
openinfer-qwen3-4b/tests/eagle3_speculative_perf.rsand `openinfer-qwen3-4b/tests/eagle3_gsm8k_perf.rs`Result
chain sd, k=3, max new=256, rtx5070ti, greedy, 10 prompt from GSM8k
Next Step: faithful Tree speculative decoding
The next step is to implement a faithful version, which we need: