feat(qwen3): EAGLE-3 Chain Speculative decoding by scatyf3 · Pull Request #454 · openinfer-project/openinfer

scatyf3 · 2026-06-24T21:19:52Z

This is an PR for issue #443

Roadmap

step1: Chain draft

Draft model loading
1. Model Loader ✅
2. Memory Allocator ✅
Draft model farward pass ✅
1. Chain(for testing) ✅
Target model forward pass: Target feature capture for draft injection ✅
Executor, wire eagle spec decoding to current steps ✅
Reserve memory for both target and draft ✅
Frontend: add EAGLE3 args, wire to existing generation pipeline ✅

Change for step 1

Kernel:

rope without qk norm eagle3_rope_kernel
prefill over contiguous KV single_prefill_nhd_noncausal_cuda
decode over contiguous KV

Model

Draft model loader openinfer-qwen3-4b/src/eagle3/loading.rs
Memory Allocator openinfer-qwen3-4b/src/eagle3/reservation.rs
Draft Forward pass openinfer-qwen3-4b/src/eagle3/forward.rs

correctness check: EAGLE3 python golden gate eagle3_drafter_golden_gate from tools/accuracy/dump_qwen3_4b_eagle3_golden.py

SD Loop

reuse dflash's target feature capture capture_layer_ids in process_all_layers_batch_multi
Draft Executor openinfer-qwen3-4b/src/executor/eagle3_lane.rs, maintain draft weight,each request's buffer, state, tmp captured_hidden
Target Executor:
1. add execute_eagle3_verify in openinfer-qwen3-4b/src/executor.rs
2. add eagle3 to SpeculativeVerify and SpeculativeDraft in execute_step_on_lane
For scheduling and sampling, we reuse the dflash util

Correctness check: openinfer-qwen3-4b/tests/eagle3_speculative_gate.rs
``
Perf:openinfer-qwen3-4b/tests/eagle3_speculative_perf.rs and `openinfer-qwen3-4b/tests/eagle3_gsm8k_perf.rs`

Result

chain sd, k=3, max new=256, rtx5070ti, greedy, 10 prompt from GSM8k

engine	spec OFF	spec ON	speedup
openinfer	83.9	108.7	1.30×
vLLM 0.21	81.9	101.1	1.23×

Next Step: faithful Tree speculative decoding

The next step is to implement a faithful version, which we need:

Draft Tree decoding
Target Tree verify
Runtime: CUDA Graph support for dynamic tree
KV Management for Tree

Squash of the 17-commit eagle3 series (loader → KV budget → lane load → 3-layer capture → chain draft/verify → lossless gate → shift fix → γ tuning → GSM8K harness). Granular history preserved in branch eagle3-backup-prerebase. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Draft chain step now uses a dedicated FlashInfer single-query decode path (single_decode_nhd) instead of borrowing DFlash's noncausal prefill kernel, so the draft's per-step attention is structurally single-query and no longer name-collides with the prefill path. Add an EAGLE-3 drafter golden gate validating prefill_batched against the official SafeAILab/EAGLE cnets.Model reference (regret=0, max rel logit delta 1.25% bf16). Fixture + generator under tools/accuracy. Drops the old kernel-vs-kernel check, which used DFlash's kernel as oracle and so could pass while both kernels shared an EAGLE-wrong bug. Perf A/B harness: tracing-subscriber dev-dep to surface the acceptance-rate debug trace, plus an OPENINFER_TEST_MEM_UTIL override to fit target+draft on a 16GB card. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scatyf3 · 2026-07-03T20:34:33Z

@xiaguan @CAICAIIs this adds the chain version of EAGLE-3 and is ready for review — let me know if you'd like any changes before merging.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5d350206e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-07-03T20:37:29Z

        dflash_draft_model_path,
+        eagle3_draft_model_path,


Reject EAGLE when batch-invariant is enabled

When --batch-invariant is combined with --eagle3-draft-model-path, this path still reaches scheduler::start_qwen3 because the guard above only passes dflash_draft_model_path.is_some() into ensure_batch_invariant_supported. Pin mode's GEMM path bails on unwarmed drafter shapes, and EAGLE adds fc/qkv/lm_head GEMMs with different (M,K) shapes, so the first captured EAGLE request will error at runtime instead of being rejected at startup. Please include EAGLE in the same batch-invariant rejection as DFlash.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-07-03T20:37:29Z

+                let max_cache_len = (req.prompt_tokens.len()
+                    + req.max_output_tokens
+                    + crate::eagle3::EAGLE3_CHAIN_LENGTH)
+                    .min(max_pos)


Reserve EAGLE's transient chain slots before admitting

For an EAGLE request where prompt_tokens.len() + max_output_tokens == max_pos, this clamp allocates no room for the 3 transient draft positions even though the request is still admitted under the target context limit. Near the end of that request, the scheduler still asks chain_round for EAGLE3_CHAIN_LENGTH drafts, and draft_step rejects once the transient position reaches state.max_cache_len, turning a valid length-capped generation into an error. Please either cap EAGLE admission by the chain length or clamp the drafted span before rolling the chain.

Useful? React with 👍 / 👎.

The batch-invariant guard only forwarded dflash_draft_model_path.is_some() into ensure_batch_invariant_supported, so --batch-invariant + --eagle3-draft-model-path slipped through to the scheduler. Pin mode warms/self-checks only the base projections; EAGLE-3's fc/qkv/lm_head GEMMs are new (M,K) shapes the pin never covers, so the first captured EAGLE request would bail at the GEMM boundary at runtime instead of being rejected at startup. Fold EAGLE-3 into the same rejection as DFlash (OR the two draft-path flags; generalize the message) + regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…osition limit The draft-KV reservation clamps max_cache_len to the drafter's max_position_embeddings (max_pos), so a request with prompt + max_output within EAGLE3_CHAIN_LENGTH of max_pos got no room for the transient draft positions the chain writes each round. Near the end of such a request the scheduler still asks chain_round for CHAIN_LENGTH drafts and draft_step bails (position >= max_cache_len), turning a valid length-capped generation into a runtime error. Gate EAGLE admission on prompt + max_output + CHAIN_LENGTH <= max_pos: near-limit requests fall back to plain decode (still correct, just not accelerated) instead of erroring. Guarantees the reservation's .min(max_pos) clamp never eats the chain headroom. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scatyf3 changed the title ~~feat(qwen3): EAGLE-3 drafter loader (WIP)~~ feat(qwen3): EAGLE-3 Speculative decoding (WIP) Jun 24, 2026

xiaguan mentioned this pull request Jun 25, 2026

[Roadmap] Speculative decoding (qwen3 first, shared primitives) #443

Open

scatyf3 force-pushed the feat/eagle3-loader branch 2 times, most recently from 1ddef8d to 9fc79df Compare July 1, 2026 23:04

scatyf3 and others added 4 commits July 3, 2026 16:14

docs(eagle3): trim reservation/loader/attention comments

15dd5c3

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(eagle3): trim chain-length, aux-layer, and lane comments

c5d3502

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scatyf3 force-pushed the feat/eagle3-loader branch from 1a4d6b1 to c5d3502 Compare July 3, 2026 20:24

scatyf3 changed the title ~~feat(qwen3): EAGLE-3 Speculative decoding (WIP)~~ feat(qwen3): EAGLE-3 Chain Speculative decoding Jul 3, 2026

scatyf3 marked this pull request as ready for review July 3, 2026 20:30

chatgpt-codex-connector Bot reviewed Jul 3, 2026

View reviewed changes

scatyf3 and others added 2 commits July 3, 2026 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(qwen3): EAGLE-3 Chain Speculative decoding#454

feat(qwen3): EAGLE-3 Chain Speculative decoding#454
scatyf3 wants to merge 6 commits into
openinfer-project:mainfrom
scatyf3:feat/eagle3-loader

scatyf3 commented Jun 24, 2026 •

edited

Loading

Uh oh!

scatyf3 commented Jul 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jul 3, 2026

Uh oh!

chatgpt-codex-connector Bot Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

scatyf3 commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Roadmap

step1: Chain draft

Change for step 1

Next Step: faithful Tree speculative decoding

Uh oh!

scatyf3 commented Jul 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

scatyf3 commented Jun 24, 2026 •

edited

Loading