Skip to content

feat(qwen3): EAGLE-3 Chain Speculative decoding#454

Open
scatyf3 wants to merge 6 commits into
openinfer-project:mainfrom
scatyf3:feat/eagle3-loader
Open

feat(qwen3): EAGLE-3 Chain Speculative decoding#454
scatyf3 wants to merge 6 commits into
openinfer-project:mainfrom
scatyf3:feat/eagle3-loader

Conversation

@scatyf3

@scatyf3 scatyf3 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

This is an PR for issue #443

Roadmap

step1: Chain draft

  1. Draft model loading
    1. Model Loader ✅
    2. Memory Allocator ✅
  2. Draft model farward pass ✅
    1. Chain(for testing) ✅
  3. Target model forward pass: Target feature capture for draft injection ✅
  4. Executor, wire eagle spec decoding to current steps ✅
  5. Reserve memory for both target and draft ✅
  6. Frontend: add EAGLE3 args, wire to existing generation pipeline ✅

Change for step 1

Kernel:

  1. rope without qk norm eagle3_rope_kernel
  2. prefill over contiguous KV single_prefill_nhd_noncausal_cuda
  3. decode over contiguous KV

Model

  1. Draft model loader openinfer-qwen3-4b/src/eagle3/loading.rs
  2. Memory Allocator openinfer-qwen3-4b/src/eagle3/reservation.rs
  3. Draft Forward pass openinfer-qwen3-4b/src/eagle3/forward.rs

correctness check: EAGLE3 python golden gate eagle3_drafter_golden_gate from tools/accuracy/dump_qwen3_4b_eagle3_golden.py

SD Loop

  1. reuse dflash's target feature capture capture_layer_ids in process_all_layers_batch_multi
  2. Draft Executor openinfer-qwen3-4b/src/executor/eagle3_lane.rs, maintain draft weight,each request's buffer, state, tmp captured_hidden
  3. Target Executor:
    1. add execute_eagle3_verify in openinfer-qwen3-4b/src/executor.rs
    2. add eagle3 to SpeculativeVerify and SpeculativeDraft in execute_step_on_lane
  4. For scheduling and sampling, we reuse the dflash util

Correctness check: openinfer-qwen3-4b/tests/eagle3_speculative_gate.rs
``
Perf:openinfer-qwen3-4b/tests/eagle3_speculative_perf.rs and `openinfer-qwen3-4b/tests/eagle3_gsm8k_perf.rs`

Result

chain sd, k=3, max new=256, rtx5070ti, greedy, 10 prompt from GSM8k

engine spec OFF spec ON speedup
openinfer 83.9 108.7 1.30×
vLLM 0.21 81.9 101.1 1.23×

Next Step: faithful Tree speculative decoding

The next step is to implement a faithful version, which we need:

  1. Draft Tree decoding
  2. Target Tree verify
  3. Runtime: CUDA Graph support for dynamic tree
  4. KV Management for Tree

@scatyf3 scatyf3 changed the title feat(qwen3): EAGLE-3 drafter loader (WIP) feat(qwen3): EAGLE-3 Speculative decoding (WIP) Jun 24, 2026
@scatyf3 scatyf3 force-pushed the feat/eagle3-loader branch 2 times, most recently from 1ddef8d to 9fc79df Compare July 1, 2026 23:04
scatyf3 and others added 4 commits July 3, 2026 16:14
Squash of the 17-commit eagle3 series (loader → KV budget → lane load →
3-layer capture → chain draft/verify → lossless gate → shift fix → γ tuning
→ GSM8K harness). Granular history preserved in branch eagle3-backup-prerebase.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Draft chain step now uses a dedicated FlashInfer single-query decode path
(single_decode_nhd) instead of borrowing DFlash's noncausal prefill kernel,
so the draft's per-step attention is structurally single-query and no longer
name-collides with the prefill path.

Add an EAGLE-3 drafter golden gate validating prefill_batched against the
official SafeAILab/EAGLE cnets.Model reference (regret=0, max rel logit
delta 1.25% bf16). Fixture + generator under tools/accuracy. Drops the old
kernel-vs-kernel check, which used DFlash's kernel as oracle and so could
pass while both kernels shared an EAGLE-wrong bug.

Perf A/B harness: tracing-subscriber dev-dep to surface the acceptance-rate
debug trace, plus an OPENINFER_TEST_MEM_UTIL override to fit target+draft
on a 16GB card.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@scatyf3 scatyf3 force-pushed the feat/eagle3-loader branch from 1a4d6b1 to c5d3502 Compare July 3, 2026 20:24
@scatyf3 scatyf3 changed the title feat(qwen3): EAGLE-3 Speculative decoding (WIP) feat(qwen3): EAGLE-3 Chain Speculative decoding Jul 3, 2026
@scatyf3 scatyf3 marked this pull request as ready for review July 3, 2026 20:30
@scatyf3

scatyf3 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

@xiaguan @CAICAIIs this adds the chain version of EAGLE-3 and is ready for review — let me know if you'd like any changes before merging.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5d350206e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 394 to +395
dflash_draft_model_path,
eagle3_draft_model_path,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject EAGLE when batch-invariant is enabled

When --batch-invariant is combined with --eagle3-draft-model-path, this path still reaches scheduler::start_qwen3 because the guard above only passes dflash_draft_model_path.is_some() into ensure_batch_invariant_supported. Pin mode's GEMM path bails on unwarmed drafter shapes, and EAGLE adds fc/qkv/lm_head GEMMs with different (M,K) shapes, so the first captured EAGLE request will error at runtime instead of being rejected at startup. Please include EAGLE in the same batch-invariant rejection as DFlash.

Useful? React with 👍 / 👎.

Comment on lines +132 to +135
let max_cache_len = (req.prompt_tokens.len()
+ req.max_output_tokens
+ crate::eagle3::EAGLE3_CHAIN_LENGTH)
.min(max_pos)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reserve EAGLE's transient chain slots before admitting

For an EAGLE request where prompt_tokens.len() + max_output_tokens == max_pos, this clamp allocates no room for the 3 transient draft positions even though the request is still admitted under the target context limit. Near the end of that request, the scheduler still asks chain_round for EAGLE3_CHAIN_LENGTH drafts, and draft_step rejects once the transient position reaches state.max_cache_len, turning a valid length-capped generation into an error. Please either cap EAGLE admission by the chain length or clamp the drafted span before rolling the chain.

Useful? React with 👍 / 👎.

scatyf3 and others added 2 commits July 3, 2026 17:22
The batch-invariant guard only forwarded dflash_draft_model_path.is_some() into
ensure_batch_invariant_supported, so --batch-invariant + --eagle3-draft-model-path
slipped through to the scheduler. Pin mode warms/self-checks only the base
projections; EAGLE-3's fc/qkv/lm_head GEMMs are new (M,K) shapes the pin never
covers, so the first captured EAGLE request would bail at the GEMM boundary at
runtime instead of being rejected at startup. Fold EAGLE-3 into the same rejection
as DFlash (OR the two draft-path flags; generalize the message) + regression test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…osition limit

The draft-KV reservation clamps max_cache_len to the drafter's max_position_embeddings
(max_pos), so a request with prompt + max_output within EAGLE3_CHAIN_LENGTH of max_pos
got no room for the transient draft positions the chain writes each round. Near the end
of such a request the scheduler still asks chain_round for CHAIN_LENGTH drafts and
draft_step bails (position >= max_cache_len), turning a valid length-capped generation
into a runtime error.

Gate EAGLE admission on prompt + max_output + CHAIN_LENGTH <= max_pos: near-limit
requests fall back to plain decode (still correct, just not accelerated) instead of
erroring. Guarantees the reservation's .min(max_pos) clamp never eats the chain headroom.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant