Skip to content

Conversation

@SiqiLi-Fighting
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link

Summary of Changes

Hello @SiqiLi-Fighting, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a major new feature: Multi-Token Prediction (MTP) within the EAGLE speculative decoding framework for SGLang. The primary objective is to significantly boost inference throughput for large language models by enabling them to predict and verify multiple tokens concurrently, moving beyond the limitations of sequential single-token generation. This involves a comprehensive set of changes, including extending the attention mechanism to support custom, non-causal masks essential for parallel verification, and integrating a dedicated EAGLE speculative decoding worker. The modifications touch core components such as the scheduler, model executor, and attention backends, ensuring a robust and configurable implementation of this advanced decoding strategy.

Highlights

  • Multi-Token Prediction (MTP) RFC: A comprehensive Request for Comments (RFC) document has been added, detailing the proposal for implementing Multi-Token Prediction (MTP) as an enhancement to the existing EAGLE speculative decoding algorithm. This RFC outlines the motivation, goals, design, and implementation plan for MTP, aiming to significantly improve inference throughput.
  • EAGLE Speculative Decoding Core Logic: New files and extensive modifications introduce the core logic for EAGLE speculative decoding. This includes the definition of EagleDraftInput and EagleVerifyInput dataclasses, functions for managing cache locations, building the speculative tree structure, and implementing token verification algorithms like verify_tree_greedy and tree_speculative_sampling_target_only.
  • Custom Attention Mask Support: The FlashAttention kernel and backend have been significantly extended to support custom attention masks and a causal parameter. This is a crucial change that enables non-causal attention patterns, which are necessary for parallel verification of multiple speculative tokens in the EAGLE framework.
  • Modular Speculative Algorithm Framework: A new SpeculativeAlgorithm enum has been introduced, providing a structured way to define and manage different speculative decoding strategies (e.g., EAGLE, EAGLE3, STANDALONE). This allows for flexible selection and integration of various speculative decoding approaches.
  • Configurable Speculative Decoding Parameters: Numerous new command-line arguments have been added to server_args.py, allowing users to configure various aspects of speculative decoding, including the chosen algorithm, draft model path, number of speculative steps, EAGLE top-k value, number of draft tokens, and acceptance thresholds.
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/release-pypi.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@SiqiLi-Fighting SiqiLi-Fighting force-pushed the feat/eagle-support-rebase branch from 91a2644 to dde5c16 Compare October 23, 2025 04:24
@SiqiLi-Fighting SiqiLi-Fighting force-pushed the feat/eagle-support-rebase branch from dde5c16 to e4474e2 Compare October 23, 2025 04:25
@SiqiLi-Fighting SiqiLi-Fighting force-pushed the feat/eagle-support-rebase branch from adb2610 to fac0b3e Compare October 23, 2025 18:32
Iamleos and others added 4 commits October 25, 2025 16:01
* add llama eagle3 model file

* fix padding bug

* fix some padding problem

* rm some debug log
* qwen eagle3

* rm log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants