Skip to content

[Megatron] Support routing replay on NPU with performance and compatibility enhancements#5298

Open
755651978 wants to merge 4 commits intoverl-project:mainfrom
755651978:main-0212
Open

[Megatron] Support routing replay on NPU with performance and compatibility enhancements#5298
755651978 wants to merge 4 commits intoverl-project:mainfrom
755651978:main-0212

Conversation

@755651978
Copy link

What does this PR do?

This PR enables MoE Routing Replay on NPU (MindSpeed) platforms and provides critical compatibility patches for Megatron 0.12.1. It addresses the absence of key routing methods in older Megatron versions and ensures end-to-end data alignment from the rollout phase to the training phase.

Key Enhancements

  1. NPU-Compatible Routing Replay & Version Patching (router_replay_patch.py)

Compatibility Injection: Since Megatron 0.12.1 lacks the is_aux_loss_enabled() method, this PR implements it at the module level. Using types.MethodType, we dynamically bind this method to the router instance, ensuring consistency with newer Megatron APIs.

MindSpeed Resilience: Implements class-level attribute injection for enable_routing_replay and moe_router_fusion. This prevents MindSpeed’s dynamic dataclass reconstruction from stripping verl-specific configurations during NPU initialization.

Dynamic Signature Detection: Uses inspect.signature to adapt to different TransformerConfig versions and vp_stage (Virtual Pipeline) logic, ensuring correct layer offset mapping in complex pipeline-parallel NPU setups.

  1. Robust Data Alignment for Agent Loops (tool_agent_loop.py & vllm_async_rollout.py)

Deterministic Rollout: Standardizes max_tokens to a fixed response_length in the vLLM rollout worker. This prevents shape mismatches in routed_experts caused by fluctuating prompt lengths.

Data Preservation: Uses safe getattr calls to ensure that routing metadata captured during the agent's interaction loop is successfully passed to AgentLoopOutput for training.

Testing & Validation

Environment: Tested on Ascend NPU with MindSpeed.

Routing Consistency: Verified that the routed_experts generated during the rollout phase perfectly match the indices replayed during the training phase.

Performance: Benchmarked the forward pass; relocating helper functions resulted in a measurable reduction in Python-level overhead per iteration.

@CLAassistant
Copy link

CLAassistant commented Feb 12, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for MoE Routing Replay on NPU platforms and adds several compatibility patches for older Megatron versions. The changes include dynamic method injection, adaptive configuration patching using introspection, and data alignment fixes for deterministic rollouts. While the changes are generally well-implemented and improve compatibility, I've identified a critical copy-paste bug in the TransformerConfig patching logic that could lead to incorrect behavior or runtime errors. Please address this issue.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>


def _get_aux_loss_coeff(_self, aux_loss_type: str) -> float:
"""获取给定辅助损失类型的系数。"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释最好统一写成英文,无关注释建议去掉

if vp_rank is None:
vp_rank = 0
num_layers_to_build = get_num_layers_to_build(tf_config, vp_stage=vp_rank)
offset = get_transformer_layer_offset(tf_config, vp_stage=vp_rank)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一段为何要单独判断一下是否有vp_state入参

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前npu配套中megatron还只支持到0.12.0,从megatron中导入的get_num_layers_to_build函数没有vp_stage参数,暂时无法通过其他方式使用该参数


try:
sig = inspect.signature(TransformerConfig.__init__)
native_params = sig.parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先保存一份原始的TransfoermerConfig参数

# Simple solution: remove the unknown parameter before calling original constructor
enable_routing_replay = kwargs.pop("enable_routing_replay", TransformerConfig.enable_routing_replay)
if "enable_routing_replay" not in native_params:
enable_routing_replay = kwargs.pop("enable_routing_replay", TransformerConfig.enable_routing_replay)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果原始的TransformerConfig中没有当前这个参数,则用新的注入的参数值代替

response_ids=response_ids[: self.response_length],
response_mask=agent_data.response_mask[: self.response_length],
multi_modal_data=multi_modal_data,
routed_experts=routed_experts,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为何之前也没传入,是bugfix么还是npu特有的

# Add class attribute with default value
TransformerConfig.enable_routing_replay = False
try:
global_args = get_args()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verl/utils/megatron/router_replay_patch.py:340:23: F821 Undefined name get_args

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments