Skip to content

Conversation

@SwekeR-463
Copy link
Contributor

Fixes #1194

  • Aligned the act_max for the first linear projections (gate and up) across all experts in MoE blocks to ensure a single input_scale for FP8 dispatch.
  • Added unit test with DeepSeek-V2-Lite-Chat.

Copy link
Contributor

@yiliu30 yiliu30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
Overall LGTM. Left a few comments.

@yiliu30
Copy link
Contributor

yiliu30 commented Jan 5, 2026

@chensuyue @XuehaoSun Looks like the XPU test failed, and it doesn’t seem related to this PR. Could you take a look?

Copy link
Contributor

@WeiweiZhang1 WeiweiZhang1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@yiliu30 yiliu30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@xin3he
Copy link
Contributor

xin3he commented Jan 5, 2026

@SwekeR-463 @yiliu30 I think here is a missing change, set_amax_for_all_moe_layers is not applied for fp8 during tuning.

if is_nv_fp(self.act_data_type) or is_static_wfp8afp8(self):

if is_nv_fp(self.act_data_type):

@xin3he
Copy link
Contributor

xin3he commented Jan 5, 2026

BTW, if my understanding is correct that vLLM requires AR_ENABLE_UNIFY_MOE_INPUT_SCALE, I think the default value should be True since vLLM is our main target.

@yiliu30
Copy link
Contributor

yiliu30 commented Jan 5, 2026

@SwekeR-463 @yiliu30 I think here is a missing change, set_amax_for_all_moe_layers is not applied for fp8 during tuning.

if is_nv_fp(self.act_data_type) or is_static_wfp8afp8(self):

if is_nv_fp(self.act_data_type):

Hi @xin3he ,this seems to be a general gap in FP8_STATIC rather than something related to this enhancement. Since this PR focus is the RTN case, it’s fine to ignore it here. Please feel free to create another PR to fix that part.

@yiliu30
Copy link
Contributor

yiliu30 commented Jan 5, 2026

BTW, if my understanding is correct that vLLM requires AR_ENABLE_UNIFY_MOE_INPUT_SCALE, I think the default value should be True since vLLM is our main target.

Since FP8 dispatch is primarily for extreme inference speed, and it’s disabled by default in vllm‑gaudi.
And sharing input scales across all experts may degrade accuracy; I prefer to keep it disabled by default.

Copy link
Contributor

@xin3he xin3he left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply, LGTM since it's for RTN only.

@yiliu30 yiliu30 added the ready only add when the PR is ready to merge label Jan 5, 2026
@yiliu30
Copy link
Contributor

yiliu30 commented Jan 5, 2026

Hi @XuehaoSun looks like the CI was blocked by CodeQLExpected test? Could you please take a look, thx!

@wenhuach21 wenhuach21 merged commit 41a8377 into intel:main Jan 6, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready only add when the PR is ready to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Align the first input scale of MoE experts for FP8 dispatch

5 participants