Skip to content

Conversation

@xutizhou
Copy link
Contributor

Overview:

Details:

Power-Law Distribution Support for Prefill MoE Benchmarking

This change adds power-law token distribution simulation for MoE prefill phase benchmarking.

Overview:

  • Simulates realistic token-to-expert assignment patterns observed in production workloads
  • Configurable alpha parameter controls distribution skewness:
    • alpha < 1.0: More uniform distribution (e.g., 0.6, 0.8)
    • alpha ~ 1.0: Zipf-like distribution (e.g., 1.02)
    • alpha > 1.0: Heavy-tailed distribution with few dominant experts (e.g., 1.2)
  • Multiple samples (5x) are generated per configuration to reduce variance from
    single-sample outliers
  • Ensures max tokens per expert stays within bounds via power_law_logits_v3/v4

Implementation:

  • power_law_logits_v3: Generates power-law distributed topk_idx, topk_weights, and
    num_recv_tokens_per_expert for prefill phase
  • power_law_logits_v4: Similar to v3 but ensures max tokens per expert <= num_tokens,
    used for decode phase
  • Results are logged with distribution type (uniform/power_law_{alpha}) for analysis

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

- Introduced `power_law_logits_v3` function to calculate token distribution among experts based on a power law.
- Added debug option for logging expert token assignments.
- Enhanced the handling of token distribution adjustments to ensure proper expert allocation.
- Expanded `get_moe_prefill_test_cases` to include distribution types and power law alpha values.
- Updated `benchmark_moe_layer_prefill` to handle new test case format and added logging for distribution type.
- Refined token distribution logic in `power_law_logits_v3` to return additional metrics for expert allocation.
- Improved handling of weights and indices for uniform and power-law distributions.
- Removed unnecessary debug print statements in `power_law_logits_v3` and `benchmark_moe_layer_prefill`.
- Updated `expert_assignments` tensor type to `int64` for consistency.
- Changed masking logic in `topk_idx` to set out-of-range indices to -1 instead of 0 for better clarity.
- Ensured `topk_idx_iter` is directly moved to the device without type conversion.
- Updated the `num_experts` descriptions to accurately reflect the expert per GPU calculations for different expert parallel sizes.
- Ensured clarity in the documentation for users configuring expert settings.
- Implemented multiple sampling in `benchmark_moe_layer_prefill` for power law distribution to mitigate outlier effects.
- Adjusted handling of weights and indices for both power law and uniform distributions.
- Ensured consistent processing of samples during warmup and iteration phases to improve performance and reliability.
@xutizhou xutizhou requested a review from AichenF as a code owner November 26, 2025 04:14
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the feat label Nov 26, 2025
@xutizhou xutizhou changed the title feat: power law feat: add sglang prefill moe power law Nov 26, 2025
Copy link
Contributor

@AichenF AichenF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.context_ops.extend(
could also be modified, as the context_moe uses "uniform" as the default workload_distribution for now.


def power_law_logits_v3(num_tokens, num_experts, topk, ep, alpha):
if num_tokens * topk > num_experts:
num_tokens_per_expert = sample_power_law(num_experts, alpha, 1, num_tokens * 0.8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is 0.8 here, is this parameter should be fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a hyper parameter referring

num_tokens_per_expert = sample_power_law(num_experts, alpha, 1, num_tokens * 0.8)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.8 means the number of tokens for the most heavy expert is 0.8 * num_tokens, which is from the statistics. You can change the coefficient if you need.

Copy link
Contributor

@AichenF AichenF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._power_law_alpha
Do we need to use different _power_law_alpha values for prefill and decode in the modeling?

@xutizhou
Copy link
Contributor Author

self.context_ops.extend(

could also be modified, as the context_moe uses "uniform" as the default workload_distribution for now.

We can use modify it in another PR

@xutizhou
Copy link
Contributor Author

self._power_law_alpha Do we need to use different _power_law_alpha values for prefill and decode in the modeling?

Yes, I think it's more flexible.

@tianhaox tianhaox self-requested a review November 27, 2025 17:05
- Changed the hardcoded "uniform" string to the variable `workload_distribution` for improved flexibility in model configuration.
@xutizhou
Copy link
Contributor Author

xutizhou commented Dec 1, 2025

self.context_ops.extend(

could also be modified, as the context_moe uses "uniform" as the default workload_distribution for now.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants