Skip to content

Conversation

@CedricHwong
Copy link

@CedricHwong CedricHwong commented Dec 26, 2025

What does this PR do?

Type of change: Bug fix

Overview:

  • Synchronizes MSE calibration amax across distributed groups (DP/EP/TP) after calibration finishes.
  • Adds a multi‑GPU test that verifies amax values match when distributed_sync=True and differ when distributed_sync=False.

Usage

  import copy
  import modelopt.torch.quantization as mtq

  # Build a quantization config that uses MSE calibration
  cfg = copy.deepcopy(mtq.INT8_DEFAULT_CFG)
  cfg["algorithm"] = {
      "method": "mse",
      "distributed_sync": True,
  }
#Run quantization + calibration (forward_loop feeds calibration data)
  model = mtq.quantize(model, cfg, forward_loop)

Testing

PYTHONPATH=/root/epfs/workspace/code/personal_repos/Model-Optimizer pytest -q tests/gpu/torch/quantization/test_mse_calibrate_sync.py
      - Result: 3 passed, 1 skipped (skip: 1‑GPU case)

Additional Information

  • New test validates distributed amax synchronization for MSE calibration under NCCL.

@CedricHwong CedricHwong requested a review from a team as a code owner December 26, 2025 11:23
@CedricHwong CedricHwong requested a review from ajrasane December 26, 2025 11:23
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@realAsma
Copy link
Contributor

realAsma commented Jan 8, 2026

Can we hold off on this PR. We are currently running into issues related to Megatron MoE max calibration. See #752

In addition, amax calibration across mse is not as straight forward as implemented in this PR.
To be correct, mse distributed sync should sync the mse_losses across the relevant distributed parallelism.

Considering that we are still getting mse calibration results, this feature is too early in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants