Skip to content

Commit fe52b2a

Browse files
Bias running average computation in float (#738)
## What does this PR do? **Type of change:** Bug fix **Overview:** ? Computing Bias running average with bf16 creates incorrect estimations. Impact on accuracy for Qwen2.5-7B model: With BF16 running average: NVFP4_AFFINE_KV | 59.11% -- | -- With running average in Float: NVFP4_AFFINE_KV | 71.81% -- | -- ## Usage Use examples/lm_eval/mmlu.py with batchsize of 1 Note: the issue is masked with larger batch sizes ## Testing - Ran mmlu benchmark with mmlu.py and nv-eval - also ploted bf16 and float running average for different layers, one of the example for layer 0 in Qwen2.5-7B: <img width="2100" height="600" alt="image" src="https://github.com/user-attachments/assets/715059c5-34a4-495e-b6f1-0b57cf0c08af" /> Note: for the larger value bf16 shows smaller value compared to float ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes - **Did you write any new necessary tests?**: No - **Did you add or update any necessary documentation?**: NA - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: NA ## Additional Information <!-- E.g. related issue. --> Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
1 parent 4eb1835 commit fe52b2a

File tree

1 file changed

+7
-1
lines changed
  • modelopt/torch/quantization/calib

1 file changed

+7
-1
lines changed

modelopt/torch/quantization/calib/bias.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,13 @@ def collect(self, x: torch.Tensor):
134134
if self._calib_bias is None:
135135
self._calib_bias = bias_
136136
else:
137-
self._calib_bias = (self._calib_bias * self._cnt + bias_) / (self._cnt + 1)
137+
dtype = bias_.dtype
138+
# Convert bias to float for numerical stability
139+
self._calib_bias = (self._calib_bias.float() * self._cnt + bias_.float()) / (
140+
self._cnt + 1
141+
)
142+
self._calib_bias = self._calib_bias.to(dtype)
143+
138144
self._cnt += 1
139145
elif self._method == "max_min":
140146
max_, min_ = compute_maxmin(x, self._axis)

0 commit comments

Comments
 (0)