You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the documnent about quantization here: https://github.com/mlcommons/inference_results_v2.1/blob/master/closed/NVIDIA/documentation/calibration.md
I'm learning the FP8 part and find that the FP8 quantizaiton equation doesn't make sense to me. As we know, fp8 is a floating point format so an fp8-quantized fp32 number will alse be a floating point number with less exponent and significant bits than fp32. However, from the fp8 quantization equation in the above link: x_q = round(clip(x / dr * m, -m, m))
x_q would be an integer after the round operation.
It would be appreciated that you can share some explainations. Thanks a lot.
The text was updated successfully, but these errors were encountered:
Thanks for the documnent about quantization here: https://github.com/mlcommons/inference_results_v2.1/blob/master/closed/NVIDIA/documentation/calibration.md
I'm learning the FP8 part and find that the FP8 quantizaiton equation doesn't make sense to me. As we know, fp8 is a floating point format so an fp8-quantized fp32 number will alse be a floating point number with less exponent and significant bits than fp32. However, from the fp8 quantization equation in the above link:
x_q = round(clip(x / dr * m, -m, m))
x_q would be an integer after the round operation.
It would be appreciated that you can share some explainations. Thanks a lot.
The text was updated successfully, but these errors were encountered: