FP8 quantization equations #5

zhuango · 2022-11-02T01:41:06Z

Thanks for the documnent about quantization here: https://github.com/mlcommons/inference_results_v2.1/blob/master/closed/NVIDIA/documentation/calibration.md
I'm learning the FP8 part and find that the FP8 quantizaiton equation doesn't make sense to me. As we know, fp8 is a floating point format so an fp8-quantized fp32 number will alse be a floating point number with less exponent and significant bits than fp32. However, from the fp8 quantization equation in the above link:
x_q = round(clip(x / dr * m, -m, m))
x_q would be an integer after the round operation.
It would be appreciated that you can share some explainations. Thanks a lot.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 quantization equations #5

FP8 quantization equations #5

zhuango commented Nov 2, 2022

FP8 quantization equations #5

FP8 quantization equations #5

Comments

zhuango commented Nov 2, 2022