Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FP8 quantization equations #5

Open
zhuango opened this issue Nov 2, 2022 · 0 comments
Open

FP8 quantization equations #5

zhuango opened this issue Nov 2, 2022 · 0 comments

Comments

@zhuango
Copy link

zhuango commented Nov 2, 2022

Thanks for the documnent about quantization here: https://github.com/mlcommons/inference_results_v2.1/blob/master/closed/NVIDIA/documentation/calibration.md
I'm learning the FP8 part and find that the FP8 quantizaiton equation doesn't make sense to me. As we know, fp8 is a floating point format so an fp8-quantized fp32 number will alse be a floating point number with less exponent and significant bits than fp32. However, from the fp8 quantization equation in the above link:
x_q = round(clip(x / dr * m, -m, m))
x_q would be an integer after the round operation.
It would be appreciated that you can share some explainations. Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant