Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency 20x with quant_mode = true #21

Open
LiamPKU opened this issue Mar 8, 2022 · 1 comment
Open

Latency 20x with quant_mode = true #21

LiamPKU opened this issue Mar 8, 2022 · 1 comment
Labels
question Further information is requested

Comments

@LiamPKU
Copy link

LiamPKU commented Mar 8, 2022

In the hugging face config, I set quant_mode = TRUE.
The weight_integer buffer remains 0, and the result is wrong.
Moreover, inference latency of integer mode is 20 times of float mode.
Can you please explain the reason for me?

@LiamPKU LiamPKU added the question Further information is requested label Mar 8, 2022
@huu4ontocord
Copy link

Hi,

Similar to this, I also found it is MUCH slower in quant_mode = True. here's a notebook with a slightly modified version of the HF code to allow dynamically switching quant_mode. You can see the timing difference.

https://colab.research.google.com/drive/1DkYFGc18oPvAn5nyGEL1aIFHmD_aNlXW

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants