Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can use the CPU in the inference state? #1

Open
luoling1993 opened this issue Mar 18, 2021 · 1 comment
Open

Can use the CPU in the inference state? #1

luoling1993 opened this issue Mar 18, 2021 · 1 comment
Labels
question Further information is requested

Comments

@luoling1993
Copy link

Excellent work!

Can use the CPU in the inference state?
And how much faster than baseline?

@luoling1993 luoling1993 added the question Further information is requested label Mar 18, 2021
@kssteven418
Copy link
Owner

Thanks for your interest!
I should first mention that this PyTorch implementation of I-BERT only searches for the integer parameters (i.e., performs quantization-aware-training) that minimize the accuracy degradation as compared to the full-precision counterpart.
As far as I know, PyTorch does not support integer operations (unless using its own quantization library, whose functionality is very limited) and thus the current PyTorch implementation does not achieve latency reduction on real hardware by itself.
In order to deploy I-BERT on GPU or CPU and achieve speedup, you should additionally export the integer parameters (which are obtained from this implementation) along with the model architecture to other frameworks that support deployment on integer processing units. TVM and TensorRT are such examples.

Hope this answers your question!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants