-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Optimization #34
Comments
for GPU you can use the here's an example implementation of this library for BERT, you can follow this guide and make suitable changes for T5. In addition to this you also need to implement |
I would also check out this recent demo that NVIDIA did of TensorRT, which involves converting to ONNX as an intermediate step. They run the tests on GPU https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/README.md |
ONNX Runtime supports GPU quantization through the TensorRT provider (now embedded by default in the GPU version of the pipy lib, no need for a custom compilation). However it only supports PTQ, meaning there is a 2-3 point accuracy cost (VS QAT or dynamic quantization which are usually close to non quantized accuracy). Quantization brings a X2 speedup, that you can add to a 1.3 speedup when you switch from ORT to TRT, so quite significant on base / large models (not yet benchmarked on distilled models) Hopefully, QAT is also doable but requires some work per model (modify the attention part to add QDQ). You can see some here ELS-RD/transformer-deploy#29, for now only for Albert, Electra, Bert, Roberta and Distilbert. I will probably add support for Deberta V1 and V2, T5 and Bart as a next step. |
I really appreciate the functionality that the fastT5 library offers! Like the original poster, I am looking to leverage the speedup from both ONNX Runtime and quantization that fastT5 offers, and deploy this on a Nvidia GPU. Do you have any pointers on how to accomplish this with a t5-large model? @Ki6an or @sam-writer Thanks! |
I know understand @pommedeterresautee's comment. you do not need to convert to TRT format to use TRT. you can convert to ONNX format, then per the onnx docs, you can use the TRT execution provider import onnxruntime as ort
# set providers to ['TensorrtExecutionProvider', 'CUDAExecutionProvider'] with TensorrtExecutionProvider having the higher priority.
sess = ort.InferenceSession('model.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider']) This might be fast enough because TRT gives a 1.3x speed boost. But if you want the 2x speed boost of quantization, you take an accuracy hit. In Another consideration on GPU that isn't a factor on CPU is |
Here is an example from a branch of the ONNX library that demonstrates using |
@sam-writer So currently to get a performance optimization for inference time on T5 on GPU, do you recommend this code?
Is there an example-code of real use? I would like to improve the GPU inference time with a T5-base with |
Thanks for sharing the repo . It is really helpful.
I'm exploring ways to do the optimization on GPU. I know its not presently supported. Could you share some approach or references to implement the optimization on GPU(Nvidia)
The text was updated successfully, but these errors were encountered: