Quantization of pretrained models using onnxruntime #6387

HemaSowjanyaMamidi · 2021-01-20T03:02:07Z

HemaSowjanyaMamidi
Jan 20, 2021

Hi All,
I took a pretrained model in keras and converted to onnx model and later I wanted to check in terms of quantized models so I tried the following ways.

I used onnxruntime-gpu for getting the quantized models but got an error while importing quantize_dynamic method.
Later I tried with CPU version of onnxruntime and successfully converted the models where I can observe the change in the size which is reduced by four fold times the original onnx model.
So using the converted model I tried to inference on onnxruntime-gpu and found out that the model runs on CPU but not on GPU.

Does the onnxruntime-gpu support the quantized models?
Are the all operators being quantized properly?

yufenglee · 2021-01-21T00:19:22Z

yufenglee
Jan 21, 2021
Collaborator

@HemaSowjanyaMamidi, ORT TensorRT EP can run the static quantization, but official onnxruntime-gpu doesn't. Please refer to this script for an e2e example with TRT ep. BTW, the quantization support with TRT EP is still in progress and you may not see performance gain with it: https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/E2E_example_model/e2e_tensorrt_resnet_example.py.

11 replies

yufenglee Jan 24, 2021
Collaborator

@HemaSowjanyaMamidi , could you please share me your cpu information and your quantized model?

HemaSowjanyaMamidi Jan 24, 2021
Author

@yufenglee,
My CPU information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping: 0
CPU MHz: 2299.998
BogoMIPS: 4599.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0,1

Quantized model link:
https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/E2E_example_model/image_classification/mobilenetv2-7.onnx
This is the model I was using.

HemaSowjanyaMamidi Jan 25, 2021
Author

@yufenglee, can you please let me know if there are any specific hardware required for the quantized models to run? and I observed that the models converted from other frameworks to onnx after being quantized take longer time for inference than the onnx model obtained after conversion. Are my observations correct in this regard..? Or should we change something after the models are converted from source framework to onnx?

yufenglee Jan 25, 2021
Collaborator

@HemaSowjanyaMamidi, quantized models are expected to get better performance on CPU with VNNI extension. On CPU without VNNI, it depends on the model. If the model is IO bounding, it may get better performance; if compute bounding, it can not in general.

HemaSowjanyaMamidi Jan 28, 2021
Author

Thanks @yufenglee, how can we know whether a model is compute bounding or IO bounding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization of pretrained models using onnxruntime #6387

{{title}}

Replies: 1 comment 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Quantization of pretrained models using onnxruntime #6387

HemaSowjanyaMamidi Jan 20, 2021

Replies: 1 comment · 11 replies

yufenglee Jan 21, 2021 Collaborator

yufenglee Jan 24, 2021 Collaborator

HemaSowjanyaMamidi Jan 24, 2021 Author

HemaSowjanyaMamidi Jan 25, 2021 Author

yufenglee Jan 25, 2021 Collaborator

HemaSowjanyaMamidi Jan 28, 2021 Author

HemaSowjanyaMamidi
Jan 20, 2021

Replies: 1 comment 11 replies

yufenglee
Jan 21, 2021
Collaborator

yufenglee Jan 24, 2021
Collaborator

HemaSowjanyaMamidi Jan 24, 2021
Author

HemaSowjanyaMamidi Jan 25, 2021
Author

yufenglee Jan 25, 2021
Collaborator

HemaSowjanyaMamidi Jan 28, 2021
Author