Sparse-quantized model runs without VNNI acceleration #1

ylep · 2023-10-23T10:44:19Z

Describe the bug

Hi Dr @clementpoiret! Now that you have graduated 🎉 here is a technical issue to keep you busy 😉

On a workstation with AVX512 and VNNI CPU capabilities, I am getting the following message:

DeepSparse Optimization Status (minimal: AVX2 | partial: AVX512 | full: AVX512 VNNI): full
[nm_ort 7fda254c7280 >WARN<  is_supported_graph src/onnxruntime_neuralmagic/supported/ops.cc:134] Warning: Optimized runtime disabled - Detecte
d dynamic input input dim 2. Set inputs to static shapes to enable optimal performance.

The performance is indeed worse than the non-sparse model (although I am not sure how it is counting CPU-time here w.r.t. HyperThreading):

6 min 52 s wall-time / 40 min 56 s user CPU-time for segmentation=bagging_sq hardware=deepsparse
vs 4 min 16 s wall-time / 79 min 29 s user CPU-time for hardware=onnxruntime model=bagging_accurate hardware.engine_settings.execution_providers="['CPUExecutionProvider']).

Environment

OS: Ubuntu 22.04
Python: 3.10.12
HSF Version: 1.1.3
Relevant settings: segmentation=bagging_sq hardware=deepsparse
Versions of a few relevant dependencies:

deepsparse==1.5.3
onnx==1.12.0
onnxruntime==1.16.1
onnxruntime-gpu==1.16.1
sparsezoo==1.5.2
torch==2.1.0
torchio==0.18.92

The text was updated successfully, but these errors were encountered:

clementpoiret · 2023-10-23T12:05:14Z

Dear Dr. Leprince,
It is, I believe, linked to: neuralmagic/sparseml#733

I'll have to check if they added support for TConv.
After that, I'll check if I can update the training code (and publish it on github) 👍

ylep · 2023-10-23T12:44:09Z

Ohhh so this is a duplicate of clementpoiret#22, silly me ☹️. Feel free to close this issue, or the previous one, so that we have a single place for tracking the progress.

Anyway, thanks for the reply! In the meantime I will deploy the non-sparse models as a default in NeuroSpin

clementpoiret · 2023-10-23T13:44:38Z

Np :) It's always a pleasure to read a message from Dr. Leprince 😁

Anyway, in all apps, I think sparse/optimized networks should always be optional as they rely on very recent hardware, which most do not have...

clementpoiret · 2023-11-20T13:50:09Z

Little update on the issue.
I have to try but I made an easy way to do Quantization Aware Training and Neural Pruning using Intel(R) Neural Compressor. This should work OOB:

https://github.com/clementpoiret/lightning-nc/

clementpoiret · 2023-11-20T13:52:29Z

Also, to quote ONNXRuntime:

The quantized values are 8 bits wide and can be either signed (int8) or unsigned (uint8). We can choose the signedness of the activations and the weights separately, so the data format can be (activations: uint8, weights: uint8), (activations: uint8, weights: int8), etc. Let’s use U8U8 as a shorthand for (activations: uint8, weights: uint8), U8S8 for (activations: uint8, weights: int8), and similarly S8U8 and S8S8 for the remaining two formats.

ONNX Runtime quantization on CPU can run U8U8, U8S8 and S8S8. S8S8 with QDQ is the default setting and balances performance and accuracy. It should be the first choice. Only in cases that the accuracy drops a lot, you can try U8U8. Note that S8S8 with QOperator will be slow on x86-64 CPUs and should be avoided in general. ONNX Runtime quantization on GPU only supports S8S8.

WHEN AND WHY DO I NEED TO TRY U8U8?
On x86-64 machines with AVX2 and AVX512 extensions, ONNX Runtime uses the VPMADDUBSW instruction for U8S8 for performance. This instruction might suffer from saturation issues: it can happen that the output does not fit into a 16-bit integer and has to be clamped (saturated) to fit. Generally, this is not a big issue for the final result. However, if you do encounter a large accuracy drop, it may be caused by saturation. In this case, you can either try reduce_range or the U8U8 format which doesn’t have saturation issues.

There is no such issue on other CPU architectures (x64 with VNNI and ARM).

ylep assigned clementpoiret Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse-quantized model runs without VNNI acceleration #1

Sparse-quantized model runs without VNNI acceleration #1

ylep commented Oct 23, 2023

clementpoiret commented Oct 23, 2023

ylep commented Oct 23, 2023

clementpoiret commented Oct 23, 2023

clementpoiret commented Nov 20, 2023

clementpoiret commented Nov 20, 2023 •

edited

Loading

Sparse-quantized model runs without VNNI acceleration #1

Sparse-quantized model runs without VNNI acceleration #1

Comments

ylep commented Oct 23, 2023

clementpoiret commented Oct 23, 2023

ylep commented Oct 23, 2023

clementpoiret commented Oct 23, 2023

clementpoiret commented Nov 20, 2023

clementpoiret commented Nov 20, 2023 • edited Loading

clementpoiret commented Nov 20, 2023 •

edited

Loading