Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse-quantized model runs without VNNI acceleration #1

Open
ylep opened this issue Oct 23, 2023 · 5 comments
Open

Sparse-quantized model runs without VNNI acceleration #1

ylep opened this issue Oct 23, 2023 · 5 comments
Assignees

Comments

@ylep
Copy link
Member

ylep commented Oct 23, 2023

Describe the bug

Hi Dr @clementpoiret! Now that you have graduated 🎉 here is a technical issue to keep you busy 😉

On a workstation with AVX512 and VNNI CPU capabilities, I am getting the following message:

DeepSparse Optimization Status (minimal: AVX2 | partial: AVX512 | full: AVX512 VNNI): full
[nm_ort 7fda254c7280 >WARN<  is_supported_graph src/onnxruntime_neuralmagic/supported/ops.cc:134] Warning: Optimized runtime disabled - Detecte
d dynamic input input dim 2. Set inputs to static shapes to enable optimal performance.

The performance is indeed worse than the non-sparse model (although I am not sure how it is counting CPU-time here w.r.t. HyperThreading):

  • 6 min 52 s wall-time / 40 min 56 s user CPU-time for segmentation=bagging_sq hardware=deepsparse
  • vs 4 min 16 s wall-time / 79 min 29 s user CPU-time for hardware=onnxruntime model=bagging_accurate hardware.engine_settings.execution_providers="['CPUExecutionProvider']).

Environment

  • OS: Ubuntu 22.04
  • Python: 3.10.12
  • HSF Version: 1.1.3
  • Relevant settings: segmentation=bagging_sq hardware=deepsparse
  • Versions of a few relevant dependencies:
deepsparse==1.5.3
onnx==1.12.0
onnxruntime==1.16.1
onnxruntime-gpu==1.16.1
sparsezoo==1.5.2
torch==2.1.0
torchio==0.18.92
@clementpoiret
Copy link
Collaborator

Dear Dr. Leprince,
It is, I believe, linked to: neuralmagic/sparseml#733

I'll have to check if they added support for TConv.
After that, I'll check if I can update the training code (and publish it on github) 👍

@ylep
Copy link
Member Author

ylep commented Oct 23, 2023

Ohhh so this is a duplicate of clementpoiret#22, silly me ☹️. Feel free to close this issue, or the previous one, so that we have a single place for tracking the progress.

Anyway, thanks for the reply! In the meantime I will deploy the non-sparse models as a default in NeuroSpin

@clementpoiret
Copy link
Collaborator

Np :) It's always a pleasure to read a message from Dr. Leprince 😁

Anyway, in all apps, I think sparse/optimized networks should always be optional as they rely on very recent hardware, which most do not have...

@clementpoiret
Copy link
Collaborator

Little update on the issue.
I have to try but I made an easy way to do Quantization Aware Training and Neural Pruning using Intel(R) Neural Compressor. This should work OOB:

https://github.com/clementpoiret/lightning-nc/

@clementpoiret
Copy link
Collaborator

clementpoiret commented Nov 20, 2023

Also, to quote ONNXRuntime:

The quantized values are 8 bits wide and can be either signed (int8) or unsigned (uint8). We can choose the signedness of the activations and the weights separately, so the data format can be (activations: uint8, weights: uint8), (activations: uint8, weights: int8), etc. Let’s use U8U8 as a shorthand for (activations: uint8, weights: uint8), U8S8 for (activations: uint8, weights: int8), and similarly S8U8 and S8S8 for the remaining two formats.

ONNX Runtime quantization on CPU can run U8U8, U8S8 and S8S8. S8S8 with QDQ is the default setting and balances performance and accuracy. It should be the first choice. Only in cases that the accuracy drops a lot, you can try U8U8. Note that S8S8 with QOperator will be slow on x86-64 CPUs and should be avoided in general. ONNX Runtime quantization on GPU only supports S8S8.

WHEN AND WHY DO I NEED TO TRY U8U8?
On x86-64 machines with AVX2 and AVX512 extensions, ONNX Runtime uses the VPMADDUBSW instruction for U8S8 for performance. This instruction might suffer from saturation issues: it can happen that the output does not fit into a 16-bit integer and has to be clamped (saturated) to fit. Generally, this is not a big issue for the final result. However, if you do encounter a large accuracy drop, it may be caused by saturation. In this case, you can either try reduce_range or the U8U8 format which doesn’t have saturation issues.

There is no such issue on other CPU architectures (x64 with VNNI and ARM).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants