-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics #26688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
yuslepukhin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
### Description As title - it looks like the duration of the job is very close to the timeout ### Motivation and Context Reduce retrry attempts for the ios sim job My own PR - #26688 keep timing out this job leg
…using NEON intrinsics (microsoft#26688) ### Description **Motivation and approach taken:** Add a dedicated depthwise convolution kernel for the most common depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1, dilation = 1) using NEON intrinsics. This does significantly better than the current approach of `Im2Col + SGemm`. The Im2Col step extracts convolution patches and this is a wasteful step and for a 3x3 filter, K would be 9 for the SGemm and usually Gemms are not optimized for such small `K` values. Hence, a dedicated kernel works much better. Initially, I ported over the Winograd based NEON accelerated depthwise convolution kernel from PyTorch but I found that its performance is not very good. It's poor performance is probably due to applying the Winograd transformation for the filter repeatedly. A better approach may be to tranform the filter offline and this approach can be considered for later (I reverted the PyTorch Winograd implementation in this commit: microsoft@2820a84). The current depthwise kernel added in this PR was authored by GPT5.1-Codex and with some minor bug fixes it seems to be functionally correct now and also provides the perf boost we are seeking. **Unit tests:** There are already depthwise convolution tests already existing in the codebase. I don't see a need for new ones at this point. **Kernel benchmarking:** This is the kernel level perf improvement from MLAS Conv benchmarks (About 50% kernel latency improvements): <img width="1055" height="90" alt="image" src="https://github.com/user-attachments/assets/ead9eb83-2d62-4157-a065-70c67c8c7517" /> ### Motivation and Context A key customer model had a few depthwise conolution operations and this change provides a **non-negligible ~3% throughput improvement** using the customer provided benchmarking setup For those interested, microsoft#26654 adds support for the same type of convolution variant but that leverages SME1/SME2 through KleidiAI. This PR is conceptually the same but targeting NEON only platforms. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description As title - it looks like the duration of the job is very close to the timeout ### Motivation and Context Reduce retrry attempts for the ios sim job My own PR - microsoft#26688 keep timing out this job leg
|
Since this PR and #26838 were merged in a short time window, we may have hit an edge case where the CI for #26838 succeeded, but it was last run before this PR merged.* So, we still have a reference to
*This is my hypothesis. I do not know if the CI is extensive enough to cover |
Oh ! I think the reason is NCHWC is only enabled on MacOS now. With th BF16 feature being turned on only on Linux, I think we may have to enable it in the Linux CI as well. We are a bit busy with the 1.24 release and I will add it to the backlog and re-visit later. Can you please propose a fix ? FWIW - This is a PR that might interest you AND have the fix - #27099. The PR may not go in 1.24 at this point and so if you could please provide an isolated fix - I will ensure the isolated fix gets cherry-picked into 1.24. Thanks! |
|
Cool np! Yeah, I did see #27099 and it prompted me to check why it wasn't caught earlier. I will raise a fix for it now. |
…using NEON intrinsics (#26688) ### Description **Motivation and approach taken:** Add a dedicated depthwise convolution kernel for the most common depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1, dilation = 1) using NEON intrinsics. This does significantly better than the current approach of `Im2Col + SGemm`. The Im2Col step extracts convolution patches and this is a wasteful step and for a 3x3 filter, K would be 9 for the SGemm and usually Gemms are not optimized for such small `K` values. Hence, a dedicated kernel works much better. Initially, I ported over the Winograd based NEON accelerated depthwise convolution kernel from PyTorch but I found that its performance is not very good. It's poor performance is probably due to applying the Winograd transformation for the filter repeatedly. A better approach may be to tranform the filter offline and this approach can be considered for later (I reverted the PyTorch Winograd implementation in this commit: 2820a84). The current depthwise kernel added in this PR was authored by GPT5.1-Codex and with some minor bug fixes it seems to be functionally correct now and also provides the perf boost we are seeking. **Unit tests:** There are already depthwise convolution tests already existing in the codebase. I don't see a need for new ones at this point. **Kernel benchmarking:** This is the kernel level perf improvement from MLAS Conv benchmarks (About 50% kernel latency improvements): <img width="1055" height="90" alt="image" src="https://github.com/user-attachments/assets/ead9eb83-2d62-4157-a065-70c67c8c7517" /> ### Motivation and Context A key customer model had a few depthwise conolution operations and this change provides a **non-negligible ~3% throughput improvement** using the customer provided benchmarking setup For those interested, #26654 adds support for the same type of convolution variant but that leverages SME1/SME2 through KleidiAI. This PR is conceptually the same but targeting NEON only platforms. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (cherry picked from commit c03c419)
### Description `sconv.h` was renamed to `sconv_nchwc_kernel_neon.h` in #26688 but the reference to the old name was still in a new file added at around the same time in #26838. The CI doesn't include building for this configuration yet - it will be added after the 1.24 release. ### Motivation and Context Fixes failing mainline build on Arm64 linux when `--enable_arm_neon_nchwc` is supplied. ### Testing This now passes on Arm64 linux `./build.sh --config Release --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests --enable_pybind --build_wheel --enable_arm_neon_nchwc`
### Description `sconv.h` was renamed to `sconv_nchwc_kernel_neon.h` in #26688 but the reference to the old name was still in a new file added at around the same time in #26838. The CI doesn't include building for this configuration yet - it will be added after the 1.24 release. ### Motivation and Context Fixes failing mainline build on Arm64 linux when `--enable_arm_neon_nchwc` is supplied. ### Testing This now passes on Arm64 linux `./build.sh --config Release --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests --enable_pybind --build_wheel --enable_arm_neon_nchwc` (cherry picked from commit 347b990)
### Description This PR cherry-picks the following changes for the 1.24.0 release. ### Cherry-picked Commits | Commit | Commit Title | Author | |---|---|---| | 744e7fe | Add type definitions, registration, utilities for INT2/UINT2 support (#26824) | vraspar | | 530a1fb | [QNN EP] Add BFloat16 dtype support in QNN EP (#26987) | tirupath-qti | | 8e050d1 | Implement new experimental lookup-based matrix multiplication method(TMAC) (#26695) | vraspar | | 2d2ba6b | [MLAS/CPU EP] Improve performance of Silu activation path within the QuickGelu CPU kernel (#26753) | Hariharan Seshadri | | 1c02b79 | [QNN EP] Add support for handling 0-dimension for Concat Op (#27000) | Ashwath Shankarnarayan | | cc2b01b | Fix ClipQuantFusion crash when Clip has multiple input edges (#27016) | Edward Chen | | bbd3850 | [QNN EP] Support quantized BatchNorm with per-channel DQ params on QNN HTP (#26959) | qti-yuduo | | d8f0318 | Add API to get ep graph partitioning info (#26781) | Adrian Lizarraga | | b912b18 | [OVEP] OpenVINO EP Features and bug-fixes for ORT-1.24 - Follow up (#27007) | Preetha Veeramalai | | ba11af4 | [QNN-EP] Add MatMulNBits translation for GPU (#26340) | quic-tirupath | | c03c419 | [MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics (#26688) | Hariharan Seshadri | | e7dfd69 | [QNN-EP] Support alternate Layernorm fusion pattern in QNN preprocess (#26060) | qti-mattsinc | | 4013dc1 | Implement multithreading in qgemm_kleidi (#26301) | Melike Kaptan | | 9f06181 | [CXX] Enable users to specify custom OrtSyncStream via RunOptions (#26988) | Dmitri Smirnov | | cfccd64 | Added support for QMX kernels in MLAS (#26849) | qti-vaiskv | | 29d9b2f | Tweak external resource importer handle structs (#27040) | Scott McKay | | 9d108d0 | [QNN EP] Add QuickGELU operator support for QNN provider (#27034) | tirupath-qti | | b35688f | Add INT2 and UINT2 support for QDQ, transpose and cast ops (#27022) | vraspar | | 6d34aba | Introducing BF16 Pointwise NCHWc Convolution for Arm64 (#26838) | Rohanjames1997 | | 36017ad | [EP ABI] Add CreateCustomOpDomains() API for plugin EP to register custom ops (#27050) | Chi Lo | | 50a03e4 | Add a new pipeline for CUDA 13 nuget builds (#27023) | eserscor | | a0d4439 | [EP ABI] Update Graph_GetGraphView() implementation (#26711) | Chi Lo | | 34bb209 | [webgpu] Fix a bug for im2col (#27069) | Wenqin Yang | | 46e8d45 | [QNN EP] Add FusedMatMul operator support (#27044) | tirupath-qti | | 5e7e7a3 | Disable Float32_2Bits_Asymmetric_256x256 test (#27046) | vraspar | | 39f966e | Fix Doxygen documentation build error in onnxruntime_c_api.h (#27083) | Nick Eubanks | | 8a7a797 | Print tensor for new packed type of 2 bits (#27064) | Tianlei Wu | | 01f40e6 | Fix GPU JAR testing on Linux (#27011) | eserscor | | b6ed7f3 | Fix warning around ununsed code in QNN Android Emulator builds by clang (#27026) | Hariharan Seshadri | | d7daa45 | Raise the timeout for the ios simulator job (#27045) | Hariharan Seshadri | | 7e1d818 | upgrade emsdk to 4.0.23 (#27029) | Yulong Wang | | 347b990 | Fix failing mainline build on Arm64 linux (#27101) | Rohanjames1997 | | f481b17 | Add dedicated API to support extracting compatibility string from model metadata (#27015) | adrastogi | --------- Signed-off-by: Liqun Fu <liqun.fu@microsoft.com> Signed-off-by: bfilipek <bartlomiej.filipek@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com> Signed-off-by: melkap01 <melike.kaptan@arm.com> Co-authored-by: vraspar <vrajang@outlook.com> Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com> Co-authored-by: Ashwath Shankarnarayan <ashwshan@qti.qualcomm.com> Co-authored-by: Liqun Fu <liqun.fu@microsoft.com> Co-authored-by: carzh <wolfivyaura@gmail.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: carzh <carolinezhu@microsoft.com> Co-authored-by: Vrajang Parikh <vrparikh@microsoft.com> Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Yuduo Wu <yuduow@qti.qualcomm.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: Javier Martinez <javier.e.martinez@intel.com> Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com> Co-authored-by: bopeng1234 <bo.peng@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com> Co-authored-by: TejalKhade28 <tejal.khade@intel.com> Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com> Co-authored-by: Yaru Du <yaru.du@intel.com> Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com> Co-authored-by: Dvoretckii, Mikhail <mikhail.dvoretckii@intel.com> Co-authored-by: Pallavi Gupta <pallavi.gupta@intel.com> Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Fei Chen <feich@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Akupadhye <aupadhye@qti.qualcomm.com> Co-authored-by: Wang Ning <ning4.wang@intel.com> Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: George Wu <jywu@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Wanming Lin <wanming.lin@intel.com> Co-authored-by: quic-calvnguy <quic_calvnguy@quicinc.com> Co-authored-by: Jie Chen <jie.a.chen@intel.com> Co-authored-by: xhcao <xinghua.cao@intel.com> Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Co-authored-by: quic-hungjuiw <quic_hungjuiw@quicinc.com> Co-authored-by: Ian Hunter <ianfhunter@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: Jeff Kilpatrick <jkilpatrick@qti.qualcomm.com> Co-authored-by: Jeff Kilpatrick <jkilpat@qti.qualcomm.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Nenad Banfic <46795300+nenad1002@users.noreply.github.com> Co-authored-by: derdeljan-msft <derdeljan@microsoft.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com> Co-authored-by: Jaswanth Gannamaneni <jaswanth.gannamaneni@intel.com> Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com> Co-authored-by: liang <gxgaoliang@126.com> Co-authored-by: Garth Long <garth.long@intel.com> Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com> Co-authored-by: Christopher Warrington <chwarr@microsoft.com> Co-authored-by: Ishwar Raut <iraut@nvidia.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Xinpeng Dou <15529241576@163.com> Co-authored-by: adrastogi <aditya.rastogi@microsoft.com> Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com> Co-authored-by: qti-hungjuiw <hungjuiw@qti.qualcomm.com> Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com> Co-authored-by: Adam Pocock <adam.pocock@oracle.com> Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com> Co-authored-by: Susanta Bhattacharjee <susanta.bhattacharjee@intel.com> Co-authored-by: Jozef Wludzik <jozef.wludzik@intel.com> Co-authored-by: Rajeev Sekar <rajeevsekar21@gmail.com> Co-authored-by: Mayuresh M Varerkar <mayuresh.m.varerkar@intel.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Wenqin Yang <wenqin.yang@intel.com> Co-authored-by: xieofxie <xieofxie@126.com> Co-authored-by: hualxie <hualxie@microsoft.com> Co-authored-by: Joshua Lochner <admin@xenova.com> Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com> Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com> Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Co-authored-by: chunghow-qti <chunghow@qti.qualcomm.com> Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Jiawei Shao <jiawei.shao@intel.com> Co-authored-by: czekun <chen.zekun@intel.com> Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com> Co-authored-by: quic-tirupath <quic_tirupath@quicinc.com> Co-authored-by: qti-mattsinc <mattsinc@qti.qualcomm.com> Co-authored-by: Melike Kaptan <melike.kaptan@arm.com> Co-authored-by: Damien Dooley <damien.dooley@arm.com> Co-authored-by: qti-vaiskv <vaiskv@qti.qualcomm.com> Co-authored-by: Rohanjames1997 <rohan.james4@gmail.com> Co-authored-by: eserscor <erscor@microsoft.com> Co-authored-by: eserscor <247253654+eserscor@users.noreply.github.com> Co-authored-by: Nick Eubanks <nieubank@microsoft.com> Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com> Co-authored-by: Rohanjames1997 <rohanjms@amazon.com>
Description
Motivation and approach taken:
Add a dedicated depthwise convolution kernel for the most common depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1, dilation = 1) using NEON intrinsics. This does significantly better than the current approach of
Im2Col + SGemm. The Im2Col step extracts convolution patches and this is a wasteful step and for a 3x3 filter, K would be 9 for the SGemm and usually Gemms are not optimized for such smallKvalues. Hence, a dedicated kernel works much better.Initially, I ported over the Winograd based NEON accelerated depthwise convolution kernel from PyTorch but I found that its performance is not very good. It's poor performance is probably due to applying the Winograd transformation for the filter repeatedly. A better approach may be to tranform the filter offline and this approach can be considered for later (I reverted the PyTorch Winograd implementation in this commit: 2820a84).
The current depthwise kernel added in this PR was authored by GPT5.1-Codex and with some minor bug fixes it seems to be functionally correct now and also provides the perf boost we are seeking.
Unit tests:
There are already depthwise convolution tests already existing in the codebase. I don't see a need for new ones at this point.
Kernel benchmarking:
This is the kernel level perf improvement from MLAS Conv benchmarks (About 50% kernel latency improvements):
Motivation and Context
A key customer model had a few depthwise conolution operations and this change provides a non-negligible ~3% throughput improvement using the customer provided benchmarking setup
For those interested, #26654 adds support for the same type of convolution variant but that leverages SME1/SME2 through KleidiAI. This PR is conceptually the same but targeting NEON only platforms.