Skip to content

Conversation

@hariharans29
Copy link
Member

@hariharans29 hariharans29 commented Dec 1, 2025

Description

Motivation and approach taken:

Add a dedicated depthwise convolution kernel for the most common depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1, dilation = 1) using NEON intrinsics. This does significantly better than the current approach of Im2Col + SGemm. The Im2Col step extracts convolution patches and this is a wasteful step and for a 3x3 filter, K would be 9 for the SGemm and usually Gemms are not optimized for such small K values. Hence, a dedicated kernel works much better.

Initially, I ported over the Winograd based NEON accelerated depthwise convolution kernel from PyTorch but I found that its performance is not very good. It's poor performance is probably due to applying the Winograd transformation for the filter repeatedly. A better approach may be to tranform the filter offline and this approach can be considered for later (I reverted the PyTorch Winograd implementation in this commit: 2820a84).

The current depthwise kernel added in this PR was authored by GPT5.1-Codex and with some minor bug fixes it seems to be functionally correct now and also provides the perf boost we are seeking.

Unit tests:
There are already depthwise convolution tests already existing in the codebase. I don't see a need for new ones at this point.

Kernel benchmarking:
This is the kernel level perf improvement from MLAS Conv benchmarks (About 50% kernel latency improvements):

image

Motivation and Context

A key customer model had a few depthwise conolution operations and this change provides a non-negligible ~3% throughput improvement using the customer provided benchmarking setup

For those interested, #26654 adds support for the same type of convolution variant but that leverages SME1/SME2 through KleidiAI. This PR is conceptually the same but targeting NEON only platforms.

@hariharans29 hariharans29 changed the title WIP: Conv Expt [DO NOT REVIEW] WIP: Conv Expt Dec 2, 2025
@hariharans29
Copy link
Member Author

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

@hariharans29 hariharans29 changed the title [DO NOT REVIEW] WIP: Conv Expt [MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics Dec 4, 2025
hariharans29 and others added 3 commits December 4, 2025 07:14
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@hariharans29 hariharans29 requested a review from Copilot December 5, 2025 07:08
Copy link
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@hariharans29 hariharans29 enabled auto-merge (squash) January 16, 2026 00:19
@hariharans29 hariharans29 merged commit c03c419 into main Jan 16, 2026
111 of 114 checks passed
@hariharans29 hariharans29 deleted the hari/expt_conv branch January 16, 2026 19:44
hariharans29 added a commit that referenced this pull request Jan 17, 2026
### Description
As title - it looks like the duration of the job is very close to the
timeout

### Motivation and Context
Reduce retrry attempts for the ios sim job

My own PR - #26688 keep
timing out this job leg
alex-spacemit pushed a commit to spacemit-com/onnxruntime that referenced this pull request Jan 20, 2026
…using NEON intrinsics (microsoft#26688)

### Description

**Motivation and approach taken:**

Add a dedicated depthwise convolution kernel for the most common
depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1,
dilation = 1) using NEON intrinsics. This does significantly better than
the current approach of `Im2Col + SGemm`. The Im2Col step extracts
convolution patches and this is a wasteful step and for a 3x3 filter, K
would be 9 for the SGemm and usually Gemms are not optimized for such
small `K` values. Hence, a dedicated kernel works much better.

Initially, I ported over the Winograd based NEON accelerated depthwise
convolution kernel from PyTorch but I found that its performance is not
very good. It's poor performance is probably due to applying the
Winograd transformation for the filter repeatedly. A better approach may
be to tranform the filter offline and this approach can be considered
for later (I reverted the PyTorch Winograd implementation in this
commit:
microsoft@2820a84).

The current depthwise kernel added in this PR was authored by
GPT5.1-Codex and with some minor bug fixes it seems to be functionally
correct now and also provides the perf boost we are seeking.

**Unit tests:**
There are already depthwise convolution tests already existing in the
codebase. I don't see a need for new ones at this point.

**Kernel benchmarking:**
This is the kernel level perf improvement from MLAS Conv benchmarks
(About 50% kernel latency improvements):

<img width="1055" height="90" alt="image"
src="https://github.com/user-attachments/assets/ead9eb83-2d62-4157-a065-70c67c8c7517"
/>

### Motivation and Context
A key customer model had a few depthwise conolution operations and this
change provides a **non-negligible ~3% throughput improvement** using
the customer provided benchmarking setup

For those interested,
microsoft#26654 adds support for the
same type of convolution variant but that leverages SME1/SME2 through
KleidiAI. This PR is conceptually the same but targeting NEON only
platforms.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
alex-spacemit pushed a commit to spacemit-com/onnxruntime that referenced this pull request Jan 20, 2026
### Description
As title - it looks like the duration of the job is very close to the
timeout

### Motivation and Context
Reduce retrry attempts for the ios sim job

My own PR - microsoft#26688 keep
timing out this job leg
@Rohanjames1997
Copy link
Contributor

Since this PR and #26838 were merged in a short time window, we may have hit an edge case where the CI for #26838 succeeded, but it was last run before this PR merged.*

So, we still have a reference to sconv.h lying around here which is failing the mainline build when run with --enable_arm_neon_nchwc

*This is my hypothesis. I do not know if the CI is extensive enough to cover --enable_arm_neon_nchwc

@hariharans29
Copy link
Member Author

Since this PR and #26838 were merged in a short time window, we may have hit an edge case where the CI for #26838 succeeded, but it was last run before this PR merged.*

So, we still have a reference to sconv.h lying around here which is failing the mainline build when run with --enable_arm_neon_nchwc

*This is my hypothesis. I do not know if the CI is extensive enough to cover --enable_arm_neon_nchwc

Oh ! I think the reason is NCHWC is only enabled on MacOS now. With th BF16 feature being turned on only on Linux, I think we may have to enable it in the Linux CI as well. We are a bit busy with the 1.24 release and I will add it to the backlog and re-visit later.

Can you please propose a fix ? FWIW - This is a PR that might interest you AND have the fix - #27099. The PR may not go in 1.24 at this point and so if you could please provide an isolated fix - I will ensure the isolated fix gets cherry-picked into 1.24. Thanks!

@Rohanjames1997
Copy link
Contributor

Cool np!

Yeah, I did see #27099 and it prompted me to check why it wasn't caught earlier.

I will raise a fix for it now.

@Rohanjames1997
Copy link
Contributor

#27101

tianleiwu pushed a commit that referenced this pull request Jan 21, 2026
…using NEON intrinsics (#26688)

### Description

**Motivation and approach taken:**

Add a dedicated depthwise convolution kernel for the most common
depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1,
dilation = 1) using NEON intrinsics. This does significantly better than
the current approach of `Im2Col + SGemm`. The Im2Col step extracts
convolution patches and this is a wasteful step and for a 3x3 filter, K
would be 9 for the SGemm and usually Gemms are not optimized for such
small `K` values. Hence, a dedicated kernel works much better.

Initially, I ported over the Winograd based NEON accelerated depthwise
convolution kernel from PyTorch but I found that its performance is not
very good. It's poor performance is probably due to applying the
Winograd transformation for the filter repeatedly. A better approach may
be to tranform the filter offline and this approach can be considered
for later (I reverted the PyTorch Winograd implementation in this
commit:
2820a84).

The current depthwise kernel added in this PR was authored by
GPT5.1-Codex and with some minor bug fixes it seems to be functionally
correct now and also provides the perf boost we are seeking.

**Unit tests:**
There are already depthwise convolution tests already existing in the
codebase. I don't see a need for new ones at this point.

**Kernel benchmarking:**
This is the kernel level perf improvement from MLAS Conv benchmarks
(About 50% kernel latency improvements):

<img width="1055" height="90" alt="image"
src="https://github.com/user-attachments/assets/ead9eb83-2d62-4157-a065-70c67c8c7517"
/>

### Motivation and Context
A key customer model had a few depthwise conolution operations and this
change provides a **non-negligible ~3% throughput improvement** using
the customer provided benchmarking setup

For those interested,
#26654 adds support for the
same type of convolution variant but that leverages SME1/SME2 through
KleidiAI. This PR is conceptually the same but targeting NEON only
platforms.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
(cherry picked from commit c03c419)
hariharans29 pushed a commit that referenced this pull request Jan 21, 2026
### Description
`sconv.h` was renamed to `sconv_nchwc_kernel_neon.h` in #26688 but the
reference to the old name was still in a new file added at around the
same time in #26838.
The CI doesn't include building for this configuration yet - it will be
added after the 1.24 release.



### Motivation and Context
Fixes failing mainline build on Arm64 linux when
`--enable_arm_neon_nchwc` is supplied.


### Testing
This now passes on Arm64 linux
`./build.sh --config Release --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync --skip_tests
--enable_pybind --build_wheel --enable_arm_neon_nchwc`
tianleiwu pushed a commit that referenced this pull request Jan 22, 2026
### Description
As title - it looks like the duration of the job is very close to the
timeout

### Motivation and Context
Reduce retrry attempts for the ios sim job

My own PR - #26688 keep
timing out this job leg

(cherry picked from commit d7daa45)
tianleiwu pushed a commit that referenced this pull request Jan 22, 2026
### Description
`sconv.h` was renamed to `sconv_nchwc_kernel_neon.h` in #26688 but the
reference to the old name was still in a new file added at around the
same time in #26838.
The CI doesn't include building for this configuration yet - it will be
added after the 1.24 release.

### Motivation and Context
Fixes failing mainline build on Arm64 linux when
`--enable_arm_neon_nchwc` is supplied.

### Testing
This now passes on Arm64 linux
`./build.sh --config Release --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync --skip_tests
--enable_pybind --build_wheel --enable_arm_neon_nchwc`

(cherry picked from commit 347b990)
tianleiwu added a commit that referenced this pull request Jan 23, 2026
### Description
This PR cherry-picks the following changes for the 1.24.0 release.

### Cherry-picked Commits
| Commit | Commit Title | Author |
|---|---|---|
| 744e7fe | Add type definitions, registration, utilities for
INT2/UINT2 support (#26824) | vraspar |
| 530a1fb | [QNN EP] Add BFloat16 dtype support in QNN EP (#26987) |
tirupath-qti |
| 8e050d1 | Implement new experimental lookup-based matrix
multiplication method(TMAC) (#26695) | vraspar |
| 2d2ba6b | [MLAS/CPU EP] Improve performance of Silu activation path
within the QuickGelu CPU kernel (#26753) | Hariharan Seshadri |
| 1c02b79 | [QNN EP] Add support for handling 0-dimension for Concat
Op (#27000) | Ashwath Shankarnarayan |
| cc2b01b | Fix ClipQuantFusion crash when Clip has multiple input
edges (#27016) | Edward Chen |
| bbd3850 | [QNN EP] Support quantized BatchNorm with per-channel DQ
params on QNN HTP (#26959) | qti-yuduo |
| d8f0318 | Add API to get ep graph partitioning info (#26781) |
Adrian Lizarraga |
| b912b18 | [OVEP] OpenVINO EP Features and bug-fixes for ORT-1.24 -
Follow up (#27007) | Preetha Veeramalai |
| ba11af4 | [QNN-EP] Add MatMulNBits translation for GPU (#26340) |
quic-tirupath |
| c03c419 | [MLAS/NEON] Add dedicated kernel for depthwise
convolution for ARM64 using NEON intrinsics (#26688) | Hariharan
Seshadri |
| e7dfd69 | [QNN-EP] Support alternate Layernorm fusion pattern in
QNN preprocess (#26060) | qti-mattsinc |
| 4013dc1 | Implement multithreading in qgemm_kleidi (#26301) |
Melike Kaptan |
| 9f06181 | [CXX] Enable users to specify custom OrtSyncStream via
RunOptions (#26988) | Dmitri Smirnov |
| cfccd64 | Added support for QMX kernels in MLAS (#26849) |
qti-vaiskv |
| 29d9b2f | Tweak external resource importer handle structs (#27040)
| Scott McKay |
| 9d108d0 | [QNN EP] Add QuickGELU operator support for QNN provider
(#27034) | tirupath-qti |
| b35688f | Add INT2 and UINT2 support for QDQ, transpose and cast
ops (#27022) | vraspar |
| 6d34aba | Introducing BF16 Pointwise NCHWc Convolution for Arm64
(#26838) | Rohanjames1997 |
| 36017ad | [EP ABI] Add CreateCustomOpDomains() API for plugin EP to
register custom ops (#27050) | Chi Lo |
| 50a03e4 | Add a new pipeline for CUDA 13 nuget builds (#27023) |
eserscor |
| a0d4439 | [EP ABI] Update Graph_GetGraphView() implementation
(#26711) | Chi Lo |
| 34bb209 | [webgpu] Fix a bug for im2col (#27069) | Wenqin Yang |
| 46e8d45 | [QNN EP] Add FusedMatMul operator support (#27044) |
tirupath-qti |
| 5e7e7a3 | Disable Float32_2Bits_Asymmetric_256x256 test (#27046) |
vraspar |
| 39f966e | Fix Doxygen documentation build error in
onnxruntime_c_api.h (#27083) | Nick Eubanks |
| 8a7a797 | Print tensor for new packed type of 2 bits (#27064) |
Tianlei Wu |
| 01f40e6 | Fix GPU JAR testing on Linux (#27011) | eserscor |
| b6ed7f3 | Fix warning around ununsed code in QNN Android Emulator
builds by clang (#27026) | Hariharan Seshadri |
| d7daa45 | Raise the timeout for the ios simulator job (#27045) |
Hariharan Seshadri |
| 7e1d818 | upgrade emsdk to 4.0.23 (#27029) | Yulong Wang |
| 347b990 | Fix failing mainline build on Arm64 linux (#27101) |
Rohanjames1997 |
| f481b17 | Add dedicated API to support extracting compatibility
string from model metadata (#27015) | adrastogi |

---------

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: bfilipek <bartlomiej.filipek@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com>
Signed-off-by: melkap01 <melike.kaptan@arm.com>
Co-authored-by: vraspar <vrajang@outlook.com>
Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com>
Co-authored-by: Ashwath Shankarnarayan <ashwshan@qti.qualcomm.com>
Co-authored-by: Liqun Fu <liqun.fu@microsoft.com>
Co-authored-by: carzh <wolfivyaura@gmail.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: carzh <carolinezhu@microsoft.com>
Co-authored-by: Vrajang Parikh <vrparikh@microsoft.com>
Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Yuduo Wu <yuduow@qti.qualcomm.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Javier Martinez <javier.e.martinez@intel.com>
Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com>
Co-authored-by: bopeng1234 <bo.peng@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com>
Co-authored-by: TejalKhade28 <tejal.khade@intel.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: Yaru Du <yaru.du@intel.com>
Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com>
Co-authored-by: Dvoretckii, Mikhail <mikhail.dvoretckii@intel.com>
Co-authored-by: Pallavi Gupta <pallavi.gupta@intel.com>
Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Fei Chen <feich@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Akupadhye <aupadhye@qti.qualcomm.com>
Co-authored-by: Wang Ning <ning4.wang@intel.com>
Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
Co-authored-by: quic-calvnguy <quic_calvnguy@quicinc.com>
Co-authored-by: Jie Chen <jie.a.chen@intel.com>
Co-authored-by: xhcao <xinghua.cao@intel.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: quic-hungjuiw <quic_hungjuiw@quicinc.com>
Co-authored-by: Ian Hunter <ianfhunter@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Jeff Kilpatrick <jkilpatrick@qti.qualcomm.com>
Co-authored-by: Jeff Kilpatrick <jkilpat@qti.qualcomm.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Nenad Banfic <46795300+nenad1002@users.noreply.github.com>
Co-authored-by: derdeljan-msft <derdeljan@microsoft.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com>
Co-authored-by: Jaswanth Gannamaneni <jaswanth.gannamaneni@intel.com>
Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com>
Co-authored-by: liang <gxgaoliang@126.com>
Co-authored-by: Garth Long <garth.long@intel.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com>
Co-authored-by: Christopher Warrington <chwarr@microsoft.com>
Co-authored-by: Ishwar Raut <iraut@nvidia.com>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Xinpeng Dou <15529241576@163.com>
Co-authored-by: adrastogi <aditya.rastogi@microsoft.com>
Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: qti-hungjuiw <hungjuiw@qti.qualcomm.com>
Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: Susanta Bhattacharjee <susanta.bhattacharjee@intel.com>
Co-authored-by: Jozef Wludzik <jozef.wludzik@intel.com>
Co-authored-by: Rajeev Sekar <rajeevsekar21@gmail.com>
Co-authored-by: Mayuresh M Varerkar <mayuresh.m.varerkar@intel.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Wenqin Yang <wenqin.yang@intel.com>
Co-authored-by: xieofxie <xieofxie@126.com>
Co-authored-by: hualxie <hualxie@microsoft.com>
Co-authored-by: Joshua Lochner <admin@xenova.com>
Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com>
Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: chunghow-qti <chunghow@qti.qualcomm.com>
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Jiawei Shao <jiawei.shao@intel.com>
Co-authored-by: czekun <chen.zekun@intel.com>
Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com>
Co-authored-by: quic-tirupath <quic_tirupath@quicinc.com>
Co-authored-by: qti-mattsinc <mattsinc@qti.qualcomm.com>
Co-authored-by: Melike Kaptan <melike.kaptan@arm.com>
Co-authored-by: Damien Dooley <damien.dooley@arm.com>
Co-authored-by: qti-vaiskv <vaiskv@qti.qualcomm.com>
Co-authored-by: Rohanjames1997 <rohan.james4@gmail.com>
Co-authored-by: eserscor <erscor@microsoft.com>
Co-authored-by: eserscor <247253654+eserscor@users.noreply.github.com>
Co-authored-by: Nick Eubanks <nieubank@microsoft.com>
Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com>
Co-authored-by: Rohanjames1997 <rohanjms@amazon.com>
@tianleiwu tianleiwu added cherry-picked Cherry-picked for a cherrypicks branch and removed release:1.24.0 labels Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-picked Cherry-picked for a cherrypicks branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants