[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics #26688

hariharans29 · 2025-12-01T15:00:23Z

Description

Motivation and approach taken:

Add a dedicated depthwise convolution kernel for the most common depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1, dilation = 1) using NEON intrinsics. This does significantly better than the current approach of Im2Col + SGemm. The Im2Col step extracts convolution patches and this is a wasteful step and for a 3x3 filter, K would be 9 for the SGemm and usually Gemms are not optimized for such small K values. Hence, a dedicated kernel works much better.

Initially, I ported over the Winograd based NEON accelerated depthwise convolution kernel from PyTorch but I found that its performance is not very good. It's poor performance is probably due to applying the Winograd transformation for the filter repeatedly. A better approach may be to tranform the filter offline and this approach can be considered for later (I reverted the PyTorch Winograd implementation in this commit: 2820a84).

The current depthwise kernel added in this PR was authored by GPT5.1-Codex and with some minor bug fixes it seems to be functionally correct now and also provides the perf boost we are seeking.

Unit tests:
There are already depthwise convolution tests already existing in the codebase. I don't see a need for new ones at this point.

Kernel benchmarking:
This is the kernel level perf improvement from MLAS Conv benchmarks (About 50% kernel latency improvements):

Motivation and Context

A key customer model had a few depthwise conolution operations and this change provides a non-negligible ~3% throughput improvement using the customer provided benchmarking setup

For those interested, #26654 adds support for the same type of convolution variant but that leverages SME1/SME2 through KleidiAI. This PR is conceptually the same but targeting NEON only platforms.

hariharans29 · 2025-12-02T02:18:39Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-12-02T02:18:54Z

Azure Pipelines successfully started running 4 pipeline(s).

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/mlas/bench/bench_sconv.cpp

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

onnxruntime/core/mlas/lib/convolve.cpp

yuslepukhin

### Description As title - it looks like the duration of the job is very close to the timeout ### Motivation and Context Reduce retrry attempts for the ios sim job My own PR - #26688 keep timing out this job leg

…using NEON intrinsics (microsoft#26688) ### Description **Motivation and approach taken:** Add a dedicated depthwise convolution kernel for the most common depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1, dilation = 1) using NEON intrinsics. This does significantly better than the current approach of `Im2Col + SGemm`. The Im2Col step extracts convolution patches and this is a wasteful step and for a 3x3 filter, K would be 9 for the SGemm and usually Gemms are not optimized for such small `K` values. Hence, a dedicated kernel works much better. Initially, I ported over the Winograd based NEON accelerated depthwise convolution kernel from PyTorch but I found that its performance is not very good. It's poor performance is probably due to applying the Winograd transformation for the filter repeatedly. A better approach may be to tranform the filter offline and this approach can be considered for later (I reverted the PyTorch Winograd implementation in this commit: microsoft@2820a84). The current depthwise kernel added in this PR was authored by GPT5.1-Codex and with some minor bug fixes it seems to be functionally correct now and also provides the perf boost we are seeking. **Unit tests:** There are already depthwise convolution tests already existing in the codebase. I don't see a need for new ones at this point. **Kernel benchmarking:** This is the kernel level perf improvement from MLAS Conv benchmarks (About 50% kernel latency improvements): <img width="1055" height="90" alt="image" src="https://github.com/user-attachments/assets/ead9eb83-2d62-4157-a065-70c67c8c7517" /> ### Motivation and Context A key customer model had a few depthwise conolution operations and this change provides a **non-negligible ~3% throughput improvement** using the customer provided benchmarking setup For those interested, microsoft#26654 adds support for the same type of convolution variant but that leverages SME1/SME2 through KleidiAI. This PR is conceptually the same but targeting NEON only platforms. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

### Description As title - it looks like the duration of the job is very close to the timeout ### Motivation and Context Reduce retrry attempts for the ios sim job My own PR - microsoft#26688 keep timing out this job leg

Rohanjames1997 · 2026-01-21T17:56:46Z

Since this PR and #26838 were merged in a short time window, we may have hit an edge case where the CI for #26838 succeeded, but it was last run before this PR merged.*

So, we still have a reference to sconv.h lying around here which is failing the mainline build when run with --enable_arm_neon_nchwc

onnxruntime/onnxruntime/core/mlas/lib/sbconv_kernel_neon.cpp

Line 20 in 01f40e6

#include "sconv.h"

*This is my hypothesis. I do not know if the CI is extensive enough to cover --enable_arm_neon_nchwc

hariharans29 · 2026-01-21T18:17:21Z

Since this PR and #26838 were merged in a short time window, we may have hit an edge case where the CI for #26838 succeeded, but it was last run before this PR merged.*

So, we still have a reference to sconv.h lying around here which is failing the mainline build when run with --enable_arm_neon_nchwc

onnxruntime/onnxruntime/core/mlas/lib/sbconv_kernel_neon.cpp

Line 20 in 01f40e6

#include "sconv.h"

*This is my hypothesis. I do not know if the CI is extensive enough to cover --enable_arm_neon_nchwc

Oh ! I think the reason is NCHWC is only enabled on MacOS now. With th BF16 feature being turned on only on Linux, I think we may have to enable it in the Linux CI as well. We are a bit busy with the 1.24 release and I will add it to the backlog and re-visit later.

Can you please propose a fix ? FWIW - This is a PR that might interest you AND have the fix - #27099. The PR may not go in 1.24 at this point and so if you could please provide an isolated fix - I will ensure the isolated fix gets cherry-picked into 1.24. Thanks!

Rohanjames1997 · 2026-01-21T19:08:26Z

Cool np!

Yeah, I did see #27099 and it prompted me to check why it wasn't caught earlier.

I will raise a fix for it now.

Rohanjames1997 · 2026-01-21T19:15:27Z

#27101

…using NEON intrinsics (#26688) ### Description **Motivation and approach taken:** Add a dedicated depthwise convolution kernel for the most common depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1, dilation = 1) using NEON intrinsics. This does significantly better than the current approach of `Im2Col + SGemm`. The Im2Col step extracts convolution patches and this is a wasteful step and for a 3x3 filter, K would be 9 for the SGemm and usually Gemms are not optimized for such small `K` values. Hence, a dedicated kernel works much better. Initially, I ported over the Winograd based NEON accelerated depthwise convolution kernel from PyTorch but I found that its performance is not very good. It's poor performance is probably due to applying the Winograd transformation for the filter repeatedly. A better approach may be to tranform the filter offline and this approach can be considered for later (I reverted the PyTorch Winograd implementation in this commit: 2820a84). The current depthwise kernel added in this PR was authored by GPT5.1-Codex and with some minor bug fixes it seems to be functionally correct now and also provides the perf boost we are seeking. **Unit tests:** There are already depthwise convolution tests already existing in the codebase. I don't see a need for new ones at this point. **Kernel benchmarking:** This is the kernel level perf improvement from MLAS Conv benchmarks (About 50% kernel latency improvements): <img width="1055" height="90" alt="image" src="https://github.com/user-attachments/assets/ead9eb83-2d62-4157-a065-70c67c8c7517" /> ### Motivation and Context A key customer model had a few depthwise conolution operations and this change provides a **non-negligible ~3% throughput improvement** using the customer provided benchmarking setup For those interested, #26654 adds support for the same type of convolution variant but that leverages SME1/SME2 through KleidiAI. This PR is conceptually the same but targeting NEON only platforms. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (cherry picked from commit c03c419)

### Description `sconv.h` was renamed to `sconv_nchwc_kernel_neon.h` in #26688 but the reference to the old name was still in a new file added at around the same time in #26838. The CI doesn't include building for this configuration yet - it will be added after the 1.24 release. ### Motivation and Context Fixes failing mainline build on Arm64 linux when `--enable_arm_neon_nchwc` is supplied. ### Testing This now passes on Arm64 linux `./build.sh --config Release --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests --enable_pybind --build_wheel --enable_arm_neon_nchwc`

### Description As title - it looks like the duration of the job is very close to the timeout ### Motivation and Context Reduce retrry attempts for the ios sim job My own PR - #26688 keep timing out this job leg (cherry picked from commit d7daa45)

### Description `sconv.h` was renamed to `sconv_nchwc_kernel_neon.h` in #26688 but the reference to the old name was still in a new file added at around the same time in #26838. The CI doesn't include building for this configuration yet - it will be added after the 1.24 release. ### Motivation and Context Fixes failing mainline build on Arm64 linux when `--enable_arm_neon_nchwc` is supplied. ### Testing This now passes on Arm64 linux `./build.sh --config Release --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests --enable_pybind --build_wheel --enable_arm_neon_nchwc` (cherry picked from commit 347b990)

### Description This PR cherry-picks the following changes for the 1.24.0 release. ### Cherry-picked Commits | Commit | Commit Title | Author | |---|---|---| | 744e7fe | Add type definitions, registration, utilities for INT2/UINT2 support (#26824) | vraspar | | 530a1fb | [QNN EP] Add BFloat16 dtype support in QNN EP (#26987) | tirupath-qti | | 8e050d1 | Implement new experimental lookup-based matrix multiplication method(TMAC) (#26695) | vraspar | | 2d2ba6b | [MLAS/CPU EP] Improve performance of Silu activation path within the QuickGelu CPU kernel (#26753) | Hariharan Seshadri | | 1c02b79 | [QNN EP] Add support for handling 0-dimension for Concat Op (#27000) | Ashwath Shankarnarayan | | cc2b01b | Fix ClipQuantFusion crash when Clip has multiple input edges (#27016) | Edward Chen | | bbd3850 | [QNN EP] Support quantized BatchNorm with per-channel DQ params on QNN HTP (#26959) | qti-yuduo | | d8f0318 | Add API to get ep graph partitioning info (#26781) | Adrian Lizarraga | | b912b18 | [OVEP] OpenVINO EP Features and bug-fixes for ORT-1.24 - Follow up (#27007) | Preetha Veeramalai | | ba11af4 | [QNN-EP] Add MatMulNBits translation for GPU (#26340) | quic-tirupath | | c03c419 | [MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics (#26688) | Hariharan Seshadri | | e7dfd69 | [QNN-EP] Support alternate Layernorm fusion pattern in QNN preprocess (#26060) | qti-mattsinc | | 4013dc1 | Implement multithreading in qgemm_kleidi (#26301) | Melike Kaptan | | 9f06181 | [CXX] Enable users to specify custom OrtSyncStream via RunOptions (#26988) | Dmitri Smirnov | | cfccd64 | Added support for QMX kernels in MLAS (#26849) | qti-vaiskv | | 29d9b2f | Tweak external resource importer handle structs (#27040) | Scott McKay | | 9d108d0 | [QNN EP] Add QuickGELU operator support for QNN provider (#27034) | tirupath-qti | | b35688f | Add INT2 and UINT2 support for QDQ, transpose and cast ops (#27022) | vraspar | | 6d34aba | Introducing BF16 Pointwise NCHWc Convolution for Arm64 (#26838) | Rohanjames1997 | | 36017ad | [EP ABI] Add CreateCustomOpDomains() API for plugin EP to register custom ops (#27050) | Chi Lo | | 50a03e4 | Add a new pipeline for CUDA 13 nuget builds (#27023) | eserscor | | a0d4439 | [EP ABI] Update Graph_GetGraphView() implementation (#26711) | Chi Lo | | 34bb209 | [webgpu] Fix a bug for im2col (#27069) | Wenqin Yang | | 46e8d45 | [QNN EP] Add FusedMatMul operator support (#27044) | tirupath-qti | | 5e7e7a3 | Disable Float32_2Bits_Asymmetric_256x256 test (#27046) | vraspar | | 39f966e | Fix Doxygen documentation build error in onnxruntime_c_api.h (#27083) | Nick Eubanks | | 8a7a797 | Print tensor for new packed type of 2 bits (#27064) | Tianlei Wu | | 01f40e6 | Fix GPU JAR testing on Linux (#27011) | eserscor | | b6ed7f3 | Fix warning around ununsed code in QNN Android Emulator builds by clang (#27026) | Hariharan Seshadri | | d7daa45 | Raise the timeout for the ios simulator job (#27045) | Hariharan Seshadri | | 7e1d818 | upgrade emsdk to 4.0.23 (#27029) | Yulong Wang | | 347b990 | Fix failing mainline build on Arm64 linux (#27101) | Rohanjames1997 | | f481b17 | Add dedicated API to support extracting compatibility string from model metadata (#27015) | adrastogi | --------- Signed-off-by: Liqun Fu <liqun.fu@microsoft.com> Signed-off-by: bfilipek <bartlomiej.filipek@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com> Signed-off-by: melkap01 <melike.kaptan@arm.com> Co-authored-by: vraspar <vrajang@outlook.com> Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com> Co-authored-by: Ashwath Shankarnarayan <ashwshan@qti.qualcomm.com> Co-authored-by: Liqun Fu <liqun.fu@microsoft.com> Co-authored-by: carzh <wolfivyaura@gmail.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: carzh <carolinezhu@microsoft.com> Co-authored-by: Vrajang Parikh <vrparikh@microsoft.com> Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Yuduo Wu <yuduow@qti.qualcomm.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: Javier Martinez <javier.e.martinez@intel.com> Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com> Co-authored-by: bopeng1234 <bo.peng@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com> Co-authored-by: TejalKhade28 <tejal.khade@intel.com> Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com> Co-authored-by: Yaru Du <yaru.du@intel.com> Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com> Co-authored-by: Dvoretckii, Mikhail <mikhail.dvoretckii@intel.com> Co-authored-by: Pallavi Gupta <pallavi.gupta@intel.com> Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Fei Chen <feich@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Akupadhye <aupadhye@qti.qualcomm.com> Co-authored-by: Wang Ning <ning4.wang@intel.com> Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: George Wu <jywu@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Wanming Lin <wanming.lin@intel.com> Co-authored-by: quic-calvnguy <quic_calvnguy@quicinc.com> Co-authored-by: Jie Chen <jie.a.chen@intel.com> Co-authored-by: xhcao <xinghua.cao@intel.com> Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Co-authored-by: quic-hungjuiw <quic_hungjuiw@quicinc.com> Co-authored-by: Ian Hunter <ianfhunter@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: Jeff Kilpatrick <jkilpatrick@qti.qualcomm.com> Co-authored-by: Jeff Kilpatrick <jkilpat@qti.qualcomm.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Nenad Banfic <46795300+nenad1002@users.noreply.github.com> Co-authored-by: derdeljan-msft <derdeljan@microsoft.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com> Co-authored-by: Jaswanth Gannamaneni <jaswanth.gannamaneni@intel.com> Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com> Co-authored-by: liang <gxgaoliang@126.com> Co-authored-by: Garth Long <garth.long@intel.com> Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com> Co-authored-by: Christopher Warrington <chwarr@microsoft.com> Co-authored-by: Ishwar Raut <iraut@nvidia.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Xinpeng Dou <15529241576@163.com> Co-authored-by: adrastogi <aditya.rastogi@microsoft.com> Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com> Co-authored-by: qti-hungjuiw <hungjuiw@qti.qualcomm.com> Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com> Co-authored-by: Adam Pocock <adam.pocock@oracle.com> Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com> Co-authored-by: Susanta Bhattacharjee <susanta.bhattacharjee@intel.com> Co-authored-by: Jozef Wludzik <jozef.wludzik@intel.com> Co-authored-by: Rajeev Sekar <rajeevsekar21@gmail.com> Co-authored-by: Mayuresh M Varerkar <mayuresh.m.varerkar@intel.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Wenqin Yang <wenqin.yang@intel.com> Co-authored-by: xieofxie <xieofxie@126.com> Co-authored-by: hualxie <hualxie@microsoft.com> Co-authored-by: Joshua Lochner <admin@xenova.com> Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com> Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com> Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Co-authored-by: chunghow-qti <chunghow@qti.qualcomm.com> Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Jiawei Shao <jiawei.shao@intel.com> Co-authored-by: czekun <chen.zekun@intel.com> Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com> Co-authored-by: quic-tirupath <quic_tirupath@quicinc.com> Co-authored-by: qti-mattsinc <mattsinc@qti.qualcomm.com> Co-authored-by: Melike Kaptan <melike.kaptan@arm.com> Co-authored-by: Damien Dooley <damien.dooley@arm.com> Co-authored-by: qti-vaiskv <vaiskv@qti.qualcomm.com> Co-authored-by: Rohanjames1997 <rohan.james4@gmail.com> Co-authored-by: eserscor <erscor@microsoft.com> Co-authored-by: eserscor <247253654+eserscor@users.noreply.github.com> Co-authored-by: Nick Eubanks <nieubank@microsoft.com> Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com> Co-authored-by: Rohanjames1997 <rohanjms@amazon.com>

hariharans29 added 8 commits November 30, 2025 22:16

Initial commit

160c71e

More changes

a134ea0

More changes

a44f708

Fix builds

3a3ccf7

Fix builds 2

212dbf1

Threaded

fceae09

Fix x64 builds

3793d70

Experiment

481a7f6

hariharans29 changed the title ~~WIP: Conv Expt~~ [DO NOT REVIEW] WIP: Conv Expt Dec 2, 2025

hariharans29 added 10 commits December 1, 2025 19:11

Experiment revert

8993a0a

Refactor

d765c1a

More changes

a428d50

a

d53dd15

Try

8f12c51

More changes

67b6801

Relax padding

01b43fb

Vanilla NEON Depthwise

ea83394

Fix indexing

dd94a3b

Add benchmark

ffd291a

github-actions bot reviewed Dec 4, 2025

View reviewed changes

onnxruntime/test/mlas/bench/bench_sconv.cpp Outdated Show resolved Hide resolved

hariharans29 added 2 commits December 4, 2025 07:03

Add lambda

92fb604

Rework

d15bb93

hariharans29 changed the title ~~[DO NOT REVIEW] WIP: Conv Expt~~ [MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics Dec 4, 2025

hariharans29 mentioned this pull request Dec 4, 2025

ARM KleidiAI Micro-Kernel: Depthwise Convolution (3x3 kernel, Stride of 1) #26654

Open

hariharans29 and others added 3 commits December 4, 2025 07:14

Update onnxruntime/test/mlas/bench/bench_sconv.cpp

119ec9a

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Fix

d0fc143

Remove Winograd implementation

2820a84

hariharans29 requested a review from Copilot December 5, 2025 07:08

Use MLAS helpers instead of using NEON intrindics directly

60a2b81

yuslepukhin reviewed Jan 12, 2026

View reviewed changes

onnxruntime/core/mlas/lib/convolve.cpp Show resolved Hide resolved

hariharans29 added the release:1.24.0 label Jan 13, 2026

hariharans29 added 4 commits January 15, 2026 11:02

Resolve comments

245f1cd

Typo

2487491

Merge remote-tracking branch 'origin' into hari/expt_conv

65bc06f

Fix

20d2bb1

yuslepukhin approved these changes Jan 15, 2026

View reviewed changes

hariharans29 enabled auto-merge (squash) January 16, 2026 00:19

Merge remote-tracking branch 'origin' into hari/expt_conv

9c2a654

hariharans29 mentioned this pull request Jan 16, 2026

Raise the timeout for the ios simulator job #27045

Merged

hariharans29 merged commit c03c419 into main Jan 16, 2026
111 of 114 checks passed

hariharans29 deleted the hari/expt_conv branch January 16, 2026 19:44

Rohanjames1997 mentioned this pull request Jan 21, 2026

Fix failing mainline build on Arm64 linux #27101

Merged

This was referenced Jan 21, 2026

1.24.0 release cherry-pick round 1 #27103

Closed

1.24.0 release cherry-pick round 1 #27104

Merged

tianleiwu added cherry-picked Cherry-picked for a cherrypicks branch and removed release:1.24.0 labels Jan 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics #26688

[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics #26688

Uh oh!

hariharans29 commented Dec 1, 2025 •

edited

Loading

Uh oh!

hariharans29 commented Dec 2, 2025

Uh oh!

azure-pipelines bot commented Dec 2, 2025

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

yuslepukhin left a comment

Uh oh!

Uh oh!

Rohanjames1997 commented Jan 21, 2026

Uh oh!

hariharans29 commented Jan 21, 2026

Uh oh!

Rohanjames1997 commented Jan 21, 2026

Uh oh!

Rohanjames1997 commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics #26688

[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics #26688

Uh oh!

Conversation

hariharans29 commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

hariharans29 commented Dec 2, 2025

Uh oh!

azure-pipelines bot commented Dec 2, 2025

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yuslepukhin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Rohanjames1997 commented Jan 21, 2026

Uh oh!

hariharans29 commented Jan 21, 2026

Uh oh!

Rohanjames1997 commented Jan 21, 2026

Uh oh!

Rohanjames1997 commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hariharans29 commented Dec 1, 2025 •

edited

Loading