mlas/arm64: add NEON conv asm kernels and tune NCHWC kernel selection #27099

milpuz01 · 2026-01-21T16:49:23Z

Overview

This PR adds ARM64 NEON assembly micro‑kernels for NCHW, depthwise, and pointwise convolution, wires them into the MLAS build, and adds shape‑based selection heuristics for NCHWC depthwise/pointwise to favor the asm kernels in safe cases (stride‑1 pointwise; wider depthwise outputs). The BF16 path is unchanged.

Key changes

cmake/onnxruntime_mlas.cmake
- Add new AArch64 assembly sources for NCHW, depthwise, and pointwise conv to the MLAS build.
onnxruntime/core/mlas/lib/aarch64/SconvKernelNeon.S
- New vectorised NCHW convolution micro‑kernel.
onnxruntime/core/mlas/lib/aarch64/SconvDepthwiseKernelNeon.S
- New vectorised depthwise micro‑kernel (fast path for in‑bounds loads, slow path for padding).
onnxruntime/core/mlas/lib/aarch64/SconvPointwiseKernelNeon.S
- New vectorised pointwise micro‑kernel (multi‑output reuse).
onnxruntime/core/mlas/lib/mlasi.h, onnxruntime/core/mlas/lib/platform.cpp
- Declare/register new asm kernels and prefer them on ARM64.
onnxruntime/core/mlas/lib/snchwc.cpp
- Heuristics: use pointwise asm when StrideHeight == 1 && StrideWidth == 1 and OutputThisIteration >= 4; use depthwise asm when OutputWidth >= 4.
onnxruntime/core/mlas/lib/sbconv_kernel_neon.cpp
- Include fix for the conv kernel flags header.

Performance

Numbers below are expressed as multipliers vs the non‑NCHWC baseline (same model and perf_test settings):

Baseline (no --enable_arm_neon_nchwc)

8 cores: 1.00×
16 cores: 1.00×

With --enable_arm_neon_nchwc (no asm additions/heuristics)

8 cores: 1.18×
16 cores: 1.24×

With this PR (asm kernels + heuristics)

8 cores: 1.77×
16 cores: 2.54×

Testing

./build.sh --config Release --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests --enable_pybind --build_wheel --enable_arm_neon_nchwc
OMP_NUM_THREADS=8 ./build/Linux/Release/onnxruntime_perf_test -I -m times -r 1000 --x 8 ~/mobilenetv2-7.onnx

Signed-off-by: Milos Puzovic <[email protected]>

aviralagrawal · 2026-01-21T17:32:52Z

Interesting contribution - thank you!

A few questions -

Pointwise convolution is currently implemented via direct GEMM - which I assume is optimized. How does this kernel beat the performance of GEMM?
Can you share the link to the mobilenet model that you used for performance benchmarking?
How does it perform on single threaded experiments? Afaik, the original nchwc kernels in NEON kernels for NCHWc Convolution and Pooling #25580 suffered in the single threaded setting but outperformed the default in thread counts>8.

milpuz01 · 2026-01-21T22:42:06Z

Hi @aviralagrawal, thank you vey much for your prompt feedback.

Pointwise convolution is currently implemented via direct GEMM - which I assume is optimized. How does this kernel beat the performance of GEMM?

Compared to direct GEMM implementation of pointwise convolution asm kernel computes 1x1 conv directly:

it explicitly tiles 4 outputs: computes up to 4 output positions in parallel and reuses filter loads across those outputs so with single load we are able to accumulate 4 outputs while direct GEMM doesn't tile multiple outputs together
fuses accumulate/bias/ReLU into store path instead of separate passes with direct GEMM
unrolls the block size explicitly with 16 invocations to keep accumulators in registers and minimise loop overheads thus reducing dispatch/param overhead and output read-modify-write passes compared to direct GEMM

As usual there are trade-offs so direct GEMM would be faster when output count is small because then asm kernel drops to single-output path which has less ILP and won't be able to reuse filter loads, non-unit stride and non-contigious output regions hence why we have heuristics checking for stride width and height and very large K/M when GEMM blocking can make better use of caches then a fixed 4-output tile.

This is best illustrated if we extract pointwise convolutions from mobilnet that we ran and we can see that on average asm implementation is 1.07x faster, and significant speed ups are when number of channels is high and K/M are small (in the image those are H and W dimensions). In convolution heavy networks the convolutions that are dominant are ones with large number of channels and low height and width so we see visible performance improvements as optimisations from this PR are weighted in that direction.

Can you share the link to the mobilenet model that you used for performance benchmarking?

For benchmarking we used the model from: https://github.com/onnx/models/blob/main/validated/vision/classification/mobilenet/model/mobilenetv2-7.onnx

How does it perform on single threaded experiments? Afaik, the original nchwc kernels in NEON kernels for NCHWc Convolution and Pooling #25580 suffered in the single threaded setting but outperformed the default in thread counts>8.

Running OMP_NUM_THREADS=1 ./build/Linux/Release/onnxruntime_perf_test -I -m times -r 1000 --x 1 ~/mobilenetv2-7.onnx on Graviton 4 with binary that was built with --enable_arm_neon_nchwc it is slower by 0.89x then building without that flag, while with this PR it is actually 1.25x faster than the baseline.

Rohanjames1997 · 2026-01-23T18:26:10Z

Thanks @milpuz01 for the detailed description & comment!

A couple questions from my side:

Is there a reason why ConvNchwcFloatKernel was not optimized? Afaik, It is not very different from ConvNchwFloatKernel. The x86 asm implementations of these two kernels differ very slightly too. It is a much heavier kernel than Pointwise and Depthwise, and many larger Conv models stress this kernel. An example for this type of model is in this comment: NEON kernels for NCHWc Convolution and Pooling #25580 (comment).
Can we switch the default path of Fp32 Conv on Arm64 to use these new kernels? (effectively voiding --enable_arm_neon_nchwc like it was before) Asking because this PR improves upon the single-threaded performance as well. I'd love to hear your thoughts, but also would be wise to hear from @hariharans29 before implementing.

onnxruntime/core/mlas/lib/platform.cpp

onnxruntime/core/mlas/lib/sbconv_kernel_neon.cpp

onnxruntime/core/mlas/lib/platform.cpp

onnxruntime/core/mlas/lib/mlasi.h

cmake/onnxruntime_mlas.cmake

onnxruntime/core/mlas/lib/aarch64/SconvPointwiseKernelNeon.S

milpuz01 · 2026-01-23T21:15:57Z

Hi @Rohanjames1997, thank you very much for your comments.

Is there a reason why ConvNchwcFloatKernel was not optimized?

No, particular reason. Mostly because the focus for this PR was on MobileNet model and lack of bandwidth. Thank you for sharing the model where ConvNchwcFloatKernel is invoked. We will take a look at optimising it to, but I would suggest that we add optimisation as a follow up PR so that we do not overload this PR with too many changes to review.

Can we switch the default path of Fp32 Conv on Arm64 to use these new kernels? (effectively voiding --enable_arm_neon_nchwc like it was before) Asking because this PR improves upon the single-threaded performance as well. I'd love to hear your thoughts, but also would be wise to hear from @hariharans29 before implementing.

Yes, I think that is great idea and would be interesting to hear from @hariharans29 too what other testing we should make to try to make these kernels default. As you can see above this change is not going to accelerate all possible pointwise convolutions for example but on average it will show the improvements so if we could agree on a set of performance targets we can use that to drive the decision.

Also thank you for your code review I will address them in a separate commit.

hariharans29 · 2026-01-23T21:48:14Z

Hi @Rohanjames1997, thank you very much for your comments.

Is there a reason why ConvNchwcFloatKernel was not optimized?

No, particular reason. Mostly because the focus for this PR was on MobileNet model and lack of bandwidth. Thank you for sharing the model where ConvNchwcFloatKernel is invoked. We will take a look at optimising it to, but I would suggest that we add optimisation as a follow up PR so that we do not overload this PR with too many changes to review.

Can we switch the default path of Fp32 Conv on Arm64 to use these new kernels? (effectively voiding --enable_arm_neon_nchwc like it was before) Asking because this PR improves upon the single-threaded performance as well. I'd love to hear your thoughts, but also would be wise to hear from @hariharans29 before implementing.

Yes, I think that is great idea and would be interesting to hear from @hariharans29 too what other testing we should make to try to make these kernels default. As you can see above this change is not going to accelerate all possible pointwise convolutions for example but on average it will show the improvements so if we could agree on a set of performance targets we can use that to drive the decision.

Also thank you for your code review I will address them in a separate commit.

Unfortunately, I don't have a comprehensive list of performance targets to be met to make the feature default. Since, the performance testing may not include all possible Conv shapes, I would like to err on the side of caution and atleast provide one release timeline heads-up to the users before considering making the feature default. I would also encourage you to open a discussion to solicit feedback from other ORT users on ARM if they see speed-up for their models with this feature. It would provide greater confidence and a strong data point to turn it on by default.

Thanks for this contribution, we will review it shortly !

onnxruntime/core/mlas/lib/aarch64/SconvKernelNeon.S

onnxruntime/core/mlas/lib/platform.cpp

onnxruntime/core/mlas/lib/snchwc.cpp

Signed-off-by: Milos Puzovic <[email protected]>

milpuz01 · 2026-01-26T16:50:04Z

Unfortunately, I don't have a comprehensive list of performance targets to be met to make the feature default. Since, the performance testing may not include all possible Conv shapes, I would like to err on the side of caution and atleast provide one release timeline heads-up to the users before considering making the feature default. I would also encourage you to open a discussion to solicit feedback from other ORT users on ARM if they see speed-up for their models with this feature. It would provide greater confidence and a strong data point to turn it on by default.

Thanks @hariharans29. I agree with erring on the side of caution. If this PR goes through and it is in the main release is it possible to add a note that we would like to make --enable_arm_neon_nchwc as a default in the future releases so that we can try to get some feedback on that via that route too? Thanks again.

hariharans29 · 2026-01-26T18:05:10Z

Unfortunately, I don't have a comprehensive list of performance targets to be met to make the feature default. Since, the performance testing may not include all possible Conv shapes, I would like to err on the side of caution and atleast provide one release timeline heads-up to the users before considering making the feature default. I would also encourage you to open a discussion to solicit feedback from other ORT users on ARM if they see speed-up for their models with this feature. It would provide greater confidence and a strong data point to turn it on by default.

Thanks @hariharans29. I agree with erring on the side of caution. If this PR goes through and it is in the main release is it possible to add a note that we would like to make --enable_arm_neon_nchwc as a default in the future releases so that we can try to get some feedback on that via that route too? Thanks again.

Thanks @milpuz01. The PR should go through in main eventually but I don't think it will go in 1.24.0 unfortunately as the release branch is cut and the bar to take in new code at this point is critical bug fixes and urgent customer asks only. I will try to take this in for 1.24.1 when it happens and sure I will add a note about considering making it default in one of the future releases, but ultimately, as discussed in the comment #27099 (comment), I expect the NchwcFloatKernel needs optimizations before considering that.

Copilot

Pull request overview

Adds new AArch64 NEON assembly micro-kernels for NCHW, depthwise NCHWc, and pointwise NCHWc convolution, integrates them into the MLAS build, and updates NCHWc kernel-selection heuristics to prefer the asm kernels in selected shapes.

Changes:

Add new AArch64 .S convolution micro-kernels (NCHW, depthwise NCHWc, pointwise NCHWc) and wire them into the MLAS build.
Update ARM64 platform init and NCHWc execution heuristics to select asm kernels for pointwise (stride-1, larger tiles) and depthwise (wider outputs).
Remove the old intrinsics wrapper for the NCHW float kernel in the NCHWc NEON source file.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
cmake/onnxruntime_mlas.cmake	Adds new AArch64 asm sources to the ARM NEON NCHWc MLAS build setup.
onnxruntime/core/mlas/lib/snchwc.cpp	Adds ARM64 heuristics to prefer asm depthwise/pointwise kernels in “safe” cases.
onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.cpp	Removes the old NCHW float kernel wrapper implementation from the NCHWc NEON source file.
onnxruntime/core/mlas/lib/platform.cpp	Switches ARM64 NCHW conv kernel default to asm; updates commentary around kernel choices.
onnxruntime/core/mlas/lib/mlasi.h	Declares new asm kernel entry points for ARM64 NEON NCHWc.
onnxruntime/core/mlas/lib/aarch64/SconvKernelNeon.S	Adds new NCHW convolution asm micro-kernel.
onnxruntime/core/mlas/lib/aarch64/SconvDepthwiseKernelNeon.S	Adds new depthwise NCHWc asm micro-kernel (fast/slow path for padding).
onnxruntime/core/mlas/lib/aarch64/SconvPointwiseKernelNeon.S	Adds new pointwise NCHWc asm micro-kernel (multi-output reuse).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/mlas/lib/aarch64/SconvKernelNeon.S

onnxruntime/core/mlas/lib/platform.cpp

onnxruntime/core/mlas/lib/aarch64/SconvPointwiseKernelNeon.S

Fix microsoft#27125 It does fix the build issue on Linux, but I am not entirely sure whether this is the optimal fix.

### Description Models with corresponding Olive recipes are deprecated. ### Motivation and Context Olive and Olive-recipes is the entry point for model optimization. We want onnxruntime to be only for runtime. So, deprecating examples that are already present in olive recipes.

…t#27134) Bumps [lodash](https://github.com/lodash/lodash) from 4.17.21 to 4.17.23. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/lodash/lodash/commit/dec55b7a3b382da075e2eac90089b4cd00a26cbb"><code>dec55b7</code></a> Bump main to v4.17.23 (<a href="https://redirect.github.com/lodash/lodash/issues/6088">#6088</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/19c9251b3631d7cf220b43bc757eb33f1084f117"><code>19c9251</code></a> fix: setCacheHas JSDoc return type should be boolean (<a href="https://redirect.github.com/lodash/lodash/issues/6071">#6071</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/b5e672995ae26929d111a6e94589f8d03fb8e578"><code>b5e6729</code></a> jsdoc: Add -0 and BigInt zeros to _.compact falsey values list (<a href="https://redirect.github.com/lodash/lodash/issues/6062">#6062</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/edadd452146f7e4bad4ea684e955708931d84d81"><code>edadd45</code></a> Prevent prototype pollution on baseUnset function</li> <li><a href="https://github.com/lodash/lodash/commit/4879a7a7d0a4494b0e83c7fa21bcc9fc6e7f1a6d"><code>4879a7a</code></a> doc: fix autoLink function, conversion of source links (<a href="https://redirect.github.com/lodash/lodash/issues/6056">#6056</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/9648f692b0fc7c2f6a7a763d754377200126c2e8"><code>9648f69</code></a> chore: remove <code>yarn.lock</code> file (<a href="https://redirect.github.com/lodash/lodash/issues/6053">#6053</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/dfa407db0bf5b200f2c7a9e4f06830ceaf074be9"><code>dfa407d</code></a> ci: remove legacy configuration files (<a href="https://redirect.github.com/lodash/lodash/issues/6052">#6052</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/156e1965ae78b121a88f81178ab81632304e8d64"><code>156e196</code></a> feat: add renovate setup (<a href="https://redirect.github.com/lodash/lodash/issues/6039">#6039</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/933e1061b8c344d3fc742cdc400175d5ffc99bce"><code>933e106</code></a> ci: add pipeline for Bun (<a href="https://redirect.github.com/lodash/lodash/issues/6023">#6023</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/072a807ff7ad8ffc7c1d2c3097266e815d138e20"><code>072a807</code></a> docs: update links related to Open JS Foundation (<a href="https://redirect.github.com/lodash/lodash/issues/5968">#5968</a>)</li> <li>Additional commits viewable in <a href="https://github.com/lodash/lodash/compare/4.17.21...4.17.23">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=lodash&package-manager=npm_and_yarn&previous-version=4.17.21&new-version=4.17.23)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

) ### Description  ### Motivation and Context

Description Enables the file mapping of weights as well as the overall context bin. This feature is currently only enabled for ARM64 WIN devices Motivation and Context Currently, when reading the context bin, ORT allocates a large buffer on the heap. Assuming the same model is used, each ORT session will allocate a buffer for the context bin. This is incredibly wasteful when large models are used. Instead, WIN file mapping can be leveraged to map the context bin, then every time a context needs to be created with the context bin, the pointer to the context bin can be retrieved and used instead of some pre-allocated buffer, thus making QNN EP more memory-efficient. In the case of multiple ORT sessions, the context bin will only be loaded once for all sessions, increasing memory efficiency and overall initialization performance. This is very useful regarding the use of LLMs going forward. --------- Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>

…spec (microsoft#27164) I missed the operator since it didn't have the corresponding tests at the time. With onnx/onnx#7618, the disabled test should be able to pass. --- This pull request updates the ONNX Runtime CPU execution provider to add support for the `LpNormalization` operator for opset version 22, in addition to clarifying and correcting the registration for earlier versions. It also updates the backend test filters to reflect this new support. **ONNX Operator Kernel Registration:** * Added new kernel registrations for `LpNormalization` with opset version 22 for both `float` and `double` data types in `cpu_execution_provider.cc`. [[1]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908R1328-R1329) [[2]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908R3389-R3392) * Updated the registration for `LpNormalization` for opset versions 1 through 21 to use the correct versioned kernel macro, ensuring correct kernel selection and compatibility. [[1]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908L197-R198) [[2]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908L1731-R1735) **Test Filters Update:** * Updated `onnx_backend_test_series_filters.jsonc` to remove the exclusion of `test_l1normalization*`, `test_lpnormalization*`, and `test_l2normalization*` now that `LpNormalization` opset 22 is implemented, and added a TODO comment referencing ONNX 1.21 for a known zero-norm issue. [[1]](diffhunk://#diff-abc0f78c2314f9e7648c8081125d0ce9f33b12399520d92d811d73e3c795ed59R32-R33) [[2]](diffhunk://#diff-abc0f78c2314f9e7648c8081125d0ce9f33b12399520d92d811d73e3c795ed59L42) [[3]](diffhunk://#diff-abc0f78c2314f9e7648c8081125d0ce9f33b12399520d92d811d73e3c795ed59L70-L71)

…ft#27151) ### Description Previously in `MatMulReadFnSource()` we use duplicated code to read data from two inputs `a` and `b`. This patch implements another overload of `MatMulReadFnSource()` to only read data from one input to reduce duplicated code and get ready for further use.

…crosoft#27179) ## Problem Description The `MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` test was exhibiting flaky behavior (failure rate ~2-20%) with numerical mismatches. Investigation revealed a **race condition** in the [GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326) step within [MlasLutGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/inc/mlas_qnbit.h#L328). When the batch size `M > 1`, [MlasLutGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/inc/mlas_qnbit.h#L328) attempted to parallelize the LUT generation over the batch dimension using `MlasTrySimpleParallel`. However, the underlying [GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326) implementation (specifically shared usage of `lut_scales`/`lut_biases` or internal buffers) is not thread-safe for concurrent execution on the same destination buffers or related state. This led to corruption of the Look-Up Tables or scales, causing random output errors. ## Solution This PR modifies [onnxruntime/core/mlas/lib/qlutgemm.cpp](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/qlutgemm.cpp) to **serialize the [GenerateLUT](file:///home/tlwu/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#324-355) loop**. Instead of using `MlasTrySimpleParallel`, we now use a simple `for` loop to process each row of the batch sequentially. **Performance Impact:** The [GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326) step is computationally lightweight compared to the subsequent [TMACComputeGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L505) matrix multiplication. Serializing this setup step has negligible impact on overall inference latency (micro-benchmarks showed no measurable regression), but effectively eliminates the race condition. ## Verification * **Reproduction:** The issue was reliably reproduced by running `MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` in a loop (failing ~1 in 5 times). * **Verification:** After applying the fix, the same test passed **50/50 iterations** consistently. * **Regression Testing:** Standard `MatMulNBitsLutGemm` tests (including `BlkLen64` and `M=1` cases) continue to pass.

See related issues: microsoft#26889

Bumps [tar](https://github.com/isaacs/node-tar) to 7.5.7 and updates ancestor dependency [cmake-js](https://github.com/cmake-js/cmake-js). These dependencies need to be updated together. Updates `tar` from 6.2.1 to 7.5.7 <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/isaacs/node-tar/blob/main/CHANGELOG.md">tar's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <h2>7.5</h2> <ul> <li>Added <code>zstd</code> compression support.</li> <li>Consistent TOCTOU behavior in sync t.list</li> <li>Only read from ustar block if not specified in Pax</li> <li>Fix sync tar.list when file size reduces while reading</li> <li>Sanitize absolute linkpaths properly</li> <li>Prevent writing hardlink entries to the archive ahead of their file target</li> </ul> <h2>7.4</h2> <ul> <li>Deprecate <code>onentry</code> in favor of <code>onReadEntry</code> for clarity.</li> </ul> <h2>7.3</h2> <ul> <li>Add <code>onWriteEntry</code> option</li> </ul> <h2>7.2</h2> <ul> <li>DRY the command definitions into a single <code>makeCommand</code> method, and update the type signatures to more appropriately infer the return type from the options and arguments provided.</li> </ul> <h2>7.1</h2> <ul> <li>Update minipass to v7.1.0</li> <li>Update the type definitions of <code>write()</code> and <code>end()</code> methods on <code>Unpack</code> and <code>Parser</code> classes to be compatible with the NodeJS.WritableStream type in the latest versions of <code>@types/node</code>.</li> </ul> <h2>7.0</h2> <ul> <li>Drop support for node <18</li> <li>Rewrite in TypeScript, provide ESM and CommonJS hybrid interface</li> <li>Add tree-shake friendly exports, like <code>import('tar/create')</code> and <code>import('tar/read-entry')</code> to get individual functions or classes.</li> <li>Add <code>chmod</code> option that defaults to false, and deprecate <code>noChmod</code>. That is, reverse the default option regarding explicitly setting file system modes to match tar entry settings.</li> <li>Add <code>processUmask</code> option to avoid having to call <code>process.umask()</code> when <code>chmod: true</code> (or <code>noChmod: false</code>) is set.</li> </ul>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/isaacs/node-tar/commit/4a37eb9a1cf1137df4eb70c5c7f849f412ff3cdb"><code>4a37eb9</code></a> 7.5.7</li> <li><a href="https://github.com/isaacs/node-tar/commit/f4a7aa9bc3d717c987fdf1480ff7a64e87ffdb46"><code>f4a7aa9</code></a> fix: properly sanitize hard links containing ..</li> <li><a href="https://github.com/isaacs/node-tar/commit/394ece6ad8d81742a4e4058af227c616cd947a25"><code>394ece6</code></a> 7.5.6</li> <li><a href="https://github.com/isaacs/node-tar/commit/7d4cc17c76f6bd11dcd83de47187dc6dff206eee"><code>7d4cc17</code></a> fix race puting a Link ahead of its target File</li> <li><a href="https://github.com/isaacs/node-tar/commit/26ab90474e642cf00d84a05bcdc2eaf2a19f1581"><code>26ab904</code></a> 7.5.5</li> <li><a href="https://github.com/isaacs/node-tar/commit/e9a1ddb821b29ddee75b9470dd511066148c8070"><code>e9a1ddb</code></a> fix: do not prevent valid linkpaths within archive</li> <li><a href="https://github.com/isaacs/node-tar/commit/911c886bb170a6ee3db05fd3709221752213ec8a"><code>911c886</code></a> 7.5.4</li> <li><a href="https://github.com/isaacs/node-tar/commit/3b1abfae650056edfabcbe0a0df5954d390521e6"><code>3b1abfa</code></a> normalize out unicode ligatures</li> <li><a href="https://github.com/isaacs/node-tar/commit/a43478c5c51a71ec996cea62ff824eb9dc9dd17c"><code>a43478c</code></a> remove some unused files</li> <li><a href="https://github.com/isaacs/node-tar/commit/970c58f6d3d0c932081f8b40218f612db2fabb5a"><code>970c58f</code></a> update deps</li> <li>Additional commits viewable in <a href="https://github.com/isaacs/node-tar/compare/v6.2.1...v7.5.7">compare view</a></li> </ul> </details> <details> <summary>Maintainer changes</summary> <p>This version was pushed to npm by <a href="https://www.npmjs.com/~isaacs">isaacs</a>, a new releaser for tar since your current version.</p> </details> <br /> Updates `cmake-js` from 7.2.1 to 8.0.0 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/cmake-js/cmake-js/releases">cmake-js's releases</a>.</em></p> <blockquote> <h2>v8.0.0</h2> <p>This is a small but breaking change.</p> <p>This now requires nodejs 20 or later, due to increased requirements of updated dependencies</p> <p>With the increased minimum, this now uses the builtin fetch which further reduces the install size!</p> <p><strong>Full Changelog</strong>: <a href="https://github.com/cmake-js/cmake-js/compare/v7.4.0...v8.0.0">https://github.com/cmake-js/cmake-js/compare/v7.4.0...v8.0.0</a></p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/cmake-js/cmake-js/blob/master/changelog.md">cmake-js's changelog</a>.</em></p> <blockquote> <h1>v8.0.0 - 27/01/26</h1> <ul> <li>feat: require nodejs 20 or later</li> <li>feat: update deprecated dependencies</li> </ul> <h1>v7.4.0 - 14/11/25</h1> <ul> <li>feat(windows): support msvc 2026 (Thanks to <a href="https://github.com/Norgerkaj"><code>@Norgerkaj</code></a>)</li> </ul> <h1>v7.3.1 - 17/04/25</h1> <ul> <li>fix(windows): support windows arm64 (Thanks to <a href="https://github.com/jaycex"><code>@jaycex</code></a>)</li> <li>fix(windows): support newer visual studio installations</li> </ul> <h1>v7.3.0 - 15/01/24</h1> <ul> <li>feat(windows): replace custom libnode.def generation with version from node-api-headers</li> <li>fix: support for vs2015 with nodejs 18 and older (<a href="https://redirect.github.com/cmake-js/cmake-js/issues/317">#317</a>)</li> <li>fix(windows): always remove Path if PATH is also defined (<a href="https://redirect.github.com/cmake-js/cmake-js/issues/319">#319</a>)</li> <li>fix: Cmake arguments got converted to numbers (<a href="https://redirect.github.com/cmake-js/cmake-js/issues/314">#314</a>)</li> <li>fix: update node-api-headers</li> <li>chore: update dependencies</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/cmake-js/cmake-js/commit/a2c37135370d79e6fd962deecc2f013058d06191"><code>a2c3713</code></a> chore: v8.0.0</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/33c03a9a6a36c284323c5953f9cb88041775f4ac"><code>33c03a9</code></a> chore: fix ci</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/3bf03be7d3cdab323f3456e987421df3668ed217"><code>3bf03be</code></a> chore: fix ci</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/ab5e651e6dfe8b7d32396fd6cffb6b373e51d6a4"><code>ab5e651</code></a> chore(deps): bump actions/checkout from 5 to 6 (<a href="https://redirect.github.com/cmake-js/cmake-js/issues/358">#358</a>)</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/818bece268209646b75c9b9fdb9556c002fa8217"><code>818bece</code></a> fix: replace npmlog with simple inline logger</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/0b3a84081a9971c352b0bf7386db7404154afb51"><code>0b3a840</code></a> feat!: replace axios with fetch</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/c429d9e664d98bb0115e97341bcc57ce325e497c"><code>c429d9e</code></a> feat!: require nodejs 20</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/a5fe3c26759535023aca156bc0abb925b045c8fd"><code>a5fe3c2</code></a> v7.4.0</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/4ab302a8e03b5faac3c4f991cef5c2a37b5ff8f9"><code>4ab302a</code></a> feat(windows): add visual studio 2026 support (<a href="https://redirect.github.com/cmake-js/cmake-js/issues/357">#357</a>)</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/2d0abc48429392dc9fb9485a5ae4b9d197c52aee"><code>2d0abc4</code></a> chore: fix readme typo (<a href="https://redirect.github.com/cmake-js/cmake-js/issues/353">#353</a>)</li> <li>Additional commits viewable in <a href="https://github.com/cmake-js/cmake-js/compare/v7.2.1...v8.0.0">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

### Description Adds C/C++ API named `GetTensorElementTypeAndShapeDataReference` that returns an OrtValue tensor's shape and type without allocating a new buffer for the shape data. ### Motivation and Context This new API function can be used instead of `OrtApi::GetTypeInfo()` or `OrtApi::GetTensorTypeAndShape` to decrease the number of heap allocations and thus improve inference latency for plugin EPs kernels that frequently retrieve tensor shapes during inference. (e.g., WebGPU plugin EP)

…microsoft#27157) Replaces the deprecated pkg_resources library with importlib.metadata to fix ModuleNotFoundError.

Bumps [lodash](https://github.com/lodash/lodash) from 4.17.21 to 4.17.23. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/lodash/lodash/commit/dec55b7a3b382da075e2eac90089b4cd00a26cbb"><code>dec55b7</code></a> Bump main to v4.17.23 (<a href="https://redirect.github.com/lodash/lodash/issues/6088">#6088</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/19c9251b3631d7cf220b43bc757eb33f1084f117"><code>19c9251</code></a> fix: setCacheHas JSDoc return type should be boolean (<a href="https://redirect.github.com/lodash/lodash/issues/6071">#6071</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/b5e672995ae26929d111a6e94589f8d03fb8e578"><code>b5e6729</code></a> jsdoc: Add -0 and BigInt zeros to _.compact falsey values list (<a href="https://redirect.github.com/lodash/lodash/issues/6062">#6062</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/edadd452146f7e4bad4ea684e955708931d84d81"><code>edadd45</code></a> Prevent prototype pollution on baseUnset function</li> <li><a href="https://github.com/lodash/lodash/commit/4879a7a7d0a4494b0e83c7fa21bcc9fc6e7f1a6d"><code>4879a7a</code></a> doc: fix autoLink function, conversion of source links (<a href="https://redirect.github.com/lodash/lodash/issues/6056">#6056</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/9648f692b0fc7c2f6a7a763d754377200126c2e8"><code>9648f69</code></a> chore: remove <code>yarn.lock</code> file (<a href="https://redirect.github.com/lodash/lodash/issues/6053">#6053</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/dfa407db0bf5b200f2c7a9e4f06830ceaf074be9"><code>dfa407d</code></a> ci: remove legacy configuration files (<a href="https://redirect.github.com/lodash/lodash/issues/6052">#6052</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/156e1965ae78b121a88f81178ab81632304e8d64"><code>156e196</code></a> feat: add renovate setup (<a href="https://redirect.github.com/lodash/lodash/issues/6039">#6039</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/933e1061b8c344d3fc742cdc400175d5ffc99bce"><code>933e106</code></a> ci: add pipeline for Bun (<a href="https://redirect.github.com/lodash/lodash/issues/6023">#6023</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/072a807ff7ad8ffc7c1d2c3097266e815d138e20"><code>072a807</code></a> docs: update links related to Open JS Foundation (<a href="https://redirect.github.com/lodash/lodash/issues/5968">#5968</a>)</li> <li>Additional commits viewable in <a href="https://github.com/lodash/lodash/compare/4.17.21...4.17.23">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=lodash&package-manager=npm_and_yarn&previous-version=4.17.21&new-version=4.17.23)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…oft#27106) Bumps [lodash](https://github.com/lodash/lodash) from 4.17.21 to 4.17.23. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/lodash/lodash/commit/dec55b7a3b382da075e2eac90089b4cd00a26cbb"><code>dec55b7</code></a> Bump main to v4.17.23 (<a href="https://redirect.github.com/lodash/lodash/issues/6088">#6088</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/19c9251b3631d7cf220b43bc757eb33f1084f117"><code>19c9251</code></a> fix: setCacheHas JSDoc return type should be boolean (<a href="https://redirect.github.com/lodash/lodash/issues/6071">#6071</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/b5e672995ae26929d111a6e94589f8d03fb8e578"><code>b5e6729</code></a> jsdoc: Add -0 and BigInt zeros to _.compact falsey values list (<a href="https://redirect.github.com/lodash/lodash/issues/6062">#6062</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/edadd452146f7e4bad4ea684e955708931d84d81"><code>edadd45</code></a> Prevent prototype pollution on baseUnset function</li> <li><a href="https://github.com/lodash/lodash/commit/4879a7a7d0a4494b0e83c7fa21bcc9fc6e7f1a6d"><code>4879a7a</code></a> doc: fix autoLink function, conversion of source links (<a href="https://redirect.github.com/lodash/lodash/issues/6056">#6056</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/9648f692b0fc7c2f6a7a763d754377200126c2e8"><code>9648f69</code></a> chore: remove <code>yarn.lock</code> file (<a href="https://redirect.github.com/lodash/lodash/issues/6053">#6053</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/dfa407db0bf5b200f2c7a9e4f06830ceaf074be9"><code>dfa407d</code></a> ci: remove legacy configuration files (<a href="https://redirect.github.com/lodash/lodash/issues/6052">#6052</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/156e1965ae78b121a88f81178ab81632304e8d64"><code>156e196</code></a> feat: add renovate setup (<a href="https://redirect.github.com/lodash/lodash/issues/6039">#6039</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/933e1061b8c344d3fc742cdc400175d5ffc99bce"><code>933e106</code></a> ci: add pipeline for Bun (<a href="https://redirect.github.com/lodash/lodash/issues/6023">#6023</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/072a807ff7ad8ffc7c1d2c3097266e815d138e20"><code>072a807</code></a> docs: update links related to Open JS Foundation (<a href="https://redirect.github.com/lodash/lodash/issues/5968">#5968</a>)</li> <li>Additional commits viewable in <a href="https://github.com/lodash/lodash/compare/4.17.21...4.17.23">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=lodash&package-manager=npm_and_yarn&previous-version=4.17.21&new-version=4.17.23)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…27195) ### Description Fixes C++ documentation generation by replacing `<` and `>` with `[` and `]`. Angle brackets are mistaken as html tags. Successful run: https://github.com/microsoft/onnxruntime/actions/runs/21456738258 ### Motivation and Context Allow C++ document generation to succeed.

…rosoft#27174) ## Problem Description The `MatMulNBitsLutGemm` test suite, specifically `Float32_2Bits_Symmetric_256x256_BlkLen64`, was observing intermittent failures (flakiness). The failure manifested as numerical mismatches exceeding the tolerance, suggesting non-deterministic behavior in the kernel execution. ## Root Cause Analysis The issue was traced to the usage of `_mm256_i32gather_ps` in sqnbitgemm_lut_kernel_avx2.cpp While the gather indices were technically calculating addresses within the bounds of the allocated buffer, gather instructions on certain AVX2 hardware implementations can exhibit non-deterministic behavior or subtle performance/prefetching artifacts when operating on specific stride patterns (in this case, gathering with a stride of 4 floats). ## Solution This PR replaces the `_mm256_i32gather_ps` instruction with a sequence of **contiguous loads (`_mm256_loadu_ps`) followed by deterministic shuffles**. ### How it works: 1. **Contiguous Load**: We load 4 contiguous vectors of 8 floats elements using `_mm256_loadu_ps`. This is always memory-safe and deterministic. 2. **Deterministic Shuffle**: We apply a verified sequence of `unpack` and `permutevar8x32` instructions to rearrange these 32 linearly loaded elements into the exact same stride-4 layout that the gather instruction produced. ### Benefits: * **Stability**: Eliminates the hardware-dependent non-determinism of gather. * **Safety**: Usage of `loadu` guarantees we only touch memory within the explicit range of the 32 elements we intend to load. * **Correctness**: The shuffle logic was verified against the reference gather behavior using a C++ reproduction script to ensure bit-exact layout equivalence. ### Performance Micro-benchmark on MatMulNBitsLutGemm (256x256, BlkLen=64). Original (Gather): ~55.55 us Fixed (Load+Shuffle): ~57.79 us Delta: +2.24 us (~4% slower) The slight performance regression is expected because replacing a single hardware gather instruction with a sequence of loadu, unpack, and permute instructions adds instruction count overhead. However, this is a necessary tradeoff to ensure deterministic behavior and memory safety across all AVX2 implementations. ## Verification * **Tests**: All 9 tests in `MatMulNBitsLutGemm` passed successfully (including the previously flaky `BlkLen64` case).

…osoft#27120) Description Conditionally disable linking of cpuinfo for onnxruntime_runtime_path_test_shared_library on targets, where cpuinfo is not supported. Motivation and Context Recent changes enabling onnxruntime_autoep_test and related shared library tests on non-Windows platforms exposed a transitive dependency issue. cpuinfo was being linked unconditionally on Linux, leading to linker failures on ppc64le (cannot find -lcpuinfo). Solution Add CPUINFO_SUPPORTED guards to exclude cpuinfo from the link list while preserving existing behavior.

The logic of interleaved NEON kernel is not correct from code review: 1. **Test Code Logic:** The test code `test_rope.h` allocates the `sin` and `cos` tables based on the `interleaved` flag: ```c++ size_t table_len = interleaved ? rotary_emb_dim / 2 : rotary_emb_dim; std::vector<float> sin_data(table_len); std::vector<float> cos_data(table_len); ``` For the `interleaved = true` case, the test creates `sin` and `cos` tables of length `rotary_emb_dim / 2`. 2. **AVX2 (fp32) Kernel Logic (`interleaved = true`):** This kernel loads the `sin`/`cos` data using an index of `i / 2`: ```c++ float32x8_t sin_val = _mm256_loadu_ps(sin_data + i / 2); float32x8_t cos_val = _mm256_loadu_ps(cos_data + i / 2); ``` This logic expects a `sin`/`cos` table of length `rotary_emb_dim / 2`. **Conclusion: The AVX2 (fp32) kernel is consistent with the test code.** 3. **NEON (fp16) Kernel Logic (`interleaved = true`):** This kernel loads the `sin`/`cos` data using an index of `i`: ```c++ // Enters loop with sin_val = MlasLoadFloat16x8(sin + i); //... // Inside loop, for next iteration: sin_val = MlasLoadFloat16x8(sin + i + 16); ``` This logic expects a `sin`/`cos` table of length `rotary_emb_dim`. **Conclusion: The NEON (fp16) kernel is NOT consistent with the test code.** ### Regression Test ``` cmake --build build/Linux/Release --config Release --target onnxruntime_mlas_test && ./build/Linux/Release/onnxruntime_mlas_test --gtest_filter=NeonFp16RoPE* ``` Before applying the fix, the test failed: ``` [ FAILED ] NeonFp16RoPE.ShortExecute (13 ms) onnxruntime/onnxruntime/test/mlas/unittest/test_rope_neon_fp16.cpp:66: Failure Value of: CloseEnough(output_impl[i].ToFloat(), output_ref[i].ToFloat()) Actual: false Expected: true Expected bits: 19491 (16.546875) Actual bits: 56596 (-325) @[16], rotary_emb_dim=24, interleaved=true ``` After applying the fix, test passed. ### Summary The `RopeKernel_Avx2_fp32_Impl<true>` kernel correctly aligns with the test code (and the fallback implementation) by expecting a `sin`/`cos` table of length `rotary_emb_dim / 2`. The `RopeKernel_Fp16_Impl<true>` (NEON) kernel incorrectly expects a table of length `rotary_emb_dim`. When run against the provided test, the NEON kernel will read past the end of the `sin_data` and `cos_data` vectors. --------- Co-authored-by: Copilot <[email protected]>

…microsoft#27207) ### Description This PR renames the following existing names for MemoryInfo: - `WebGPU_Buffer` -> `WebGPU_Buf` - `WebNN_Tensor` -> `WebNN_Ten` ### Motivation and Context the `OrtMemoryInfo` uses a `std::string` to store the name. modern C++ compilers uses "small string optimization" (SSO) to avoid an extra memory allocation if the string is small enough. While different compiler may have different implementation, the following test program is used to test what exact limit is for a certain compiler: ```c++ #include <string> #include <cstdio> int main() { std::string webgpu0 = "WebGPU_Buf"; std::string webgpu1 = "WebGPU_Buff"; std::string webgpu2 = "WebGPU_Buffe"; std::string webgpu3 = "WebGPU_Buffer"; printf("=========== %s\n string address: %p\n data address : %p\n\n", webgpu0.c_str(), (void*)&webgpu0, (void*)webgpu0.data()); printf("=========== %s\n string address: %p\n data address : %p\n\n", webgpu1.c_str(), (void*)&webgpu1, (void*)webgpu1.data()); printf("=========== %s\n string address: %p\n data address : %p\n\n", webgpu2.c_str(), (void*)&webgpu2, (void*)webgpu2.data()); printf("=========== %s\n string address: %p\n data address : %p\n\n", webgpu3.c_str(), (void*)&webgpu3, (void*)webgpu3.data()); return 0; } ``` While using emscripten (targetting wasm32), the runtime result is like this: ``` =========== WebGPU_Buf string address: 0x10db0 data address : 0x10db0 =========== WebGPU_Buff string address: 0x10da4 data address : 0x10dc8 =========== WebGPU_Buffe string address: 0x10d98 data address : 0x10de0 =========== WebGPU_Buffer string address: 0x10d8c data address : 0x10df8 ``` Which shows that the string need to be no more than 10 bytes (exclude the '\0' at end) to enable SSO.

Signed-off-by: Milos Puzovic <[email protected]>

…ck spill Signed-off-by: Milos Puzovic <[email protected]>

mlas/arm64: add NEON conv asm kernels and tune NCHWC kernel selection

d216d02

Signed-off-by: Milos Puzovic <[email protected]>

hariharans29 mentioned this pull request Jan 21, 2026

[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics #26688

Merged

Rohanjames1997 reviewed Jan 23, 2026

View reviewed changes

hariharans29 reviewed Jan 23, 2026

View reviewed changes

onnxruntime/core/mlas/lib/aarch64/SconvKernelNeon.S Outdated Show resolved Hide resolved

hariharans29 reviewed Jan 23, 2026

View reviewed changes

onnxruntime/core/mlas/lib/aarch64/SconvKernelNeon.S Show resolved Hide resolved

hariharans29 reviewed Jan 23, 2026

View reviewed changes

onnxruntime/core/mlas/lib/platform.cpp Outdated Show resolved Hide resolved

hariharans29 reviewed Jan 23, 2026

View reviewed changes

onnxruntime/core/mlas/lib/snchwc.cpp Show resolved Hide resolved

milpuz01 added 2 commits January 26, 2026 16:24

Merge branch 'microsoft:main' into aarch64_convolutions

3c20fab

Address comments from the reviewers

6c980e3

Signed-off-by: Milos Puzovic <[email protected]>

hariharans29 requested a review from Copilot January 26, 2026 17:59

Copilot started reviewing on behalf of hariharans29 January 26, 2026 18:00 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

titaiwangms and others added 11 commits February 4, 2026 09:57

Apply absl cuda warning patch to othe OS (microsoft#27126)

fe78076

Fix microsoft#27125 It does fix the build issue on Linux, but I am not entirely sure whether this is the optimal fix.

webgpu: optimize Gemm and MatMul using subgroup feature (microsoft#26433

6645413

) ### Description  ### Motivation and Context

remove coloredlogs (microsoft#27135)

74443ff

See related issues: microsoft#26889

tianleiwu and others added 10 commits February 4, 2026 09:57

Fix: Replace pkg_resources with importlib.metadata in machine_info.py (…

7d017ba

…microsoft#27157) Replaces the deprecated pkg_resources library with importlib.metadata to fix ModuleNotFoundError.

mlas/arm64: add NEON conv asm kernels and tune NCHWC kernel selection

fa933c9

Signed-off-by: Milos Puzovic <[email protected]>

Address the comments from reviewers, fix failing tests and reduce sta…

5cba128

…ck spill Signed-off-by: Milos Puzovic <[email protected]>

mlas/arm64: add NEON conv asm kernels and tune NCHWC kernel selection #27099

Are you sure you want to change the base?

mlas/arm64: add NEON conv asm kernels and tune NCHWC kernel selection #27099

Conversation

milpuz01 commented Jan 21, 2026

Overview

Key changes

Performance

Testing

Uh oh!

aviralagrawal commented Jan 21, 2026

Uh oh!

milpuz01 commented Jan 21, 2026

Uh oh!

Rohanjames1997 commented Jan 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

milpuz01 commented Jan 23, 2026

Uh oh!

hariharans29 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

milpuz01 commented Jan 26, 2026

Uh oh!

hariharans29 commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

hariharans29 commented Jan 23, 2026 •

edited

Loading

hariharans29 commented Jan 26, 2026 •

edited

Loading