Skip to content

Conversation

@milpuz01
Copy link
Contributor

Overview

This PR adds ARM64 NEON assembly micro‑kernels for NCHW, depthwise, and pointwise convolution, wires them into the MLAS build, and adds shape‑based selection heuristics for NCHWC depthwise/pointwise to favor the asm kernels in safe cases (stride‑1 pointwise; wider depthwise outputs). The BF16 path is unchanged.

Key changes

  • cmake/onnxruntime_mlas.cmake
    • Add new AArch64 assembly sources for NCHW, depthwise, and pointwise conv to the MLAS build.
  • onnxruntime/core/mlas/lib/aarch64/SconvKernelNeon.S
    • New vectorised NCHW convolution micro‑kernel.
  • onnxruntime/core/mlas/lib/aarch64/SconvDepthwiseKernelNeon.S
    • New vectorised depthwise micro‑kernel (fast path for in‑bounds loads, slow path for padding).
  • onnxruntime/core/mlas/lib/aarch64/SconvPointwiseKernelNeon.S
    • New vectorised pointwise micro‑kernel (multi‑output reuse).
  • onnxruntime/core/mlas/lib/mlasi.h, onnxruntime/core/mlas/lib/platform.cpp
    • Declare/register new asm kernels and prefer them on ARM64.
  • onnxruntime/core/mlas/lib/snchwc.cpp
    • Heuristics: use pointwise asm when StrideHeight == 1 && StrideWidth == 1 and OutputThisIteration >= 4; use depthwise asm when OutputWidth >= 4.
  • onnxruntime/core/mlas/lib/sbconv_kernel_neon.cpp
    • Include fix for the conv kernel flags header.

Performance

Numbers below are expressed as multipliers vs the non‑NCHWC baseline (same model and perf_test settings):

Baseline (no --enable_arm_neon_nchwc)

  • 8 cores: 1.00×
  • 16 cores: 1.00×

With --enable_arm_neon_nchwc (no asm additions/heuristics)

  • 8 cores: 1.18×
  • 16 cores: 1.24×

With this PR (asm kernels + heuristics)

  • 8 cores: 1.77×
  • 16 cores: 2.54×

Testing

  • ./build.sh --config Release --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests --enable_pybind --build_wheel --enable_arm_neon_nchwc
  • OMP_NUM_THREADS=8 ./build/Linux/Release/onnxruntime_perf_test -I -m times -r 1000 --x 8 ~/mobilenetv2-7.onnx

@aviralagrawal
Copy link

Interesting contribution - thank you!

A few questions -

  1. Pointwise convolution is currently implemented via direct GEMM - which I assume is optimized. How does this kernel beat the performance of GEMM?
  2. Can you share the link to the mobilenet model that you used for performance benchmarking?
  3. How does it perform on single threaded experiments? Afaik, the original nchwc kernels in NEON kernels for NCHWc Convolution and Pooling #25580 suffered in the single threaded setting but outperformed the default in thread counts>8.

@milpuz01
Copy link
Contributor Author

Hi @aviralagrawal, thank you vey much for your prompt feedback.

  1. Pointwise convolution is currently implemented via direct GEMM - which I assume is optimized. How does this kernel beat the performance of GEMM?

Compared to direct GEMM implementation of pointwise convolution asm kernel computes 1x1 conv directly:

  • it explicitly tiles 4 outputs: computes up to 4 output positions in parallel and reuses filter loads across those outputs so with single load we are able to accumulate 4 outputs while direct GEMM doesn't tile multiple outputs together
  • fuses accumulate/bias/ReLU into store path instead of separate passes with direct GEMM
  • unrolls the block size explicitly with 16 invocations to keep accumulators in registers and minimise loop overheads thus reducing dispatch/param overhead and output read-modify-write passes compared to direct GEMM

As usual there are trade-offs so direct GEMM would be faster when output count is small because then asm kernel drops to single-output path which has less ILP and won't be able to reuse filter loads, non-unit stride and non-contigious output regions hence why we have heuristics checking for stride width and height and very large K/M when GEMM blocking can make better use of caches then a fixed 4-output tile.

This is best illustrated if we extract pointwise convolutions from mobilnet that we ran and we can see that on average asm implementation is 1.07x faster, and significant speed ups are when number of channels is high and K/M are small (in the image those are H and W dimensions). In convolution heavy networks the convolutions that are dominant are ones with large number of channels and low height and width so we see visible performance improvements as optimisations from this PR are weighted in that direction.

image
  1. Can you share the link to the mobilenet model that you used for performance benchmarking?

For benchmarking we used the model from: https://github.com/onnx/models/blob/main/validated/vision/classification/mobilenet/model/mobilenetv2-7.onnx

  1. How does it perform on single threaded experiments? Afaik, the original nchwc kernels in NEON kernels for NCHWc Convolution and Pooling #25580 suffered in the single threaded setting but outperformed the default in thread counts>8.

Running OMP_NUM_THREADS=1 ./build/Linux/Release/onnxruntime_perf_test -I -m times -r 1000 --x 1 ~/mobilenetv2-7.onnx on Graviton 4 with binary that was built with --enable_arm_neon_nchwc it is slower by 0.89x then building without that flag, while with this PR it is actually 1.25x faster than the baseline.

@Rohanjames1997
Copy link
Contributor

Thanks @milpuz01 for the detailed description & comment!

A couple questions from my side:

  1. Is there a reason why ConvNchwcFloatKernel was not optimized? Afaik, It is not very different from ConvNchwFloatKernel. The x86 asm implementations of these two kernels differ very slightly too. It is a much heavier kernel than Pointwise and Depthwise, and many larger Conv models stress this kernel. An example for this type of model is in this comment: NEON kernels for NCHWc Convolution and Pooling #25580 (comment).

  2. Can we switch the default path of Fp32 Conv on Arm64 to use these new kernels? (effectively voiding --enable_arm_neon_nchwc like it was before) Asking because this PR improves upon the single-threaded performance as well. I'd love to hear your thoughts, but also would be wise to hear from @hariharans29 before implementing.

@milpuz01
Copy link
Contributor Author

Hi @Rohanjames1997, thank you very much for your comments.

  1. Is there a reason why ConvNchwcFloatKernel was not optimized?

No, particular reason. Mostly because the focus for this PR was on MobileNet model and lack of bandwidth. Thank you for sharing the model where ConvNchwcFloatKernel is invoked. We will take a look at optimising it to, but I would suggest that we add optimisation as a follow up PR so that we do not overload this PR with too many changes to review.

  1. Can we switch the default path of Fp32 Conv on Arm64 to use these new kernels? (effectively voiding --enable_arm_neon_nchwc like it was before) Asking because this PR improves upon the single-threaded performance as well. I'd love to hear your thoughts, but also would be wise to hear from @hariharans29 before implementing.

Yes, I think that is great idea and would be interesting to hear from @hariharans29 too what other testing we should make to try to make these kernels default. As you can see above this change is not going to accelerate all possible pointwise convolutions for example but on average it will show the improvements so if we could agree on a set of performance targets we can use that to drive the decision.

Also thank you for your code review I will address them in a separate commit.

@hariharans29
Copy link
Member

hariharans29 commented Jan 23, 2026

Hi @Rohanjames1997, thank you very much for your comments.

  1. Is there a reason why ConvNchwcFloatKernel was not optimized?

No, particular reason. Mostly because the focus for this PR was on MobileNet model and lack of bandwidth. Thank you for sharing the model where ConvNchwcFloatKernel is invoked. We will take a look at optimising it to, but I would suggest that we add optimisation as a follow up PR so that we do not overload this PR with too many changes to review.

  1. Can we switch the default path of Fp32 Conv on Arm64 to use these new kernels? (effectively voiding --enable_arm_neon_nchwc like it was before) Asking because this PR improves upon the single-threaded performance as well. I'd love to hear your thoughts, but also would be wise to hear from @hariharans29 before implementing.

Yes, I think that is great idea and would be interesting to hear from @hariharans29 too what other testing we should make to try to make these kernels default. As you can see above this change is not going to accelerate all possible pointwise convolutions for example but on average it will show the improvements so if we could agree on a set of performance targets we can use that to drive the decision.

Also thank you for your code review I will address them in a separate commit.

Unfortunately, I don't have a comprehensive list of performance targets to be met to make the feature default. Since, the performance testing may not include all possible Conv shapes, I would like to err on the side of caution and atleast provide one release timeline heads-up to the users before considering making the feature default. I would also encourage you to open a discussion to solicit feedback from other ORT users on ARM if they see speed-up for their models with this feature. It would provide greater confidence and a strong data point to turn it on by default.

Thanks for this contribution, we will review it shortly !

@milpuz01
Copy link
Contributor Author

Unfortunately, I don't have a comprehensive list of performance targets to be met to make the feature default. Since, the performance testing may not include all possible Conv shapes, I would like to err on the side of caution and atleast provide one release timeline heads-up to the users before considering making the feature default. I would also encourage you to open a discussion to solicit feedback from other ORT users on ARM if they see speed-up for their models with this feature. It would provide greater confidence and a strong data point to turn it on by default.

Thanks @hariharans29. I agree with erring on the side of caution. If this PR goes through and it is in the main release is it possible to add a note that we would like to make --enable_arm_neon_nchwc as a default in the future releases so that we can try to get some feedback on that via that route too? Thanks again.

@hariharans29
Copy link
Member

hariharans29 commented Jan 26, 2026

Unfortunately, I don't have a comprehensive list of performance targets to be met to make the feature default. Since, the performance testing may not include all possible Conv shapes, I would like to err on the side of caution and atleast provide one release timeline heads-up to the users before considering making the feature default. I would also encourage you to open a discussion to solicit feedback from other ORT users on ARM if they see speed-up for their models with this feature. It would provide greater confidence and a strong data point to turn it on by default.

Thanks @hariharans29. I agree with erring on the side of caution. If this PR goes through and it is in the main release is it possible to add a note that we would like to make --enable_arm_neon_nchwc as a default in the future releases so that we can try to get some feedback on that via that route too? Thanks again.

Thanks @milpuz01. The PR should go through in main eventually but I don't think it will go in 1.24.0 unfortunately as the release branch is cut and the bar to take in new code at this point is critical bug fixes and urgent customer asks only. I will try to take this in for 1.24.1 when it happens and sure I will add a note about considering making it default in one of the future releases, but ultimately, as discussed in the comment #27099 (comment), I expect the NchwcFloatKernel needs optimizations before considering that.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new AArch64 NEON assembly micro-kernels for NCHW, depthwise NCHWc, and pointwise NCHWc convolution, integrates them into the MLAS build, and updates NCHWc kernel-selection heuristics to prefer the asm kernels in selected shapes.

Changes:

  • Add new AArch64 .S convolution micro-kernels (NCHW, depthwise NCHWc, pointwise NCHWc) and wire them into the MLAS build.
  • Update ARM64 platform init and NCHWc execution heuristics to select asm kernels for pointwise (stride-1, larger tiles) and depthwise (wider outputs).
  • Remove the old intrinsics wrapper for the NCHW float kernel in the NCHWc NEON source file.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
cmake/onnxruntime_mlas.cmake Adds new AArch64 asm sources to the ARM NEON NCHWc MLAS build setup.
onnxruntime/core/mlas/lib/snchwc.cpp Adds ARM64 heuristics to prefer asm depthwise/pointwise kernels in “safe” cases.
onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.cpp Removes the old NCHW float kernel wrapper implementation from the NCHWc NEON source file.
onnxruntime/core/mlas/lib/platform.cpp Switches ARM64 NCHW conv kernel default to asm; updates commentary around kernel choices.
onnxruntime/core/mlas/lib/mlasi.h Declares new asm kernel entry points for ARM64 NEON NCHWc.
onnxruntime/core/mlas/lib/aarch64/SconvKernelNeon.S Adds new NCHW convolution asm micro-kernel.
onnxruntime/core/mlas/lib/aarch64/SconvDepthwiseKernelNeon.S Adds new depthwise NCHWc asm micro-kernel (fast/slow path for padding).
onnxruntime/core/mlas/lib/aarch64/SconvPointwiseKernelNeon.S Adds new pointwise NCHWc asm micro-kernel (multi-output reuse).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

titaiwangms and others added 11 commits February 4, 2026 09:57
Fix microsoft#27125 

It does fix the build issue on Linux, but I am not entirely sure whether
this is the optimal fix.
### Description
Models with corresponding Olive recipes are deprecated.


### Motivation and Context
Olive and Olive-recipes is the entry point for model optimization. We
want onnxruntime to be only for runtime. So, deprecating examples that
are already present in olive recipes.
…t#27134)

Bumps [lodash](https://github.com/lodash/lodash) from 4.17.21 to
4.17.23.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/lodash/lodash/commit/dec55b7a3b382da075e2eac90089b4cd00a26cbb"><code>dec55b7</code></a>
Bump main to v4.17.23 (<a
href="https://redirect.github.com/lodash/lodash/issues/6088">#6088</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/19c9251b3631d7cf220b43bc757eb33f1084f117"><code>19c9251</code></a>
fix: setCacheHas JSDoc return type should be boolean (<a
href="https://redirect.github.com/lodash/lodash/issues/6071">#6071</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/b5e672995ae26929d111a6e94589f8d03fb8e578"><code>b5e6729</code></a>
jsdoc: Add -0 and BigInt zeros to _.compact falsey values list (<a
href="https://redirect.github.com/lodash/lodash/issues/6062">#6062</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/edadd452146f7e4bad4ea684e955708931d84d81"><code>edadd45</code></a>
Prevent prototype pollution on baseUnset function</li>
<li><a
href="https://github.com/lodash/lodash/commit/4879a7a7d0a4494b0e83c7fa21bcc9fc6e7f1a6d"><code>4879a7a</code></a>
doc: fix autoLink function, conversion of source links (<a
href="https://redirect.github.com/lodash/lodash/issues/6056">#6056</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/9648f692b0fc7c2f6a7a763d754377200126c2e8"><code>9648f69</code></a>
chore: remove <code>yarn.lock</code> file (<a
href="https://redirect.github.com/lodash/lodash/issues/6053">#6053</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/dfa407db0bf5b200f2c7a9e4f06830ceaf074be9"><code>dfa407d</code></a>
ci: remove legacy configuration files (<a
href="https://redirect.github.com/lodash/lodash/issues/6052">#6052</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/156e1965ae78b121a88f81178ab81632304e8d64"><code>156e196</code></a>
feat: add renovate setup (<a
href="https://redirect.github.com/lodash/lodash/issues/6039">#6039</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/933e1061b8c344d3fc742cdc400175d5ffc99bce"><code>933e106</code></a>
ci: add pipeline for Bun (<a
href="https://redirect.github.com/lodash/lodash/issues/6023">#6023</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/072a807ff7ad8ffc7c1d2c3097266e815d138e20"><code>072a807</code></a>
docs: update links related to Open JS Foundation (<a
href="https://redirect.github.com/lodash/lodash/issues/5968">#5968</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/lodash/lodash/compare/4.17.21...4.17.23">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=lodash&package-manager=npm_and_yarn&previous-version=4.17.21&new-version=4.17.23)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Description
Enables the file mapping of weights as well as the overall context bin.
This feature is currently only enabled for ARM64 WIN devices

Motivation and Context
Currently, when reading the context bin, ORT allocates a large buffer on
the heap. Assuming the same model is used, each ORT session will
allocate a buffer for the context bin. This is incredibly wasteful when
large models are used. Instead, WIN file mapping can be leveraged to map
the context bin, then every time a context needs to be created with the
context bin, the pointer to the context bin can be retrieved and used
instead of some pre-allocated buffer, thus making QNN EP more
memory-efficient. In the case of multiple ORT sessions, the context bin
will only be loaded once for all sessions, increasing memory efficiency
and overall initialization performance. This is very useful regarding
the use of LLMs going forward.

---------

Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>
…spec (microsoft#27164)

I missed the operator since it didn't have the corresponding tests at
the time.
With onnx/onnx#7618, the disabled test should be
able to pass.

---

This pull request updates the ONNX Runtime CPU execution provider to add
support for the `LpNormalization` operator for opset version 22, in
addition to clarifying and correcting the registration for earlier
versions. It also updates the backend test filters to reflect this new
support.

**ONNX Operator Kernel Registration:**

* Added new kernel registrations for `LpNormalization` with opset
version 22 for both `float` and `double` data types in
`cpu_execution_provider.cc`.
[[1]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908R1328-R1329)
[[2]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908R3389-R3392)
* Updated the registration for `LpNormalization` for opset versions 1
through 21 to use the correct versioned kernel macro, ensuring correct
kernel selection and compatibility.
[[1]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908L197-R198)
[[2]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908L1731-R1735)

**Test Filters Update:**

* Updated `onnx_backend_test_series_filters.jsonc` to remove the
exclusion of `test_l1normalization*`, `test_lpnormalization*`, and
`test_l2normalization*` now that `LpNormalization` opset 22 is
implemented, and added a TODO comment referencing ONNX 1.21 for a known
zero-norm issue.
[[1]](diffhunk://#diff-abc0f78c2314f9e7648c8081125d0ce9f33b12399520d92d811d73e3c795ed59R32-R33)
[[2]](diffhunk://#diff-abc0f78c2314f9e7648c8081125d0ce9f33b12399520d92d811d73e3c795ed59L42)
[[3]](diffhunk://#diff-abc0f78c2314f9e7648c8081125d0ce9f33b12399520d92d811d73e3c795ed59L70-L71)
…ft#27151)

### Description
Previously in `MatMulReadFnSource()` we use duplicated code to read data
from two inputs `a` and `b`. This patch implements another overload of
`MatMulReadFnSource()` to only read data from one input to reduce
duplicated code and get ready for further use.
…crosoft#27179)

## Problem Description
The `MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` test
was exhibiting flaky behavior (failure rate ~2-20%) with numerical
mismatches.
Investigation revealed a **race condition** in the
[GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326)
step within
[MlasLutGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/inc/mlas_qnbit.h#L328).

When the batch size `M > 1`,
[MlasLutGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/inc/mlas_qnbit.h#L328)
attempted to parallelize the LUT generation over the batch dimension
using `MlasTrySimpleParallel`. However, the underlying
[GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326)
implementation (specifically shared usage of `lut_scales`/`lut_biases`
or internal buffers) is not thread-safe for concurrent execution on the
same destination buffers or related state. This led to corruption of the
Look-Up Tables or scales, causing random output errors.

## Solution
This PR modifies
[onnxruntime/core/mlas/lib/qlutgemm.cpp](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/qlutgemm.cpp)
to **serialize the
[GenerateLUT](file:///home/tlwu/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#324-355)
loop**.
Instead of using `MlasTrySimpleParallel`, we now use a simple `for` loop
to process each row of the batch sequentially.

**Performance Impact:**
The
[GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326)
step is computationally lightweight compared to the subsequent
[TMACComputeGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L505)
matrix multiplication. Serializing this setup step has negligible impact
on overall inference latency (micro-benchmarks showed no measurable
regression), but effectively eliminates the race condition.

## Verification
* **Reproduction:** The issue was reliably reproduced by running
`MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` in a loop
(failing ~1 in 5 times).
* **Verification:** After applying the fix, the same test passed **50/50
iterations** consistently.
* **Regression Testing:** Standard `MatMulNBitsLutGemm` tests (including
`BlkLen64` and `M=1` cases) continue to pass.
Bumps [tar](https://github.com/isaacs/node-tar) to 7.5.7 and updates
ancestor dependency [cmake-js](https://github.com/cmake-js/cmake-js).
These dependencies need to be updated together.

Updates `tar` from 6.2.1 to 7.5.7
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/isaacs/node-tar/blob/main/CHANGELOG.md">tar's
changelog</a>.</em></p>
<blockquote>
<h1>Changelog</h1>
<h2>7.5</h2>
<ul>
<li>Added <code>zstd</code> compression support.</li>
<li>Consistent TOCTOU behavior in sync t.list</li>
<li>Only read from ustar block if not specified in Pax</li>
<li>Fix sync tar.list when file size reduces while reading</li>
<li>Sanitize absolute linkpaths properly</li>
<li>Prevent writing hardlink entries to the archive ahead of their
file target</li>
</ul>
<h2>7.4</h2>
<ul>
<li>Deprecate <code>onentry</code> in favor of <code>onReadEntry</code>
for clarity.</li>
</ul>
<h2>7.3</h2>
<ul>
<li>Add <code>onWriteEntry</code> option</li>
</ul>
<h2>7.2</h2>
<ul>
<li>DRY the command definitions into a single <code>makeCommand</code>
method,
and update the type signatures to more appropriately infer the
return type from the options and arguments provided.</li>
</ul>
<h2>7.1</h2>
<ul>
<li>Update minipass to v7.1.0</li>
<li>Update the type definitions of <code>write()</code> and
<code>end()</code> methods on
<code>Unpack</code> and <code>Parser</code> classes to be compatible
with the
NodeJS.WritableStream type in the latest versions of
<code>@types/node</code>.</li>
</ul>
<h2>7.0</h2>
<ul>
<li>Drop support for node &lt;18</li>
<li>Rewrite in TypeScript, provide ESM and CommonJS hybrid
interface</li>
<li>Add tree-shake friendly exports, like
<code>import('tar/create')</code>
and <code>import('tar/read-entry')</code> to get individual functions or
classes.</li>
<li>Add <code>chmod</code> option that defaults to false, and deprecate
<code>noChmod</code>. That is, reverse the default option regarding
explicitly setting file system modes to match tar entry
settings.</li>
<li>Add <code>processUmask</code> option to avoid having to call
<code>process.umask()</code> when <code>chmod: true</code> (or
<code>noChmod: false</code>) is
set.</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/isaacs/node-tar/commit/4a37eb9a1cf1137df4eb70c5c7f849f412ff3cdb"><code>4a37eb9</code></a>
7.5.7</li>
<li><a
href="https://github.com/isaacs/node-tar/commit/f4a7aa9bc3d717c987fdf1480ff7a64e87ffdb46"><code>f4a7aa9</code></a>
fix: properly sanitize hard links containing ..</li>
<li><a
href="https://github.com/isaacs/node-tar/commit/394ece6ad8d81742a4e4058af227c616cd947a25"><code>394ece6</code></a>
7.5.6</li>
<li><a
href="https://github.com/isaacs/node-tar/commit/7d4cc17c76f6bd11dcd83de47187dc6dff206eee"><code>7d4cc17</code></a>
fix race puting a Link ahead of its target File</li>
<li><a
href="https://github.com/isaacs/node-tar/commit/26ab90474e642cf00d84a05bcdc2eaf2a19f1581"><code>26ab904</code></a>
7.5.5</li>
<li><a
href="https://github.com/isaacs/node-tar/commit/e9a1ddb821b29ddee75b9470dd511066148c8070"><code>e9a1ddb</code></a>
fix: do not prevent valid linkpaths within archive</li>
<li><a
href="https://github.com/isaacs/node-tar/commit/911c886bb170a6ee3db05fd3709221752213ec8a"><code>911c886</code></a>
7.5.4</li>
<li><a
href="https://github.com/isaacs/node-tar/commit/3b1abfae650056edfabcbe0a0df5954d390521e6"><code>3b1abfa</code></a>
normalize out unicode ligatures</li>
<li><a
href="https://github.com/isaacs/node-tar/commit/a43478c5c51a71ec996cea62ff824eb9dc9dd17c"><code>a43478c</code></a>
remove some unused files</li>
<li><a
href="https://github.com/isaacs/node-tar/commit/970c58f6d3d0c932081f8b40218f612db2fabb5a"><code>970c58f</code></a>
update deps</li>
<li>Additional commits viewable in <a
href="https://github.com/isaacs/node-tar/compare/v6.2.1...v7.5.7">compare
view</a></li>
</ul>
</details>
<details>
<summary>Maintainer changes</summary>
<p>This version was pushed to npm by <a
href="https://www.npmjs.com/~isaacs">isaacs</a>, a new releaser for tar
since your current version.</p>
</details>
<br />

Updates `cmake-js` from 7.2.1 to 8.0.0
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/cmake-js/cmake-js/releases">cmake-js's
releases</a>.</em></p>
<blockquote>
<h2>v8.0.0</h2>
<p>This is a small but breaking change.</p>
<p>This now requires nodejs 20 or later, due to increased requirements
of updated dependencies</p>
<p>With the increased minimum, this now uses the builtin fetch which
further reduces the install size!</p>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/cmake-js/cmake-js/compare/v7.4.0...v8.0.0">https://github.com/cmake-js/cmake-js/compare/v7.4.0...v8.0.0</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/cmake-js/cmake-js/blob/master/changelog.md">cmake-js's
changelog</a>.</em></p>
<blockquote>
<h1>v8.0.0 - 27/01/26</h1>
<ul>
<li>feat: require nodejs 20 or later</li>
<li>feat: update deprecated dependencies</li>
</ul>
<h1>v7.4.0 - 14/11/25</h1>
<ul>
<li>feat(windows): support msvc 2026 (Thanks to <a
href="https://github.com/Norgerkaj"><code>@​Norgerkaj</code></a>)</li>
</ul>
<h1>v7.3.1 - 17/04/25</h1>
<ul>
<li>fix(windows): support windows arm64 (Thanks to <a
href="https://github.com/jaycex"><code>@​jaycex</code></a>)</li>
<li>fix(windows): support newer visual studio installations</li>
</ul>
<h1>v7.3.0 - 15/01/24</h1>
<ul>
<li>feat(windows): replace custom libnode.def generation with version
from node-api-headers</li>
<li>fix: support for vs2015 with nodejs 18 and older (<a
href="https://redirect.github.com/cmake-js/cmake-js/issues/317">#317</a>)</li>
<li>fix(windows): always remove Path if PATH is also defined (<a
href="https://redirect.github.com/cmake-js/cmake-js/issues/319">#319</a>)</li>
<li>fix: Cmake arguments got converted to numbers (<a
href="https://redirect.github.com/cmake-js/cmake-js/issues/314">#314</a>)</li>
<li>fix: update node-api-headers</li>
<li>chore: update dependencies</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/cmake-js/cmake-js/commit/a2c37135370d79e6fd962deecc2f013058d06191"><code>a2c3713</code></a>
chore: v8.0.0</li>
<li><a
href="https://github.com/cmake-js/cmake-js/commit/33c03a9a6a36c284323c5953f9cb88041775f4ac"><code>33c03a9</code></a>
chore: fix ci</li>
<li><a
href="https://github.com/cmake-js/cmake-js/commit/3bf03be7d3cdab323f3456e987421df3668ed217"><code>3bf03be</code></a>
chore: fix ci</li>
<li><a
href="https://github.com/cmake-js/cmake-js/commit/ab5e651e6dfe8b7d32396fd6cffb6b373e51d6a4"><code>ab5e651</code></a>
chore(deps): bump actions/checkout from 5 to 6 (<a
href="https://redirect.github.com/cmake-js/cmake-js/issues/358">#358</a>)</li>
<li><a
href="https://github.com/cmake-js/cmake-js/commit/818bece268209646b75c9b9fdb9556c002fa8217"><code>818bece</code></a>
fix: replace npmlog with simple inline logger</li>
<li><a
href="https://github.com/cmake-js/cmake-js/commit/0b3a84081a9971c352b0bf7386db7404154afb51"><code>0b3a840</code></a>
feat!: replace axios with fetch</li>
<li><a
href="https://github.com/cmake-js/cmake-js/commit/c429d9e664d98bb0115e97341bcc57ce325e497c"><code>c429d9e</code></a>
feat!: require nodejs 20</li>
<li><a
href="https://github.com/cmake-js/cmake-js/commit/a5fe3c26759535023aca156bc0abb925b045c8fd"><code>a5fe3c2</code></a>
v7.4.0</li>
<li><a
href="https://github.com/cmake-js/cmake-js/commit/4ab302a8e03b5faac3c4f991cef5c2a37b5ff8f9"><code>4ab302a</code></a>
feat(windows): add visual studio 2026 support (<a
href="https://redirect.github.com/cmake-js/cmake-js/issues/357">#357</a>)</li>
<li><a
href="https://github.com/cmake-js/cmake-js/commit/2d0abc48429392dc9fb9485a5ae4b9d197c52aee"><code>2d0abc4</code></a>
chore: fix readme typo (<a
href="https://redirect.github.com/cmake-js/cmake-js/issues/353">#353</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/cmake-js/cmake-js/compare/v7.2.1...v8.0.0">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
Adds C/C++ API named `GetTensorElementTypeAndShapeDataReference` that
returns an OrtValue tensor's shape and type without allocating a new
buffer for the shape data.



### Motivation and Context
This new API function can be used instead of `OrtApi::GetTypeInfo()` or
`OrtApi::GetTensorTypeAndShape` to decrease the number of heap
allocations and thus improve inference latency for plugin EPs kernels
that frequently retrieve tensor shapes during inference. (e.g., WebGPU
plugin EP)
tianleiwu and others added 10 commits February 4, 2026 09:57
…microsoft#27157)

Replaces the deprecated pkg_resources library with importlib.metadata to
fix ModuleNotFoundError.
Bumps [lodash](https://github.com/lodash/lodash) from 4.17.21 to
4.17.23.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/lodash/lodash/commit/dec55b7a3b382da075e2eac90089b4cd00a26cbb"><code>dec55b7</code></a>
Bump main to v4.17.23 (<a
href="https://redirect.github.com/lodash/lodash/issues/6088">#6088</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/19c9251b3631d7cf220b43bc757eb33f1084f117"><code>19c9251</code></a>
fix: setCacheHas JSDoc return type should be boolean (<a
href="https://redirect.github.com/lodash/lodash/issues/6071">#6071</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/b5e672995ae26929d111a6e94589f8d03fb8e578"><code>b5e6729</code></a>
jsdoc: Add -0 and BigInt zeros to _.compact falsey values list (<a
href="https://redirect.github.com/lodash/lodash/issues/6062">#6062</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/edadd452146f7e4bad4ea684e955708931d84d81"><code>edadd45</code></a>
Prevent prototype pollution on baseUnset function</li>
<li><a
href="https://github.com/lodash/lodash/commit/4879a7a7d0a4494b0e83c7fa21bcc9fc6e7f1a6d"><code>4879a7a</code></a>
doc: fix autoLink function, conversion of source links (<a
href="https://redirect.github.com/lodash/lodash/issues/6056">#6056</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/9648f692b0fc7c2f6a7a763d754377200126c2e8"><code>9648f69</code></a>
chore: remove <code>yarn.lock</code> file (<a
href="https://redirect.github.com/lodash/lodash/issues/6053">#6053</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/dfa407db0bf5b200f2c7a9e4f06830ceaf074be9"><code>dfa407d</code></a>
ci: remove legacy configuration files (<a
href="https://redirect.github.com/lodash/lodash/issues/6052">#6052</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/156e1965ae78b121a88f81178ab81632304e8d64"><code>156e196</code></a>
feat: add renovate setup (<a
href="https://redirect.github.com/lodash/lodash/issues/6039">#6039</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/933e1061b8c344d3fc742cdc400175d5ffc99bce"><code>933e106</code></a>
ci: add pipeline for Bun (<a
href="https://redirect.github.com/lodash/lodash/issues/6023">#6023</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/072a807ff7ad8ffc7c1d2c3097266e815d138e20"><code>072a807</code></a>
docs: update links related to Open JS Foundation (<a
href="https://redirect.github.com/lodash/lodash/issues/5968">#5968</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/lodash/lodash/compare/4.17.21...4.17.23">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=lodash&package-manager=npm_and_yarn&previous-version=4.17.21&new-version=4.17.23)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…oft#27106)

Bumps [lodash](https://github.com/lodash/lodash) from 4.17.21 to
4.17.23.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/lodash/lodash/commit/dec55b7a3b382da075e2eac90089b4cd00a26cbb"><code>dec55b7</code></a>
Bump main to v4.17.23 (<a
href="https://redirect.github.com/lodash/lodash/issues/6088">#6088</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/19c9251b3631d7cf220b43bc757eb33f1084f117"><code>19c9251</code></a>
fix: setCacheHas JSDoc return type should be boolean (<a
href="https://redirect.github.com/lodash/lodash/issues/6071">#6071</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/b5e672995ae26929d111a6e94589f8d03fb8e578"><code>b5e6729</code></a>
jsdoc: Add -0 and BigInt zeros to _.compact falsey values list (<a
href="https://redirect.github.com/lodash/lodash/issues/6062">#6062</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/edadd452146f7e4bad4ea684e955708931d84d81"><code>edadd45</code></a>
Prevent prototype pollution on baseUnset function</li>
<li><a
href="https://github.com/lodash/lodash/commit/4879a7a7d0a4494b0e83c7fa21bcc9fc6e7f1a6d"><code>4879a7a</code></a>
doc: fix autoLink function, conversion of source links (<a
href="https://redirect.github.com/lodash/lodash/issues/6056">#6056</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/9648f692b0fc7c2f6a7a763d754377200126c2e8"><code>9648f69</code></a>
chore: remove <code>yarn.lock</code> file (<a
href="https://redirect.github.com/lodash/lodash/issues/6053">#6053</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/dfa407db0bf5b200f2c7a9e4f06830ceaf074be9"><code>dfa407d</code></a>
ci: remove legacy configuration files (<a
href="https://redirect.github.com/lodash/lodash/issues/6052">#6052</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/156e1965ae78b121a88f81178ab81632304e8d64"><code>156e196</code></a>
feat: add renovate setup (<a
href="https://redirect.github.com/lodash/lodash/issues/6039">#6039</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/933e1061b8c344d3fc742cdc400175d5ffc99bce"><code>933e106</code></a>
ci: add pipeline for Bun (<a
href="https://redirect.github.com/lodash/lodash/issues/6023">#6023</a>)</li>
<li><a
href="https://github.com/lodash/lodash/commit/072a807ff7ad8ffc7c1d2c3097266e815d138e20"><code>072a807</code></a>
docs: update links related to Open JS Foundation (<a
href="https://redirect.github.com/lodash/lodash/issues/5968">#5968</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/lodash/lodash/compare/4.17.21...4.17.23">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=lodash&package-manager=npm_and_yarn&previous-version=4.17.21&new-version=4.17.23)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…27195)

### Description
Fixes C++ documentation generation by replacing `<` and `>` with `[` and
`]`. Angle brackets are mistaken as html tags.

Successful run:
https://github.com/microsoft/onnxruntime/actions/runs/21456738258

### Motivation and Context
Allow C++ document generation to succeed.
…rosoft#27174)

## Problem Description
The `MatMulNBitsLutGemm` test suite, specifically
`Float32_2Bits_Symmetric_256x256_BlkLen64`, was observing intermittent
failures (flakiness).
The failure manifested as numerical mismatches exceeding the tolerance,
suggesting non-deterministic behavior in the kernel execution.

## Root Cause Analysis
The issue was traced to the usage of `_mm256_i32gather_ps` in
sqnbitgemm_lut_kernel_avx2.cpp
While the gather indices were technically calculating addresses within
the bounds of the allocated buffer, gather instructions on certain AVX2
hardware implementations can exhibit non-deterministic behavior or
subtle performance/prefetching artifacts when operating on specific
stride patterns (in this case, gathering with a stride of 4 floats).

## Solution
This PR replaces the `_mm256_i32gather_ps` instruction with a sequence
of **contiguous loads (`_mm256_loadu_ps`) followed by deterministic
shuffles**.

### How it works:
1. **Contiguous Load**: We load 4 contiguous vectors of 8 floats
elements using `_mm256_loadu_ps`. This is always memory-safe and
deterministic.
2. **Deterministic Shuffle**: We apply a verified sequence of `unpack`
and `permutevar8x32` instructions to rearrange these 32 linearly loaded
elements into the exact same stride-4 layout that the gather instruction
produced.

### Benefits:
* **Stability**: Eliminates the hardware-dependent non-determinism of
gather.
* **Safety**: Usage of `loadu` guarantees we only touch memory within
the explicit range of the 32 elements we intend to load.
* **Correctness**: The shuffle logic was verified against the reference
gather behavior using a C++ reproduction script to ensure bit-exact
layout equivalence.

### Performance

Micro-benchmark on MatMulNBitsLutGemm (256x256, BlkLen=64).
Original (Gather): ~55.55 us
Fixed (Load+Shuffle): ~57.79 us
Delta: +2.24 us (~4% slower)

The slight performance regression is expected because replacing a single
hardware gather instruction with a sequence of loadu, unpack, and
permute instructions adds instruction count overhead. However, this is a
necessary tradeoff to ensure deterministic behavior and memory safety
across all AVX2 implementations.

## Verification
* **Tests**: All 9 tests in `MatMulNBitsLutGemm` passed successfully
(including the previously flaky `BlkLen64` case).
…osoft#27120)

Description
Conditionally disable linking of cpuinfo for
onnxruntime_runtime_path_test_shared_library on targets, where cpuinfo
is not supported.

Motivation and Context
Recent changes enabling onnxruntime_autoep_test and related shared
library tests on non-Windows platforms exposed a transitive dependency
issue. cpuinfo was being linked unconditionally on Linux, leading to
linker failures on ppc64le (cannot find -lcpuinfo).

Solution
Add CPUINFO_SUPPORTED guards to exclude cpuinfo from the link list while
preserving existing behavior.
The logic of interleaved NEON kernel is not correct from code review:

1.  **Test Code Logic:**
The test code `test_rope.h` allocates the `sin` and `cos` tables based
on the `interleaved` flag:

    ```c++
size_t table_len = interleaved ? rotary_emb_dim / 2 : rotary_emb_dim;
    std::vector<float> sin_data(table_len);
    std::vector<float> cos_data(table_len);
    ```

For the `interleaved = true` case, the test creates `sin` and `cos`
tables of length `rotary_emb_dim / 2`.

2.  **AVX2 (fp32) Kernel Logic (`interleaved = true`):**
    This kernel loads the `sin`/`cos` data using an index of `i / 2`:

    ```c++
    float32x8_t sin_val = _mm256_loadu_ps(sin_data + i / 2);
    float32x8_t cos_val = _mm256_loadu_ps(cos_data + i / 2);
    ```

This logic expects a `sin`/`cos` table of length `rotary_emb_dim / 2`.
**Conclusion: The AVX2 (fp32) kernel is consistent with the test code.**

3.  **NEON (fp16) Kernel Logic (`interleaved = true`):**
    This kernel loads the `sin`/`cos` data using an index of `i`:

    ```c++
    // Enters loop with sin_val = MlasLoadFloat16x8(sin + i);
    //...
    // Inside loop, for next iteration:
    sin_val = MlasLoadFloat16x8(sin + i + 16); 
    ```

    This logic expects a `sin`/`cos` table of length `rotary_emb_dim`.
**Conclusion: The NEON (fp16) kernel is NOT consistent with the test
code.**

### Regression Test
```
cmake --build build/Linux/Release --config Release --target onnxruntime_mlas_test && ./build/Linux/Release/onnxruntime_mlas_test --gtest_filter=NeonFp16RoPE*
```

Before applying the fix, the test failed:
```
[  FAILED  ] NeonFp16RoPE.ShortExecute (13 ms)
onnxruntime/onnxruntime/test/mlas/unittest/test_rope_neon_fp16.cpp:66: Failure
Value of: CloseEnough(output_impl[i].ToFloat(), output_ref[i].ToFloat())
  Actual: false
Expected: true
Expected bits: 19491 (16.546875) Actual bits: 56596 (-325) @[16], rotary_emb_dim=24, interleaved=true
```
After applying the fix, test passed.

### Summary

The `RopeKernel_Avx2_fp32_Impl<true>` kernel correctly aligns with the
test code (and the fallback implementation) by expecting a `sin`/`cos`
table of length `rotary_emb_dim / 2`.

The `RopeKernel_Fp16_Impl<true>` (NEON) kernel incorrectly expects a
table of length `rotary_emb_dim`. When run against the provided test,
the NEON kernel will read past the end of the `sin_data` and `cos_data`
vectors.

---------

Co-authored-by: Copilot <[email protected]>
…microsoft#27207)

### Description

This PR renames the following existing names for MemoryInfo:

- `WebGPU_Buffer` -> `WebGPU_Buf`
- `WebNN_Tensor` -> `WebNN_Ten`

### Motivation and Context

the `OrtMemoryInfo` uses a `std::string` to store the name. modern C++
compilers uses "small string optimization" (SSO) to avoid an extra
memory allocation if the string is small enough.

While different compiler may have different implementation, the
following test program is used to test what exact limit is for a certain
compiler:

```c++
#include <string>
#include <cstdio>

int main() {
  std::string webgpu0 = "WebGPU_Buf";
  std::string webgpu1 = "WebGPU_Buff";
  std::string webgpu2 = "WebGPU_Buffe";
  std::string webgpu3 = "WebGPU_Buffer";

  printf("=========== %s\n string address: %p\n data address  : %p\n\n", webgpu0.c_str(), (void*)&webgpu0, (void*)webgpu0.data());
  printf("=========== %s\n string address: %p\n data address  : %p\n\n", webgpu1.c_str(), (void*)&webgpu1, (void*)webgpu1.data());
  printf("=========== %s\n string address: %p\n data address  : %p\n\n", webgpu2.c_str(), (void*)&webgpu2, (void*)webgpu2.data());
  printf("=========== %s\n string address: %p\n data address  : %p\n\n", webgpu3.c_str(), (void*)&webgpu3, (void*)webgpu3.data());

  return 0;
}
```

While using emscripten (targetting wasm32), the runtime result is like
this:
```
=========== WebGPU_Buf
 string address: 0x10db0
 data address  : 0x10db0

=========== WebGPU_Buff
 string address: 0x10da4
 data address  : 0x10dc8

=========== WebGPU_Buffe
 string address: 0x10d98
 data address  : 0x10de0

=========== WebGPU_Buffer
 string address: 0x10d8c
 data address  : 0x10df8
```

Which shows that the string need to be no more than 10 bytes (exclude
the '\0' at end) to enable SSO.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.