-
Notifications
You must be signed in to change notification settings - Fork 3.7k
mlas/arm64: add NEON conv asm kernels and tune NCHWC kernel selection #27099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Milos Puzovic <[email protected]>
|
Interesting contribution - thank you! A few questions -
|
|
Hi @aviralagrawal, thank you vey much for your prompt feedback.
Compared to direct GEMM implementation of pointwise convolution asm kernel computes 1x1 conv directly:
As usual there are trade-offs so direct GEMM would be faster when output count is small because then asm kernel drops to single-output path which has less ILP and won't be able to reuse filter loads, non-unit stride and non-contigious output regions hence why we have heuristics checking for stride width and height and very large K/M when GEMM blocking can make better use of caches then a fixed 4-output tile. This is best illustrated if we extract pointwise convolutions from mobilnet that we ran and we can see that on average asm implementation is 1.07x faster, and significant speed ups are when number of channels is high and K/M are small (in the image those are H and W dimensions). In convolution heavy networks the convolutions that are dominant are ones with large number of channels and low height and width so we see visible performance improvements as optimisations from this PR are weighted in that direction.
For benchmarking we used the model from: https://github.com/onnx/models/blob/main/validated/vision/classification/mobilenet/model/mobilenetv2-7.onnx
Running |
|
Thanks @milpuz01 for the detailed description & comment! A couple questions from my side:
|
|
Hi @Rohanjames1997, thank you very much for your comments.
No, particular reason. Mostly because the focus for this PR was on MobileNet model and lack of bandwidth. Thank you for sharing the model where
Yes, I think that is great idea and would be interesting to hear from @hariharans29 too what other testing we should make to try to make these kernels default. As you can see above this change is not going to accelerate all possible pointwise convolutions for example but on average it will show the improvements so if we could agree on a set of performance targets we can use that to drive the decision. Also thank you for your code review I will address them in a separate commit. |
Unfortunately, I don't have a comprehensive list of performance targets to be met to make the feature default. Since, the performance testing may not include all possible Conv shapes, I would like to err on the side of caution and atleast provide one release timeline heads-up to the users before considering making the feature default. I would also encourage you to open a discussion to solicit feedback from other ORT users on ARM if they see speed-up for their models with this feature. It would provide greater confidence and a strong data point to turn it on by default. Thanks for this contribution, we will review it shortly ! |
Signed-off-by: Milos Puzovic <[email protected]>
Thanks @hariharans29. I agree with erring on the side of caution. If this PR goes through and it is in the main release is it possible to add a note that we would like to make |
Thanks @milpuz01. The PR should go through in main eventually but I don't think it will go in 1.24.0 unfortunately as the release branch is cut and the bar to take in new code at this point is critical bug fixes and urgent customer asks only. I will try to take this in for 1.24.1 when it happens and sure I will add a note about considering making it default in one of the future releases, but ultimately, as discussed in the comment #27099 (comment), I expect the NchwcFloatKernel needs optimizations before considering that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds new AArch64 NEON assembly micro-kernels for NCHW, depthwise NCHWc, and pointwise NCHWc convolution, integrates them into the MLAS build, and updates NCHWc kernel-selection heuristics to prefer the asm kernels in selected shapes.
Changes:
- Add new AArch64
.Sconvolution micro-kernels (NCHW, depthwise NCHWc, pointwise NCHWc) and wire them into the MLAS build. - Update ARM64 platform init and NCHWc execution heuristics to select asm kernels for pointwise (stride-1, larger tiles) and depthwise (wider outputs).
- Remove the old intrinsics wrapper for the NCHW float kernel in the NCHWc NEON source file.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| cmake/onnxruntime_mlas.cmake | Adds new AArch64 asm sources to the ARM NEON NCHWc MLAS build setup. |
| onnxruntime/core/mlas/lib/snchwc.cpp | Adds ARM64 heuristics to prefer asm depthwise/pointwise kernels in “safe” cases. |
| onnxruntime/core/mlas/lib/sconv_nchwc_kernel_neon.cpp | Removes the old NCHW float kernel wrapper implementation from the NCHWc NEON source file. |
| onnxruntime/core/mlas/lib/platform.cpp | Switches ARM64 NCHW conv kernel default to asm; updates commentary around kernel choices. |
| onnxruntime/core/mlas/lib/mlasi.h | Declares new asm kernel entry points for ARM64 NEON NCHWc. |
| onnxruntime/core/mlas/lib/aarch64/SconvKernelNeon.S | Adds new NCHW convolution asm micro-kernel. |
| onnxruntime/core/mlas/lib/aarch64/SconvDepthwiseKernelNeon.S | Adds new depthwise NCHWc asm micro-kernel (fast/slow path for padding). |
| onnxruntime/core/mlas/lib/aarch64/SconvPointwiseKernelNeon.S | Adds new pointwise NCHWc asm micro-kernel (multi-output reuse). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Fix microsoft#27125 It does fix the build issue on Linux, but I am not entirely sure whether this is the optimal fix.
### Description Models with corresponding Olive recipes are deprecated. ### Motivation and Context Olive and Olive-recipes is the entry point for model optimization. We want onnxruntime to be only for runtime. So, deprecating examples that are already present in olive recipes.
…t#27134) Bumps [lodash](https://github.com/lodash/lodash) from 4.17.21 to 4.17.23. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/lodash/lodash/commit/dec55b7a3b382da075e2eac90089b4cd00a26cbb"><code>dec55b7</code></a> Bump main to v4.17.23 (<a href="https://redirect.github.com/lodash/lodash/issues/6088">#6088</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/19c9251b3631d7cf220b43bc757eb33f1084f117"><code>19c9251</code></a> fix: setCacheHas JSDoc return type should be boolean (<a href="https://redirect.github.com/lodash/lodash/issues/6071">#6071</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/b5e672995ae26929d111a6e94589f8d03fb8e578"><code>b5e6729</code></a> jsdoc: Add -0 and BigInt zeros to _.compact falsey values list (<a href="https://redirect.github.com/lodash/lodash/issues/6062">#6062</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/edadd452146f7e4bad4ea684e955708931d84d81"><code>edadd45</code></a> Prevent prototype pollution on baseUnset function</li> <li><a href="https://github.com/lodash/lodash/commit/4879a7a7d0a4494b0e83c7fa21bcc9fc6e7f1a6d"><code>4879a7a</code></a> doc: fix autoLink function, conversion of source links (<a href="https://redirect.github.com/lodash/lodash/issues/6056">#6056</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/9648f692b0fc7c2f6a7a763d754377200126c2e8"><code>9648f69</code></a> chore: remove <code>yarn.lock</code> file (<a href="https://redirect.github.com/lodash/lodash/issues/6053">#6053</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/dfa407db0bf5b200f2c7a9e4f06830ceaf074be9"><code>dfa407d</code></a> ci: remove legacy configuration files (<a href="https://redirect.github.com/lodash/lodash/issues/6052">#6052</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/156e1965ae78b121a88f81178ab81632304e8d64"><code>156e196</code></a> feat: add renovate setup (<a href="https://redirect.github.com/lodash/lodash/issues/6039">#6039</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/933e1061b8c344d3fc742cdc400175d5ffc99bce"><code>933e106</code></a> ci: add pipeline for Bun (<a href="https://redirect.github.com/lodash/lodash/issues/6023">#6023</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/072a807ff7ad8ffc7c1d2c3097266e815d138e20"><code>072a807</code></a> docs: update links related to Open JS Foundation (<a href="https://redirect.github.com/lodash/lodash/issues/5968">#5968</a>)</li> <li>Additional commits viewable in <a href="https://github.com/lodash/lodash/compare/4.17.21...4.17.23">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Description Enables the file mapping of weights as well as the overall context bin. This feature is currently only enabled for ARM64 WIN devices Motivation and Context Currently, when reading the context bin, ORT allocates a large buffer on the heap. Assuming the same model is used, each ORT session will allocate a buffer for the context bin. This is incredibly wasteful when large models are used. Instead, WIN file mapping can be leveraged to map the context bin, then every time a context needs to be created with the context bin, the pointer to the context bin can be retrieved and used instead of some pre-allocated buffer, thus making QNN EP more memory-efficient. In the case of multiple ORT sessions, the context bin will only be loaded once for all sessions, increasing memory efficiency and overall initialization performance. This is very useful regarding the use of LLMs going forward. --------- Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>
…spec (microsoft#27164) I missed the operator since it didn't have the corresponding tests at the time. With onnx/onnx#7618, the disabled test should be able to pass. --- This pull request updates the ONNX Runtime CPU execution provider to add support for the `LpNormalization` operator for opset version 22, in addition to clarifying and correcting the registration for earlier versions. It also updates the backend test filters to reflect this new support. **ONNX Operator Kernel Registration:** * Added new kernel registrations for `LpNormalization` with opset version 22 for both `float` and `double` data types in `cpu_execution_provider.cc`. [[1]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908R1328-R1329) [[2]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908R3389-R3392) * Updated the registration for `LpNormalization` for opset versions 1 through 21 to use the correct versioned kernel macro, ensuring correct kernel selection and compatibility. [[1]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908L197-R198) [[2]](diffhunk://#diff-054ffdd679ada14ebb4b1db27a60b2881e2db48f9dc3f0b948c784cdcdaf4908L1731-R1735) **Test Filters Update:** * Updated `onnx_backend_test_series_filters.jsonc` to remove the exclusion of `test_l1normalization*`, `test_lpnormalization*`, and `test_l2normalization*` now that `LpNormalization` opset 22 is implemented, and added a TODO comment referencing ONNX 1.21 for a known zero-norm issue. [[1]](diffhunk://#diff-abc0f78c2314f9e7648c8081125d0ce9f33b12399520d92d811d73e3c795ed59R32-R33) [[2]](diffhunk://#diff-abc0f78c2314f9e7648c8081125d0ce9f33b12399520d92d811d73e3c795ed59L42) [[3]](diffhunk://#diff-abc0f78c2314f9e7648c8081125d0ce9f33b12399520d92d811d73e3c795ed59L70-L71)
…ft#27151) ### Description Previously in `MatMulReadFnSource()` we use duplicated code to read data from two inputs `a` and `b`. This patch implements another overload of `MatMulReadFnSource()` to only read data from one input to reduce duplicated code and get ready for further use.
…crosoft#27179) ## Problem Description The `MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` test was exhibiting flaky behavior (failure rate ~2-20%) with numerical mismatches. Investigation revealed a **race condition** in the [GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326) step within [MlasLutGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/inc/mlas_qnbit.h#L328). When the batch size `M > 1`, [MlasLutGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/inc/mlas_qnbit.h#L328) attempted to parallelize the LUT generation over the batch dimension using `MlasTrySimpleParallel`. However, the underlying [GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326) implementation (specifically shared usage of `lut_scales`/`lut_biases` or internal buffers) is not thread-safe for concurrent execution on the same destination buffers or related state. This led to corruption of the Look-Up Tables or scales, causing random output errors. ## Solution This PR modifies [onnxruntime/core/mlas/lib/qlutgemm.cpp](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/qlutgemm.cpp) to **serialize the [GenerateLUT](file:///home/tlwu/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#324-355) loop**. Instead of using `MlasTrySimpleParallel`, we now use a simple `for` loop to process each row of the batch sequentially. **Performance Impact:** The [GenerateLUT](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L326) step is computationally lightweight compared to the subsequent [TMACComputeGemm](https://github.com/microsoft/onnxruntime/blob/38dfc91f38fe53da9eaf7e9fb9b158904eb3cd5b/onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp#L505) matrix multiplication. Serializing this setup step has negligible impact on overall inference latency (micro-benchmarks showed no measurable regression), but effectively eliminates the race condition. ## Verification * **Reproduction:** The issue was reliably reproduced by running `MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` in a loop (failing ~1 in 5 times). * **Verification:** After applying the fix, the same test passed **50/50 iterations** consistently. * **Regression Testing:** Standard `MatMulNBitsLutGemm` tests (including `BlkLen64` and `M=1` cases) continue to pass.
See related issues: microsoft#26889
Bumps [tar](https://github.com/isaacs/node-tar) to 7.5.7 and updates ancestor dependency [cmake-js](https://github.com/cmake-js/cmake-js). These dependencies need to be updated together. Updates `tar` from 6.2.1 to 7.5.7 <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/isaacs/node-tar/blob/main/CHANGELOG.md">tar's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <h2>7.5</h2> <ul> <li>Added <code>zstd</code> compression support.</li> <li>Consistent TOCTOU behavior in sync t.list</li> <li>Only read from ustar block if not specified in Pax</li> <li>Fix sync tar.list when file size reduces while reading</li> <li>Sanitize absolute linkpaths properly</li> <li>Prevent writing hardlink entries to the archive ahead of their file target</li> </ul> <h2>7.4</h2> <ul> <li>Deprecate <code>onentry</code> in favor of <code>onReadEntry</code> for clarity.</li> </ul> <h2>7.3</h2> <ul> <li>Add <code>onWriteEntry</code> option</li> </ul> <h2>7.2</h2> <ul> <li>DRY the command definitions into a single <code>makeCommand</code> method, and update the type signatures to more appropriately infer the return type from the options and arguments provided.</li> </ul> <h2>7.1</h2> <ul> <li>Update minipass to v7.1.0</li> <li>Update the type definitions of <code>write()</code> and <code>end()</code> methods on <code>Unpack</code> and <code>Parser</code> classes to be compatible with the NodeJS.WritableStream type in the latest versions of <code>@types/node</code>.</li> </ul> <h2>7.0</h2> <ul> <li>Drop support for node <18</li> <li>Rewrite in TypeScript, provide ESM and CommonJS hybrid interface</li> <li>Add tree-shake friendly exports, like <code>import('tar/create')</code> and <code>import('tar/read-entry')</code> to get individual functions or classes.</li> <li>Add <code>chmod</code> option that defaults to false, and deprecate <code>noChmod</code>. That is, reverse the default option regarding explicitly setting file system modes to match tar entry settings.</li> <li>Add <code>processUmask</code> option to avoid having to call <code>process.umask()</code> when <code>chmod: true</code> (or <code>noChmod: false</code>) is set.</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/isaacs/node-tar/commit/4a37eb9a1cf1137df4eb70c5c7f849f412ff3cdb"><code>4a37eb9</code></a> 7.5.7</li> <li><a href="https://github.com/isaacs/node-tar/commit/f4a7aa9bc3d717c987fdf1480ff7a64e87ffdb46"><code>f4a7aa9</code></a> fix: properly sanitize hard links containing ..</li> <li><a href="https://github.com/isaacs/node-tar/commit/394ece6ad8d81742a4e4058af227c616cd947a25"><code>394ece6</code></a> 7.5.6</li> <li><a href="https://github.com/isaacs/node-tar/commit/7d4cc17c76f6bd11dcd83de47187dc6dff206eee"><code>7d4cc17</code></a> fix race puting a Link ahead of its target File</li> <li><a href="https://github.com/isaacs/node-tar/commit/26ab90474e642cf00d84a05bcdc2eaf2a19f1581"><code>26ab904</code></a> 7.5.5</li> <li><a href="https://github.com/isaacs/node-tar/commit/e9a1ddb821b29ddee75b9470dd511066148c8070"><code>e9a1ddb</code></a> fix: do not prevent valid linkpaths within archive</li> <li><a href="https://github.com/isaacs/node-tar/commit/911c886bb170a6ee3db05fd3709221752213ec8a"><code>911c886</code></a> 7.5.4</li> <li><a href="https://github.com/isaacs/node-tar/commit/3b1abfae650056edfabcbe0a0df5954d390521e6"><code>3b1abfa</code></a> normalize out unicode ligatures</li> <li><a href="https://github.com/isaacs/node-tar/commit/a43478c5c51a71ec996cea62ff824eb9dc9dd17c"><code>a43478c</code></a> remove some unused files</li> <li><a href="https://github.com/isaacs/node-tar/commit/970c58f6d3d0c932081f8b40218f612db2fabb5a"><code>970c58f</code></a> update deps</li> <li>Additional commits viewable in <a href="https://github.com/isaacs/node-tar/compare/v6.2.1...v7.5.7">compare view</a></li> </ul> </details> <details> <summary>Maintainer changes</summary> <p>This version was pushed to npm by <a href="https://www.npmjs.com/~isaacs">isaacs</a>, a new releaser for tar since your current version.</p> </details> <br /> Updates `cmake-js` from 7.2.1 to 8.0.0 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/cmake-js/cmake-js/releases">cmake-js's releases</a>.</em></p> <blockquote> <h2>v8.0.0</h2> <p>This is a small but breaking change.</p> <p>This now requires nodejs 20 or later, due to increased requirements of updated dependencies</p> <p>With the increased minimum, this now uses the builtin fetch which further reduces the install size!</p> <p><strong>Full Changelog</strong>: <a href="https://github.com/cmake-js/cmake-js/compare/v7.4.0...v8.0.0">https://github.com/cmake-js/cmake-js/compare/v7.4.0...v8.0.0</a></p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/cmake-js/cmake-js/blob/master/changelog.md">cmake-js's changelog</a>.</em></p> <blockquote> <h1>v8.0.0 - 27/01/26</h1> <ul> <li>feat: require nodejs 20 or later</li> <li>feat: update deprecated dependencies</li> </ul> <h1>v7.4.0 - 14/11/25</h1> <ul> <li>feat(windows): support msvc 2026 (Thanks to <a href="https://github.com/Norgerkaj"><code>@Norgerkaj</code></a>)</li> </ul> <h1>v7.3.1 - 17/04/25</h1> <ul> <li>fix(windows): support windows arm64 (Thanks to <a href="https://github.com/jaycex"><code>@jaycex</code></a>)</li> <li>fix(windows): support newer visual studio installations</li> </ul> <h1>v7.3.0 - 15/01/24</h1> <ul> <li>feat(windows): replace custom libnode.def generation with version from node-api-headers</li> <li>fix: support for vs2015 with nodejs 18 and older (<a href="https://redirect.github.com/cmake-js/cmake-js/issues/317">#317</a>)</li> <li>fix(windows): always remove Path if PATH is also defined (<a href="https://redirect.github.com/cmake-js/cmake-js/issues/319">#319</a>)</li> <li>fix: Cmake arguments got converted to numbers (<a href="https://redirect.github.com/cmake-js/cmake-js/issues/314">#314</a>)</li> <li>fix: update node-api-headers</li> <li>chore: update dependencies</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/cmake-js/cmake-js/commit/a2c37135370d79e6fd962deecc2f013058d06191"><code>a2c3713</code></a> chore: v8.0.0</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/33c03a9a6a36c284323c5953f9cb88041775f4ac"><code>33c03a9</code></a> chore: fix ci</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/3bf03be7d3cdab323f3456e987421df3668ed217"><code>3bf03be</code></a> chore: fix ci</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/ab5e651e6dfe8b7d32396fd6cffb6b373e51d6a4"><code>ab5e651</code></a> chore(deps): bump actions/checkout from 5 to 6 (<a href="https://redirect.github.com/cmake-js/cmake-js/issues/358">#358</a>)</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/818bece268209646b75c9b9fdb9556c002fa8217"><code>818bece</code></a> fix: replace npmlog with simple inline logger</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/0b3a84081a9971c352b0bf7386db7404154afb51"><code>0b3a840</code></a> feat!: replace axios with fetch</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/c429d9e664d98bb0115e97341bcc57ce325e497c"><code>c429d9e</code></a> feat!: require nodejs 20</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/a5fe3c26759535023aca156bc0abb925b045c8fd"><code>a5fe3c2</code></a> v7.4.0</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/4ab302a8e03b5faac3c4f991cef5c2a37b5ff8f9"><code>4ab302a</code></a> feat(windows): add visual studio 2026 support (<a href="https://redirect.github.com/cmake-js/cmake-js/issues/357">#357</a>)</li> <li><a href="https://github.com/cmake-js/cmake-js/commit/2d0abc48429392dc9fb9485a5ae4b9d197c52aee"><code>2d0abc4</code></a> chore: fix readme typo (<a href="https://redirect.github.com/cmake-js/cmake-js/issues/353">#353</a>)</li> <li>Additional commits viewable in <a href="https://github.com/cmake-js/cmake-js/compare/v7.2.1...v8.0.0">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description Adds C/C++ API named `GetTensorElementTypeAndShapeDataReference` that returns an OrtValue tensor's shape and type without allocating a new buffer for the shape data. ### Motivation and Context This new API function can be used instead of `OrtApi::GetTypeInfo()` or `OrtApi::GetTensorTypeAndShape` to decrease the number of heap allocations and thus improve inference latency for plugin EPs kernels that frequently retrieve tensor shapes during inference. (e.g., WebGPU plugin EP)
…microsoft#27157) Replaces the deprecated pkg_resources library with importlib.metadata to fix ModuleNotFoundError.
Bumps [lodash](https://github.com/lodash/lodash) from 4.17.21 to 4.17.23. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/lodash/lodash/commit/dec55b7a3b382da075e2eac90089b4cd00a26cbb"><code>dec55b7</code></a> Bump main to v4.17.23 (<a href="https://redirect.github.com/lodash/lodash/issues/6088">#6088</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/19c9251b3631d7cf220b43bc757eb33f1084f117"><code>19c9251</code></a> fix: setCacheHas JSDoc return type should be boolean (<a href="https://redirect.github.com/lodash/lodash/issues/6071">#6071</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/b5e672995ae26929d111a6e94589f8d03fb8e578"><code>b5e6729</code></a> jsdoc: Add -0 and BigInt zeros to _.compact falsey values list (<a href="https://redirect.github.com/lodash/lodash/issues/6062">#6062</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/edadd452146f7e4bad4ea684e955708931d84d81"><code>edadd45</code></a> Prevent prototype pollution on baseUnset function</li> <li><a href="https://github.com/lodash/lodash/commit/4879a7a7d0a4494b0e83c7fa21bcc9fc6e7f1a6d"><code>4879a7a</code></a> doc: fix autoLink function, conversion of source links (<a href="https://redirect.github.com/lodash/lodash/issues/6056">#6056</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/9648f692b0fc7c2f6a7a763d754377200126c2e8"><code>9648f69</code></a> chore: remove <code>yarn.lock</code> file (<a href="https://redirect.github.com/lodash/lodash/issues/6053">#6053</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/dfa407db0bf5b200f2c7a9e4f06830ceaf074be9"><code>dfa407d</code></a> ci: remove legacy configuration files (<a href="https://redirect.github.com/lodash/lodash/issues/6052">#6052</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/156e1965ae78b121a88f81178ab81632304e8d64"><code>156e196</code></a> feat: add renovate setup (<a href="https://redirect.github.com/lodash/lodash/issues/6039">#6039</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/933e1061b8c344d3fc742cdc400175d5ffc99bce"><code>933e106</code></a> ci: add pipeline for Bun (<a href="https://redirect.github.com/lodash/lodash/issues/6023">#6023</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/072a807ff7ad8ffc7c1d2c3097266e815d138e20"><code>072a807</code></a> docs: update links related to Open JS Foundation (<a href="https://redirect.github.com/lodash/lodash/issues/5968">#5968</a>)</li> <li>Additional commits viewable in <a href="https://github.com/lodash/lodash/compare/4.17.21...4.17.23">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…oft#27106) Bumps [lodash](https://github.com/lodash/lodash) from 4.17.21 to 4.17.23. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/lodash/lodash/commit/dec55b7a3b382da075e2eac90089b4cd00a26cbb"><code>dec55b7</code></a> Bump main to v4.17.23 (<a href="https://redirect.github.com/lodash/lodash/issues/6088">#6088</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/19c9251b3631d7cf220b43bc757eb33f1084f117"><code>19c9251</code></a> fix: setCacheHas JSDoc return type should be boolean (<a href="https://redirect.github.com/lodash/lodash/issues/6071">#6071</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/b5e672995ae26929d111a6e94589f8d03fb8e578"><code>b5e6729</code></a> jsdoc: Add -0 and BigInt zeros to _.compact falsey values list (<a href="https://redirect.github.com/lodash/lodash/issues/6062">#6062</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/edadd452146f7e4bad4ea684e955708931d84d81"><code>edadd45</code></a> Prevent prototype pollution on baseUnset function</li> <li><a href="https://github.com/lodash/lodash/commit/4879a7a7d0a4494b0e83c7fa21bcc9fc6e7f1a6d"><code>4879a7a</code></a> doc: fix autoLink function, conversion of source links (<a href="https://redirect.github.com/lodash/lodash/issues/6056">#6056</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/9648f692b0fc7c2f6a7a763d754377200126c2e8"><code>9648f69</code></a> chore: remove <code>yarn.lock</code> file (<a href="https://redirect.github.com/lodash/lodash/issues/6053">#6053</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/dfa407db0bf5b200f2c7a9e4f06830ceaf074be9"><code>dfa407d</code></a> ci: remove legacy configuration files (<a href="https://redirect.github.com/lodash/lodash/issues/6052">#6052</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/156e1965ae78b121a88f81178ab81632304e8d64"><code>156e196</code></a> feat: add renovate setup (<a href="https://redirect.github.com/lodash/lodash/issues/6039">#6039</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/933e1061b8c344d3fc742cdc400175d5ffc99bce"><code>933e106</code></a> ci: add pipeline for Bun (<a href="https://redirect.github.com/lodash/lodash/issues/6023">#6023</a>)</li> <li><a href="https://github.com/lodash/lodash/commit/072a807ff7ad8ffc7c1d2c3097266e815d138e20"><code>072a807</code></a> docs: update links related to Open JS Foundation (<a href="https://redirect.github.com/lodash/lodash/issues/5968">#5968</a>)</li> <li>Additional commits viewable in <a href="https://github.com/lodash/lodash/compare/4.17.21...4.17.23">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…27195) ### Description Fixes C++ documentation generation by replacing `<` and `>` with `[` and `]`. Angle brackets are mistaken as html tags. Successful run: https://github.com/microsoft/onnxruntime/actions/runs/21456738258 ### Motivation and Context Allow C++ document generation to succeed.
…rosoft#27174) ## Problem Description The `MatMulNBitsLutGemm` test suite, specifically `Float32_2Bits_Symmetric_256x256_BlkLen64`, was observing intermittent failures (flakiness). The failure manifested as numerical mismatches exceeding the tolerance, suggesting non-deterministic behavior in the kernel execution. ## Root Cause Analysis The issue was traced to the usage of `_mm256_i32gather_ps` in sqnbitgemm_lut_kernel_avx2.cpp While the gather indices were technically calculating addresses within the bounds of the allocated buffer, gather instructions on certain AVX2 hardware implementations can exhibit non-deterministic behavior or subtle performance/prefetching artifacts when operating on specific stride patterns (in this case, gathering with a stride of 4 floats). ## Solution This PR replaces the `_mm256_i32gather_ps` instruction with a sequence of **contiguous loads (`_mm256_loadu_ps`) followed by deterministic shuffles**. ### How it works: 1. **Contiguous Load**: We load 4 contiguous vectors of 8 floats elements using `_mm256_loadu_ps`. This is always memory-safe and deterministic. 2. **Deterministic Shuffle**: We apply a verified sequence of `unpack` and `permutevar8x32` instructions to rearrange these 32 linearly loaded elements into the exact same stride-4 layout that the gather instruction produced. ### Benefits: * **Stability**: Eliminates the hardware-dependent non-determinism of gather. * **Safety**: Usage of `loadu` guarantees we only touch memory within the explicit range of the 32 elements we intend to load. * **Correctness**: The shuffle logic was verified against the reference gather behavior using a C++ reproduction script to ensure bit-exact layout equivalence. ### Performance Micro-benchmark on MatMulNBitsLutGemm (256x256, BlkLen=64). Original (Gather): ~55.55 us Fixed (Load+Shuffle): ~57.79 us Delta: +2.24 us (~4% slower) The slight performance regression is expected because replacing a single hardware gather instruction with a sequence of loadu, unpack, and permute instructions adds instruction count overhead. However, this is a necessary tradeoff to ensure deterministic behavior and memory safety across all AVX2 implementations. ## Verification * **Tests**: All 9 tests in `MatMulNBitsLutGemm` passed successfully (including the previously flaky `BlkLen64` case).
…osoft#27120) Description Conditionally disable linking of cpuinfo for onnxruntime_runtime_path_test_shared_library on targets, where cpuinfo is not supported. Motivation and Context Recent changes enabling onnxruntime_autoep_test and related shared library tests on non-Windows platforms exposed a transitive dependency issue. cpuinfo was being linked unconditionally on Linux, leading to linker failures on ppc64le (cannot find -lcpuinfo). Solution Add CPUINFO_SUPPORTED guards to exclude cpuinfo from the link list while preserving existing behavior.
The logic of interleaved NEON kernel is not correct from code review:
1. **Test Code Logic:**
The test code `test_rope.h` allocates the `sin` and `cos` tables based
on the `interleaved` flag:
```c++
size_t table_len = interleaved ? rotary_emb_dim / 2 : rotary_emb_dim;
std::vector<float> sin_data(table_len);
std::vector<float> cos_data(table_len);
```
For the `interleaved = true` case, the test creates `sin` and `cos`
tables of length `rotary_emb_dim / 2`.
2. **AVX2 (fp32) Kernel Logic (`interleaved = true`):**
This kernel loads the `sin`/`cos` data using an index of `i / 2`:
```c++
float32x8_t sin_val = _mm256_loadu_ps(sin_data + i / 2);
float32x8_t cos_val = _mm256_loadu_ps(cos_data + i / 2);
```
This logic expects a `sin`/`cos` table of length `rotary_emb_dim / 2`.
**Conclusion: The AVX2 (fp32) kernel is consistent with the test code.**
3. **NEON (fp16) Kernel Logic (`interleaved = true`):**
This kernel loads the `sin`/`cos` data using an index of `i`:
```c++
// Enters loop with sin_val = MlasLoadFloat16x8(sin + i);
//...
// Inside loop, for next iteration:
sin_val = MlasLoadFloat16x8(sin + i + 16);
```
This logic expects a `sin`/`cos` table of length `rotary_emb_dim`.
**Conclusion: The NEON (fp16) kernel is NOT consistent with the test
code.**
### Regression Test
```
cmake --build build/Linux/Release --config Release --target onnxruntime_mlas_test && ./build/Linux/Release/onnxruntime_mlas_test --gtest_filter=NeonFp16RoPE*
```
Before applying the fix, the test failed:
```
[ FAILED ] NeonFp16RoPE.ShortExecute (13 ms)
onnxruntime/onnxruntime/test/mlas/unittest/test_rope_neon_fp16.cpp:66: Failure
Value of: CloseEnough(output_impl[i].ToFloat(), output_ref[i].ToFloat())
Actual: false
Expected: true
Expected bits: 19491 (16.546875) Actual bits: 56596 (-325) @[16], rotary_emb_dim=24, interleaved=true
```
After applying the fix, test passed.
### Summary
The `RopeKernel_Avx2_fp32_Impl<true>` kernel correctly aligns with the
test code (and the fallback implementation) by expecting a `sin`/`cos`
table of length `rotary_emb_dim / 2`.
The `RopeKernel_Fp16_Impl<true>` (NEON) kernel incorrectly expects a
table of length `rotary_emb_dim`. When run against the provided test,
the NEON kernel will read past the end of the `sin_data` and `cos_data`
vectors.
---------
Co-authored-by: Copilot <[email protected]>
…microsoft#27207) ### Description This PR renames the following existing names for MemoryInfo: - `WebGPU_Buffer` -> `WebGPU_Buf` - `WebNN_Tensor` -> `WebNN_Ten` ### Motivation and Context the `OrtMemoryInfo` uses a `std::string` to store the name. modern C++ compilers uses "small string optimization" (SSO) to avoid an extra memory allocation if the string is small enough. While different compiler may have different implementation, the following test program is used to test what exact limit is for a certain compiler: ```c++ #include <string> #include <cstdio> int main() { std::string webgpu0 = "WebGPU_Buf"; std::string webgpu1 = "WebGPU_Buff"; std::string webgpu2 = "WebGPU_Buffe"; std::string webgpu3 = "WebGPU_Buffer"; printf("=========== %s\n string address: %p\n data address : %p\n\n", webgpu0.c_str(), (void*)&webgpu0, (void*)webgpu0.data()); printf("=========== %s\n string address: %p\n data address : %p\n\n", webgpu1.c_str(), (void*)&webgpu1, (void*)webgpu1.data()); printf("=========== %s\n string address: %p\n data address : %p\n\n", webgpu2.c_str(), (void*)&webgpu2, (void*)webgpu2.data()); printf("=========== %s\n string address: %p\n data address : %p\n\n", webgpu3.c_str(), (void*)&webgpu3, (void*)webgpu3.data()); return 0; } ``` While using emscripten (targetting wasm32), the runtime result is like this: ``` =========== WebGPU_Buf string address: 0x10db0 data address : 0x10db0 =========== WebGPU_Buff string address: 0x10da4 data address : 0x10dc8 =========== WebGPU_Buffe string address: 0x10d98 data address : 0x10de0 =========== WebGPU_Buffer string address: 0x10d8c data address : 0x10df8 ``` Which shows that the string need to be no more than 10 bytes (exclude the '\0' at end) to enable SSO.
Signed-off-by: Milos Puzovic <[email protected]>
…ck spill Signed-off-by: Milos Puzovic <[email protected]>

Overview
This PR adds ARM64 NEON assembly micro‑kernels for NCHW, depthwise, and pointwise convolution, wires them into the MLAS build, and adds shape‑based selection heuristics for NCHWC depthwise/pointwise to favor the asm kernels in safe cases (stride‑1 pointwise; wider depthwise outputs). The BF16 path is unchanged.
Key changes
Performance
Numbers below are expressed as multipliers vs the non‑NCHWC baseline (same model and perf_test settings):
Baseline (no
--enable_arm_neon_nchwc)With
--enable_arm_neon_nchwc(no asm additions/heuristics)With this PR (asm kernels + heuristics)
Testing
./build.sh --config Release --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests --enable_pybind --build_wheel --enable_arm_neon_nchwcOMP_NUM_THREADS=8 ./build/Linux/Release/onnxruntime_perf_test -I -m times -r 1000 --x 8 ~/mobilenetv2-7.onnx