1.24.0 release cherry-pick round 1 #27104

tianleiwu · 2026-01-21T20:56:38Z

Description

This PR cherry-picks the following changes for the 1.24.0 release.

Cherry-picked Commits

Commit	Commit Title	Author
`744e7fe`	Add type definitions, registration, utilities for INT2/UINT2 support (#26824)	vraspar
`530a1fb`	[QNN EP] Add BFloat16 dtype support in QNN EP (#26987)	tirupath-qti
`8e050d1`	Implement new experimental lookup-based matrix multiplication method(TMAC) (#26695)	vraspar
`2d2ba6b`	[MLAS/CPU EP] Improve performance of Silu activation path within the QuickGelu CPU kernel (#26753)	Hariharan Seshadri
`1c02b79`	[QNN EP] Add support for handling 0-dimension for Concat Op (#27000)	Ashwath Shankarnarayan
`cc2b01b`	Fix ClipQuantFusion crash when Clip has multiple input edges (#27016)	Edward Chen
`bbd3850`	[QNN EP] Support quantized BatchNorm with per-channel DQ params on QNN HTP (#26959)	qti-yuduo
`d8f0318`	Add API to get ep graph partitioning info (#26781)	Adrian Lizarraga
`b912b18`	[OVEP] OpenVINO EP Features and bug-fixes for ORT-1.24 - Follow up (#27007)	Preetha Veeramalai
`ba11af4`	[QNN-EP] Add MatMulNBits translation for GPU (#26340)	quic-tirupath
`c03c419`	[MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics (#26688)	Hariharan Seshadri
`e7dfd69`	[QNN-EP] Support alternate Layernorm fusion pattern in QNN preprocess (#26060)	qti-mattsinc
`4013dc1`	Implement multithreading in qgemm_kleidi (#26301)	Melike Kaptan
`9f06181`	[CXX] Enable users to specify custom OrtSyncStream via RunOptions (#26988)	Dmitri Smirnov
`cfccd64`	Added support for QMX kernels in MLAS (#26849)	qti-vaiskv
`29d9b2f`	Tweak external resource importer handle structs (#27040)	Scott McKay
`9d108d0`	[QNN EP] Add QuickGELU operator support for QNN provider (#27034)	tirupath-qti
`b35688f`	Add INT2 and UINT2 support for QDQ, transpose and cast ops (#27022)	vraspar
`6d34aba`	Introducing BF16 Pointwise NCHWc Convolution for Arm64 (#26838)	Rohanjames1997
`36017ad`	[EP ABI] Add CreateCustomOpDomains() API for plugin EP to register custom ops (#27050)	Chi Lo
`50a03e4`	Add a new pipeline for CUDA 13 nuget builds (#27023)	eserscor
`a0d4439`	[EP ABI] Update Graph_GetGraphView() implementation (#26711)	Chi Lo
`34bb209`	[webgpu] Fix a bug for im2col (#27069)	Wenqin Yang
`46e8d45`	[QNN EP] Add FusedMatMul operator support (#27044)	tirupath-qti
`5e7e7a3`	Disable Float32_2Bits_Asymmetric_256x256 test (#27046)	vraspar
`39f966e`	Fix Doxygen documentation build error in onnxruntime_c_api.h (#27083)	Nick Eubanks
`8a7a797`	Print tensor for new packed type of 2 bits (#27064)	Tianlei Wu
`01f40e6`	Fix GPU JAR testing on Linux (#27011)	eserscor
`b6ed7f3`	Fix warning around ununsed code in QNN Android Emulator builds by clang (#27026)	Hariharan Seshadri
`d7daa45`	Raise the timeout for the ios simulator job (#27045)	Hariharan Seshadri
`7e1d818`	upgrade emsdk to 4.0.23 (#27029)	Yulong Wang
`347b990`	Fix failing mainline build on Arm64 linux (#27101)	Rohanjames1997
`f481b17`	Add dedicated API to support extracting compatibility string from model metadata (#27015)	adrastogi

…26824) ### Description This is first PR towards adding 2 bit support to ORT I will create follow up PRs to - Add CPU operators (CAST, QDQ, non compute ops) ### Motivation and Context  - ONNX recently added 2 bit support onnx/onnx#7446 (cherry picked from commit 744e7fe)

### Description - QNN NPU backend supports BFloat16 dtype for many operators - QNN EP adds a new session option "htp_bf16_enable" to enable Users to signal processing the Float32 graph in BFloat16 precision - When User specifies "htp_bf16_enable", the QNN EP lowers incoming Float32 Ort graph into BFloat16 QNN graph. - The ORT CPU fallback still receives Float32 partitions. - The lowered QNN graph still accepts float32 inputs, outputs and constant initializers. The QNN EP inserts Cast operators to do the necessary precision switch. ### Motivation and Context - This enables computing accuracy sensitive float32 models in bfloat16 precision on Qualcomm NPU accelerator to improve inference time w.r.t computing in float32 precision. --------- Co-authored-by: Ashwath Shankarnarayan <ashwshan@qti.qualcomm.com> (cherry picked from commit 530a1fb)

…TMAC) (#26695) ### Description This PR introduces a new experimental lookup-table(LUT) based matrix multiplication method for 2-bit MatMulNBits on x64 AVX2 inspired from [T-MAC paper](https://arxiv.org/abs/2407.00088) and [T-MAC repository](https://github.com/microsoft/T-MAC) to speed up low bit LLM inference. Unlike the existing quant-dequant methods, the LUT-based method directly supports mixed-precision-GEMM without dequantization. It uses bit-wise table lookup to eliminate multiplications and reduce additions required in matrix multiplication. <img width="1910" height="759" alt="image" src="https://github.com/user-attachments/assets/3e3f2ced-eba4-4d4e-a63c-fec479943202" /> This PR: - Add` mlas.use_lut_gemm` session option allowing use of LUT GEMM inside matmulnbits when it is available (2-bit, BlkLen multiple of 32, K multiple of 32, N multiple of 128, AVX2 present). - Introduces LUT packing + kernel config cache (packs bitplanes, scales, ZP) and the main `MlasLUTGemm` entry that generates per-row LUTs and calls the AVX2 kernel. - Implements AVX2 LUT generation `GenerateLUT_avx2` and GEMM compute `TMACComputeGemm_avx2` and wires dispatch in MLAS platform init. - Updates MatMulNBits PrePack/Compute to use LUT packing/compute when opted-in; keeps existing quant-dequant path as fallback. - Extends Python quant bindings with 2-bit QDQ helper for parity with the new path. - Adds MLAS unit tests covering LUT GEMM across symmetric/asymmetric quant and multiple shapes/block sizes. ### Main components: - `MlasInitLUTGemmKernelConfig`: Config for LUT kernels - `MlasLUTGemmPackQuantBData`: Pre Packing of quantized weight - `MlasLUTPackScalesAndZeroPoints`: Pre Packing of qunatized scales and zero points - `MlasLUTGemm`: Main Entry point - `GenerateLUT_avx2`: LUT construction from activations - `TMACComputeGemm_avx2`: AVX2 LUT GEMM kernel - Session option: mlas.use_lut_gemm ### How to test - MLAS LUT GEMM unit tests: see `test_sqlutgemm.cpp` - Run MatMulNBits models with session option `mlas.use_lut_gemm=1` on AVX2 machines; expect fallback to existing path if availability checks fail. ### Perf Focus of this PR is functional + kernel bring-up; perf to be reported separately once broader profiling is done. ### Future Work - Support MLFloat16 (FP16 scales and zero points) - Add neon kernel for ARM. - Add kernels for 4 bit weights and bitnet kernels - Broader batch (N>1) support and additional shape coverage. --------- Signed-off-by: Liqun Fu <liqun.fu@microsoft.com> Co-authored-by: Liqun Fu <liqun.fu@microsoft.com> Co-authored-by: carzh <wolfivyaura@gmail.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: carzh <carolinezhu@microsoft.com> Co-authored-by: Vrajang Parikh <vrparikh@microsoft.com> (cherry picked from commit 8e050d1)

…QuickGelu CPU kernel (#26753) ### Description The `Silu` activation is basically the same as `QuickGelu` but with the scaling factor (`alpha`) as 1. In cusomer models containing `Silu`, the graph optimizer suite correctly fuses the nodes into a QuickGelu with alpha = 1. This optimizes the implementation of QuickGelu when alpha = 1 by avoiding the scaling and vectorizes the subsequent elementwise multiplication. **Tests:** There are already tests for QuickGelu with alpha = 1 and there are no new tests necessary (https://github.com/microsoft/onnxruntime/blob/f98c756b45b81520c6e2a09c370575a013f02cce/onnxruntime/test/contrib_ops/activation_op_test.cc#L126) **Performance improvements measured:** Gives about 2.5% throughput boost for a customer model that has a lot of Silu activations. ### Motivation and Context Some low hanging fruit perf improvements that give instant easy perf wins --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (cherry picked from commit 2d2ba6b)

### Description - Added dedicated ConcatOpBuilder to handle tensors with 0-dimension - Added unit tests for testing 0-dim inputs to Concat - Removed existing logic for concat from base op builder ### Motivation and Context - Currently 0- dims for Concat is not handled in QNN EP (cherry picked from commit 1c02b79)

@qti-yuduo

## Motivation `ClipQuantFusion in clip_quantizelinear.cc` calls `graph_utils::RemoveNode()` without first checking `graph_utils::CanRemoveNode()`. When a Clip node has min/max inputs from DequantizeLinear nodes (instead of initializers), it has multiple input edges. `RemoveNode()` throws exception: ``` [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: graph_utils.cc:650 bool onnxruntime::graph_utils::RemoveNode(onnxruntime::Graph&, onnxruntime::Node&) Should be unreachable if CanRemoveNodeAndMergeEdges is in sync with the logic here. ``` ## Fix: Added `CanRemoveNode()` check to `ClipQuantFusion::SatisfyCondition()` to skip nodes that cannot be safely removed. ## Test: Added `ClipQuantFusion_MultipleInputEdges` test that creates a Clip node with min from a DQ node (2 input edges) and verifies the optimizer doesn't crash. ## Note: This PR is a duplicate of #26923 which GitHub is preventing us from merging. Credit goes to original author @qti-yuduo. --------- Co-authored-by: Yuduo Wu <yuduow@qti.qualcomm.com> (cherry picked from commit cc2b01b)

…N HTP (#26959) ## Motivation: QNN HTP was rejecting quantized BatchNorm models where parameters (scale, mean, var) come through DequantizeLinear nodes with per-channel INT8 quantization. This pattern is common in quantized models from quantization tools. ## Changes: - Helpers to resolve BatchNorm params through DQ nodes to their underlying initializers - Support per-channel dequantization for BatchNorm parameters - Support input datatype of UFIXED_POINT_16 - Add unit test covering this QDQ params configuration (cherry picked from commit bbd3850)

### Description - Adds API functions to get information about the subgraphs/nodes assigned to the EPs in the session. - `Session_GetEpGraphAssignmentInfo`: Returns a list of "subgraphs", each with information about the assigned EP and nodes. - Note: App must enable session configuration `"session.record_ep_graph_assignment_info"` to signal ORT to collect this information. If not enabled, API returns empty results. - `EpAssignedSubgraph_GetEpName`: Returns the name of the EP to which the subgraph is assigned - `EpAssignedSubgraph_GetNodes`: Returns a list of assigned nodes - `EpAssignedNode_GetName`: Returns the assigned node's name - `EpAssignedNode_GetDomain`: Returns the assigned node's domain - `EpAssignedNode_GetOperatorType`: Returns the assigned node's operator type - Also adds C++ and Python bindings #### Structure of returned information The API returns a list of "subgraphs". Each subgraph has the following information: - Subgraph info: - EP name: The name of the execution provider to which this subgraph is assigned. - nodes: Name and operator type of each node. Ex: `[{"multiply", "Mul"}, ...]` Python example program (taken from unit tests): ```python def test_get_graph_provider_assignment_info(self): """ Tests querying for information about the nodes assigned to the CPU EP. """ # Create session options that enables recording EP graph partitioning info. session_options = onnxrt.SessionOptions() session_options.add_session_config_entry("session.record_ep_graph_assignment_info", "1") session = onnxrt.InferenceSession(get_name("add_mul_add.onnx"), sess_options=session_options) # Query session for information on each subgraph assigned to an EP. ep_subgraphs = session.get_provider_graph_assignment_info() # Check that all 3 nodes are assigned to CPU EP (each in its own subgraph) self.assertEqual(len(ep_subgraphs), 3) for ep_subgraph in ep_subgraphs: self.assertEqual(ep_subgraph.ep_name, "CPUExecutionProvider") self.assertEqual(len(ep_subgraph.get_nodes()), 1) # Serialize each node to an identifier (concatenates operator type and node name) node_ids: list[str] = [f"{n.op_type}/{n.name}" for s in ep_subgraphs for n in s.get_nodes()] # Should have 1 Mul and 2 Adds. self.assertEqual(len(node_ids), 3) self.assertIn("Add/add_0", node_ids) self.assertIn("Add/add_1", node_ids) self.assertIn("Mul/mul_0", node_ids) ``` C++ program (taken from unit test): ```c++ // Check the ep graph partitioning (Mul on plugin EP, others on CPU EP). // Model has 3 subgraphs (in no particular order): // - Subgraph 1: Add assigned to CPU EP. // - Subgraph 2: Mul assigned to plugin EP. // - Subgraph 3: Add assigned to CPU EP. std::vector<Ort::ConstEpAssignedSubgraph> ep_subgraphs = session.GetEpGraphAssignmentInfo(); ASSERT_EQ(ep_subgraphs.size(), 3); for (Ort::ConstEpAssignedSubgraph ep_subgraph : ep_subgraphs) { std::string ep_name = ep_subgraph.EpName(); ASSERT_TRUE(ep_name == Utils::example_ep_info.ep_name || ep_name == kCpuExecutionProvider); const std::vector<Ort::ConstEpAssignedNode> ep_nodes = ep_subgraph.GetNodes(); ASSERT_GE(ep_nodes.size(), 1); // All of these subgraphs just have one node. if (ep_name == kCpuExecutionProvider) { std::string op_type = ep_nodes[0].OpType(); std::string node_name = ep_nodes[0].Name(); ASSERT_EQ(op_type, "Add"); ASSERT_TRUE(node_name == "add_0" || node_name == "add_1"); } else { ASSERT_TRUE(ep_name == Utils::example_ep_info.ep_name); std::string op_type = ep_nodes[0].OpType(); std::string node_name = ep_nodes[0].Name(); ASSERT_EQ(op_type, "Mul"); ASSERT_EQ(node_name, "mul_0"); } } ``` ### Motivation and Context  --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (cherry picked from commit d8f0318)

…27007) ### Description This PR refines OpenVINO EP backend execution and input validation, improving reshape handling, symbolic vs dynamic dimension checks, and execution consistency. It also adds explicit support for stateful/KV-cache inference by introducing cache index tracking, validation, and reset logic across backend, context, and interface layers, with corresponding test updates. --------- Signed-off-by: bfilipek <bartlomiej.filipek@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: Javier Martinez <javier.e.martinez@intel.com> Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com> Co-authored-by: bopeng1234 <bo.peng@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com> Co-authored-by: TejalKhade28 <tejal.khade@intel.com> Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com> Co-authored-by: Yaru Du <yaru.du@intel.com> Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com> Co-authored-by: Dvoretckii, Mikhail <mikhail.dvoretckii@intel.com> Co-authored-by: Pallavi Gupta <pallavi.gupta@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Fei Chen <feich@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: vraspar <vrajang@outlook.com> Co-authored-by: qti-yuduo <yuduow@qti.qualcomm.com> Co-authored-by: Akupadhye <aupadhye@qti.qualcomm.com> Co-authored-by: Wang Ning <ning4.wang@intel.com> Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: George Wu <jywu@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Wanming Lin <wanming.lin@intel.com> Co-authored-by: quic-calvnguy <quic_calvnguy@quicinc.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: Jie Chen <jie.a.chen@intel.com> Co-authored-by: xhcao <xinghua.cao@intel.com> Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Co-authored-by: quic-hungjuiw <quic_hungjuiw@quicinc.com> Co-authored-by: Ian Hunter <ianfhunter@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: Jeff Kilpatrick <jkilpatrick@qti.qualcomm.com> Co-authored-by: Jeff Kilpatrick <jkilpat@qti.qualcomm.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Nenad Banfic <46795300+nenad1002@users.noreply.github.com> Co-authored-by: derdeljan-msft <derdeljan@microsoft.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com> Co-authored-by: Jaswanth Gannamaneni <jaswanth.gannamaneni@intel.com> Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com> Co-authored-by: liang <gxgaoliang@126.com> Co-authored-by: Garth Long <garth.long@intel.com> Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com> Co-authored-by: Christopher Warrington <chwarr@microsoft.com> Co-authored-by: Ishwar Raut <iraut@nvidia.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Xinpeng Dou <15529241576@163.com> Co-authored-by: adrastogi <aditya.rastogi@microsoft.com> Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com> Co-authored-by: qti-hungjuiw <hungjuiw@qti.qualcomm.com> Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com> Co-authored-by: Adam Pocock <adam.pocock@oracle.com> Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Susanta Bhattacharjee <susanta.bhattacharjee@intel.com> Co-authored-by: Jozef Wludzik <jozef.wludzik@intel.com> Co-authored-by: Rajeev Sekar <rajeevsekar21@gmail.com> Co-authored-by: Mayuresh M Varerkar <mayuresh.m.varerkar@intel.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Wenqin Yang <wenqin.yang@intel.com> Co-authored-by: xieofxie <xieofxie@126.com> Co-authored-by: hualxie <hualxie@microsoft.com> Co-authored-by: Joshua Lochner <admin@xenova.com> Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com> Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com> Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Co-authored-by: chunghow-qti <chunghow@qti.qualcomm.com> Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Jiawei Shao <jiawei.shao@intel.com> Co-authored-by: czekun <chen.zekun@intel.com> Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com> (cherry picked from commit b912b18)

### Description Add support for translation of MatMulNBits contrib op to QNN with FullyConnected operation with INT4 BlockQuantized weights Implementation details: - Translate MatMulNBits to FullyConnected in OpBuilder - Support QNN_QUANTIZATION_ENCODING_BLOCK for INT4 weights - Pass INT4 weights and quant params as BlockQuantization encoding params in QNN Testing: - Added new unit tests for MNB -> QNN-GPU - Validated all OnnxRuntime tests - Validated the following LLMs through Olive and ORT-GenAI execution flow - LlaMA3.2 1B - Qwen2.5 - DeepSeek-R1-Qwen 1.5b - Phi3.5-mini-instruct ### Motivation and Context LLMs with INT4 quantization pass in Olive will generate a model with MatMulMBits contrib ops. To run these ops via QNN-EP, MatMulNBits is translated to QNN FullyConnected op with INT4 weights. --------- Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com> (cherry picked from commit ba11af4)

…using NEON intrinsics (#26688) ### Description **Motivation and approach taken:** Add a dedicated depthwise convolution kernel for the most common depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1, dilation = 1) using NEON intrinsics. This does significantly better than the current approach of `Im2Col + SGemm`. The Im2Col step extracts convolution patches and this is a wasteful step and for a 3x3 filter, K would be 9 for the SGemm and usually Gemms are not optimized for such small `K` values. Hence, a dedicated kernel works much better. Initially, I ported over the Winograd based NEON accelerated depthwise convolution kernel from PyTorch but I found that its performance is not very good. It's poor performance is probably due to applying the Winograd transformation for the filter repeatedly. A better approach may be to tranform the filter offline and this approach can be considered for later (I reverted the PyTorch Winograd implementation in this commit: 2820a84). The current depthwise kernel added in this PR was authored by GPT5.1-Codex and with some minor bug fixes it seems to be functionally correct now and also provides the perf boost we are seeking. **Unit tests:** There are already depthwise convolution tests already existing in the codebase. I don't see a need for new ones at this point. **Kernel benchmarking:** This is the kernel level perf improvement from MLAS Conv benchmarks (About 50% kernel latency improvements): <img width="1055" height="90" alt="image" src="https://github.com/user-attachments/assets/ead9eb83-2d62-4157-a065-70c67c8c7517" /> ### Motivation and Context A key customer model had a few depthwise conolution operations and this change provides a **non-negligible ~3% throughput improvement** using the customer provided benchmarking setup For those interested, #26654 adds support for the same type of convolution variant but that leverages SME1/SME2 through KleidiAI. This PR is conceptually the same but targeting NEON only platforms. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (cherry picked from commit c03c419)

…#26060) ### Description Small change to allow QNN Preprocess to allow a Mul node (with A=B) instead of a Pow node (with Y=2) for layernorm fusion. (cherry picked from commit e7dfd69)

**Key changes** This PR makes changes to improve the performance on Dynamic Qgemms by implementing tiling and threading across operations. The changes introduce thread local buffers for reusing memory during inference. And utilizes those in Dynamic Quantised Matmul operations using Kleidiai kernels. And updating KleidiAI version to 1.15.0 **Example performance** single thread : <img width="2100" height="900" alt="ort_ops_compare_encoder_1_2025-10-02_17-21-32_vs_encoder_1_2025-10-02_16-54-55" src="https://github.com/user-attachments/assets/c23c808d-5fab-4995-997e-a57a66a23d68" /> 2 threads : <img width="2100" height="900" alt="ort_ops_compare_encoder_2_2025-10-02_17-21-47_vs_encoder_2_2025-10-02_16-55-13" src="https://github.com/user-attachments/assets/31a0eb7a-7ff4-40c9-9425-b70231f131e8" /> --------- Signed-off-by: melkap01 <melike.kaptan@arm.com> Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Co-authored-by: Damien Dooley <damien.dooley@arm.com> Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com> (cherry picked from commit 4013dc1)

…6988) ### Description Enable device stream override using RunOptions for a particular run. The stream is restored after Run() completes. ### Motivation and Context Gpu InterOp requirements. When enabled, the inference would run with the specified stream with proper synchronization with imported external synchronization facilities. (cherry picked from commit 9f06181)

Supported operations with QMX: SGEMM, QGEMM, Convolution (cherry picked from commit cfccd64)

### Description  Use the descriptor struct in the external resource handle. We were copying most fields, but the setup is a little more intuitive when the descriptor is used directly. ### Motivation and Context  (cherry picked from commit 29d9b2f)

### Description Add support for the QuickGELU operator in the QNN provider: - Implement QuickGeluOpBuilder to handle QuickGELU operations - Add registration for QuickGELU in op_builder_factory - Add comprehensive tests for CPU and HTP backends - Support both float and quantized (QDQ) versions ### Motivation and Context - QNN doesn't have a direct operator to map QuickGelu so decompose it as x * sigmoid(alpha * x) for computing the whole model on HTP to improve inference time. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (cherry picked from commit 9d108d0)

(cherry picked from commit b35688f)

## Description This PR adds a BF16 (bfloat16) pointwise convolution kernel for ARM64 NCHWc format, leveraging the existing SBGEMM infrastructure. When the `mlas.enable_gemm_fastmath_arm64_bfloat16` session option is enabled on supported ARM64 Linux hardware, Pointwise Conv is rerouted to use this BF16 implementation. This is an opt-in feature, similar to how BF16 matmul is opt-in. Added a bool ZeroMode field to `MLAS_SBGEMM_DATA_PARAMS` (default `true` for backward compatibility) to enable per-batch control over output accumulation. This mirrors the beta parameter in FP32's `MlasGemmBatch` and is required for Pointwise convolutions with >128 input channels, where multiple GEMM calls must accumulate into the same output buffer. ## Motivation and Context The existing `mlas.enable_gemm_fastmath_arm64_bfloat16` session option accelerates MatMul operations on ARM64 processors with BF16 support, but convolution operations did not benefit from this optimization. Pointwise convolutions (1x1 kernels) are essentially batched matrix multiplications. This change extends the BF16 fastmath optimization to pointwise NCHWc convolutions, reusing the same session option. The implementation mirrors the FP32 pointwise kernel structure while delegating the actual computation to SBGEMM, ensuring correctness and maintainability. ## Performance improvement Measured a 15-20% gain on Mobilenet inference on an AWS Graviton4 instance. Before (FP32) ``` /build/Linux/Release/onnxruntime_perf_test -C "mlas.enable_gemm_fastmath_arm64_bfloat16|0" -x 32 -I -m times -r 2000 ~/scripts/mobilenet.onnx Number of inferences per second: 559.154 ``` After (BF16) ``` ./build/Linux/Release/onnxruntime_perf_test -C "mlas.enable_gemm_fastmath_arm64_bfloat16|1" -x 32 -I -m times -r 2000 ~/scripts/mobilenet.onnx Number of inferences per second: 651.221 ``` (cherry picked from commit 6d34aba)

…stom ops (#27050) ### Description The newly added two APIs, `CreateCustomOpDomains()` and `GetNumCustomOpDomains`, are used when running inference on a model that contains EP-specific custom operations. Workflow: 1. The EP implements these functions to supply a list of `OrtCustomOpDomain` instances. 2. The application either 1) calls `SessionOptionsAppendExecutionProvider_V2()` with an `OrtEpDevice` containing the plugin EP's factory or 2) enables auto ep selection. 3. Then ORT either 1) `SessionOptionsAppendExecutionProvider_V2()` appends the provided OrtCustomOpDomains to the session options or 2) registers the OrtCustomOpDomains from the selected EP devices. As a result, any session created from these session options will have these custom op domains registered in ORT, ensuring that the custom ops are properly recognized and validated when the model is loaded. Plugin EPs can provide two types of custom ops: 1. A full OrtCustomOp with a concrete kernel implementation - This Example EP demonstrates this approach. - In GetCapability(), it calls EpGraphSupportInfo_AddSingleNode() to inform ORT that the custom node should NOT be fused or compiled. Instead, ORT should invoke the custom node's Compute() function at runtime. 2. A "placeholder" OrtCustomOp with an empty kernel implementation - A compile-based Plugin EP can supply an OrtCustomOp whose CustomKernel::Compute() does nothing. The purpose is to satisfy model validation during model loading by registering the custom op as a valid operator in the session. - In GetCapability(), the EP should call EpGraphSupportInfo_AddNodesToFuse() to notify ORT that this custom node should be fused and compiled by the EP. - In Compile(), the EP executes its compiled bits to perform inference for the fused custom node. ### Motivation and Context Currently, the provider-bridge TRT RTX EP and TRT EP supports registering custom op domain list in session option so that it can run model contains TRT specific custom ops. This PR adds the same feature for plugin EP. (cherry picked from commit 36017ad)

### Description Adds new pipeline for CUDA 13 Nuget builds ### Motivation and Context The artifacts from this pipeline will be used by the release pipeline to publish the nuget packages to our public feed. --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: eserscor <247253654+eserscor@users.noreply.github.com> (cherry picked from commit 50a03e4)

### Description This PR mainly modifies the followings: - Update `Graph_GetGraphView()` implementation. - Make sure EpGraph maintains the min/max node index, so that when querying node outside that range, it can return null. - Provide option to create an EpGraph that contains its parent node when the graph is the subgraph of a control flow op. #### Update Graph_GetGraphView() implementation In some cases, e.g. when model has a node that produces output consumed by multiple nodes, calling the current implementation of `Graph_GetGraphView()` to get a subgraph returns incorrect `OrtGraph.` - Original graph: <img width="414" height="356" alt="image" src="https://github.com/user-attachments/assets/739c092d-0880-4f6e-9351-e08e0e141b35" /> - Incorrect graph after calling `Graph_GetGraphView()` to get the subgraph: It includes three of the nodes from the original graph. The `topk_indices` is the output of the `TopK` and it shouldn't be added as a graph input shown in the graph below. The API implementation has issue handling this case. If we feed this subgraph into TRT parser, it would fail to parse the graph. <img width="349" height="341" alt="image" src="https://github.com/user-attachments/assets/1306e22c-7c5d-45a2-bc18-6864fa2966ba" /> - Correct graph after calling `Graph_GetGraphView()` to get the subgraph: It includes three of the nodes from the original graph. The `topk_indices` now is not added as a graph input. Instead, the `topk_indices` is added as a graph output which is expected as `Mod` is in another subgraph that consumes it, so this subgraph has to make `topk_indices` a graph output. <img width="413" height="350" alt="image" src="https://github.com/user-attachments/assets/b9135690-a341-41b2-9495-184030ab5cff" /> ### Motivation and Context  (cherry picked from commit a0d4439)

### Description  This PR fix a bug in `im2col` for `pads` in some dimension. ### Motivation and Context  (cherry picked from commit 34bb209)

### Description Add support for the FusedMatMul operator in the QNN execution provider. FusedMatMul is a contrib operator in the Microsoft domain that performs a fused matrix multiplication with optional bias addition and activation. Implementation details: - Added FusedMatMulOpBuilder class that decomposes FusedMatMul into: 1. MatMul operation 2. Optional bias addition 3. Optional activation (Relu, Sigmoid, Tanh, Gelu) - Handles various attributes: transA, transB, alpha, and activation - Supports higher rank tensors and different data types Added comprehensive tests: - Basic functionality tests with various configurations - Tests for both CPU and HTP backends - QDQ (Quantize-Dequantize) tests for 8-bit and 16-bit precision ### Motivation and Context Since QNN HTP doesn't support, decomposing it into QNN HTP supported operators to improve the inference time of customer models having FusedMatMul operator. (cherry picked from commit 46e8d45)

### Description This test seems to be flaky and fails `Linux QNN CI Pipeline`. Disabling this test until I figure out the root cause for the inaccuracy ### Motivation and Context (cherry picked from commit 5e7e7a3)

# Fix Doxygen documentation build errors from recent PRs Fixes multiple Doxygen errors introduced by recent API additions that cause the nightly documentation build to fail (`WARN_AS_ERROR=YES`). ## Root Cause Analysis | Error | File | Line | Introduced By | Commit | Fix | |-------|------|------|---------------|--------|-----| | Duplicate `\addtogroup Global` | onnxruntime_c_api.h | 973 | PR #26828 - OrtExternalResourceImporter API | c54be3c | Remove redundant group markers | | Unresolved `::SetSessionLogSeverityLevel()` | onnxruntime_c_api.h | 1065 | PR #26971 - CreateEnvWithOptions API | 3874516 | Use `OrtApi::SetSessionLogSeverityLevel` | | Unresolved `::RunOptionsSetRunLogSeverityLevel()` | onnxruntime_c_api.h | 1066 | PR #26971 - CreateEnvWithOptions API | 3874516 | Use `OrtApi::RunOptionsSetRunLogSeverityLevel` | | `<ep_name>` interpreted as HTML | onnxruntime_c_api.h | 1119 | PR #26971 - CreateEnvWithOptions API | 3874516 | Escape as `\<ep_name\>` | | `\param[in] importer` not found | onnxruntime_c_api.h | 7982 | PR #26828 - OrtExternalResourceImporter API | c54be3c | Use `\param[in] input` (macro expands to `input`) | | `\param[in] handle` not found | onnxruntime_c_api.h | 8025 | PR #26828 - OrtExternalResourceImporter API | c54be3c | Use `\param[in] input` | | `\param[in] handle` not found | onnxruntime_c_api.h | 8091 | PR #26828 - OrtExternalResourceImporter API | c54be3c | Use `\param[in] input` | | Unresolved `::CreateLoopKernel()` | onnxruntime_ep_c_api.h | 667 | PR #26927 - Control flow kernels API | 1ed8fd9 | Use `OrtEpApi::CreateLoopKernel` | | Unresolved `::CreateScanKernel()` | onnxruntime_ep_c_api.h | 710 | PR #26927 - Control flow kernels API | 1ed8fd9 | Use `OrtEpApi::CreateScanKernel` | | `<ep_name>` interpreted as HTML | onnxruntime_ep_c_api.h | 1434 | PR #26971 - CreateEnvWithOptions API | 3874516 | Escape as `\<ep_name\>` | | `\param[out] out` not found | onnxruntime_ep_c_api.h | 1440 | PR #26971 - CreateEnvWithOptions API | 3874516 | Use `\param[out] config_entries` | ## Summary by PR | PR | Issues | |----|--------| | **#26828** (c54be3c) - OrtExternalResourceImporter API for D3D12 | Duplicate Doxygen group, incorrect `\param` names for `ORT_CLASS_RELEASE` macros | | **#26927** (1ed8fd9) - Control flow kernels API | `::Method()` syntax unresolvable by Doxygen | | **#26971** (3874516) - CreateEnvWithOptions API | `::Method()` syntax, `<ep_name>` HTML interpretation, incorrect param name | ## Technical Details ### `ORT_CLASS_RELEASE` Macro Issue The `ORT_CLASS_RELEASE(X)` macro at line 164 expands to: ```cpp void(ORT_API_CALL * Release##X)(_Frees_ptr_opt_ Ort##X * input) ``` The parameter is always named `input`, but the documentation in PR #26828 used semantic names like `importer` and `handle`. Doxygen validates `\param` names against actual parameter names in the expanded code. ### Doxygen Link Resolution Doxygen 1.9.8 cannot resolve `::MethodName()` as a link to a method. The correct syntax is to qualify with the struct name: `OrtApi::MethodName`. ## Testing Verified locally with Doxygen 1.9.8 (matches CI configuration). (cherry picked from commit 39f966e)

### Description To fix a build error for dump node inputs and outputs build option. (cherry picked from commit 8a7a797)

### Description Fix GPU JAR testing ### Motivation and Context Testing JAR for GPU was missing libcustom_library.so on Linux. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (cherry picked from commit 01f40e6)

hariharans29 · 2026-01-22T18:52:45Z

Can you please also take in : 347b990 ?

…ng (#27026) ### Description As title ### Motivation and Context Keep CI check happy --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit b6ed7f3)

### Description As title - it looks like the duration of the job is very close to the timeout ### Motivation and Context Reduce retrry attempts for the ios sim job My own PR - #26688 keep timing out this job leg (cherry picked from commit d7daa45)

### Description upgrade emsdk to 4.0.23 from 4.0.21 ### Motivation and Context This version fixes a problem that fails the build under windows when using emscan-deps.bat. (cherry picked from commit 7e1d818)

…el metadata (#27015) ### Description This change proposes a new helper ORT API for callers that need to extract the model compatibility string from a precompiled model. ### Motivation and Context  See #25749 for more background on the model compatibility concept and infrastructure; #25841 provides a related helper API for an application to call to do a validation check using the compatibility info string. However, there is no direct way to get to the model metadata without creating a session (which some callers may prefer to avoid) or by taking a dependency on a separate library to parse the model's protobuf (which again callers may prefer to avoid). This change proposes a separate helper API which can be used to retrieve the compatibility info string, thereby avoiding session creation or an external dependency. This does incur some amount of redundant work in that the model protobuf will be parsed again during session creation- but for some callers, this tradeoff may be acceptable. --------- Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com> (cherry picked from commit f481b17)

### Description `sconv.h` was renamed to `sconv_nchwc_kernel_neon.h` in #26688 but the reference to the old name was still in a new file added at around the same time in #26838. The CI doesn't include building for this configuration yet - it will be added after the 1.24 release. ### Motivation and Context Fixes failing mainline build on Arm64 linux when `--enable_arm_neon_nchwc` is supplied. ### Testing This now passes on Arm64 linux `./build.sh --config Release --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync --skip_tests --enable_pybind --build_wheel --enable_arm_neon_nchwc` (cherry picked from commit 347b990)

hariharans29

LGTM for my side. Thanks.

adrianlizarraga

Primarily checked QNN EP PRs, OpenVINO EP PR, #26781, and #27015

vraspar and others added 28 commits January 21, 2026 12:44

[QNN-EP] Support alternate Layernorm fusion pattern in QNN preprocess (…

fb53090

…#26060) ### Description Small change to allow QNN Preprocess to allow a Mul node (with A=B) instead of a Pow node (with Y=2) for layernorm fusion. (cherry picked from commit e7dfd69)

Added support for QMX kernels in MLAS (#26849)

b048ae8

Supported operations with QMX: SGEMM, QGEMM, Convolution (cherry picked from commit cfccd64)

Add INT2 and UINT2 support for QDQ, transpose and cast ops (#27022)

9e3066a

(cherry picked from commit b35688f)

Disable Float32_2Bits_Asymmetric_256x256 test (#27046)

60bd5f0

### Description This test seems to be flaky and fails `Linux QNN CI Pipeline`. Disabling this test until I figure out the root cause for the inaccuracy ### Motivation and Context (cherry picked from commit 5e7e7a3)

Print tensor for new packed type of 2 bits (#27064)

dac7ecc

### Description To fix a build error for dump node inputs and outputs build option. (cherry picked from commit 8a7a797)

Fix GPU JAR testing on Linux (#27011)

33d872e

### Description Fix GPU JAR testing ### Motivation and Context Testing JAR for GPU was missing libcustom_library.so on Linux. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (cherry picked from commit 01f40e6)

tianleiwu requested review from adrianlizarraga and edgchen1 January 21, 2026 20:57

adrianlizarraga requested review from chilo-ms, hariharans29 and vraspar January 22, 2026 18:47

hariharans29 and others added 5 commits January 22, 2026 10:54

upgrade emsdk to 4.0.23 (#27029)

44d4421

### Description upgrade emsdk to 4.0.23 from 4.0.21 ### Motivation and Context This version fixes a problem that fails the build under windows when using emscan-deps.bat. (cherry picked from commit 7e1d818)

hariharans29 approved these changes Jan 22, 2026

View reviewed changes

adrianlizarraga approved these changes Jan 22, 2026

View reviewed changes

tianleiwu merged commit fe30e5c into rel-1.24.0 Jan 23, 2026
88 of 92 checks passed

tianleiwu deleted the tlwu/rel-1.24.0_cherrypick_round_1 branch January 23, 2026 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.24.0 release cherry-pick round 1 #27104

1.24.0 release cherry-pick round 1 #27104

tianleiwu commented Jan 21, 2026 •

edited

Loading

Uh oh!

hariharans29 commented Jan 22, 2026

Uh oh!

hariharans29 left a comment

Uh oh!

adrianlizarraga left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

1.24.0 release cherry-pick round 1 #27104

1.24.0 release cherry-pick round 1 #27104

Conversation

tianleiwu commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Cherry-picked Commits

Uh oh!

hariharans29 commented Jan 22, 2026

Uh oh!

hariharans29 left a comment

Choose a reason for hiding this comment

Uh oh!

adrianlizarraga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

tianleiwu commented Jan 21, 2026 •

edited

Loading