Skip to content

Conversation

@tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Jan 21, 2026

Description

This PR cherry-picks the following changes for the 1.24.0 release.

Cherry-picked Commits

Commit Commit Title Author
744e7fe Add type definitions, registration, utilities for INT2/UINT2 support (#26824) vraspar
530a1fb [QNN EP] Add BFloat16 dtype support in QNN EP (#26987) tirupath-qti
8e050d1 Implement new experimental lookup-based matrix multiplication method(TMAC) (#26695) vraspar
2d2ba6b [MLAS/CPU EP] Improve performance of Silu activation path within the QuickGelu CPU kernel (#26753) Hariharan Seshadri
1c02b79 [QNN EP] Add support for handling 0-dimension for Concat Op (#27000) Ashwath Shankarnarayan
cc2b01b Fix ClipQuantFusion crash when Clip has multiple input edges (#27016) Edward Chen
bbd3850 [QNN EP] Support quantized BatchNorm with per-channel DQ params on QNN HTP (#26959) qti-yuduo
d8f0318 Add API to get ep graph partitioning info (#26781) Adrian Lizarraga
b912b18 [OVEP] OpenVINO EP Features and bug-fixes for ORT-1.24 - Follow up (#27007) Preetha Veeramalai
ba11af4 [QNN-EP] Add MatMulNBits translation for GPU (#26340) quic-tirupath
c03c419 [MLAS/NEON] Add dedicated kernel for depthwise convolution for ARM64 using NEON intrinsics (#26688) Hariharan Seshadri
e7dfd69 [QNN-EP] Support alternate Layernorm fusion pattern in QNN preprocess (#26060) qti-mattsinc
4013dc1 Implement multithreading in qgemm_kleidi (#26301) Melike Kaptan
9f06181 [CXX] Enable users to specify custom OrtSyncStream via RunOptions (#26988) Dmitri Smirnov
cfccd64 Added support for QMX kernels in MLAS (#26849) qti-vaiskv
29d9b2f Tweak external resource importer handle structs (#27040) Scott McKay
9d108d0 [QNN EP] Add QuickGELU operator support for QNN provider (#27034) tirupath-qti
b35688f Add INT2 and UINT2 support for QDQ, transpose and cast ops (#27022) vraspar
6d34aba Introducing BF16 Pointwise NCHWc Convolution for Arm64 (#26838) Rohanjames1997
36017ad [EP ABI] Add CreateCustomOpDomains() API for plugin EP to register custom ops (#27050) Chi Lo
50a03e4 Add a new pipeline for CUDA 13 nuget builds (#27023) eserscor
a0d4439 [EP ABI] Update Graph_GetGraphView() implementation (#26711) Chi Lo
34bb209 [webgpu] Fix a bug for im2col (#27069) Wenqin Yang
46e8d45 [QNN EP] Add FusedMatMul operator support (#27044) tirupath-qti
5e7e7a3 Disable Float32_2Bits_Asymmetric_256x256 test (#27046) vraspar
39f966e Fix Doxygen documentation build error in onnxruntime_c_api.h (#27083) Nick Eubanks
8a7a797 Print tensor for new packed type of 2 bits (#27064) Tianlei Wu
01f40e6 Fix GPU JAR testing on Linux (#27011) eserscor
b6ed7f3 Fix warning around ununsed code in QNN Android Emulator builds by clang (#27026) Hariharan Seshadri
d7daa45 Raise the timeout for the ios simulator job (#27045) Hariharan Seshadri
7e1d818 upgrade emsdk to 4.0.23 (#27029) Yulong Wang
347b990 Fix failing mainline build on Arm64 linux (#27101) Rohanjames1997
f481b17 Add dedicated API to support extracting compatibility string from model metadata (#27015) adrastogi

vraspar and others added 28 commits January 21, 2026 12:44
…26824)

### Description
This is first PR towards adding 2 bit support to ORT

I will create follow up PRs to
- Add CPU operators (CAST, QDQ, non compute ops)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- ONNX recently added 2 bit support
onnx/onnx#7446

(cherry picked from commit 744e7fe)
### Description
- QNN NPU backend supports BFloat16 dtype for many operators
- QNN EP adds a new session option "htp_bf16_enable" to enable Users to
signal processing the Float32 graph in BFloat16 precision
- When User specifies "htp_bf16_enable", the QNN EP lowers incoming
Float32 Ort graph into BFloat16 QNN graph.
 - The ORT CPU fallback still receives Float32 partitions.
- The lowered QNN graph still accepts float32 inputs, outputs and
constant initializers. The QNN EP inserts Cast operators to do the
necessary precision switch.

### Motivation and Context
- This enables computing accuracy sensitive float32 models in bfloat16
precision on Qualcomm NPU accelerator to improve inference time w.r.t
computing in float32 precision.

---------

Co-authored-by: Ashwath Shankarnarayan <ashwshan@qti.qualcomm.com>
(cherry picked from commit 530a1fb)
…TMAC) (#26695)

### Description

This PR introduces a new experimental lookup-table(LUT) based matrix
multiplication method for 2-bit MatMulNBits on x64 AVX2 inspired from
[T-MAC paper](https://arxiv.org/abs/2407.00088) and [T-MAC
repository](https://github.com/microsoft/T-MAC) to speed up low bit LLM
inference.

Unlike the existing quant-dequant methods, the LUT-based method directly
supports mixed-precision-GEMM without dequantization. It uses bit-wise
table lookup to eliminate multiplications and reduce additions required
in matrix multiplication.

<img width="1910" height="759" alt="image"
src="https://github.com/user-attachments/assets/3e3f2ced-eba4-4d4e-a63c-fec479943202"
/>

This PR:
- Add` mlas.use_lut_gemm` session option allowing use of LUT GEMM inside
matmulnbits when it is available (2-bit, BlkLen multiple of 32, K
multiple of 32, N multiple of 128, AVX2 present).
- Introduces LUT packing + kernel config cache (packs bitplanes, scales,
ZP) and the main `MlasLUTGemm` entry that generates per-row LUTs and
calls the AVX2 kernel.
- Implements AVX2 LUT generation `GenerateLUT_avx2` and GEMM compute
`TMACComputeGemm_avx2` and wires dispatch in MLAS platform init.
- Updates MatMulNBits PrePack/Compute to use LUT packing/compute when
opted-in; keeps existing quant-dequant path as fallback.
- Extends Python quant bindings with 2-bit QDQ helper for parity with
the new path.
- Adds MLAS unit tests covering LUT GEMM across symmetric/asymmetric
quant and multiple shapes/block sizes.

 ### Main components:

 - `MlasInitLUTGemmKernelConfig`: Config for LUT kernels
 - `MlasLUTGemmPackQuantBData`: Pre Packing of quantized weight
- `MlasLUTPackScalesAndZeroPoints`: Pre Packing of qunatized scales and
zero points

 - `MlasLUTGemm`: Main Entry point
 - `GenerateLUT_avx2`:  LUT construction from activations
 - `TMACComputeGemm_avx2`: AVX2 LUT GEMM kernel
 - Session option: mlas.use_lut_gemm

### How to test
- MLAS LUT GEMM unit tests: see `test_sqlutgemm.cpp`
- Run MatMulNBits models with session option `mlas.use_lut_gemm=1` on
AVX2 machines; expect fallback to existing path if availability checks
fail.

### Perf
Focus of this PR is functional + kernel bring-up; perf to be reported
separately once broader profiling is done.

### Future Work
- Support MLFloat16 (FP16 scales and zero points)
- Add neon kernel for ARM.
- Add kernels for 4 bit weights and bitnet kernels
- Broader batch (N>1) support and additional shape coverage.

---------

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Co-authored-by: Liqun Fu <liqun.fu@microsoft.com>
Co-authored-by: carzh <wolfivyaura@gmail.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: carzh <carolinezhu@microsoft.com>
Co-authored-by: Vrajang Parikh <vrparikh@microsoft.com>
(cherry picked from commit 8e050d1)
…QuickGelu CPU kernel (#26753)

### Description

The `Silu` activation is basically the same as `QuickGelu` but with the
scaling factor (`alpha`) as 1. In cusomer models containing `Silu`, the
graph optimizer suite correctly fuses the nodes into a QuickGelu with
alpha = 1. This optimizes the implementation of QuickGelu when alpha = 1
by avoiding the scaling and vectorizes the subsequent elementwise
multiplication.

**Tests:**
There are already tests for QuickGelu with alpha = 1 and there are no
new tests necessary
(https://github.com/microsoft/onnxruntime/blob/f98c756b45b81520c6e2a09c370575a013f02cce/onnxruntime/test/contrib_ops/activation_op_test.cc#L126)

**Performance improvements measured:**
Gives about 2.5% throughput boost for a customer model that has a lot of
Silu activations.

### Motivation and Context
Some low hanging fruit perf improvements that give instant easy perf
wins

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
(cherry picked from commit 2d2ba6b)
### Description
- Added dedicated ConcatOpBuilder to handle tensors with 0-dimension
- Added unit tests for testing 0-dim inputs to Concat
- Removed existing logic for concat from base op builder

### Motivation and Context
- Currently 0- dims for Concat is not handled in QNN EP

(cherry picked from commit 1c02b79)
## Motivation

`ClipQuantFusion in clip_quantizelinear.cc` calls
`graph_utils::RemoveNode()` without first checking
`graph_utils::CanRemoveNode()`. When a Clip node has min/max inputs from
DequantizeLinear nodes (instead of initializers), it has multiple input
edges. `RemoveNode()` throws exception:

```
 [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: graph_utils.cc:650 bool onnxruntime::graph_utils::RemoveNode(onnxruntime::Graph&, onnxruntime::Node&) Should be unreachable if CanRemoveNodeAndMergeEdges is in sync with the logic here.
```

## Fix:
Added `CanRemoveNode()` check to `ClipQuantFusion::SatisfyCondition()`
to skip nodes that cannot be safely removed.

## Test:
Added `ClipQuantFusion_MultipleInputEdges` test that creates a Clip node
with min from a DQ node (2 input edges) and verifies the optimizer
doesn't crash.

## Note:
This PR is a duplicate of #26923 which GitHub is preventing us from
merging. Credit goes to original author @qti-yuduo.

---------

Co-authored-by: Yuduo Wu <yuduow@qti.qualcomm.com>
(cherry picked from commit cc2b01b)
…N HTP (#26959)

## Motivation:
QNN HTP was rejecting quantized BatchNorm models where parameters
(scale, mean, var) come through DequantizeLinear nodes with per-channel
INT8 quantization. This pattern is common in quantized models from
quantization tools.

## Changes:

- Helpers to resolve BatchNorm params through DQ nodes to their
underlying initializers
- Support per-channel dequantization for BatchNorm parameters
- Support input datatype of UFIXED_POINT_16
- Add unit test covering this QDQ params configuration

(cherry picked from commit bbd3850)
### Description
- Adds API functions to get information about the subgraphs/nodes
assigned to the EPs in the session.
- `Session_GetEpGraphAssignmentInfo`: Returns a list of "subgraphs",
each with information about the assigned EP and nodes.
- Note: App must enable session configuration
`"session.record_ep_graph_assignment_info"` to signal ORT to collect
this information. If not enabled, API returns empty results.
- `EpAssignedSubgraph_GetEpName`: Returns the name of the EP to which
the subgraph is assigned
  - `EpAssignedSubgraph_GetNodes`: Returns a list of assigned nodes
  - `EpAssignedNode_GetName`: Returns the assigned node's name
  - `EpAssignedNode_GetDomain`: Returns the assigned node's domain
- `EpAssignedNode_GetOperatorType`: Returns the assigned node's operator
type
- Also adds C++ and Python bindings

#### Structure of returned information
The API returns a list of "subgraphs". Each subgraph has the following
information:
- Subgraph info:
- EP name: The name of the execution provider to which this subgraph is
assigned.
- nodes: Name and operator type of each node. Ex: `[{"multiply", "Mul"},
...]`

Python example program (taken from unit tests):
```python
    def test_get_graph_provider_assignment_info(self):
        """
        Tests querying for information about the nodes assigned to the CPU EP.
        """

        # Create session options that enables recording EP graph partitioning info.
        session_options = onnxrt.SessionOptions()
        session_options.add_session_config_entry("session.record_ep_graph_assignment_info", "1")

        session = onnxrt.InferenceSession(get_name("add_mul_add.onnx"), sess_options=session_options)

        # Query session for information on each subgraph assigned to an EP.
        ep_subgraphs = session.get_provider_graph_assignment_info()

        # Check that all 3 nodes are assigned to CPU EP (each in its own subgraph)
        self.assertEqual(len(ep_subgraphs), 3)
        for ep_subgraph in ep_subgraphs:
            self.assertEqual(ep_subgraph.ep_name, "CPUExecutionProvider")
            self.assertEqual(len(ep_subgraph.get_nodes()), 1)

        # Serialize each node to an identifier (concatenates operator type and node name)
        node_ids: list[str] = [f"{n.op_type}/{n.name}" for s in ep_subgraphs for n in s.get_nodes()]

        # Should have 1 Mul and 2 Adds.
        self.assertEqual(len(node_ids), 3)
        self.assertIn("Add/add_0", node_ids)
        self.assertIn("Add/add_1", node_ids)
        self.assertIn("Mul/mul_0", node_ids)
```

C++ program (taken from unit test):
```c++
  // Check the ep graph partitioning (Mul on plugin EP, others on CPU EP).
  // Model has 3 subgraphs (in no particular order):
  // - Subgraph 1: Add assigned to CPU EP.
  // - Subgraph 2: Mul assigned to plugin EP.
  // - Subgraph 3: Add assigned to CPU EP.
  std::vector<Ort::ConstEpAssignedSubgraph> ep_subgraphs = session.GetEpGraphAssignmentInfo();
  ASSERT_EQ(ep_subgraphs.size(), 3);

  for (Ort::ConstEpAssignedSubgraph ep_subgraph : ep_subgraphs) {
    std::string ep_name = ep_subgraph.EpName();
    ASSERT_TRUE(ep_name == Utils::example_ep_info.ep_name || ep_name == kCpuExecutionProvider);

    const std::vector<Ort::ConstEpAssignedNode> ep_nodes = ep_subgraph.GetNodes();
    ASSERT_GE(ep_nodes.size(), 1);  // All of these subgraphs just have one node.

    if (ep_name == kCpuExecutionProvider) {
      std::string op_type = ep_nodes[0].OpType();
      std::string node_name = ep_nodes[0].Name();

      ASSERT_EQ(op_type, "Add");
      ASSERT_TRUE(node_name == "add_0" || node_name == "add_1");
    } else {
      ASSERT_TRUE(ep_name == Utils::example_ep_info.ep_name);

      std::string op_type = ep_nodes[0].OpType();
      std::string node_name = ep_nodes[0].Name();
      ASSERT_EQ(op_type, "Mul");
      ASSERT_EQ(node_name, "mul_0");
    }
  }
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
(cherry picked from commit d8f0318)
…27007)

### Description
This PR refines OpenVINO EP backend execution and input validation,
improving reshape handling, symbolic vs dynamic dimension checks, and
execution consistency. It also adds explicit support for
stateful/KV-cache inference by introducing cache index tracking,
validation, and reset logic across backend, context, and interface
layers, with corresponding test updates.

---------

Signed-off-by: bfilipek <bartlomiej.filipek@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Javier Martinez <javier.e.martinez@intel.com>
Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com>
Co-authored-by: bopeng1234 <bo.peng@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com>
Co-authored-by: TejalKhade28 <tejal.khade@intel.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: Yaru Du <yaru.du@intel.com>
Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com>
Co-authored-by: Dvoretckii, Mikhail <mikhail.dvoretckii@intel.com>
Co-authored-by: Pallavi Gupta <pallavi.gupta@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Jianhui Dai <jianhui.j.dai@intel.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Fei Chen <feich@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: vraspar <vrajang@outlook.com>
Co-authored-by: qti-yuduo <yuduow@qti.qualcomm.com>
Co-authored-by: Akupadhye <aupadhye@qti.qualcomm.com>
Co-authored-by: Wang Ning <ning4.wang@intel.com>
Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: George Wu <jywu@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Wanming Lin <wanming.lin@intel.com>
Co-authored-by: quic-calvnguy <quic_calvnguy@quicinc.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
Co-authored-by: Jie Chen <jie.a.chen@intel.com>
Co-authored-by: xhcao <xinghua.cao@intel.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: quic-hungjuiw <quic_hungjuiw@quicinc.com>
Co-authored-by: Ian Hunter <ianfhunter@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Jeff Kilpatrick <jkilpatrick@qti.qualcomm.com>
Co-authored-by: Jeff Kilpatrick <jkilpat@qti.qualcomm.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Nenad Banfic <46795300+nenad1002@users.noreply.github.com>
Co-authored-by: derdeljan-msft <derdeljan@microsoft.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com>
Co-authored-by: Jaswanth Gannamaneni <jaswanth.gannamaneni@intel.com>
Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com>
Co-authored-by: liang <gxgaoliang@126.com>
Co-authored-by: Garth Long <garth.long@intel.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com>
Co-authored-by: Christopher Warrington <chwarr@microsoft.com>
Co-authored-by: Ishwar Raut <iraut@nvidia.com>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Xinpeng Dou <15529241576@163.com>
Co-authored-by: adrastogi <aditya.rastogi@microsoft.com>
Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: qti-hungjuiw <hungjuiw@qti.qualcomm.com>
Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Susanta Bhattacharjee <susanta.bhattacharjee@intel.com>
Co-authored-by: Jozef Wludzik <jozef.wludzik@intel.com>
Co-authored-by: Rajeev Sekar <rajeevsekar21@gmail.com>
Co-authored-by: Mayuresh M Varerkar <mayuresh.m.varerkar@intel.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Wenqin Yang <wenqin.yang@intel.com>
Co-authored-by: xieofxie <xieofxie@126.com>
Co-authored-by: hualxie <hualxie@microsoft.com>
Co-authored-by: Joshua Lochner <admin@xenova.com>
Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com>
Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: chunghow-qti <chunghow@qti.qualcomm.com>
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Jiawei Shao <jiawei.shao@intel.com>
Co-authored-by: czekun <chen.zekun@intel.com>
Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com>
(cherry picked from commit b912b18)
### Description
Add support for translation of MatMulNBits contrib op to
  QNN with FullyConnected operation with INT4 BlockQuantized weights

Implementation details:
 - Translate MatMulNBits to FullyConnected in OpBuilder
 - Support QNN_QUANTIZATION_ENCODING_BLOCK for INT4 weights
- Pass INT4 weights and quant params as BlockQuantization encoding
params in QNN

Testing:
 - Added new unit tests for MNB -> QNN-GPU
 - Validated all OnnxRuntime tests
- Validated the following LLMs through Olive and ORT-GenAI execution
flow
   - LlaMA3.2 1B
   - Qwen2.5
   - DeepSeek-R1-Qwen 1.5b
   - Phi3.5-mini-instruct

### Motivation and Context
LLMs with INT4 quantization pass in Olive will generate a model with
MatMulMBits contrib ops.
To run these ops via QNN-EP, MatMulNBits is translated to QNN
FullyConnected op with INT4 weights.

---------

Co-authored-by: tirupath-qti <tirupath@qti.qualcomm.com>
(cherry picked from commit ba11af4)
…using NEON intrinsics (#26688)

### Description

**Motivation and approach taken:**

Add a dedicated depthwise convolution kernel for the most common
depthwise convolution configuration (3x3 filter, stride = 1, pad <= 1,
dilation = 1) using NEON intrinsics. This does significantly better than
the current approach of `Im2Col + SGemm`. The Im2Col step extracts
convolution patches and this is a wasteful step and for a 3x3 filter, K
would be 9 for the SGemm and usually Gemms are not optimized for such
small `K` values. Hence, a dedicated kernel works much better.

Initially, I ported over the Winograd based NEON accelerated depthwise
convolution kernel from PyTorch but I found that its performance is not
very good. It's poor performance is probably due to applying the
Winograd transformation for the filter repeatedly. A better approach may
be to tranform the filter offline and this approach can be considered
for later (I reverted the PyTorch Winograd implementation in this
commit:
2820a84).

The current depthwise kernel added in this PR was authored by
GPT5.1-Codex and with some minor bug fixes it seems to be functionally
correct now and also provides the perf boost we are seeking.

**Unit tests:**
There are already depthwise convolution tests already existing in the
codebase. I don't see a need for new ones at this point.

**Kernel benchmarking:**
This is the kernel level perf improvement from MLAS Conv benchmarks
(About 50% kernel latency improvements):

<img width="1055" height="90" alt="image"
src="https://github.com/user-attachments/assets/ead9eb83-2d62-4157-a065-70c67c8c7517"
/>

### Motivation and Context
A key customer model had a few depthwise conolution operations and this
change provides a **non-negligible ~3% throughput improvement** using
the customer provided benchmarking setup

For those interested,
#26654 adds support for the
same type of convolution variant but that leverages SME1/SME2 through
KleidiAI. This PR is conceptually the same but targeting NEON only
platforms.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
(cherry picked from commit c03c419)
…#26060)

### Description
Small change to allow QNN Preprocess to allow a Mul node (with A=B)
instead of a Pow node (with Y=2) for layernorm fusion.

(cherry picked from commit e7dfd69)
**Key changes**

This PR makes changes to improve the performance on Dynamic Qgemms by
implementing tiling and threading across operations.

The changes introduce thread local buffers for reusing memory during
inference. And utilizes those in Dynamic Quantised Matmul operations
using Kleidiai kernels.

And updating KleidiAI version to 1.15.0

**Example performance**

single thread :
<img width="2100" height="900"
alt="ort_ops_compare_encoder_1_2025-10-02_17-21-32_vs_encoder_1_2025-10-02_16-54-55"
src="https://github.com/user-attachments/assets/c23c808d-5fab-4995-997e-a57a66a23d68"
/>

2 threads :
<img width="2100" height="900"
alt="ort_ops_compare_encoder_2_2025-10-02_17-21-47_vs_encoder_2_2025-10-02_16-55-13"
src="https://github.com/user-attachments/assets/31a0eb7a-7ff4-40c9-9425-b70231f131e8"
/>

---------

Signed-off-by: melkap01 <melike.kaptan@arm.com>
Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Damien Dooley <damien.dooley@arm.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
(cherry picked from commit 4013dc1)
…6988)

### Description
Enable device stream override using RunOptions for a particular run.
The stream is restored after Run() completes.

### Motivation and Context
Gpu InterOp requirements.
When enabled, the inference would run with the specified stream with
proper synchronization with imported external synchronization
facilities.

(cherry picked from commit 9f06181)
Supported operations with QMX: SGEMM, QGEMM, Convolution

(cherry picked from commit cfccd64)
### Description
<!-- Describe your changes. -->
Use the descriptor struct in the external resource handle.

We were copying most fields, but the setup is a little more intuitive
when the descriptor is used directly.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

(cherry picked from commit 29d9b2f)
### Description
Add support for the QuickGELU operator in the QNN provider:
- Implement QuickGeluOpBuilder to handle QuickGELU operations
- Add registration for QuickGELU in op_builder_factory
- Add comprehensive tests for CPU and HTP backends
- Support both float and quantized (QDQ) versions

### Motivation and Context
- QNN doesn't have a direct operator to map QuickGelu so decompose it as
x * sigmoid(alpha * x) for computing the whole model on HTP to improve
inference time.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
(cherry picked from commit 9d108d0)
## Description

This PR adds a BF16 (bfloat16) pointwise convolution kernel for ARM64
NCHWc format, leveraging the existing SBGEMM infrastructure. When the
`mlas.enable_gemm_fastmath_arm64_bfloat16` session option is enabled on
supported ARM64 Linux hardware, Pointwise Conv is rerouted to use this
BF16 implementation. This is an opt-in feature, similar to how BF16
matmul is opt-in.

Added a bool ZeroMode field to `MLAS_SBGEMM_DATA_PARAMS` (default `true`
for backward compatibility) to enable per-batch control over output
accumulation. This mirrors the beta parameter in FP32's `MlasGemmBatch`
and is required for Pointwise convolutions with >128 input channels,
where multiple GEMM calls must accumulate into the same output buffer.

## Motivation and Context

The existing `mlas.enable_gemm_fastmath_arm64_bfloat16` session option
accelerates MatMul operations on ARM64 processors with BF16 support, but
convolution operations did not benefit from this optimization. Pointwise
convolutions (1x1 kernels) are essentially batched matrix
multiplications.

This change extends the BF16 fastmath optimization to pointwise NCHWc
convolutions, reusing the same session option. The implementation
mirrors the FP32 pointwise kernel structure while delegating the actual
computation to SBGEMM, ensuring correctness and maintainability.

## Performance improvement
Measured a 15-20% gain on Mobilenet inference on an AWS Graviton4
instance.

Before (FP32)
```
/build/Linux/Release/onnxruntime_perf_test -C "mlas.enable_gemm_fastmath_arm64_bfloat16|0" -x 32 -I -m times -r 2000 ~/scripts/mobilenet.onnx

Number of inferences per second: 559.154
```

After (BF16)
```
./build/Linux/Release/onnxruntime_perf_test -C "mlas.enable_gemm_fastmath_arm64_bfloat16|1" -x 32 -I -m times -r 2000 ~/scripts/mobilenet.onnx

Number of inferences per second: 651.221

```

(cherry picked from commit 6d34aba)
…stom ops (#27050)

### Description

The newly added two APIs, `CreateCustomOpDomains()` and
`GetNumCustomOpDomains`, are used when running inference on a model that
contains EP-specific custom operations.

 Workflow:
1. The EP implements these functions to supply a list of
`OrtCustomOpDomain` instances.
2. The application either 1) calls
`SessionOptionsAppendExecutionProvider_V2()` with an `OrtEpDevice`
containing
     the plugin EP's factory or 2) enables auto ep selection.
3. Then ORT either 1) `SessionOptionsAppendExecutionProvider_V2()`
appends the provided OrtCustomOpDomains to the
session options or 2) registers the OrtCustomOpDomains from the selected
EP devices.

As a result, any session created from these session options will have
these custom op domains registered
in ORT, ensuring that the custom ops are properly recognized and
validated when the model is loaded.

Plugin EPs can provide two types of custom ops:
   1. A full OrtCustomOp with a concrete kernel implementation
      - This Example EP demonstrates this approach.
- In GetCapability(), it calls EpGraphSupportInfo_AddSingleNode() to
inform ORT
that the custom node should NOT be fused or compiled. Instead, ORT
should invoke
        the custom node's Compute() function at runtime.

   2. A "placeholder" OrtCustomOp with an empty kernel implementation
- A compile-based Plugin EP can supply an OrtCustomOp whose
CustomKernel::Compute()
does nothing. The purpose is to satisfy model validation during model
loading by
        registering the custom op as a valid operator in the session.
- In GetCapability(), the EP should call
EpGraphSupportInfo_AddNodesToFuse() to
notify ORT that this custom node should be fused and compiled by the EP.
- In Compile(), the EP executes its compiled bits to perform inference
for
      the fused custom node.

### Motivation and Context

Currently, the provider-bridge TRT RTX EP and TRT EP supports
registering custom op domain list in session option so
that it can run model contains TRT specific custom ops.

This PR adds the same feature for plugin EP.

(cherry picked from commit 36017ad)
### Description

Adds new pipeline for CUDA 13 Nuget builds

### Motivation and Context

The artifacts from this pipeline will be used by the release pipeline to
publish the nuget packages to our public feed.

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: eserscor <247253654+eserscor@users.noreply.github.com>
(cherry picked from commit 50a03e4)
### Description

This PR mainly modifies the followings:

- Update `Graph_GetGraphView()` implementation.
- Make sure EpGraph maintains the min/max node index, so that when
querying node outside that range, it can return null.
- Provide option to create an EpGraph that contains its parent node when
the graph is the subgraph of a control flow op.

#### Update Graph_GetGraphView() implementation
In some cases, e.g. when model has a node that produces output consumed
by multiple nodes, calling the current implementation of
`Graph_GetGraphView()` to get a subgraph returns incorrect `OrtGraph.`

- Original graph:

<img width="414" height="356" alt="image"
src="https://github.com/user-attachments/assets/739c092d-0880-4f6e-9351-e08e0e141b35"
/>

- Incorrect graph after calling `Graph_GetGraphView()` to get the
subgraph:

  It includes three of the nodes from the original graph.
The `topk_indices` is the output of the `TopK` and it shouldn't be added
as a graph input shown in the graph below.
  The API implementation has issue handling this case.
If we feed this subgraph into TRT parser, it would fail to parse the
graph.

<img width="349" height="341" alt="image"
src="https://github.com/user-attachments/assets/1306e22c-7c5d-45a2-bc18-6864fa2966ba"
/>

- Correct graph after calling `Graph_GetGraphView()` to get the
subgraph:

  It includes three of the nodes from the original graph.
The `topk_indices` now is not added as a graph input. Instead, the
`topk_indices` is added as a graph output which is expected as `Mod` is
in another subgraph that consumes it, so this subgraph has to make
`topk_indices` a graph output.

<img width="413" height="350" alt="image"
src="https://github.com/user-attachments/assets/b9135690-a341-41b2-9495-184030ab5cff"
/>

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

(cherry picked from commit a0d4439)
### Description
<!-- Describe your changes. -->

This PR fix a bug in `im2col` for `pads` in some dimension.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

(cherry picked from commit 34bb209)
### Description
Add support for the FusedMatMul operator in the QNN execution provider.
 FusedMatMul is a contrib operator in the Microsoft domain that performs
a fused matrix multiplication with optional bias addition and
activation.

Implementation details:
- Added FusedMatMulOpBuilder class that decomposes FusedMatMul into:
  1. MatMul operation
  2. Optional bias addition
  3. Optional activation (Relu, Sigmoid, Tanh, Gelu)
- Handles various attributes: transA, transB, alpha, and activation
- Supports higher rank tensors and different data types

Added comprehensive tests:
- Basic functionality tests with various configurations
- Tests for both CPU and HTP backends
- QDQ (Quantize-Dequantize) tests for 8-bit and 16-bit precision

### Motivation and Context
Since QNN HTP doesn't support, decomposing it into QNN HTP supported
operators to improve the inference time of customer models having
FusedMatMul operator.

(cherry picked from commit 46e8d45)
### Description
This test seems to be flaky and fails `Linux QNN CI Pipeline`. Disabling
this test until I figure out the root cause for the inaccuracy

### Motivation and Context

(cherry picked from commit 5e7e7a3)
# Fix Doxygen documentation build errors from recent PRs

Fixes multiple Doxygen errors introduced by recent API additions that
cause the nightly documentation build to fail (`WARN_AS_ERROR=YES`).

## Root Cause Analysis

| Error | File | Line | Introduced By | Commit | Fix |
|-------|------|------|---------------|--------|-----|
| Duplicate `\addtogroup Global` | onnxruntime_c_api.h | 973 | PR #26828
- OrtExternalResourceImporter API | c54be3c | Remove redundant group
markers |
| Unresolved `::SetSessionLogSeverityLevel()` | onnxruntime_c_api.h |
1065 | PR #26971 - CreateEnvWithOptions API | 3874516 | Use
`OrtApi::SetSessionLogSeverityLevel` |
| Unresolved `::RunOptionsSetRunLogSeverityLevel()` |
onnxruntime_c_api.h | 1066 | PR #26971 - CreateEnvWithOptions API |
3874516 | Use `OrtApi::RunOptionsSetRunLogSeverityLevel` |
| `<ep_name>` interpreted as HTML | onnxruntime_c_api.h | 1119 | PR
#26971 - CreateEnvWithOptions API | 3874516 | Escape as `\<ep_name\>` |
| `\param[in] importer` not found | onnxruntime_c_api.h | 7982 | PR
#26828 - OrtExternalResourceImporter API | c54be3c | Use `\param[in]
input` (macro expands to `input`) |
| `\param[in] handle` not found | onnxruntime_c_api.h | 8025 | PR #26828
- OrtExternalResourceImporter API | c54be3c | Use `\param[in] input` |
| `\param[in] handle` not found | onnxruntime_c_api.h | 8091 | PR #26828
- OrtExternalResourceImporter API | c54be3c | Use `\param[in] input` |
| Unresolved `::CreateLoopKernel()` | onnxruntime_ep_c_api.h | 667 | PR
#26927 - Control flow kernels API | 1ed8fd9 | Use
`OrtEpApi::CreateLoopKernel` |
| Unresolved `::CreateScanKernel()` | onnxruntime_ep_c_api.h | 710 | PR
#26927 - Control flow kernels API | 1ed8fd9 | Use
`OrtEpApi::CreateScanKernel` |
| `<ep_name>` interpreted as HTML | onnxruntime_ep_c_api.h | 1434 | PR
#26971 - CreateEnvWithOptions API | 3874516 | Escape as `\<ep_name\>` |
| `\param[out] out` not found | onnxruntime_ep_c_api.h | 1440 | PR
#26971 - CreateEnvWithOptions API | 3874516 | Use `\param[out]
config_entries` |

## Summary by PR

| PR | Issues |
|----|--------|
| **#26828** (c54be3c) - OrtExternalResourceImporter API for D3D12 |
Duplicate Doxygen group, incorrect `\param` names for
`ORT_CLASS_RELEASE` macros |
| **#26927** (1ed8fd9) - Control flow kernels API | `::Method()` syntax
unresolvable by Doxygen |
| **#26971** (3874516) - CreateEnvWithOptions API | `::Method()`
syntax, `<ep_name>` HTML interpretation, incorrect param name |

## Technical Details

### `ORT_CLASS_RELEASE` Macro Issue

The `ORT_CLASS_RELEASE(X)` macro at line 164 expands to:
```cpp
void(ORT_API_CALL * Release##X)(_Frees_ptr_opt_ Ort##X * input)
```

The parameter is always named `input`, but the documentation in PR
#26828 used semantic names like `importer` and `handle`. Doxygen
validates `\param` names against actual parameter names in the expanded
code.

### Doxygen Link Resolution

Doxygen 1.9.8 cannot resolve `::MethodName()` as a link to a method. The
correct syntax is to qualify with the struct name: `OrtApi::MethodName`.

## Testing

Verified locally with Doxygen 1.9.8 (matches CI configuration).

(cherry picked from commit 39f966e)
### Description
To fix a build error for dump node inputs and outputs build option.

(cherry picked from commit 8a7a797)
### Description
Fix GPU JAR testing

### Motivation and Context
Testing JAR for GPU was missing libcustom_library.so on Linux.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
(cherry picked from commit 01f40e6)
@hariharans29
Copy link
Member

Can you please also take in : 347b990 ?

hariharans29 and others added 5 commits January 22, 2026 10:54
…ng (#27026)

### Description
As title

### Motivation and Context
Keep CI check happy

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
(cherry picked from commit b6ed7f3)
### Description
As title - it looks like the duration of the job is very close to the
timeout

### Motivation and Context
Reduce retrry attempts for the ios sim job

My own PR - #26688 keep
timing out this job leg

(cherry picked from commit d7daa45)
### Description

upgrade emsdk to 4.0.23 from 4.0.21

### Motivation and Context

This version fixes a problem that fails the build under windows when
using emscan-deps.bat.

(cherry picked from commit 7e1d818)
…el metadata (#27015)

### Description
This change proposes a new helper ORT API for callers that need to
extract the model compatibility string from a precompiled model.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
See #25749 for more background on the model compatibility concept and
infrastructure; #25841 provides a related helper API for an application
to call to do a validation check using the compatibility info string.
However, there is no direct way to get to the model metadata without
creating a session (which some callers may prefer to avoid) or by taking
a dependency on a separate library to parse the model's protobuf (which
again callers may prefer to avoid).

This change proposes a separate helper API which can be used to retrieve
the compatibility info string, thereby avoiding session creation or an
external dependency. This does incur some amount of redundant work in
that the model protobuf will be parsed again during session creation-
but for some callers, this tradeoff may be acceptable.

---------

Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: adrastogi <8368026+adrastogi@users.noreply.github.com>
(cherry picked from commit f481b17)
### Description
`sconv.h` was renamed to `sconv_nchwc_kernel_neon.h` in #26688 but the
reference to the old name was still in a new file added at around the
same time in #26838.
The CI doesn't include building for this configuration yet - it will be
added after the 1.24 release.

### Motivation and Context
Fixes failing mainline build on Arm64 linux when
`--enable_arm_neon_nchwc` is supplied.

### Testing
This now passes on Arm64 linux
`./build.sh --config Release --build_shared_lib --parallel
--compile_no_warning_as_error --skip_submodule_sync --skip_tests
--enable_pybind --build_wheel --enable_arm_neon_nchwc`

(cherry picked from commit 347b990)
Copy link
Member

@hariharans29 hariharans29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for my side. Thanks.

Copy link
Contributor

@adrianlizarraga adrianlizarraga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primarily checked QNN EP PRs, OpenVINO EP PR, #26781, and #27015

@tianleiwu tianleiwu merged commit fe30e5c into rel-1.24.0 Jan 23, 2026
88 of 92 checks passed
@tianleiwu tianleiwu deleted the tlwu/rel-1.24.0_cherrypick_round_1 branch January 23, 2026 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.