Skip to content

Conversation

Jaswanth51
Copy link

Description

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

quic-muchhsu and others added 18 commits August 20, 2025 09:36
### Description
Add qnn support for mod op when fmod = 0.

### Motivation and Context
QNN doesn't support mod op. This PR will allow QNN process mod op for fmod=0 case.

---------

Signed-off-by: Mu-Chein Hsu <[email protected]>
### Description
In PoolOpBuilder, 
- Revise the check to exploit ORT macros.
- Fix invoking the function for 5D cases.

### Motivation and Context
Refer to microsoft#25778.
Pool builder incorrectly invokes a function calculating 4D shape in 5D input, which originally expects 3D cases only. However, the check used assert to validate the shape, which did not work in Release nor RelWithDebInfo builds.
### Description
Add QNN EP support for thresholdedrelu op.

### Motivation and Context
thresholdedrelu wasn't previously supported.

Signed-off-by: Mu-Chein Hsu <[email protected]>
…ob (microsoft#25794)

### Description
<!-- Describe your changes. -->

Set iOS simulator runtime version to 18.5 in mac.yml iphone_simulator
job.

This job uses Xcode 16.4. According to this table, the corresponding
simulator SDK version is 18.5.

https://github.com/actions/runner-images/blob/da7977bf2699f44e70b7d3c3352dedb0da38db9c/images/macos/macos-15-arm64-Readme.md?plain=1#L181

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Address intermittent CI build timeouts.
### Description

Add a new API `Graph_GetModelMetadata`

### Motivation and Context
VitisAI EP would convert ONNX IR to another IR which is suitable for AMD
AI compilers.
The metadata in a OrtModel contains many important infomation produced
by other tools, e.g. Olive.

This API potentially used by many other execution providers which need
to access the same information.
…osoft#25562)

### Description
<!-- Describe your changes. -->
Add HardSwish operator which is x*HardSigmoid(x)
Add bf16 support for HardSigmoid


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
HardSwish is implemented as HardSidmoid + Add in CUDA EP currently.
A fused HardSwish should take half the time of HardSigmoid + Add.

---------

Co-authored-by: kaiyu <[email protected]>
Co-authored-by: Copilot <[email protected]>
### Description

Fix build break caused by warning C4702: unreachable code.

```
onnxruntime\contrib_ops\webgpu\quantization\matmul_nbits.cc(95,1): error C2220: the following warning is treated
 as an error [C:\code\o3\build_main\Debug\onnxruntime_providers_webgpu.vcxproj]
onnxruntime\contrib_ops\webgpu\quantization\matmul_nbits.cc(95,1): warning C4702: unreachable code [C:\code\o3\b
uild_main\Debug\onnxruntime_providers_webgpu.vcxproj]
```

Seems the CI pipeline does not catch this.
### Description

Add a build flag to enable/disable mixed gemm cutlass kernel.

To disable the kernel, you can append the following at the end of build
command line:
`--cmake_extra_defines onnxruntime_USE_FPA_INTB_GEMM=OFF`

### Motivation and Context

FpA IntB Gemm need a lot of time to compile. With such option, developer
can speed up the build especially on build machine with limited memory.
* Implements `GetEPContextNodes()`
* Enables usage of `AddExternalInitializersFromFilesInMemory` for models
that have to be communicated as byte stream but are larger than 2GB
* Add EP context unit tests for file, bytestreams and both embed modes

NOTE: For large models > 2GB, `embed_mode=0` must be used.
`embed_mode=1` fails due to protobuf limitations

---------

Co-authored-by: Maximilian Müller <[email protected]>
### Description

upgrade WGSL Template to v0.1.15

Changes:
- fs-eire/wgsl-template#21
…rosoft#25800)

This reconfiguration is done to NOT allocate tensors with an exact
matching size. If that strategy is used a tensor will always trigger an
allocation in the arena and not reuse memory since the memory size has
to exactly match.
This became a big problem with ORT GenAI since the arena grew constantly
when prompting with different prompt lengths. No arena shrinkage was
triggered to return older tensors. @skottmckay I am happy to be educated
of a better usage of the allocators.

Issues with this: 
Since the arena is not used for workspace allocations anymore (using
reserve) it will likely not be possible in the future to allocate on a
stream and immediately free memory after an enqueue call. That could
have enabled workspace sharing in a multi model pipeline very nicely.

@chilo-ms can you help merge this.
### Description
<!-- Describe your changes. -->
This PR provides C++ interfaces for the following:

Env
====
CopyTensors()

CreateSharedAllocator
GetSharedAllocator
ReleaseSharedAllocator
CreateAndRegisterAllocatorV2

RegisterAllocator
UnregisterAllocator

EpDevice
======
EpDevice_MemoryInfo
CreateSyncStreamForEpDevice

MemoryInfo
========
CreateMemoryInfo_V2
MemoryInfoGetName 
MemoryInfoGetId 
MemoryInfoGetMemType
MemoryInfoGetType
MemoryInfoGetDeviceMemType
MemoryInfoGetVendorId

Session
==========
SessionGetInputName
SessionGetOutputName

SessionGetMemoryInfoForInputs
SessionGetMemoryInfoForOutputs
SessionGetEpDeviceForInputs

SyncStream
===========
SyncStream_GetHandle
ReleaseSyncStream

OrtArenaCfg
===========
CreateArenaCfgV2

TRT
===
CreateTensorRTProviderOptions and V2
UpdateTensorRTProviderOptions

SessionOptions
==============
OrtSessionOptionsAppendExecutionProvider_CPU

Prepacked container
=============

CUDA Options V2
===========
OrtCUDAProviderOptionsV2
CreateCUDAProviderOptions

GetCUDAProviderOptionsByName
UpdateCUDAProviderOptionsWithValue
UpdateCUDAProviderOptions
GetCUDAProviderOptionsAsString

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Provide a way to write exception safe code.
### Description
Added the header `<cstdint>` to `semver.h`.


### Motivation and Context
Correcting compilation under linux systems, to prevent the error:
```
/xxx/onnxruntime/core/common/semver.h:18:3: error: »uint32_t« does not name a type
   18 |   uint32_t major{};
   19 |   uint32_t minor{};
   20 |   uint32_t patch{};
```
…y info (microsoft#25749)

### Description
This pull request introduces a new mechanism for validating compiled
model compatibility with execution providers (EPs) in ONNX Runtime. It
adds infrastructure for EPs to generate and store compatibility
information in model metadata, and for the runtime to enforce
compatibility checks during session initialization.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The APIs proposed in this PR address two requirements:

1. Apps that have an already pre-compiled model on device need a way to
determine if the pre-compiled app is still valid (given the EPs /
drivers / etc. on the system).
2. Apps may have many different pre-compiled versions of a model stored
on a remote server, and want to figure out which of those models they
should download for the device where they are running.

### Testing
Validated that the new suite of tests passes cleanly. 
Created a private build of this ORT and the AMD Vitis EP. I stepped
through the core logic (the EP doesn't have this support wired up as yet
so there is no compatibility info written out) and for regression
purposes, confirmed I could compile and run inferences through ResNet.

---------

Co-authored-by: Aditya Rastogi <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
<!-- Describe your changes. -->

Disable cpuinfo for ARM64EC builds. There's an error when linking to
cpuinfo built for ARM64EC when using `--use_vckpg`.

This issue was exposed by a recent change (microsoft#25228) but cpuinfo was
actually not being used before for ARM64EC. The macros here don't
properly account for ARM64EC:

https://github.com/microsoft/onnxruntime/blob/e6d3e085cb0bb96da7c3458b97316ecca234b37a/onnxruntime/core/common/cpuid_arch_definition.h#L8-L14

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix a packaging pipeline failure. Revert to the old behavior of not
calling cpuinfo from the CPUIDInfo ctor for ARM64EC.

This PR is just a workaround. The cpuinfo link issue needs more
investigation.
### Description
Put the flash decoding shader into three template files.


### Motivation and Context
Moving to templates will improve code readability.
@Jaswanth51 Jaswanth51 requested a review from ankitm3k August 25, 2025 04:48
@ankitm3k ankitm3k merged commit e812aea into ovep-develop Aug 25, 2025
6 of 8 checks passed
@ankitm3k ankitm3k deleted the sync_msft_25082025 branch August 25, 2025 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.