Sync with Microsoft ONNX Runtime - 05/08/2025 #769

Jaswanth51 · 2025-08-05T08:01:52Z

Description

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

…icrosoft#25590) ### Description  use session id to track them with LogSessionCreation if we call Run in different threads, we could differentiate them with thread id given Run is not async ### Motivation and Context  --------- Co-authored-by: hualxie <[email protected]>

…#25602) ### Description Add new API to VitisAI to save graph as a string ### Motivation and Context to support in-memory flow --------- Co-authored-by: yifei <[email protected]>

Add support of bfloat16 in MoE and qMoE cuda ops.

@chilo-ms

### Description  !. Disable Turing GPU EP devices ### Motivation and Context  Turing will not be supported in this release @chilo-ms @jywu-msft

### Description - Add unit tests for LPBQ fusions on MatMul and Gemm nodes ### Motivation and Context - This commit is adding Unit tests for avoiding future regressions in LPBQ fusions

### Description We have a big packaging pipeline that build nuget/java/nodejs packages. After that we run these. This PR split the tests to a dedicated pipeline and refactored the code that use maven to download deps instead of using direct HTTP fetch. The new approach allows us to use Azure DevOps artifacts as an internal mirror to meet network isolation requirements. Thsi PR also enabled WebGPU and CoreML EP tests for java package on macOS. This PR also updated tools/python/run_packaging_pipelines.py a little bit to add the support for RC releases. ### Motivation and Context Make the packaging pipelines smaller and easier to use.

microsoft#25589) ### Description Cached opSupportLimits in webnn backend and avoid quering it from lower layer each time to improve the performance. Update the trace event in data transfer. ### Motivation and Context In current implementation, each time calling ensureTensor API to check input/output tensor, MLContext.opSupportLimits API will be called to query support ops capability from chromium and this function call will be the hotspot. Call this API when session is created and then cache it will avoid the frequent lower API call.

…g HTP. (microsoft#25605) ### Description Lower Gemm with 2d bias to FC + ElementwiseAdd when targeting HTP. ### Motivation and Context This change will allow Gemm with 2d bias stays on HTP and not falling back to CPU. --------- Signed-off-by: Mu-Chein Hsu <[email protected]>

This introduces a new LayoutProgram to pre-process the input matrix A, converting it to a layout that is more efficient for the SubgroupMatrixLoad operation on Intel GPUs.

…microsoft#25584) Bug fix for QNN EP when multiple consumer of same cast node inserted for Gather indices. Now conditionally skip adding the node when already happen.

### Description - Translate ONNX ScatterElements as QNN's ScatterElements Op - Handle unsupported reduction value i.e., "min" - Add unit tests to verify ScatterElements Op support on HTP --------- Co-authored-by: Tirupathi Reddy T <[email protected]>

### Description - Pre-allocated memory for HTP context params list during context creation when VTCM backup buffer sharing is enabled. This is done to avoid memory-based issues due to vector resizing/re-allocation. - Handle case where new binary contexts need to be processed

Resolve microsoft#24277

…gth (microsoft#25594) ### Description  microsoft#25372 adds sliding window support for Group Query Attention, disabling Flash Attention as it's not yet supported. This PR adds a check for the sliding window and applies Flash Attention when the window size exceeds the KV cache length or total sequence length. ### Motivation and Context See above.

### Description  This pull request introduces a new compiler flag check for `-Warray-bounds` and addresses a Clang-specific warning in the `MlasQ4Int8TileGemmKernelBlkLen32Avx2` function. The warning is generated in unreachable code, however it is still triggered the warning. Anyway add the change to make Clang happy. Fixes microsoft#23180

…5575) ### Description This PR implements GatherBlockQuantized operator for CUDA EP with 4 bit and 8 bit data support. ### Motivation and Context GatherBlockQuantified operator is essential for MOE model's expert selection, especially when the model has been statically quantized. --------- Co-authored-by: Xiaoyan Hu <[email protected]>

…ft#25626) ### Description  Move moving weights to memory to the end of Graph::Resolve(). Modify Inject so it copies data into TensorProto according to the C API docs. ### Motivation and Context  TypeAndShape inference runs as a part of `Resolve()` and it unable to inspect and load the initializers that point to OrtValues at that time. We choose to move TensorProto to OrtValue conversion at the end of `Resolve()`. References: microsoft#25579

This pull request introduces significant updates to the ONNX Runtime's handling of quantized Mixture-of-Experts (MoE) operations. The changes include adjustments to tensor type constraints, the addition of new kernel definitions, and the implementation of a new `QMoE` operator for CPU execution. These updates aim to enhance support for quantized MoE operations and improve validation mechanisms for input tensors and scales. ### Documentation Updates: * Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and `fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`. * Added descriptions for the new `QMoE` operator in `docs/OperatorKernels.md`. [[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565) [[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961) ### Operator Enhancements: * Introduced a new `QMoE` operator for quantized Mixture-of-Experts in CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`). [[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109) [[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275) * Registered the `QMoE` operator in the kernel registry. ### Codebase Additions: * Added `MoEBaseCPU` class in `onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared functionality for MoE operations, including input validation and scale checking. * Implemented the `QMoE` operator in `onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with support for quantized tensor types and activation types. ### CUDA and Graph Updates: * Updated type constraints for `T2` in CUDA implementation of `QMoE`. * Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use `T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`. [[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443) [[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452) These changes collectively improve the framework's ability to handle quantized MoE operations efficiently while ensuring robust validation for input tensors and scales. --------- Co-authored-by: Kunal Vaishnavi <[email protected]> Co-authored-by: Tianlei Wu <[email protected]>

…crosoft#25617) ### Description use cross-compile to build x86_64 target for WebGPU. This is required for incoming Dawn upgrade (microsoft#25461). Latest dawn need at least Clang v16 to compile, but available macOS in CI pipelines are up to v15.2. This PR depends on microsoft#25615. ### Motivation and Context

### Weight Shape Update Make sure the shape reflects actual memory layout. The weight is stored in column major. ### Add support for SwiGLU activation attributes Add spec for the new activation type SwiGLU (Swish-Gated Linear Unit) by introducing a few new attributes. For reference, see the [Triton kernel implementation](https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/swiglu.py). #### New Attributes for SwiGLU * **`swiglu_fusion`**: * `0`: Not fused — two separate GEMMs (FC1 and FC3). * `1`: Fused GEMMs using **interleaved** format (g and l are interleaved per row). * `2`: Fused GEMMs using **non-interleaved** (concatenated) format. * **`swiglu_limit`**: Clamp threshold applied to `g` and `l`. * **`activation_alpha`**: Scalar multiplier applied to `g` before sigmoid. * **`activation_beta`**: Added to `l` before the final output computation. --- ### SwiGLU Activation Function The SwiGLU function is defined as: ``` g = xW + b l = xV + c G = min(g, limit) L = max(min(l, limit), -limit) swiglu = G * sigmoid(alpha * G) * (L + beta) ``` * `x`: Input * `W`, `V`: Weight matrices * `b`, `c`: Bias vectors * `alpha`, `beta`, `limit`: Float constants --- ### Fusion Behavior * When `swiglu_fusion = 0`: * Two GEMMs are computed independently. * FC1 → computes `g`, FC3 → computes `l`. * When `swiglu_fusion = 1`: * `g` and `l` are computed in a **single fused GEMM** (FC1). * Output is **interleaved** per row as: `gate, linear, gate, linear, ...`. * When `swiglu_fusion = 2`: * `g` and `l` are computed in a single GEMM (FC1). * Output is **concatenated** per row: `[g | l]`. ### Implement swiglu_limit for CUDA Update CUDA kernel to use default swiglu limit. Update test_moe_cuda.py to have same logic in reference implementation. ### Remaining Works The main purpose of this PR is to update spec instead of implementing them. Note that MoE/qMoE ops and tests still use hard-coded parameters and will be changed later to read from those attributes. Column-wise symmetric quantization is used for qMoE. We will add more quantization details when we add support of block-wise quantization soon.

…and change vendor_id (microsoft#25625) This fixes the CreateIExecutionProvider for MIGraphX EP when calling CreateExecutionProviderFactory, using OrtMIGraphXProviderOptions instead of ProviderOptions. Also changes the vendor_id so that OrderDevices in provider_policy_context.cc will default dml ep when ep_policy is set to GPU. Will update pending more changes to MIGraphX EP. Co-authored-by: ozhang <[email protected]>

Update operator spec to support block quantization in qMoE. Implementation will come later.

- Added new op builder for GatherNd Op - Added unit tests for GatherNd and QDQ tests - Disabled two tests in ORT test as QNN CPU does not support negative indices ### Description Adding support for GatherNd op in QNN EP ### Motivation and Context Currently GatherNd op is not supported in QNN EP and hence falls back to ORT CPU.

### Description - enable 2bit matmulnbits - falls back to ComputeBUnpacked (dequants to fp32) - Also adapting quantize script to enable 2 bits - adds 2bit unit tests - [blockwise quantize for 2bits already implemented ](https://github.com/microsoft/onnxruntime/blob/b9575476e94daa9c6578aba92d8f04324dd15815/onnxruntime/core/mlas/lib/q4_dq.cpp#L407) ### Motivation and Context - working on enabling bitnet + lowbit LLM's --------- Co-authored-by: Hector Li <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…#25349) ### Motivation and Context Android devices (like S24) doesn't seem to allow fp16 in uniforms so the WebGPU EP has to manually handle passing an fp32 in the uniform and converting to fp16 before using.

### Description Currently the subgroup feature is not working correctly for WebGPU EP on web. See microsoft#25595. Before the bug is fixed in upstream (https://issues.chromium.org/issues/435879324), use this workaround to enable subgroup support.

xieofxie and others added 29 commits July 31, 2025 16:05

[VitisAI] add new api to VitisAI to save graph as a string (microsoft…

866c7e3

…#25602) ### Description Add new API to VitisAI to save graph as a string ### Motivation and Context to support in-memory flow --------- Co-authored-by: yifei <[email protected]>

[CUDA] BF16 MoE and qMoE (microsoft#25572)

68b9d9b

Add support of bfloat16 in MoE and qMoE cuda ops.

[QNN EP] Add Unit tests for LPBQ Fusions (microsoft#25592)

5c0a7d8

### Description - Add unit tests for LPBQ fusions on MatMul and Gemm nodes ### Motivation and Context - This commit is adding Unit tests for avoiding future regressions in LPBQ fusions

[build] disable CodeQL for NPM Packaging Pipeline (microsoft#25614)

7b2f667

Optimize layout for SubgroupMatrixLoad on Intel (microsoft#25384)

b77dbd4

This introduces a new LayoutProgram to pre-process the input matrix A, converting it to a layout that is more efficient for the SubgroupMatrixLoad operation on Intel GPUs.

[QNN EP] Bug fix: multiple consumer for cast result in name conflicts (…

554cda5

…microsoft#25584) Bug fix for QNN EP when multiple consumer of same cast node inserted for Gather indices. Now conditionally skip adding the node when already happen.

Update macOS target version from 13.3 to 13.4 (microsoft#25616)

a120b4b

Resolve microsoft#24277

[CANN] Fix CANN build error (microsoft#25627)

9ecd9f3

Update qMoE spec to support block quantization (microsoft#25641)

59871e3

Update operator spec to support block quantization in qMoE. Implementation will come later.

Add patch for WebGPU on Android to handle fp16 in uniforms (microsoft…

7b31a10

…#25349) ### Motivation and Context Android devices (like S24) doesn't seem to allow fp16 in uniforms so the WebGPU EP has to manually handle passing an fp32 in the uniform and converting to fp16 before using.

Merge branch 'master' into sync_msft_05082025

ddc64b9

Jaswanth51 requested a review from ankitm3k August 5, 2025 08:01

ankitm3k approved these changes Aug 5, 2025

View reviewed changes

ankitm3k merged commit 71f8877 into ovep-develop Aug 5, 2025
6 of 8 checks passed

ankitm3k deleted the sync_msft_05082025 branch August 5, 2025 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync with Microsoft ONNX Runtime - 05/08/2025 #769

Sync with Microsoft ONNX Runtime - 05/08/2025 #769

Uh oh!

Jaswanth51 commented Aug 5, 2025

Uh oh!

Uh oh!

Uh oh!

Sync with Microsoft ONNX Runtime - 05/08/2025 #769

Sync with Microsoft ONNX Runtime - 05/08/2025 #769

Uh oh!

Conversation

Jaswanth51 commented Aug 5, 2025

Description

Uh oh!

Uh oh!

Uh oh!