Skip to content

Conversation

Jaswanth51
Copy link

Description

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

xieofxie and others added 29 commits July 31, 2025 16:05
…icrosoft#25590)

### Description
<!-- Describe your changes. -->

use session id to track them with LogSessionCreation

if we call Run in different threads, we could differentiate them with
thread id given Run is not async

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: hualxie <[email protected]>
…#25602)

### Description
Add new API to VitisAI to save graph as a string

### Motivation and Context
to support in-memory flow

---------

Co-authored-by: yifei <[email protected]>
Add support of bfloat16 in MoE and qMoE cuda ops.
### Description
<!-- Describe your changes. -->

!. Disable Turing GPU EP devices 

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Turing will not be supported in this release

@chilo-ms @jywu-msft
### Description
 - Add unit tests for LPBQ fusions on MatMul and Gemm nodes

### Motivation and Context
- This commit is adding Unit tests for avoiding future regressions in LPBQ fusions
### Description
We have a big packaging pipeline that build nuget/java/nodejs packages.
After that we run these. This PR split the tests to a dedicated pipeline
and refactored the code that use maven to download deps instead of using
direct HTTP fetch. The new approach allows us to use Azure DevOps
artifacts as an internal mirror to meet network isolation requirements.
Thsi PR also enabled WebGPU and CoreML EP tests for java package on macOS.

This PR also updated tools/python/run_packaging_pipelines.py a little
bit to add the support for RC releases.

### Motivation and Context
Make the packaging pipelines smaller and easier to use.
microsoft#25589)

### Description
Cached opSupportLimits in webnn backend and avoid quering it from lower
layer each time to improve the performance. Update the trace event in
data transfer.



### Motivation and Context
In current implementation, each time calling ensureTensor API to check
input/output tensor, MLContext.opSupportLimits API will be called to
query support ops capability from chromium and this function call will
be the hotspot. Call this API when session is created and then cache it
will avoid the frequent lower API call.
…g HTP. (microsoft#25605)

### Description
Lower Gemm with 2d bias to FC + ElementwiseAdd when targeting HTP.

### Motivation and Context
This change will allow Gemm with 2d bias stays on HTP and not falling back to CPU.

---------

Signed-off-by: Mu-Chein Hsu <[email protected]>
This introduces a new LayoutProgram to pre-process the input matrix A,
converting it to a layout that is more efficient for the
SubgroupMatrixLoad operation on Intel GPUs.
…microsoft#25584)

Bug fix for QNN EP when multiple consumer of same cast node inserted for Gather indices. Now conditionally skip adding the node when already happen.
### Description
 - Translate ONNX ScatterElements as QNN's ScatterElements Op
 - Handle unsupported reduction value i.e., "min"
 - Add unit tests to verify ScatterElements Op support on HTP

---------

Co-authored-by: Tirupathi Reddy T <[email protected]>
### Description
- Pre-allocated memory for HTP context params list during context creation when VTCM backup buffer sharing is enabled. This is done to avoid memory-based issues due to vector resizing/re-allocation.
 - Handle case where new binary contexts need to be processed
…gth (microsoft#25594)

### Description
<!-- Describe your changes. -->
microsoft#25372 adds sliding window support for Group Query Attention, disabling
Flash Attention as it's not yet supported.

This PR adds a check for the sliding window and applies Flash Attention
when the window size exceeds the KV cache length or total sequence
length.

### Motivation and Context
See above.
### Description
<!-- Describe your changes. -->


This pull request introduces a new compiler flag check for
`-Warray-bounds` and addresses a Clang-specific warning in the
`MlasQ4Int8TileGemmKernelBlkLen32Avx2` function.

The warning is generated in unreachable code, however it is still
triggered the warning. Anyway add the change to make Clang happy.

Fixes microsoft#23180
…5575)

### Description
This PR implements GatherBlockQuantized operator for CUDA EP with 4 bit
and 8 bit data support.


### Motivation and Context
GatherBlockQuantified operator is essential for MOE model's expert
selection, especially when the model has been statically quantized.

---------

Co-authored-by: Xiaoyan Hu <[email protected]>
…ft#25626)

### Description
<!-- Describe your changes. -->
Move moving weights to memory to the end of Graph::Resolve().
Modify Inject so it copies data into TensorProto according to the C API
docs.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
TypeAndShape inference runs as a part of `Resolve()` and it unable to
inspect and load the initializers that point to OrtValues at that time.
We choose to move TensorProto to OrtValue conversion at the end of
`Resolve()`.

References: microsoft#25579
This pull request introduces significant updates to the ONNX Runtime's
handling of quantized Mixture-of-Experts (MoE) operations. The changes
include adjustments to tensor type constraints, the addition of new
kernel definitions, and the implementation of a new `QMoE` operator for
CPU execution. These updates aim to enhance support for quantized MoE
operations and improve validation mechanisms for input tensors and
scales.

### Documentation Updates:
* Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and
`fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`.
* Added descriptions for the new `QMoE` operator in
`docs/OperatorKernels.md`.
[[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565)
[[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961)

### Operator Enhancements:
* Introduced a new `QMoE` operator for quantized Mixture-of-Experts in
CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`).
[[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109)
[[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275)
* Registered the `QMoE` operator in the kernel registry.

### Codebase Additions:
* Added `MoEBaseCPU` class in
`onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared
functionality for MoE operations, including input validation and scale
checking.
* Implemented the `QMoE` operator in
`onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with
support for quantized tensor types and activation types.

### CUDA and Graph Updates:
* Updated type constraints for `T2` in CUDA implementation of `QMoE`.
* Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use
`T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`.
[[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443)
[[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452)

These changes collectively improve the framework's ability to handle
quantized MoE operations efficiently while ensuring robust validation
for input tensors and scales.

---------

Co-authored-by: Kunal Vaishnavi <[email protected]>
Co-authored-by: Tianlei Wu <[email protected]>
…crosoft#25617)

### Description

use cross-compile to build x86_64 target for WebGPU. This is required
for incoming Dawn upgrade
(microsoft#25461). Latest dawn need
at least Clang v16 to compile, but available macOS in CI pipelines are
up to v15.2.

This PR depends on microsoft#25615.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Weight Shape Update
Make sure the shape reflects actual memory layout. The weight is stored
in column major.

### Add support for SwiGLU activation attributes
Add spec for the new activation type SwiGLU (Swish-Gated Linear Unit) by
introducing a few new attributes. For reference, see the [Triton kernel
implementation](https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/swiglu.py).


#### New Attributes for SwiGLU

* **`swiglu_fusion`**:

  * `0`: Not fused — two separate GEMMs (FC1 and FC3).
* `1`: Fused GEMMs using **interleaved** format (g and l are interleaved
per row).
  * `2`: Fused GEMMs using **non-interleaved** (concatenated) format.

* **`swiglu_limit`**: Clamp threshold applied to `g` and `l`.

* **`activation_alpha`**: Scalar multiplier applied to `g` before
sigmoid.

* **`activation_beta`**: Added to `l` before the final output
computation.

---

### SwiGLU Activation Function

The SwiGLU function is defined as:

```
g = xW + b
l = xV + c
G = min(g, limit)
L = max(min(l, limit), -limit)
swiglu = G * sigmoid(alpha * G) * (L + beta)
```

* `x`: Input
* `W`, `V`: Weight matrices
* `b`, `c`: Bias vectors
* `alpha`, `beta`, `limit`: Float constants

---

### Fusion Behavior

* When `swiglu_fusion = 0`:

  * Two GEMMs are computed independently.
  * FC1 → computes `g`, FC3 → computes `l`.

* When `swiglu_fusion = 1`:

  * `g` and `l` are computed in a **single fused GEMM** (FC1).
* Output is **interleaved** per row as: `gate, linear, gate, linear,
...`.

* When `swiglu_fusion = 2`:

  * `g` and `l` are computed in a single GEMM (FC1).
  * Output is **concatenated** per row: `[g | l]`.

### Implement swiglu_limit for CUDA
Update CUDA kernel to use default swiglu limit.
Update test_moe_cuda.py to have same logic in reference implementation.

### Remaining Works
The main purpose of this PR is to update spec instead of implementing
them.
Note that MoE/qMoE ops and tests still use hard-coded parameters and
will be changed later to read from those attributes.

Column-wise symmetric quantization is used for qMoE. We will add more
quantization details when we add support of block-wise quantization
soon.
…and change vendor_id (microsoft#25625)

This fixes the CreateIExecutionProvider for MIGraphX EP when calling
CreateExecutionProviderFactory, using OrtMIGraphXProviderOptions instead
of ProviderOptions.
Also changes the vendor_id so that OrderDevices in
provider_policy_context.cc will default dml ep when ep_policy is set to
GPU. Will update pending more changes to MIGraphX EP.

Co-authored-by: ozhang <[email protected]>
Update operator spec to support block quantization in qMoE.
Implementation will come later.
- Added new op builder for GatherNd Op
- Added unit tests for GatherNd and QDQ tests
- Disabled two tests in ORT test as QNN CPU does not support negative indices

### Description
Adding support for GatherNd op in QNN EP



### Motivation and Context
Currently GatherNd op is not supported in QNN EP and hence falls back to
ORT CPU.
### Description
- enable 2bit matmulnbits
- falls back to ComputeBUnpacked (dequants to fp32)
- Also adapting quantize script to enable 2 bits
- adds 2bit unit tests
- [blockwise quantize for 2bits already implemented
](https://github.com/microsoft/onnxruntime/blob/b9575476e94daa9c6578aba92d8f04324dd15815/onnxruntime/core/mlas/lib/q4_dq.cpp#L407)

### Motivation and Context
- working on enabling bitnet + lowbit LLM's

---------

Co-authored-by: Hector Li <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…#25349)

### Motivation and Context
Android devices (like S24) doesn't seem to allow fp16 in uniforms so the
WebGPU EP has to manually handle passing an fp32 in the uniform and
converting to fp16 before using.
### Description

Currently the subgroup feature is not working correctly for WebGPU EP on
web.

See microsoft#25595.

Before the bug is fixed in upstream
(https://issues.chromium.org/issues/435879324), use this workaround to
enable subgroup support.
@Jaswanth51 Jaswanth51 requested a review from ankitm3k August 5, 2025 08:01
@ankitm3k ankitm3k merged commit 71f8877 into ovep-develop Aug 5, 2025
6 of 8 checks passed
@ankitm3k ankitm3k deleted the sync_msft_05082025 branch August 5, 2025 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.