forked from microsoft/onnxruntime
-
Notifications
You must be signed in to change notification settings - Fork 52
Sync with Microsoft ONNX Runtime - 05/08/2025 #769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…icrosoft#25590) ### Description <!-- Describe your changes. --> use session id to track them with LogSessionCreation if we call Run in different threads, we could differentiate them with thread id given Run is not async ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: hualxie <[email protected]>
…#25602) ### Description Add new API to VitisAI to save graph as a string ### Motivation and Context to support in-memory flow --------- Co-authored-by: yifei <[email protected]>
Add support of bfloat16 in MoE and qMoE cuda ops.
### Description <!-- Describe your changes. --> !. Disable Turing GPU EP devices ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Turing will not be supported in this release @chilo-ms @jywu-msft
### Description - Add unit tests for LPBQ fusions on MatMul and Gemm nodes ### Motivation and Context - This commit is adding Unit tests for avoiding future regressions in LPBQ fusions
### Description We have a big packaging pipeline that build nuget/java/nodejs packages. After that we run these. This PR split the tests to a dedicated pipeline and refactored the code that use maven to download deps instead of using direct HTTP fetch. The new approach allows us to use Azure DevOps artifacts as an internal mirror to meet network isolation requirements. Thsi PR also enabled WebGPU and CoreML EP tests for java package on macOS. This PR also updated tools/python/run_packaging_pipelines.py a little bit to add the support for RC releases. ### Motivation and Context Make the packaging pipelines smaller and easier to use.
microsoft#25589) ### Description Cached opSupportLimits in webnn backend and avoid quering it from lower layer each time to improve the performance. Update the trace event in data transfer. ### Motivation and Context In current implementation, each time calling ensureTensor API to check input/output tensor, MLContext.opSupportLimits API will be called to query support ops capability from chromium and this function call will be the hotspot. Call this API when session is created and then cache it will avoid the frequent lower API call.
…g HTP. (microsoft#25605) ### Description Lower Gemm with 2d bias to FC + ElementwiseAdd when targeting HTP. ### Motivation and Context This change will allow Gemm with 2d bias stays on HTP and not falling back to CPU. --------- Signed-off-by: Mu-Chein Hsu <[email protected]>
This introduces a new LayoutProgram to pre-process the input matrix A, converting it to a layout that is more efficient for the SubgroupMatrixLoad operation on Intel GPUs.
…microsoft#25584) Bug fix for QNN EP when multiple consumer of same cast node inserted for Gather indices. Now conditionally skip adding the node when already happen.
### Description - Translate ONNX ScatterElements as QNN's ScatterElements Op - Handle unsupported reduction value i.e., "min" - Add unit tests to verify ScatterElements Op support on HTP --------- Co-authored-by: Tirupathi Reddy T <[email protected]>
### Description - Pre-allocated memory for HTP context params list during context creation when VTCM backup buffer sharing is enabled. This is done to avoid memory-based issues due to vector resizing/re-allocation. - Handle case where new binary contexts need to be processed
…gth (microsoft#25594) ### Description <!-- Describe your changes. --> microsoft#25372 adds sliding window support for Group Query Attention, disabling Flash Attention as it's not yet supported. This PR adds a check for the sliding window and applies Flash Attention when the window size exceeds the KV cache length or total sequence length. ### Motivation and Context See above.
### Description <!-- Describe your changes. --> This pull request introduces a new compiler flag check for `-Warray-bounds` and addresses a Clang-specific warning in the `MlasQ4Int8TileGemmKernelBlkLen32Avx2` function. The warning is generated in unreachable code, however it is still triggered the warning. Anyway add the change to make Clang happy. Fixes microsoft#23180
…5575) ### Description This PR implements GatherBlockQuantized operator for CUDA EP with 4 bit and 8 bit data support. ### Motivation and Context GatherBlockQuantified operator is essential for MOE model's expert selection, especially when the model has been statically quantized. --------- Co-authored-by: Xiaoyan Hu <[email protected]>
…ft#25626) ### Description <!-- Describe your changes. --> Move moving weights to memory to the end of Graph::Resolve(). Modify Inject so it copies data into TensorProto according to the C API docs. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> TypeAndShape inference runs as a part of `Resolve()` and it unable to inspect and load the initializers that point to OrtValues at that time. We choose to move TensorProto to OrtValue conversion at the end of `Resolve()`. References: microsoft#25579
This pull request introduces significant updates to the ONNX Runtime's handling of quantized Mixture-of-Experts (MoE) operations. The changes include adjustments to tensor type constraints, the addition of new kernel definitions, and the implementation of a new `QMoE` operator for CPU execution. These updates aim to enhance support for quantized MoE operations and improve validation mechanisms for input tensors and scales. ### Documentation Updates: * Updated tensor type constraints for `fc1_scales`, `fc2_scales`, and `fc3_scales` in `docs/ContribOperators.md` to use `T2` instead of `T`. * Added descriptions for the new `QMoE` operator in `docs/OperatorKernels.md`. [[1]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135R565) [[2]](diffhunk://#diff-a44f0272e7668a044f15119b6efb44d562b873a7bee23c6b753b2c47d7697135L960-R961) ### Operator Enhancements: * Introduced a new `QMoE` operator for quantized Mixture-of-Experts in CPU kernels (`onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc`). [[1]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R109) [[2]](diffhunk://#diff-fd949b2a9885f634c37c2048da9e35d227ed20adf1d7baf5de488f304a78bde9R275) * Registered the `QMoE` operator in the kernel registry. ### Codebase Additions: * Added `MoEBaseCPU` class in `onnxruntime/contrib_ops/cpu/moe/moe_base_cpu.h` to provide shared functionality for MoE operations, including input validation and scale checking. * Implemented the `QMoE` operator in `onnxruntime/contrib_ops/cpu/quantization/moe_quantization_cpu.h` with support for quantized tensor types and activation types. ### CUDA and Graph Updates: * Updated type constraints for `T2` in CUDA implementation of `QMoE`. * Adjusted schema definitions for `fc1_scales` and `fc2_scales` to use `T2` in `onnxruntime/core/graph/contrib_ops/contrib_defs.cc`. [[1]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1443-R1443) [[2]](diffhunk://#diff-81f57d9adc2cce94f85a2949a895b7ff82efcc13d05e23ee6567661f0fecb7c0L1452-R1452) These changes collectively improve the framework's ability to handle quantized MoE operations efficiently while ensuring robust validation for input tensors and scales. --------- Co-authored-by: Kunal Vaishnavi <[email protected]> Co-authored-by: Tianlei Wu <[email protected]>
…crosoft#25617) ### Description use cross-compile to build x86_64 target for WebGPU. This is required for incoming Dawn upgrade (microsoft#25461). Latest dawn need at least Clang v16 to compile, but available macOS in CI pipelines are up to v15.2. This PR depends on microsoft#25615. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Weight Shape Update Make sure the shape reflects actual memory layout. The weight is stored in column major. ### Add support for SwiGLU activation attributes Add spec for the new activation type SwiGLU (Swish-Gated Linear Unit) by introducing a few new attributes. For reference, see the [Triton kernel implementation](https://github.com/triton-lang/triton/blob/main/python/triton_kernels/triton_kernels/swiglu.py). #### New Attributes for SwiGLU * **`swiglu_fusion`**: * `0`: Not fused — two separate GEMMs (FC1 and FC3). * `1`: Fused GEMMs using **interleaved** format (g and l are interleaved per row). * `2`: Fused GEMMs using **non-interleaved** (concatenated) format. * **`swiglu_limit`**: Clamp threshold applied to `g` and `l`. * **`activation_alpha`**: Scalar multiplier applied to `g` before sigmoid. * **`activation_beta`**: Added to `l` before the final output computation. --- ### SwiGLU Activation Function The SwiGLU function is defined as: ``` g = xW + b l = xV + c G = min(g, limit) L = max(min(l, limit), -limit) swiglu = G * sigmoid(alpha * G) * (L + beta) ``` * `x`: Input * `W`, `V`: Weight matrices * `b`, `c`: Bias vectors * `alpha`, `beta`, `limit`: Float constants --- ### Fusion Behavior * When `swiglu_fusion = 0`: * Two GEMMs are computed independently. * FC1 → computes `g`, FC3 → computes `l`. * When `swiglu_fusion = 1`: * `g` and `l` are computed in a **single fused GEMM** (FC1). * Output is **interleaved** per row as: `gate, linear, gate, linear, ...`. * When `swiglu_fusion = 2`: * `g` and `l` are computed in a single GEMM (FC1). * Output is **concatenated** per row: `[g | l]`. ### Implement swiglu_limit for CUDA Update CUDA kernel to use default swiglu limit. Update test_moe_cuda.py to have same logic in reference implementation. ### Remaining Works The main purpose of this PR is to update spec instead of implementing them. Note that MoE/qMoE ops and tests still use hard-coded parameters and will be changed later to read from those attributes. Column-wise symmetric quantization is used for qMoE. We will add more quantization details when we add support of block-wise quantization soon.
…and change vendor_id (microsoft#25625) This fixes the CreateIExecutionProvider for MIGraphX EP when calling CreateExecutionProviderFactory, using OrtMIGraphXProviderOptions instead of ProviderOptions. Also changes the vendor_id so that OrderDevices in provider_policy_context.cc will default dml ep when ep_policy is set to GPU. Will update pending more changes to MIGraphX EP. Co-authored-by: ozhang <[email protected]>
Update operator spec to support block quantization in qMoE. Implementation will come later.
- Added new op builder for GatherNd Op - Added unit tests for GatherNd and QDQ tests - Disabled two tests in ORT test as QNN CPU does not support negative indices ### Description Adding support for GatherNd op in QNN EP ### Motivation and Context Currently GatherNd op is not supported in QNN EP and hence falls back to ORT CPU.
### Description - enable 2bit matmulnbits - falls back to ComputeBUnpacked (dequants to fp32) - Also adapting quantize script to enable 2 bits - adds 2bit unit tests - [blockwise quantize for 2bits already implemented ](https://github.com/microsoft/onnxruntime/blob/b9575476e94daa9c6578aba92d8f04324dd15815/onnxruntime/core/mlas/lib/q4_dq.cpp#L407) ### Motivation and Context - working on enabling bitnet + lowbit LLM's --------- Co-authored-by: Hector Li <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…#25349) ### Motivation and Context Android devices (like S24) doesn't seem to allow fp16 in uniforms so the WebGPU EP has to manually handle passing an fp32 in the uniform and converting to fp16 before using.
### Description Currently the subgroup feature is not working correctly for WebGPU EP on web. See microsoft#25595. Before the bug is fixed in upstream (https://issues.chromium.org/issues/435879324), use this workaround to enable subgroup support.
ankitm3k
approved these changes
Aug 5, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.