-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Bf16 implementations of NCHWc Conv kernels #26980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Copilot <[email protected]>
|
This PR is in a pretty crude form. So before investing any effort manually reviewing it, can I get a copilot review? @hariharans29 |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces BF16 (bfloat16) implementations for NCHWc, NCHW, and Depthwise convolution kernels on ARM NEON platforms, building on previous work. The implementations use SBGEMM (Single-precision BFLOAT16 GEMM) operations and achieve a 33% performance improvement on models that heavily use NCHWc kernels.
Changes:
- Added
ZeroModeparameter to SBGEMM operations to support output accumulation mode - Implemented BF16 convolution kernels for NCHW/NCHWc formats using im2col + SBGEMM approach
- Registered new BF16 kernel implementations in platform initialization
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/mlas/lib/sbgemm.h | Added InitialZeroMode parameter to control zero-mode behavior in SBGEMM operations |
| onnxruntime/core/mlas/lib/sbconv_kernel_neon.cpp | New file implementing BF16 depthwise, NCHWc, NCHW, and pointwise convolution kernels |
| onnxruntime/core/mlas/lib/platform.cpp | Registered BF16 kernel function pointers for NEON platform |
| onnxruntime/core/mlas/lib/mlasi.h | Declared new BF16 convolution kernel function signatures |
| onnxruntime/core/mlas/inc/mlas.h | Added ZeroMode field to SBGEMM data parameters structure |
| cmake/onnxruntime_mlas.cmake | Added sbconv_kernel_neon.cpp to build with BF16 compilation flags |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // BF16 NCHW/NCHWc Convolution Kernel using im2col + SBGEMM. | ||
| // NCHW: 1 input channel per kernel position, single GEMM with K=KernelSize | ||
| // NCHWc: BlockSize input channels per kernel position, loop over kpos with K=BlockSize |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate comment block at lines 121-124 and lines 126-128. Remove one of the duplicate comment blocks.
| // BF16 NCHW/NCHWc Convolution Kernel using im2col + SBGEMM. | |
| // NCHW: 1 input channel per kernel position, single GEMM with K=KernelSize | |
| // NCHWc: BlockSize input channels per kernel position, loop over kpos with K=BlockSize |
| this->ConvNchwFloatKernel = MlasConvNchwFloatKernelNeon; | ||
| this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon; | ||
| this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon; | ||
| this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon; | ||
| this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon; | ||
| // this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon; | ||
| this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon; |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate assignment to ConvNchwFloatKernel. The second assignment overwrites the first. If the intent is to replace the float kernel with the BF16 kernel, remove the first assignment. Otherwise, clarify the initialization logic.
| this->ConvNchwFloatKernel = MlasConvNchwFloatKernelNeon; | |
| this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon; | |
| this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon; | |
| this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon; | |
| this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon; | |
| // this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon; | |
| this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon; | |
| this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon; | |
| this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon; | |
| this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon; | |
| // this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon; |
| this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon; | ||
| this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon; | ||
| this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon; | ||
| this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon; | ||
| // this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon; | ||
| this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon; | ||
| this->ConvPointwiseFloatKernel = MlasConvPointwiseBf16KernelNeon; |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate assignment to ConvNchwcFloatKernel. The second assignment overwrites the first. If the intent is to replace the float kernel with the BF16 kernel, remove the first assignment. Otherwise, clarify the initialization logic.
| this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon; | |
| this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon; | |
| this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon; | |
| this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon; | |
| // this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon; | |
| this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon; | |
| this->ConvPointwiseFloatKernel = MlasConvPointwiseBf16KernelNeon; | |
| // this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon; | |
| this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon; | |
| // this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon; | |
| this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon; | |
| // this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon; | |
| this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon; | |
| // this->ConvPointwiseFloatKernel = MlasConvPointwiseBf16KernelNeon; |
| this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon; | ||
| this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon; | ||
| // this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon; | ||
| this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon; |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate assignment to ConvPointwiseFloatKernel. The second assignment overwrites the first. If the intent is to replace the float kernel with the BF16 kernel, remove the first assignment. Otherwise, clarify the initialization logic.
| this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon; |
Description
This is a follow-up to #26838. Only review this PR once the former is merged.
Motivation and Context
This PR introduces BF16 implementations of the Depthwise, NCHWc, and NCHW Conv kernels.
Performance Improvement
These kernels, along with the BF16 Pointwise Conv kernels, provide a 33% speedup on models (other than Mobilenet) that stress the NCHWc kernels.