Skip to content

Conversation

@Rohanjames1997
Copy link
Contributor

Description

This is a follow-up to #26838. Only review this PR once the former is merged.

Motivation and Context

This PR introduces BF16 implementations of the Depthwise, NCHWc, and NCHW Conv kernels.

Performance Improvement

These kernels, along with the BF16 Pointwise Conv kernels, provide a 33% speedup on models (other than Mobilenet) that stress the NCHWc kernels.

Rohanjames1997 and others added 27 commits December 4, 2025 20:05
@Rohanjames1997
Copy link
Contributor Author

This PR is in a pretty crude form. So before investing any effort manually reviewing it, can I get a copilot review? @hariharans29

@hariharans29 hariharans29 requested a review from Copilot January 16, 2026 21:02
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces BF16 (bfloat16) implementations for NCHWc, NCHW, and Depthwise convolution kernels on ARM NEON platforms, building on previous work. The implementations use SBGEMM (Single-precision BFLOAT16 GEMM) operations and achieve a 33% performance improvement on models that heavily use NCHWc kernels.

Changes:

  • Added ZeroMode parameter to SBGEMM operations to support output accumulation mode
  • Implemented BF16 convolution kernels for NCHW/NCHWc formats using im2col + SBGEMM approach
  • Registered new BF16 kernel implementations in platform initialization

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
onnxruntime/core/mlas/lib/sbgemm.h Added InitialZeroMode parameter to control zero-mode behavior in SBGEMM operations
onnxruntime/core/mlas/lib/sbconv_kernel_neon.cpp New file implementing BF16 depthwise, NCHWc, NCHW, and pointwise convolution kernels
onnxruntime/core/mlas/lib/platform.cpp Registered BF16 kernel function pointers for NEON platform
onnxruntime/core/mlas/lib/mlasi.h Declared new BF16 convolution kernel function signatures
onnxruntime/core/mlas/inc/mlas.h Added ZeroMode field to SBGEMM data parameters structure
cmake/onnxruntime_mlas.cmake Added sbconv_kernel_neon.cpp to build with BF16 compilation flags

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +126 to +128
// BF16 NCHW/NCHWc Convolution Kernel using im2col + SBGEMM.
// NCHW: 1 input channel per kernel position, single GEMM with K=KernelSize
// NCHWc: BlockSize input channels per kernel position, loop over kpos with K=BlockSize
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate comment block at lines 121-124 and lines 126-128. Remove one of the duplicate comment blocks.

Suggested change
// BF16 NCHW/NCHWc Convolution Kernel using im2col + SBGEMM.
// NCHW: 1 input channel per kernel position, single GEMM with K=KernelSize
// NCHWc: BlockSize input channels per kernel position, loop over kpos with K=BlockSize

Copilot uses AI. Check for mistakes.
Comment on lines 569 to 575
this->ConvNchwFloatKernel = MlasConvNchwFloatKernelNeon;
this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon;
this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon;
this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon;
this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon;
// this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon;
this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon;
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate assignment to ConvNchwFloatKernel. The second assignment overwrites the first. If the intent is to replace the float kernel with the BF16 kernel, remove the first assignment. Otherwise, clarify the initialization logic.

Suggested change
this->ConvNchwFloatKernel = MlasConvNchwFloatKernelNeon;
this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon;
this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon;
this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon;
this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon;
// this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon;
this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon;
this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon;
this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon;
this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon;
// this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon;

Copilot uses AI. Check for mistakes.
Comment on lines +570 to +576
this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon;
this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon;
this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon;
this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon;
// this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon;
this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon;
this->ConvPointwiseFloatKernel = MlasConvPointwiseBf16KernelNeon;
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate assignment to ConvNchwcFloatKernel. The second assignment overwrites the first. If the intent is to replace the float kernel with the BF16 kernel, remove the first assignment. Otherwise, clarify the initialization logic.

Suggested change
this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon;
this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon;
this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon;
this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon;
// this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon;
this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon;
this->ConvPointwiseFloatKernel = MlasConvPointwiseBf16KernelNeon;
// this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon;
this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon;
// this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon;
this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon;
// this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon;
this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon;
// this->ConvPointwiseFloatKernel = MlasConvPointwiseBf16KernelNeon;

Copilot uses AI. Check for mistakes.
this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon;
this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon;
// this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon;
this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon;
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate assignment to ConvPointwiseFloatKernel. The second assignment overwrites the first. If the intent is to replace the float kernel with the BF16 kernel, remove the first assignment. Otherwise, clarify the initialization logic.

Suggested change
this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants