Bf16 implementations of NCHWc Conv kernels #26980

Rohanjames1997 · 2026-01-12T14:29:34Z

Description

This is a follow-up to #26838. Only review this PR once the former is merged.

Motivation and Context

This PR introduces BF16 implementations of the Depthwise, NCHWc, and NCHW Conv kernels.

Performance Improvement

These kernels, along with the BF16 Pointwise Conv kernels, provide a 33% speedup on models (other than Mobilenet) that stress the NCHWc kernels.

…bf16_conv

Co-authored-by: Copilot <[email protected]>

…bf16_conv

…time into bf16_conv

Rohanjames1997 · 2026-01-16T20:08:26Z

This PR is in a pretty crude form. So before investing any effort manually reviewing it, can I get a copilot review? @hariharans29

hariharans29 · 2026-01-16T21:03:38Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-16T21:03:57Z

Azure Pipelines successfully started running 4 pipeline(s).

Copilot

Pull request overview

This PR introduces BF16 (bfloat16) implementations for NCHWc, NCHW, and Depthwise convolution kernels on ARM NEON platforms, building on previous work. The implementations use SBGEMM (Single-precision BFLOAT16 GEMM) operations and achieve a 33% performance improvement on models that heavily use NCHWc kernels.

Changes:

Added ZeroMode parameter to SBGEMM operations to support output accumulation mode
Implemented BF16 convolution kernels for NCHW/NCHWc formats using im2col + SBGEMM approach
Registered new BF16 kernel implementations in platform initialization

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
onnxruntime/core/mlas/lib/sbgemm.h	Added `InitialZeroMode` parameter to control zero-mode behavior in SBGEMM operations
onnxruntime/core/mlas/lib/sbconv_kernel_neon.cpp	New file implementing BF16 depthwise, NCHWc, NCHW, and pointwise convolution kernels
onnxruntime/core/mlas/lib/platform.cpp	Registered BF16 kernel function pointers for NEON platform
onnxruntime/core/mlas/lib/mlasi.h	Declared new BF16 convolution kernel function signatures
onnxruntime/core/mlas/inc/mlas.h	Added `ZeroMode` field to SBGEMM data parameters structure
cmake/onnxruntime_mlas.cmake	Added sbconv_kernel_neon.cpp to build with BF16 compilation flags

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-16T21:05:49Z

onnxruntime/core/mlas/lib/sbconv_kernel_neon.cpp

+// BF16 NCHW/NCHWc Convolution Kernel using im2col + SBGEMM.
+//   NCHW: 1 input channel per kernel position, single GEMM with K=KernelSize
+//   NCHWc: BlockSize input channels per kernel position, loop over kpos with K=BlockSize


Duplicate comment block at lines 121-124 and lines 126-128. Remove one of the duplicate comment blocks.

Suggested change

// BF16 NCHW/NCHWc Convolution Kernel using im2col + SBGEMM.

// NCHW: 1 input channel per kernel position, single GEMM with K=KernelSize

// NCHWc: BlockSize input channels per kernel position, loop over kpos with K=BlockSize

Copilot · 2026-01-16T21:05:49Z

onnxruntime/core/mlas/lib/platform.cpp

    this->ConvNchwFloatKernel = MlasConvNchwFloatKernelNeon;
+    this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon;
    this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon;
+    this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon;
    this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon;
+    // this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon;
    this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon;


Duplicate assignment to ConvNchwFloatKernel. The second assignment overwrites the first. If the intent is to replace the float kernel with the BF16 kernel, remove the first assignment. Otherwise, clarify the initialization logic.

Suggested change

this->ConvNchwFloatKernel = MlasConvNchwFloatKernelNeon;

this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon;

this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon;

this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon;

this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon;

// this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon;

this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon;

this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon;

this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon;

this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon;

// this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon;

Copilot · 2026-01-16T21:05:49Z

onnxruntime/core/mlas/lib/platform.cpp

+    this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon;
    this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon;
+    this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon;
    this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon;
+    // this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon;
    this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon;
+    this->ConvPointwiseFloatKernel = MlasConvPointwiseBf16KernelNeon;


Duplicate assignment to ConvNchwcFloatKernel. The second assignment overwrites the first. If the intent is to replace the float kernel with the BF16 kernel, remove the first assignment. Otherwise, clarify the initialization logic.

Suggested change

this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon;

this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon;

this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon;

this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon;

// this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon;

this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon;

this->ConvPointwiseFloatKernel = MlasConvPointwiseBf16KernelNeon;

// this->ConvNchwFloatKernel = MlasConvNchwBf16KernelNeon;

this->ConvNchwcFloatKernel = MlasConvNchwcFloatKernelNeon;

// this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon;

this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon;

// this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon;

this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon;

// this->ConvPointwiseFloatKernel = MlasConvPointwiseBf16KernelNeon;

Copilot · 2026-01-16T21:05:50Z

onnxruntime/core/mlas/lib/platform.cpp

+    this->ConvNchwcFloatKernel = MlasConvNchwcBf16KernelNeon;
    this->ConvDepthwiseFloatKernel = MlasConvDepthwiseFloatKernelNeon;
+    // this->ConvDepthwiseFloatKernel = MlasConvDepthwiseBf16KernelNeon;
    this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon;


Duplicate assignment to ConvPointwiseFloatKernel. The second assignment overwrites the first. If the intent is to replace the float kernel with the BF16 kernel, remove the first assignment. Otherwise, clarify the initialization logic.

Suggested change

this->ConvPointwiseFloatKernel = MlasConvPointwiseFloatKernelNeon;

Rohanjames1997 and others added 27 commits December 4, 2025 20:05

Introducing a BF16 optimized Pointwise Conv kernel

e1c28f3

Simplify the Fp32 Pointwise Kernel to use GEMM

4a814a2

Simplify the Fp32 Depthwise kernel

e2197cd

Simplify the Fp32 Depthwise Conv kernel

57c1c73

Make fp32 Depthwise Conv branchless

998267d

Add a BF16 Depthwise kernel. Yet to optimize perf

a560ac8

Wiring for the BF16 kernels

eb59c07

Make fp32 Depthwise Conv branchless

9a55be9

Make MlasConvFloatKernelNeonImpl branchless

4113c91

Remove redundant code

06e05a7

Merge two loops

f94a66a

Make ReLU branchless in Fp32 Pointwise Conv

7cb2b44

Hoist allocation outside the loop

98af370

Refactor the data validation step for Depthwise

3b7dd2f

Sequential memory access for depthwise

741df51

Inline the function call

8f0e39c

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

5b82b17

…bf16_conv

Fix the GEMM implementation and expand the test coverage.

c5a769d

Fix segfault in ConvNoBiasAddFusion

a0a5bd5

Eliminate potential segfault

7ea4926

Copilot's suggestion for boundary checks

abe4e21

Co-authored-by: Copilot <[email protected]>

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

a6b2e10

…bf16_conv

Merge branch 'bf16_conv' of https://github.com/Rohanjames1997/onnxrun…

2e9bed9

…time into bf16_conv

disable depthwise temp

8d76b80

Extend MlasSBGemmBatch to accept ZeroMode

6cbbd25

Merge branch 'microsoft:main' into bf16_conv

8eb4f5a

Hacks for NCHWC Conv

b63aac1

hariharans29 requested a review from Copilot January 16, 2026 21:02

Copilot started reviewing on behalf of hariharans29 January 16, 2026 21:03 View session

Copilot AI reviewed Jan 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bf16 implementations of NCHWc Conv kernels #26980

Bf16 implementations of NCHWc Conv kernels #26980

Rohanjames1997 commented Jan 12, 2026

Uh oh!

Rohanjames1997 commented Jan 16, 2026

Uh oh!

hariharans29 commented Jan 16, 2026

Uh oh!

azure-pipelines bot commented Jan 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 16, 2026

Uh oh!

Copilot AI Jan 16, 2026

Uh oh!

Copilot AI Jan 16, 2026

Uh oh!

Copilot AI Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// BF16 NCHW/NCHWc Convolution Kernel using im2col + SBGEMM.
	// NCHW: 1 input channel per kernel position, single GEMM with K=KernelSize
	// NCHWc: BlockSize input channels per kernel position, loop over kpos with K=BlockSize

Bf16 implementations of NCHWc Conv kernels #26980

Are you sure you want to change the base?

Bf16 implementations of NCHWc Conv kernels #26980

Conversation

Rohanjames1997 commented Jan 12, 2026

Description

Motivation and Context

Performance Improvement

Uh oh!

Rohanjames1997 commented Jan 16, 2026

Uh oh!

hariharans29 commented Jan 16, 2026

Uh oh!

azure-pipelines bot commented Jan 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants