[MLAS] Fix Lut GEMM Flakiness and Accuracy #27216

tianleiwu · 2026-01-30T22:21:44Z

This PR resolves flakiness and accuracy issues in the MatMulNBitsLutGemm operator.

Root Cause Analysis

The MatMulNBitsLutGemm operator exhibited non-deterministic flakiness and numerical accuracy issues. This analysis covers the root causes addressed by the changes.

Identified Root Causes

1. Data Race in LutGemmPackQuantBData

Issue: The weight packing loop was parallelized across output features ($N$). Since T-MAC packs multiple features into a single byte, concurrent updates to the same byte caused bit-level corruption.
Fix: Serialized the sub-byte accumulation phase of the weight packing process.

2. Thread-Safety in Global Configuration Map

Issue: tmac_kernel_configs (a static std::unordered_map) was accessed concurrently. Map insertions or rehashing during initialization could invalidate references held by other threads.
Fix: Added std::mutex protection and modified the parameter getter to return by value.

3. Tiling Dimension Mismatch and Buffer Safety

Issue: The orchestrator used batch size ($M$) for kernel configuration, while weights are tiled by features ($N$). Additionally, the kernel lacked clamping for partial tiles, leading to potential overruns.
Fix: Synchronized tiling logic by using $N$ for initialization, passing TotalN for parameter retrieval, and implementing explicit clamping and tail-case handling in the AVX2 kernel.

Verification Results

MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256 passed 100 consecutive iterations.
Full MatMul2Bits suite passed all 10 tests with standard 0.15f tolerance.

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp

onnxruntime/core/mlas/lib/qlutgemm.cpp

onnxruntime/test/contrib_ops/matmul_2bits_test.cc

Copilot

Pull request overview

This PR fixes critical bugs in the MatMulNBitsLutGemm (T-MAC) operator that caused intermittent failures and numerical accuracy issues for multi-row activations. The fixes address scale indexing errors, race conditions in parallel processing, and buffer allocation issues.

Changes:

Fixed incorrect LUT scale indexing from kk / (ActK * 4) to kk / ActK in AVX2 kernel
Serialized activation loop to eliminate race conditions in multi-row processing
Added tail handling for matrices where dimensions are not multiples of 32
Corrected buffer size calculations and added explicit zero-initialization

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
onnxruntime/test/contrib_ops/matmul_2bits_test.cc	Increased tolerance to 1.0f for Batch32 asymmetric test to account for T-MAC's lossy quantization
onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp	Fixed scale indexing bug, added tail case handling, and added explicit buffer initialization
onnxruntime/core/mlas/lib/qlutgemm.cpp	Corrected buffer size calculation, serialized activation loop, and added explicit zero-initialization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp

onnxruntime/test/contrib_ops/matmul_2bits_test.cc

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianleiwu requested a review from vraspar January 31, 2026 00:28

vraspar requested changes Jan 31, 2026

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp Outdated Show resolved Hide resolved

onnxruntime/core/mlas/lib/qlutgemm.cpp Show resolved Hide resolved

onnxruntime/test/contrib_ops/matmul_2bits_test.cc Outdated Show resolved Hide resolved

tianleiwu requested a review from vraspar January 31, 2026 01:15

vraspar requested a review from Copilot January 31, 2026 01:28

Copilot started reviewing on behalf of vraspar January 31, 2026 01:28 View session

Copilot AI reviewed Jan 31, 2026

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp Outdated Show resolved Hide resolved

onnxruntime/test/contrib_ops/matmul_2bits_test.cc Outdated Show resolved Hide resolved

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp Outdated Show resolved Hide resolved

tianleiwu force-pushed the tlwu/fix_lut_gemm branch 2 times, most recently from b82fa9e to ecc7081 Compare January 31, 2026 05:40

tianleiwu changed the title ~~[MLAS] Fix Lut GEMM~~ [MLAS] Fix Lut GEMM Flakiness and Accuracy Jan 31, 2026

Fix Lut GEMM Flakiness and Accuracy

2d0cc15

tianleiwu force-pushed the tlwu/fix_lut_gemm branch from ecc7081 to 2d0cc15 Compare January 31, 2026 06:07

tianleiwu requested a review from Copilot January 31, 2026 06:07

Copilot started reviewing on behalf of tianleiwu January 31, 2026 06:08 View session

Copilot AI reviewed Jan 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLAS] Fix Lut GEMM Flakiness and Accuracy #27216

[MLAS] Fix Lut GEMM Flakiness and Accuracy #27216

Uh oh!

tianleiwu commented Jan 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[MLAS] Fix Lut GEMM Flakiness and Accuracy #27216

Are you sure you want to change the base?

[MLAS] Fix Lut GEMM Flakiness and Accuracy #27216

Uh oh!

Conversation

tianleiwu commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause Analysis

Identified Root Causes

1. Data Race in LutGemmPackQuantBData

2. Thread-Safety in Global Configuration Map

3. Tiling Dimension Mismatch and Buffer Safety

Verification Results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianleiwu commented Jan 30, 2026 •

edited

Loading