Skip to content

Conversation

@tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Jan 30, 2026

This PR resolves flakiness and accuracy issues in the MatMulNBitsLutGemm operator.

Root Cause Analysis

The MatMulNBitsLutGemm operator exhibited non-deterministic flakiness and numerical accuracy issues. This analysis covers the root causes addressed by the changes.

Identified Root Causes

1. Data Race in LutGemmPackQuantBData

  • Issue: The weight packing loop was parallelized across output features ($N$). Since T-MAC packs multiple features into a single byte, concurrent updates to the same byte caused bit-level corruption.
  • Fix: Serialized the sub-byte accumulation phase of the weight packing process.

2. Thread-Safety in Global Configuration Map

  • Issue: tmac_kernel_configs (a static std::unordered_map) was accessed concurrently. Map insertions or rehashing during initialization could invalidate references held by other threads.
  • Fix: Added std::mutex protection and modified the parameter getter to return by value.

3. Tiling Dimension Mismatch and Buffer Safety

  • Issue: The orchestrator used batch size ($M$) for kernel configuration, while weights are tiled by features ($N$). Additionally, the kernel lacked clamping for partial tiles, leading to potential overruns.
  • Fix: Synchronized tiling logic by using $N$ for initialization, passing TotalN for parameter retrieval, and implementing explicit clamping and tail-case handling in the AVX2 kernel.

Verification Results

  • MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256 passed 100 consecutive iterations.
  • Full MatMul2Bits suite passed all 10 tests with standard 0.15f tolerance.

@tianleiwu tianleiwu requested a review from vraspar January 31, 2026 00:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes critical bugs in the MatMulNBitsLutGemm (T-MAC) operator that caused intermittent failures and numerical accuracy issues for multi-row activations. The fixes address scale indexing errors, race conditions in parallel processing, and buffer allocation issues.

Changes:

  • Fixed incorrect LUT scale indexing from kk / (ActK * 4) to kk / ActK in AVX2 kernel
  • Serialized activation loop to eliminate race conditions in multi-row processing
  • Added tail handling for matrices where dimensions are not multiples of 32
  • Corrected buffer size calculations and added explicit zero-initialization

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
onnxruntime/test/contrib_ops/matmul_2bits_test.cc Increased tolerance to 1.0f for Batch32 asymmetric test to account for T-MAC's lossy quantization
onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp Fixed scale indexing bug, added tail case handling, and added explicit buffer initialization
onnxruntime/core/mlas/lib/qlutgemm.cpp Corrected buffer size calculation, serialized activation loop, and added explicit zero-initialization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tianleiwu tianleiwu force-pushed the tlwu/fix_lut_gemm branch 2 times, most recently from b82fa9e to ecc7081 Compare January 31, 2026 05:40
@tianleiwu tianleiwu changed the title [MLAS] Fix Lut GEMM [MLAS] Fix Lut GEMM Flakiness and Accuracy Jan 31, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants