-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[MLAS] Fix Lut GEMM Flakiness and Accuracy #27216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes critical bugs in the MatMulNBitsLutGemm (T-MAC) operator that caused intermittent failures and numerical accuracy issues for multi-row activations. The fixes address scale indexing errors, race conditions in parallel processing, and buffer allocation issues.
Changes:
- Fixed incorrect LUT scale indexing from
kk / (ActK * 4)tokk / ActKin AVX2 kernel - Serialized activation loop to eliminate race conditions in multi-row processing
- Added tail handling for matrices where dimensions are not multiples of 32
- Corrected buffer size calculations and added explicit zero-initialization
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| onnxruntime/test/contrib_ops/matmul_2bits_test.cc | Increased tolerance to 1.0f for Batch32 asymmetric test to account for T-MAC's lossy quantization |
| onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp | Fixed scale indexing bug, added tail case handling, and added explicit buffer initialization |
| onnxruntime/core/mlas/lib/qlutgemm.cpp | Corrected buffer size calculation, serialized activation loop, and added explicit zero-initialization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b82fa9e to
ecc7081
Compare
ecc7081 to
2d0cc15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This PR resolves flakiness and accuracy issues in the
MatMulNBitsLutGemmoperator.Root Cause Analysis
The
MatMulNBitsLutGemmoperator exhibited non-deterministic flakiness and numerical accuracy issues. This analysis covers the root causes addressed by the changes.Identified Root Causes
1. Data Race in LutGemmPackQuantBData
2. Thread-Safety in Global Configuration Map
tmac_kernel_configs(a staticstd::unordered_map) was accessed concurrently. Map insertions or rehashing during initialization could invalidate references held by other threads.std::mutexprotection and modified the parameter getter to return by value.3. Tiling Dimension Mismatch and Buffer Safety
TotalNfor parameter retrieval, and implementing explicit clamping and tail-case handling in the AVX2 kernel.Verification Results
MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256passed 100 consecutive iterations.