Implement new experimental lookup-based matrix multiplication method(TMAC) #26695

vraspar · 2025-12-01T21:47:02Z

Description

This PR introduces a new experimental lookup-table(LUT) based matrix multiplication method for 2-bit MatMulNBits on x64 AVX2 inspired from T-MAC paper and T-MAC repository to speed up low bit LLM inference.

Unlike the existing quant-dequant methods, the LUT-based method directly supports mixed-precision-GEMM without dequantization. It uses bit-wise table lookup to eliminate multiplications and reduce additions required in matrix multiplication.

This PR:

Add mlas.use_lut_gemm session option allowing use of LUT GEMM inside matmulnbits when it is available (2-bit, BlkLen multiple of 32, K multiple of 32, N multiple of 128, AVX2 present).
Introduces LUT packing + kernel config cache (packs bitplanes, scales, ZP) and the main MlasLUTGemm entry that generates per-row LUTs and calls the AVX2 kernel.
Implements AVX2 LUT generation GenerateLUT_avx2 and GEMM compute TMACComputeGemm_avx2 and wires dispatch in MLAS platform init.
Updates MatMulNBits PrePack/Compute to use LUT packing/compute when opted-in; keeps existing quant-dequant path as fallback.
Extends Python quant bindings with 2-bit QDQ helper for parity with the new path.
Adds MLAS unit tests covering LUT GEMM across symmetric/asymmetric quant and multiple shapes/block sizes.

Main components:

MlasInitLUTGemmKernelConfig: Config for LUT kernels
MlasLUTGemmPackQuantBData: Pre Packing of quantized weight
MlasLUTPackScalesAndZeroPoints: Pre Packing of qunatized scales and zero points
MlasLUTGemm: Main Entry point
GenerateLUT_avx2: LUT construction from activations
TMACComputeGemm_avx2: AVX2 LUT GEMM kernel
Session option: mlas.use_lut_gemm

How to test

MLAS LUT GEMM unit tests: see test_sqlutgemm.cpp
Run MatMulNBits models with session option mlas.use_lut_gemm=1 on AVX2 machines; expect fallback to existing path if availability checks fail.

Perf

Focus of this PR is functional + kernel bring-up; perf to be reported separately once broader profiling is done.

Future Work

Support MLFloat16 (FP16 scales and zero points)
Add neon kernel for ARM.
Add kernels for 4 bit weights and bitnet kernels
Broader batch (N>1) support and additional shape coverage.

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

…st-commit

…last-commit-new

…as kernel not implemented for fp32. Also, I need to write the packing logic for the scales as well.

…ssert issue with the data shuffling in prepack

include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

edgchen1 · 2026-01-06T19:26:38Z

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

  }
+
+  // Create a temporary threadpool for parallel packing
+  // This is used during model load time to speed up weight prepacking


what is the overhead like for creating a new threadpool in each call to PrePack()?

I wonder if we should make an existing threadpool available to this code. perhaps we can pass in the threadpool from SessionState. something to consider, and maybe for a future PR.

I agree, passing thread pool to PrePack would be clean. I am planning to create second PR improving Prepacking logic in general, I will include this along with this :)

onnxruntime/core/mlas/inc/mlas_qnbit.h

onnxruntime/python/onnxruntime_pybind_quant.cc

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp

onnxruntime/core/mlas/lib/platform.cpp

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits_impl.cc

cmake/onnxruntime_mlas.cmake

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

hariharans29 · 2026-01-11T23:07:13Z

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

+      auto scale_ptr = scales ? scales->DataRaw() : nullptr;
+      packed_b_ = IAllocator::MakeUniquePtr<void>(alloc, packed_b_size_, true);
+      MlasQNBitGemmPackQuantBData(N_, K_, nbits_, block_size_, compute_type_, qptr, packed_b_.get(), scale_ptr,
+                                  has_zp_input_, nullptr, threadpool_ptr);


IIUC - The usage of threadpool in the existing non-LUT path seems like a new addition - is that intentaional (and come with apprioriate tests) ?

Initially, I thought tests in test_sqnbitgemm.cpp should suffice since they already test it with thread pool. I applied changes to only use thread pool for LUT path now.

Once we add tests, I think it might be beneficial to use thread pool for pre packing for other paths

closing comment for now to merge as discussed offline

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

onnxruntime/python/onnxruntime_pybind_quant.cc

…entation

hariharans29 · 2026-01-15T02:33:27Z

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp

+}
+
+// Conditional pragma unroll for compiler compatibility
+#if defined(__INTEL_COMPILER) || defined(__clang__)


Why is this complier dependent ? Is this implementation from the T-MAC library as is ?

hariharans29 · 2026-01-15T02:48:51Z

onnxruntime/core/mlas/lib/qlutgemm.cpp

+    // Each iteration processes one row of the activation matrix
+    // TODO(vraspar): Ideally we have to do block parallelism here
+
+    MlasTrySimpleParallel(


If M == 1, can we parallelize on N ?

hariharans29 · 2026-01-15T04:37:01Z

onnxruntime/core/mlas/lib/qlutgemm.cpp

+    }
+
+    size_t n_div = 0;
+    switch (BlkBitWidth) {


Why have this switch if BlkBitWidth is guaranteed to be 2 at this stage ?

I decided to have it generalized for when we add int4 kernels

hariharans29 · 2026-01-16T04:43:49Z

onnxruntime/core/mlas/lib/qlutgemm.h

+ * @brief Parameters for TMAC kernel
+ */
+struct MlasTMACKernelParams {
+    size_t g;


A brief comment describing what each config is and what it is used for will help

hariharans29 · 2026-01-16T05:59:04Z

onnxruntime/core/mlas/lib/qlutgemm.cpp

+)
+{
+    const MlasTMACKernelParams& tmac_params = MlasGetLutGemmKernelParams(N, K, BlkBitWidth, BlkLen, HasZeroPoint);
+    const size_t PackedQuantBDataSize = (N * BlkBitWidth) * (K / tmac_params.g / tmac_params.ngroups_per_elem);


Is there an alignment requirement for the packed weights ?

hariharans29 · 2026-01-16T06:01:59Z

onnxruntime/core/mlas/lib/qlutgemm.cpp

+    assert(bm % mgroup == 0);
+    assert(bm % bits == 0);
+
+    std::unique_ptr<uint8_t[]> buf(new uint8_t[N * bits * (K / g)]);


For the purpose of what is being done here, a standard RAII containers like vector would do, do we really need a unique_ptr here ?

…TMAC) (microsoft#26695) ### Description This PR introduces a new experimental lookup-table(LUT) based matrix multiplication method for 2-bit MatMulNBits on x64 AVX2 inspired from [T-MAC paper](https://arxiv.org/abs/2407.00088) and [T-MAC repository](https://github.com/microsoft/T-MAC) to speed up low bit LLM inference. Unlike the existing quant-dequant methods, the LUT-based method directly supports mixed-precision-GEMM without dequantization. It uses bit-wise table lookup to eliminate multiplications and reduce additions required in matrix multiplication. <img width="1910" height="759" alt="image" src="https://github.com/user-attachments/assets/3e3f2ced-eba4-4d4e-a63c-fec479943202" /> This PR: - Add` mlas.use_lut_gemm` session option allowing use of LUT GEMM inside matmulnbits when it is available (2-bit, BlkLen multiple of 32, K multiple of 32, N multiple of 128, AVX2 present). - Introduces LUT packing + kernel config cache (packs bitplanes, scales, ZP) and the main `MlasLUTGemm` entry that generates per-row LUTs and calls the AVX2 kernel. - Implements AVX2 LUT generation `GenerateLUT_avx2` and GEMM compute `TMACComputeGemm_avx2` and wires dispatch in MLAS platform init. - Updates MatMulNBits PrePack/Compute to use LUT packing/compute when opted-in; keeps existing quant-dequant path as fallback. - Extends Python quant bindings with 2-bit QDQ helper for parity with the new path. - Adds MLAS unit tests covering LUT GEMM across symmetric/asymmetric quant and multiple shapes/block sizes. ### Main components: - `MlasInitLUTGemmKernelConfig`: Config for LUT kernels - `MlasLUTGemmPackQuantBData`: Pre Packing of quantized weight - `MlasLUTPackScalesAndZeroPoints`: Pre Packing of qunatized scales and zero points - `MlasLUTGemm`: Main Entry point - `GenerateLUT_avx2`: LUT construction from activations - `TMACComputeGemm_avx2`: AVX2 LUT GEMM kernel - Session option: mlas.use_lut_gemm ### How to test - MLAS LUT GEMM unit tests: see `test_sqlutgemm.cpp` - Run MatMulNBits models with session option `mlas.use_lut_gemm=1` on AVX2 machines; expect fallback to existing path if availability checks fail. ### Perf Focus of this PR is functional + kernel bring-up; perf to be reported separately once broader profiling is done. ### Future Work - Support MLFloat16 (FP16 scales and zero points) - Add neon kernel for ARM. - Add kernels for 4 bit weights and bitnet kernels - Broader batch (N>1) support and additional shape coverage. --------- Signed-off-by: Liqun Fu <liqun.fu@microsoft.com> Co-authored-by: Liqun Fu <liqun.fu@microsoft.com> Co-authored-by: carzh <wolfivyaura@gmail.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: carzh <carolinezhu@microsoft.com> Co-authored-by: Vrajang Parikh <vrparikh@microsoft.com>

…TMAC) (#26695) ### Description This PR introduces a new experimental lookup-table(LUT) based matrix multiplication method for 2-bit MatMulNBits on x64 AVX2 inspired from [T-MAC paper](https://arxiv.org/abs/2407.00088) and [T-MAC repository](https://github.com/microsoft/T-MAC) to speed up low bit LLM inference. Unlike the existing quant-dequant methods, the LUT-based method directly supports mixed-precision-GEMM without dequantization. It uses bit-wise table lookup to eliminate multiplications and reduce additions required in matrix multiplication. <img width="1910" height="759" alt="image" src="https://github.com/user-attachments/assets/3e3f2ced-eba4-4d4e-a63c-fec479943202" /> This PR: - Add` mlas.use_lut_gemm` session option allowing use of LUT GEMM inside matmulnbits when it is available (2-bit, BlkLen multiple of 32, K multiple of 32, N multiple of 128, AVX2 present). - Introduces LUT packing + kernel config cache (packs bitplanes, scales, ZP) and the main `MlasLUTGemm` entry that generates per-row LUTs and calls the AVX2 kernel. - Implements AVX2 LUT generation `GenerateLUT_avx2` and GEMM compute `TMACComputeGemm_avx2` and wires dispatch in MLAS platform init. - Updates MatMulNBits PrePack/Compute to use LUT packing/compute when opted-in; keeps existing quant-dequant path as fallback. - Extends Python quant bindings with 2-bit QDQ helper for parity with the new path. - Adds MLAS unit tests covering LUT GEMM across symmetric/asymmetric quant and multiple shapes/block sizes. ### Main components: - `MlasInitLUTGemmKernelConfig`: Config for LUT kernels - `MlasLUTGemmPackQuantBData`: Pre Packing of quantized weight - `MlasLUTPackScalesAndZeroPoints`: Pre Packing of qunatized scales and zero points - `MlasLUTGemm`: Main Entry point - `GenerateLUT_avx2`: LUT construction from activations - `TMACComputeGemm_avx2`: AVX2 LUT GEMM kernel - Session option: mlas.use_lut_gemm ### How to test - MLAS LUT GEMM unit tests: see `test_sqlutgemm.cpp` - Run MatMulNBits models with session option `mlas.use_lut_gemm=1` on AVX2 machines; expect fallback to existing path if availability checks fail. ### Perf Focus of this PR is functional + kernel bring-up; perf to be reported separately once broader profiling is done. ### Future Work - Support MLFloat16 (FP16 scales and zero points) - Add neon kernel for ARM. - Add kernels for 4 bit weights and bitnet kernels - Broader batch (N>1) support and additional shape coverage. --------- Signed-off-by: Liqun Fu <liqun.fu@microsoft.com> Co-authored-by: Liqun Fu <liqun.fu@microsoft.com> Co-authored-by: carzh <wolfivyaura@gmail.com> Co-authored-by: Hector Li <hecli@microsoft.com> Co-authored-by: carzh <carolinezhu@microsoft.com> Co-authored-by: Vrajang Parikh <vrparikh@microsoft.com> (cherry picked from commit 8e050d1)

liqunfu and others added 30 commits January 29, 2025 19:11

init code structure for matmul 2 bits

5484560

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

add and pass q4dq tests for q2bit - rename file and test name later

8c1cfe1

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

some fixes

f6f22e3

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

add apis to neon and other avxs

3e1a951

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

fix neon build

0130061

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

disable 2bit test

b4aad01

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

2 bit quantize to support model builder

ff531cb

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>

Merge remote-tracking branch 'msft/main' into carzh/bitnet-reverse-la…

6849ea2

…st-commit

fix compile errors

e85431e

resolve build failure update

9642740

2 bits check

892222a

fixed bug causing int8 tests to fail

07b7f3f

Merge remote-tracking branch 'origin/main' into carzh/bitnet-reverse-…

5fb2edd

…last-commit-new

lintrunner

493ebd1

prepack wip -- not prepacking b data because dispatch to check for ml…

b4b143f

…as kernel not implemented for fp32. Also, I need to write the packing logic for the scales as well.

fixed dispatch issue, added acc level 4 tests, and now running into a…

534b8e6

…ssert issue with the data shuffling in prepack

deep sigh

70d6588

builds somehow

ad2572b

update

b312815

udpate

bfeac34

Implement Pre Packing of qweight for tmac

a5de108

Implement Pre packing for Scales and zero points

7ff8218

Transform zero points before interleaving

6d8e8ec

Initial implementation of tmac kernel config

5d19daf

Move pre packing scales and zp code to qlutgemm and use tmac_params

c600056

update

5cf99e6

bug fixes

f9a9b47

Fix bug in scale unpacking

5687e5e

Fix issues with TMAC GEMM kernels and remove hard coded variables

6f08418

Fix bug in LUT table generation

6191aad

vraspar marked this pull request as ready for review January 3, 2026 03:26

vraspar requested a review from edgchen1 January 3, 2026 03:30

edgchen1 reviewed Jan 6, 2026

View reviewed changes

Apply feedback and use PrePacking

f94e51e

vraspar added the release:1.24.0 label Jan 8, 2026

vraspar added 2 commits January 8, 2026 11:24

update platform.cpp

7b708ad

use MLAS_THROW_EX for qlutgemm.cpp

58e93ec

hariharans29 reviewed Jan 11, 2026

View reviewed changes

cmake/onnxruntime_mlas.cmake Show resolved Hide resolved

hariharans29 reviewed Jan 11, 2026

View reviewed changes

cmake/onnxruntime_mlas.cmake Show resolved Hide resolved

hariharans29 reviewed Jan 11, 2026

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc Outdated Show resolved Hide resolved

hariharans29 reviewed Jan 11, 2026

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc Show resolved Hide resolved

hariharans29 reviewed Jan 11, 2026

View reviewed changes

onnxruntime/python/onnxruntime_pybind_quant.cc Show resolved Hide resolved

Add LUT GEMM 2-bit tests and fix Python quantization reference implem…

469cde7

…entation

vraspar requested review from edgchen1 and hariharans29 January 13, 2026 19:09

Merge remote-tracking branch 'origin/main' into vraspar/lut-gemm

48fd982

hariharans29 reviewed Jan 15, 2026

View reviewed changes

hariharans29 approved these changes Jan 15, 2026

View reviewed changes

hariharans29 reviewed Jan 15, 2026

View reviewed changes

jambayk merged commit 8e050d1 into main Jan 15, 2026
90 checks passed

jambayk deleted the vraspar/lut-gemm branch January 15, 2026 18:56

hariharans29 reviewed Jan 16, 2026

View reviewed changes

This was referenced Jan 21, 2026

1.24.0 release cherry-pick round 1 #27103

Closed

1.24.0 release cherry-pick round 1 #27104

Open

Implement new experimental lookup-based matrix multiplication method(TMAC) #26695

Implement new experimental lookup-based matrix multiplication method(TMAC) #26695

Uh oh!

Conversation

vraspar commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Main components:

How to test

Perf

Future Work

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharans29 Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

vraspar commented Dec 1, 2025 •

edited

Loading

hariharans29 Jan 16, 2026 •

edited

Loading