Implement new experimental lookup-based matrix multiplication method(TMAC) #26695

vraspar · 2025-12-01T21:47:02Z

Description

This PR introduces a new experimental lookup-table(LUT) based matrix multiplication method for 2-bit MatMulNBits on x64 AVX2 inspired from T-MAC paper and T-MAC repository to speed up low bit LLM inference.

Unlike the existing quant-dequant methods, the LUT-based method directly supports mixed-precision-GEMM without dequantization. It uses bit-wise table lookup to eliminate multiplications and reduce additions required in matrix multiplication.

This PR:

Add mlas.use_lut_gemm session option allowing use of LUT GEMM inside matmulnbits when it is available (2-bit, BlkLen multiple of 32, K multiple of 32, N multiple of 128, AVX2 present).
Introduces LUT packing + kernel config cache (packs bitplanes, scales, ZP) and the main MlasLUTGemm entry that generates per-row LUTs and calls the AVX2 kernel.
Implements AVX2 LUT generation GenerateLUT_avx2 and GEMM compute TMACComputeGemm_avx2 and wires dispatch in MLAS platform init.
Updates MatMulNBits PrePack/Compute to use LUT packing/compute when opted-in; keeps existing quant-dequant path as fallback.
Extends Python quant bindings with 2-bit QDQ helper for parity with the new path.
Adds MLAS unit tests covering LUT GEMM across symmetric/asymmetric quant and multiple shapes/block sizes.

Main components:

MlasInitLUTGemmKernelConfig: Config for LUT kernels
MlasLUTGemmPackQuantBData: Pre Packing of quantized weight
MlasLUTPackScalesAndZeroPoints: Pre Packing of qunatized scales and zero points
MlasLUTGemm: Main Entry point
GenerateLUT_avx2: LUT construction from activations
TMACComputeGemm_avx2: AVX2 LUT GEMM kernel
Session option: mlas.use_lut_gemm

How to test

MLAS LUT GEMM unit tests: see test_sqlutgemm.cpp
Run MatMulNBits models with session option mlas.use_lut_gemm=1 on AVX2 machines; expect fallback to existing path if availability checks fail.

Perf

Focus of this PR is functional + kernel bring-up; perf to be reported separately once broader profiling is done.

Future Work

Support MLFloat16 (FP16 scales and zero points)
Add neon kernel for ARM.
Add kernels for 4 bit weights and bitnet kernels
Broader batch (N>1) support and additional shape coverage.

Signed-off-by: Liqun Fu <[email protected]>

…st-commit

…last-commit-new

…as kernel not implemented for fp32. Also, I need to write the packing logic for the scales as well.

…ssert issue with the data shuffling in prepack

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

onnxruntime/test/mlas/unittest/test_sqlutgemm.cpp

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

onnxruntime/python/onnxruntime_pybind_quant.cc

…unction signature

…le group size validation

include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

edgchen1 · 2026-01-06T19:26:38Z

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

  }
+
+  // Create a temporary threadpool for parallel packing
+  // This is used during model load time to speed up weight prepacking


what is the overhead like for creating a new threadpool in each call to PrePack()?

I wonder if we should make an existing threadpool available to this code. perhaps we can pass in the threadpool from SessionState. something to consider, and maybe for a future PR.

I agree, passing thread pool to PrePack would be clean. I am planning to create second PR improving Prepacking logic in general, I will include this along with this :)

onnxruntime/core/mlas/inc/mlas_qnbit.h

onnxruntime/python/onnxruntime_pybind_quant.cc

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp

onnxruntime/core/mlas/lib/platform.cpp

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits_impl.cc

cmake/onnxruntime_mlas.cmake

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

hariharans29 · 2026-01-11T23:07:13Z

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

+      auto scale_ptr = scales ? scales->DataRaw() : nullptr;
+      packed_b_ = IAllocator::MakeUniquePtr<void>(alloc, packed_b_size_, true);
+      MlasQNBitGemmPackQuantBData(N_, K_, nbits_, block_size_, compute_type_, qptr, packed_b_.get(), scale_ptr,
+                                  has_zp_input_, nullptr, threadpool_ptr);


IIUC - The usage of threadpool in the existing non-LUT path seems like a new addition - is that intentaional (and come with apprioriate tests) ?

Initially, I thought tests in test_sqnbitgemm.cpp should suffice since they already test it with thread pool. I applied changes to only use thread pool for LUT path now.

Once we add tests, I think it might be beneficial to use thread pool for pre packing for other paths

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

onnxruntime/python/onnxruntime_pybind_quant.cc

…entation

liqunfu and others added 30 commits January 29, 2025 19:11

init code structure for matmul 2 bits

5484560

Signed-off-by: Liqun Fu <[email protected]>

add and pass q4dq tests for q2bit - rename file and test name later

8c1cfe1

Signed-off-by: Liqun Fu <[email protected]>

some fixes

f6f22e3

Signed-off-by: Liqun Fu <[email protected]>

add apis to neon and other avxs

3e1a951

Signed-off-by: Liqun Fu <[email protected]>

fix neon build

0130061

Signed-off-by: Liqun Fu <[email protected]>

disable 2bit test

b4aad01

Signed-off-by: Liqun Fu <[email protected]>

2 bit quantize to support model builder

ff531cb

Signed-off-by: Liqun Fu <[email protected]>

Merge remote-tracking branch 'msft/main' into carzh/bitnet-reverse-la…

6849ea2

…st-commit

fix compile errors

e85431e

resolve build failure update

9642740

2 bits check

892222a

fixed bug causing int8 tests to fail

07b7f3f

Merge remote-tracking branch 'origin/main' into carzh/bitnet-reverse-…

5fb2edd

…last-commit-new

lintrunner

493ebd1

prepack wip -- not prepacking b data because dispatch to check for ml…

b4b143f

…as kernel not implemented for fp32. Also, I need to write the packing logic for the scales as well.

fixed dispatch issue, added acc level 4 tests, and now running into a…

534b8e6

…ssert issue with the data shuffling in prepack

deep sigh

70d6588

builds somehow

ad2572b

update

b312815

udpate

bfeac34

Implement Pre Packing of qweight for tmac

a5de108

Implement Pre packing for Scales and zero points

7ff8218

Transform zero points before interleaving

6d8e8ec

Initial implementation of tmac kernel config

5d19daf

Move pre packing scales and zp code to qlutgemm and use tmac_params

c600056

update

5cf99e6

bug fixes

f9a9b47

Fix bug in scale unpacking

5687e5e

Fix issues with TMAC GEMM kernels and remove hard coded variables

6f08418

Fix bug in LUT table generation

6191aad

Merge remote-tracking branch 'origin/main' into vraspar/lut-gemm

5d8a6ee

github-actions bot reviewed Dec 16, 2025

View reviewed changes

github-advanced-security bot found potential problems Dec 16, 2025

View reviewed changes

onnxruntime/test/mlas/unittest/test_sqlutgemm.cpp Fixed Show fixed Hide fixed

vraspar added 5 commits December 15, 2025 17:19

revert graph_transform_test.cc

b1fcda1

Clean up: revert unchanged files

3eb22b0

Apply linting and clean up

f61c3d8

Add headers, update binding, and general clean up + linting

bebcb64

Fix zero point test cases

6a2e822

jambayk reviewed Jan 2, 2026

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc Outdated Show resolved Hide resolved

jambayk reviewed Jan 2, 2026

View reviewed changes

onnxruntime/python/onnxruntime_pybind_quant.cc Show resolved Hide resolved

vraspar added 4 commits January 2, 2026 13:46

Refactor ComputeBPackedLUT to remove unused parameters and simplify f…

a19b2f6

…unction signature

Merge remote-tracking branch 'origin/main' into vraspar/lut-gemm

26678b2

Fix compiler warnings

e5f80cb

Improve error handling in TMACComputeGemm_avx2 for batch size and sca…

b518ce9

…le group size validation

vraspar marked this pull request as ready for review January 3, 2026 03:26

vraspar requested a review from edgchen1 January 3, 2026 03:30

edgchen1 reviewed Jan 6, 2026

View reviewed changes

Apply feedback and use PrePacking

f94e51e

vraspar added the release:1.24.0 label Jan 8, 2026

vraspar added 2 commits January 8, 2026 11:24

update platform.cpp

7b708ad

use MLAS_THROW_EX for qlutgemm.cpp

58e93ec

hariharans29 reviewed Jan 11, 2026

View reviewed changes

cmake/onnxruntime_mlas.cmake Show resolved Hide resolved

hariharans29 reviewed Jan 11, 2026

View reviewed changes

cmake/onnxruntime_mlas.cmake Show resolved Hide resolved

hariharans29 reviewed Jan 11, 2026

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc Outdated Show resolved Hide resolved

hariharans29 reviewed Jan 11, 2026

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc Show resolved Hide resolved

hariharans29 reviewed Jan 11, 2026

View reviewed changes

onnxruntime/python/onnxruntime_pybind_quant.cc Show resolved Hide resolved

Add LUT GEMM 2-bit tests and fix Python quantization reference implem…

469cde7

…entation

vraspar requested review from edgchen1 and hariharans29 January 13, 2026 19:09

Implement new experimental lookup-based matrix multiplication method(TMAC) #26695

Are you sure you want to change the base?

Implement new experimental lookup-based matrix multiplication method(TMAC) #26695

Conversation

vraspar commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Main components:

How to test

Perf

Future Work

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edgchen1 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

vraspar Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 Jan 11, 2026

Choose a reason for hiding this comment

Uh oh!

vraspar Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

vraspar commented Dec 1, 2025 •

edited

Loading