Vulkan MMQ Integer Dot Refactor and K-Quant support #16536

0cc4m · 2025-10-12T16:58:40Z

This heavily refactors the caching structure of the MMQ shader and also makes it more modular, to work with other kinds of quants.

Basically instead of turning the quants into 8-bit integers during load to shared memory, the quant structs now get copied through shared memory into registers and only reshaped into 8-bit integers directly before the integer dot operation. This saves on shared memory and on registers.

TODO:

Q2_K performance is not that good yet. Mapping the 256-wide quant structure to 32-wide Q8_1 structures is not that easy to do efficiently, so I'm still trying to find the best way to do that. @jeffbolznv Let me know if you see any obvious issues with the implementation.

jeffbolznv · 2025-10-12T18:05:49Z

Interesting. How is the performance for the legacy quants?

Having the values decoded to 8b in shared memory would allow for using int8 coopmat, so this change seems to prevent that. But if using coopmat for this isn't planned then I guess that's fine.

0cc4m · 2025-10-12T18:43:13Z

Interesting. How is the performance for the legacy quants?

It's a ~10% improvement for Intel, a little less so on AMD and Nvidia.

Having the values decoded to 8b in shared memory would allow for using int8 coopmat, so this change seems to prevent that. But if using coopmat for this isn't planned then I guess that's fine.

Yeah, I gave that a try when I first created this shader and didn't find a good way to use coopmat. I plan to take another look, but I guess I'd create a separate shader for it. There wasn't a good way to add k-quants to the structure it had.

0cc4m · 2025-10-15T04:38:58Z

@jeffbolznv I'm trying to investigate the low performance for q2_k with Nvidia Nsight Graphics, but it's giving me some weird results:
This is the q2_k shader:

This is the q4_0 shader:

One difference I can see is shared memory, but I actually requested less shared memory for q2_k than for q4_0, so I don't know what's going on there.
Also, the instruction count is quite a bit larger for q2_k, which may be related to the third-most common stall being NOINST.

Additionally, I get something like 12.81 TFLOPS on a normal run, but 14.90 TFLOPS if I disable FP16. (The test is MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1))

The hotspots otherwise are the integer dot math, the mul_q8_1 function and the data load from global to shared memory in block_a_to_shmem

Any clue what is going on?

jeffbolznv · 2025-10-15T05:15:44Z

One difference I can see is shared memory, but I actually requested less shared memory for q2_k than for q4_0, so I don't know what's going on there

This could be register spilling to shared memory. Might be worth trying a smaller tile size to not be so close to the register limit.

What is the relative performance of Q2_K and Q4_0, in the old and new paths?

0cc4m · 2025-10-15T05:43:06Z

From memory, it's something like 10-14 tflops for the scalar float16 path and around 24 tflops for the q4_0 integer dot one.

0cc4m · 2025-10-16T04:14:13Z

MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  384 runs -  2614.45 us/run -  60.13 GFLOP/run -  23.00 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2885.67 us/run -  60.13 GFLOP/run -  20.84 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  304 runs -  3302.36 us/run -  60.13 GFLOP/run -  18.21 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  324 runs -  3099.11 us/run -  60.13 GFLOP/run -  19.40 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  328 runs -  3053.64 us/run -  60.13 GFLOP/run -  19.69 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  182 runs -  5507.77 us/run -  60.13 GFLOP/run -  10.92 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  230 runs -  4363.89 us/run -  60.13 GFLOP/run -  13.78 TFLOPS

These are the actual values.

Edit: They do also improve significantly without fp16 enabled, odd.

MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  438 runs -  2284.85 us/run -  60.13 GFLOP/run -  26.32 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  370 runs -  2711.59 us/run -  60.13 GFLOP/run -  22.17 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  382 runs -  2619.61 us/run -  60.13 GFLOP/run -  22.95 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  280 runs -  3578.32 us/run -  60.13 GFLOP/run -  16.80 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  430 runs -  2327.71 us/run -  60.13 GFLOP/run -  25.83 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  186 runs -  5417.37 us/run -  60.13 GFLOP/run -  11.10 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  152 runs -  6588.63 us/run -  60.13 GFLOP/run -   9.13 TFLOPS

This is not just a testing fluke, it also increases pp512 of a model that uses the mmq shader.

model	size	params	backend	ngl	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	pp512	1921.58 ± 2.88

model	size	params	backend	ngl	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	Vulkan	99	pp512	2241.67 ± 9.24

Edit2: This is due to the accumulator type, so maybe a cache improvement due to 32-bit sums.

0cc4m added 2 commits October 11, 2025 13:12

vulkan: add mmq q2_k integer dot support

777a18e

Refactor mmq caching

3e4ff93

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vulkan MMQ Integer Dot Refactor and K-Quant support #16536

Vulkan MMQ Integer Dot Refactor and K-Quant support #16536

Uh oh!

0cc4m commented Oct 12, 2025

Uh oh!

jeffbolznv commented Oct 12, 2025

Uh oh!

0cc4m commented Oct 12, 2025

Uh oh!

0cc4m commented Oct 15, 2025

Uh oh!

jeffbolznv commented Oct 15, 2025

Uh oh!

0cc4m commented Oct 15, 2025

Uh oh!

0cc4m commented Oct 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vulkan MMQ Integer Dot Refactor and K-Quant support #16536

Are you sure you want to change the base?

Vulkan MMQ Integer Dot Refactor and K-Quant support #16536

Uh oh!

Conversation

0cc4m commented Oct 12, 2025

Uh oh!

jeffbolznv commented Oct 12, 2025

Uh oh!

0cc4m commented Oct 12, 2025

Uh oh!

0cc4m commented Oct 15, 2025

Uh oh!

jeffbolznv commented Oct 15, 2025

Uh oh!

0cc4m commented Oct 15, 2025

Uh oh!

0cc4m commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

0cc4m commented Oct 16, 2025 •

edited

Loading