-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Vulkan MMQ Integer Dot Refactor and K-Quant support #16536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Interesting. How is the performance for the legacy quants? Having the values decoded to 8b in shared memory would allow for using int8 coopmat, so this change seems to prevent that. But if using coopmat for this isn't planned then I guess that's fine. |
It's a ~10% improvement for Intel, a little less so on AMD and Nvidia.
Yeah, I gave that a try when I first created this shader and didn't find a good way to use coopmat. I plan to take another look, but I guess I'd create a separate shader for it. There wasn't a good way to add k-quants to the structure it had. |
@jeffbolznv I'm trying to investigate the low performance for q2_k with Nvidia Nsight Graphics, but it's giving me some weird results: Additionally, I get something like 12.81 TFLOPS on a normal run, but 14.90 TFLOPS if I disable FP16. (The test is The hotspots otherwise are the integer dot math, the Any clue what is going on? |
This could be register spilling to shared memory. Might be worth trying a smaller tile size to not be so close to the register limit. What is the relative performance of Q2_K and Q4_0, in the old and new paths? |
From memory, it's something like 10-14 tflops for the scalar float16 path and around 24 tflops for the q4_0 integer dot one. |
These are the actual values. Edit: They do also improve significantly without fp16 enabled, odd.
This is not just a testing fluke, it also increases pp512 of a model that uses the mmq shader. ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Edit2: This is due to the accumulator type, so maybe a cache improvement due to 32-bit sums. |
This heavily refactors the caching structure of the MMQ shader and also makes it more modular, to work with other kinds of quants.
Basically instead of turning the quants into 8-bit integers during load to shared memory, the quant structs now get copied through shared memory into registers and only reshaped into 8-bit integers directly before the integer dot operation. This saves on shared memory and on registers.
TODO:
Q2_K performance is not that good yet. Mapping the 256-wide quant structure to 32-wide Q8_1 structures is not that easy to do efficiently, so I'm still trying to find the best way to do that. @jeffbolznv Let me know if you see any obvious issues with the implementation.