Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: Vulkan backend with 7900XTX has severe performance dropoff at some batch sizes #10966

Closed
Mushoz opened this issue Dec 24, 2024 · 7 comments · Fixed by #10991
Closed
Assignees

Comments

@Mushoz
Copy link

Mushoz commented Dec 24, 2024

Name and Version

[docker@a242c844efbf ~]$ llama-cli-vulkan --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
version: 4384 (14b699e)
built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-bench

Problem description & steps to reproduce

llama-batched-bench-vulkan -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 99 -npp 512 -ntg 128 -npl 1,2,4,8,16 -pps
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
build: 4384 (14b699e) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu

main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 1, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 128 1 640 1.578 324.39 3.838 33.35 5.416 118.17
512 128 2 768 1.555 329.33 31.047 8.25 32.602 23.56
512 128 4 1024 1.570 326.11 33.209 15.42 34.779 29.44
512 128 8 1536 1.571 325.94 37.241 27.50 38.812 39.58
512 128 16 2560 1.575 325.05 28.106 72.87 29.681 86.25

I understand scaling at some batch sizes might be less than ideal. But at worst I would expect small regressions if no scaling can be achieved at all (due to overhead of batched processing). Right now, for batch sizes 2 and 4 especially there is a massive performance loss. Can anything be done to improve this situation? Poor batched performance makes speculative decoding on the vulkan backend unusable unfortunately.

First Bad Commit

No response

Relevant log output

No response

@ggerganov
Copy link
Owner

It's very difficult to implement efficient small-batch kernels. For speculative decoding, your best bet is to to increase the min draft size and keep the draft prob high:

--draft-max 16 --draft-min 8 --draft-p-min 0.9

This should still give you some speed-up for low-entropy generations.

@Mushoz
Copy link
Author

Mushoz commented Dec 24, 2024

Given the fact that even at batchsize 8 it's performing worse at token generation, that 'draft-min' should be even higher than 8, right? Especially given that it's unlikely all tokens in a long draft sequence will be accepted.

I understand that it's very difficult to optimize small batch kernels, and performance can actually go down compared to the non-batched case due to overhead, but an almost 4x drop in performance going from non-batched to batchsize 2 sounds like a bug / major bottleneck somewhere, right?

@Mushoz
Copy link
Author

Mushoz commented Dec 24, 2024

Just confirmed my own thoughts. Using:

llama-speculative-simple-vulkan -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -md /models/Qwen2.5-Coder-0.5B-Instruct-IQ4_XS.gguf -ngl 99 -ngld 99 -p "Write a minesweeper game using html, js and css. Do not give any explanations. Only output the code." --draft-p-min 0.9 --draft 16 --draft-min 8

I am seeing just over 30 tokens/sec:

decoded 1325 tokens in 43.903 seconds, speed: 30.180 t/s

The quality of the draft looked good as expected from the settings used:

n_draft   = 16
n_predict = 1325
n_drafted = 655
n_accept  = 631
accept    = 96.336%

But a simply llama-cli used as follows:

llama-cli-vulkan -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 99 -p "Write a minesweeper game using html, js and css. Do not give any explanations. Only output the code."

Is going at above 33 tokens/sec:

llama_perf_context_print: eval time = 28566.51 ms / 956 runs ( 29.88 ms per token, 33.47 tokens per second)

@jeffbolznv
Copy link
Collaborator

The Vulkan backend has two paths for matrix multiplication - a matrix-vector multiply for when N=1, and a matrix-matrix multiply for N>1. The matrix-matrix multiply is really aimed at larger matrices, and doesn't do well with small N. We should be able to do better by adapting the matrix-vector multiply to be able to do a few vectors at a time. I can look into this soon, but we should probably let #10846 land first to avoid conflicts.

@Mushoz
Copy link
Author

Mushoz commented Dec 24, 2024

Awesome, thanks for your explanation! Let me know when you start working on this, I'd be happy to help benchmark certain setups to see what works best. I know it's only applicable to the 7900xtx, but maybe it's useful data for you.

@jeffbolznv
Copy link
Collaborator

Hi @Mushoz, please give #10991 a try when you get a chance.

@Mushoz
Copy link
Author

Mushoz commented Dec 28, 2024

@jeffbolznv Will give feedback in that PR to keep the discussion in one place

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants