Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research: Performance differences between Metal (macOS) and Vulkan (Linux) #10982

Open
asahilina opened this issue Dec 26, 2024 · 5 comments
Open

Comments

@asahilina
Copy link

I'm one of the developers for the Asahi Linux GPU drivers, which provide accelerated Vulkan and OpenGL support on Apple Silicon platforms. I'm interested in improving the performance of llama.cpp on our drivers with the Vulkan backend.

As things stand today, macOS is significantly faster on a quick test with llama-bench, with default settings (tested on an M2 Max 64GB):

Linux:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M2 Max (G14C B1) (Honeykrisp) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
ggml_vulkan: Compiling shaders................................Done!
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Vulkan     |  99 |         pp512 |         92.16 ± 0.08 |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         21.93 ± 0.02 |

build: 9ba399dfa7f1 (4391)

macOS:

./build/bin/llama-bench -m /Volumes/Untitled/mistral-7b-v0.1.Q4_K_M.gguf 
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Metal,BLAS,RPC |       8 |         pp512 |        580.26 ± 8.82 |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Metal,BLAS,RPC |       8 |         tg128 |         61.18 ± 0.41 |

build: 9ba399df (4391)

(I also tested a larger 70B model which failed to load due to failing to allocate memory on Linux, but that's obviously a separate issue that's easy to debug. Probably just a hardcoded alloc size limit in the driver we can raise, since we recently refactored a bunch of stuff to handle >4G buffers properly.)

Of course, we'd like to improve the driver where possible to make things faster. However, since I know nothing about how LLMs are implemented under the hood, or the state of the llama.cpp Metal and Vulkan backends I would like to ask for help figuring out the perf issues, and analyzing whether llama.cpp itself could also be part of the root cause.

Would you be able to help us out? I'm curious about these things:

  • The state of the Metal vs. Vulkan backends, and whether any perf differences could be expected on the same hardware based on that alone (are the shaders and the way the workload is run essentially identical, or are there major differences?).
  • How to get more information on how the work is scheduled on both Metal and Vulkan (block sizes, shared memory allocations, and things like that), so we can identify if there are any differences or choices at the llama.cpp level that could explain perf differences.
  • How to run smaller micro-benchmarks. To work out driver and shader compiler issues, ideally we'd want to narrow it down to single shaders / compute launches, and measure the performance individually.
  • General info on what to expect and where we should dig deeper. Are things usually memory-bandwidth-bound (I understand that's the case for LLMs)? Or is it likely we'll run into ALU-bound shaders? Is there any heavy synchronization involved, or are we mostly dealing with large compute launches that stand alone? Is cache performance critical, and could differences in data layout or processing order matter, if any?
@jeffbolznv
Copy link
Collaborator

Hi, welcome.

So this is using the Honeykrisp driver, right? What's the state of the shader compiler there? Do you expect it to be generating reasonably optimal code?

The state of the Metal vs. Vulkan backends, and whether any perf differences could be expected on the same hardware based on that alone (are the shaders and the way the workload is run essentially identical, or are there major differences?).

The backends are separate and there's no guarantee that things are implemented the same way between Vulkan and Metal. For the benchmarks you're looking at the Vulkan shaders involved are pretty well-tuned so I don't expect that to be the issue at the source level.

How to run smaller micro-benchmarks. To work out driver and shader compiler issues, ideally we'd want to narrow it down to single shaders / compute launches, and measure the performance individually.

Please try running test-backend-ops perf. That runs directed tests of some shaders. The MUL_MAT Q4_K tests are the most relevant for the model you're using.

General info on what to expect and where we should dig deeper. Are things usually memory-bandwidth-bound (I understand that's the case for LLMs)? Or is it likely we'll run into ALU-bound shaders? Is there any heavy synchronization involved, or are we mostly dealing with large compute launches that stand alone? Is cache performance critical, and could differences in data layout or processing order matter, if any?

Most time will be spent in mul_mat_vec_q4_k.comp (for token generation) and mul_mm.comp (for prompt processing). I'm surprised token generation is 3x slower in Vulkan, I suspect an issue with the shader compiler code generation. Prompt processing in the metal backend uses simdgroup matrices, you'd need to support cooperative matrix in Vulkan to get get access to those, and I don't think you'll be able to get competitive performance without it, pp512 is maybe 3x faster with coopmat than without on other platforms (depends on model, GPU, etc.).

Is there any heavy synchronization involved, or are we mostly dealing with large compute launches that stand alone?

There's a pipeline barrier between almost every dispatch, see ggml_vk_sync_buffers. Some dispatches are quite small.

IMO first step would be to compare MUL_MAT Q4_K performance with n==1 (an existing test in test-backend-ops) between Metal and Vulkan. This is the mul_mat_vec_q4_k.comp shader.

@asahilina
Copy link
Author

asahilina commented Dec 26, 2024

So this is using the Honeykrisp driver, right? What's the state of the shader compiler there? Do you expect it to be generating reasonably optimal code?

@alyssarosenzweig can probably chime in with more specifics. The compiler itself is shared with the GL driver and has several years of development at this point (and uses lots of shared Mesa infrastructure), so it's not particularly new even though Honeykrisp is.

For the benchmarks you're looking at the Vulkan shaders involved are pretty well-tuned so I don't expect that to be the issue at the source level.

I think I tried a llama.cpp build from a few months ago and it was noticeably slower, which makes me think there's probably still low-hanging fruit on this side? (unless major optimizations happened recently and no more are expected). If you think it might be insightful, I can try to bisect the performance improvement to see if it was something interesting / unexpected.

Please try running test-backend-ops perf.

Ah, that's what I was looking for, thanks! I'll do some comparisons with Metal. A priori, one of the copy types reports 277.44 GB/s, which tells me we don't have some kind of pathological system-level memory bandwidth bottleneck (M2 Max memory BW is 400GB/s and I assume the CPY tests do read+write, with probably cache helping out here?). That's something I wanted to verify before worrying about shaders, since it wouldn't be unthinkable we have some issue with fabric/memory perf states, but that doesn't seem to be the case (or at least nothing major).

Having single-shader tests like this is very helpful, since we can outright dump the shader assembly (and pipeline config) from macOS and Linux and compare (and even manually bisect differences). Much nicer than testing games... ^^;;

pp512 is maybe 3x faster with coopmat than without on other platforms

That explains a big factor then, we don't have coopmat wired up yet. I'll look into what it would take to add that.

@jeffbolznv
Copy link
Collaborator

I think I tried a llama.cpp build from a few months ago and it was noticeably slower, which makes me think there's probably still low-hanging fruit on this side? (unless major optimizations happened recently and no more are expected).

A lot has happened in the last few months. The Vulkan path is generally within about 10% of the CUDA path for token generation at least on my system (RTX 4070 using drivers from https://developer.nvidia.com/vulkan-driver). There are some knobs like in #10846 that might help things a bit on Apple hardware.

Ah, that's what I was looking for, thanks! I'll do some comparisons with Metal. A priori, one of the copy types reports 277.44 GB/s, which tells me we don't have some kind of pathological system-level memory bandwidth bottleneck (M2 Max memory BW is 400GB/s and I assume the CPY tests do read+write, with probably cache helping out here?).

Depending on the test, 277/400 may be quite good. And the bandwidth-limited shaders aren't the majority of the time in these models.

@asahilina
Copy link
Author

A lot has happened in the last few months. The Vulkan path is generally within about 10% of the CUDA path for token generation at least on my system

Was it known to be significantly slower a few months ago on that hardware? What I'm wondering is whether it's possible some smaller change had an outsized perf impact on our platform, and whether bisecting it could lead us somewhere.

The previous version I was using was b3873 (from October 3), and that one gives these numbers:

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Vulkan     |  99 |         pp512 |         34.99 ± 0.01 |
| llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | Vulkan     |  99 |         tg128 |         11.20 ± 0.03 |

So unless something major happened to the Vulkan backend that would explain at 2-3x performance improvement in the last 2-3 months, maybe it's worth bisecting that and seeing how that happened?

Depending on the test, 277/400 may be quite good. And the bandwidth-limited shaders aren't the majority of the time in these models.

It's a copy test, so 277 read + 277 write > 500/400 right? Which I was guessing means the working set is small enough that a significant chunk fits in the cache hierarchy, which is why it's faster than DRAM bandwidth.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 27, 2024

Quite a bit of shader and backend optimization happened over the last months on the Vulkan backend. But it's most optimized on Nvidia and AMD hardware. No performance tuning has happened for Apple hardware, and UMA buffer handling can probably be improved. But it's good to hear that performance is increasing significantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants