-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research: Performance differences between Metal (macOS) and Vulkan (Linux) #10982
Comments
Hi, welcome. So this is using the Honeykrisp driver, right? What's the state of the shader compiler there? Do you expect it to be generating reasonably optimal code?
The backends are separate and there's no guarantee that things are implemented the same way between Vulkan and Metal. For the benchmarks you're looking at the Vulkan shaders involved are pretty well-tuned so I don't expect that to be the issue at the source level.
Please try running
Most time will be spent in mul_mat_vec_q4_k.comp (for token generation) and mul_mm.comp (for prompt processing). I'm surprised token generation is 3x slower in Vulkan, I suspect an issue with the shader compiler code generation. Prompt processing in the metal backend uses simdgroup matrices, you'd need to support cooperative matrix in Vulkan to get get access to those, and I don't think you'll be able to get competitive performance without it, pp512 is maybe 3x faster with coopmat than without on other platforms (depends on model, GPU, etc.).
There's a pipeline barrier between almost every dispatch, see ggml_vk_sync_buffers. Some dispatches are quite small. IMO first step would be to compare MUL_MAT Q4_K performance with n==1 (an existing test in test-backend-ops) between Metal and Vulkan. This is the mul_mat_vec_q4_k.comp shader. |
@alyssarosenzweig can probably chime in with more specifics. The compiler itself is shared with the GL driver and has several years of development at this point (and uses lots of shared Mesa infrastructure), so it's not particularly new even though Honeykrisp is.
I think I tried a llama.cpp build from a few months ago and it was noticeably slower, which makes me think there's probably still low-hanging fruit on this side? (unless major optimizations happened recently and no more are expected). If you think it might be insightful, I can try to bisect the performance improvement to see if it was something interesting / unexpected.
Ah, that's what I was looking for, thanks! I'll do some comparisons with Metal. A priori, one of the copy types reports Having single-shader tests like this is very helpful, since we can outright dump the shader assembly (and pipeline config) from macOS and Linux and compare (and even manually bisect differences). Much nicer than testing games... ^^;;
That explains a big factor then, we don't have coopmat wired up yet. I'll look into what it would take to add that. |
A lot has happened in the last few months. The Vulkan path is generally within about 10% of the CUDA path for token generation at least on my system (RTX 4070 using drivers from https://developer.nvidia.com/vulkan-driver). There are some knobs like in #10846 that might help things a bit on Apple hardware.
Depending on the test, 277/400 may be quite good. And the bandwidth-limited shaders aren't the majority of the time in these models. |
Was it known to be significantly slower a few months ago on that hardware? What I'm wondering is whether it's possible some smaller change had an outsized perf impact on our platform, and whether bisecting it could lead us somewhere. The previous version I was using was b3873 (from October 3), and that one gives these numbers:
So unless something major happened to the Vulkan backend that would explain at 2-3x performance improvement in the last 2-3 months, maybe it's worth bisecting that and seeing how that happened?
It's a copy test, so 277 read + 277 write > 500/400 right? Which I was guessing means the working set is small enough that a significant chunk fits in the cache hierarchy, which is why it's faster than DRAM bandwidth. |
Quite a bit of shader and backend optimization happened over the last months on the Vulkan backend. But it's most optimized on Nvidia and AMD hardware. No performance tuning has happened for Apple hardware, and UMA buffer handling can probably be improved. But it's good to hear that performance is increasing significantly. |
I'm one of the developers for the Asahi Linux GPU drivers, which provide accelerated Vulkan and OpenGL support on Apple Silicon platforms. I'm interested in improving the performance of llama.cpp on our drivers with the Vulkan backend.
As things stand today, macOS is significantly faster on a quick test with
llama-bench
, with default settings (tested on an M2 Max 64GB):Linux:
macOS:
(I also tested a larger 70B model which failed to load due to failing to allocate memory on Linux, but that's obviously a separate issue that's easy to debug. Probably just a hardcoded alloc size limit in the driver we can raise, since we recently refactored a bunch of stuff to handle >4G buffers properly.)
Of course, we'd like to improve the driver where possible to make things faster. However, since I know nothing about how LLMs are implemented under the hood, or the state of the llama.cpp Metal and Vulkan backends I would like to ask for help figuring out the perf issues, and analyzing whether llama.cpp itself could also be part of the root cause.
Would you be able to help us out? I'm curious about these things:
The text was updated successfully, but these errors were encountered: