-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SYCL: Fixes for building SYCL backend for AMD GPUs #10851
base: master
Are you sure you want to change the base?
Conversation
- cmake has a hardcoded `onemkl` that will cause linker failure for AMD path and needs to be changed - docfix: oneMKL compile needs `-DHIP_TARGETS` not `-DHIPTARGETS` - docfix: modified cmake build parameters to pass in more linker flags, otherwise linking will fail (might be slight overkill, but without adding `-DCMAKE_SHARED_LINKER_FLAGS` it fails
Just as an FYI, while I am able to build/compile and successfully run |
@@ -75,7 +75,7 @@ else() | |||
message(ERROR "Can't enable SYCL hip backend, GGML_SYCL_DEVICE_ARCH has not been set.") | |||
endif() | |||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl-targets=amdgcn-amd-amdhsa") | |||
target_link_libraries(ggml-sycl PRIVATE sycl pthread m dl onemkl) | |||
target_link_libraries(ggml-sycl PRIVATE sycl pthread m dl onemath_blas_rocblas) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed exactly? onemath_blas_rocblas
is a target used for compile-time dispatching but the AMD backend uses the runtime dispatching (see #10584).
If you did this to take into account the renaming of oneMKL Interface to oneMath this would need to become onemath
but:
- This is a larger change that should be made in a separate PR. Codeplay is planning to work on that, probably early 2025 FYI.
- The
onemkl
target should still be available but deprecated, I'm not 100% sure this works as expected though.
If there are any issues it may be easier to stick to a slightly older version of oneMKL Interface for now: uxlfoundation/oneMath@c00154c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the onemkl is there it throws an error despite the AMD compiled version of oneMKL being in the LD_LIBRARY_PATH. I can't explain why it worked necessarily, DLL-hell is not my strong suit, I can only say that after building the Codeplay onAPI linked in the SYCL.md doc, no matter what I did cmake would fail complaining about onemkl until I tracked down the linking here and modified it.
Have you gotten this to work successfully without this? I'm using the 2025.0.0 / rocm-6.1.0-linux version from the codeplay site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks. We can either try to understand this if you paste the error log or we can try to reproduce the issue on our side once we have time. Most of us are on holiday so this will take some time. We have not tested on AMD devices recently.
In any case I'd rather make sure we understand what is happening and what is the proper solution before merging this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I have a bunch of year-end stuff as well, we can revisit in the new year or when #10584 makes some progress, not much point building successfully if it doesn't run anyway. 😅 Happy holidays!
|
||
# Build LLAMA with rocBLAS acceleration through SYCL | ||
|
||
## AMD | ||
# Use FP32, FP16 is not supported | ||
# Find your GGML_SYCL_DEVICE_ARCH with rocminfo, under the key 'Name:' | ||
GGML_SYCL_DEVICE_ARCH=gfx90a # Example architecture | ||
cmake -B build -DGGML_SYCL=ON -DGGML_SYCL_TARGET=AMD -DGGML_SYCL_DEVICE_ARCH=${GGML_SYCL_DEVICE_ARCH} -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx | ||
cmake -B build -DGGML_SYCL=ON -DGGML_SYCL_TARGET=AMD -DGGML_SYCL_DEVICE_ARCH=${GGML_SYCL_DEVICE_ARCH} -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_SHARED_LINKER_FLAGS="-L$MKL_ROCBLAS -L/opt/intel/oneapi/mkl/latest/lib/intel64 -lonemath_blas_rocblas -Wl,--no-as-needed -lmkl_sycl -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding CMAKE_SHARED_LINKER_FLAGS
should not be needed, I'm worried this could hide a deeper issue. The CMake targets should add all the flags needed. Can you describe what was the error without this change and how do you set up your environment?
Also note that all the mkl
flags are referring to the Intel oneMKL product which runs only on Intel devices. This is different from oneMath (previously known as oneMKL Interface) which can dispatch to other SYCL devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another thing where maybe it shouldn't be needed, and maybe hints at some bigger problem, but until I started stuffing in the additional linker flags, it would fail. While maybe not optimal, these were close to the minimum number of changes required for me to successfully build the SYCL backend for my RDNA3 / gfx1100 GPU, and for me, following the instructions/using the code as it is in HEAD 100% failed for me.
This is obviously a very niche codepath (and like I mentioned, even after building reveals that the code itself has problems running), so I if someone else has built this recently and it works, feel free to close out the PR, but I went through the instructions very carefully and I'd expect others to see the same errors I did.
(Personally, I was just curious to see what the inference speed would be after doing some recent llama-bench
testing and being surprised that tg128 perf on the hipified ROCm backend was actually slower than Vulkan backend in recent builds and was curious to see how the SYCL backend would compare.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a side note, how does it compare? I'm happy to hear Vulkan tg performance is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, sadly after compiling SYCL successfully, I end up getting errors with gemm_batch
when trying to benchmark: #10850
But, here's the results for Vulkan vs ROCm backends. While Vulkan predictably loses pp512, it manages to beat ROCm on tg128. The W7900 has 864.0 GB/s of theoretical MBW, so the Vulkan inferences at 49.0% of that while the the ROCm backend is at 40.8%. (Just as a point of comparison, my 3090 with the CUDA backend gets 69.1% of MBW for tg128)
❯ VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.x86_64.json build/bin/llama-bench -m /models/gguf/Meta-Llama-3.1-8B-Instruct-
Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Pro W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
ggml_vulkan: Compiling shaders..........................Done!
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Vulkan | 99 | pp512 | 1331.45 ± 101.65 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Vulkan | 99 | tg128 | 92.41 ± 0.13 |
build: 4ddd199f (4334)
❯ build/bin/llama-bench -m /models/gguf/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Pro W7900, compute capability 11.0, VMM: no
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | ROCm | 99 | pp512 | 2552.75 ± 15.61 |
| llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | ROCm | 99 | tg128 | 77.00 ± 0.49 |
build: 4ddd199f (4334)
onemkl
that will cause linker failure for AMD path and needs to be changed-DHIP_TARGETS
not-DHIPTARGETS
-DCMAKE_SHARED_LINKER_FLAGS
it failsMake sure to read the contributing guidelines before submitting a PR