SYCL: Fixes for building SYCL backend for AMD GPUs #10851

lhl · 2024-12-16T07:52:54Z

cmake has a hardcoded onemkl that will cause linker failure for AMD path and needs to be changed
docfix: oneMKL compile needs -DHIP_TARGETS not -DHIPTARGETS
docfix: modified cmake build parameters to pass in more linker flags, otherwise linking will fail (might be slight overkill, but without adding -DCMAKE_SHARED_LINKER_FLAGS it fails

Make sure to read the contributing guidelines before submitting a PR

- cmake has a hardcoded `onemkl` that will cause linker failure for AMD path and needs to be changed - docfix: oneMKL compile needs `-DHIP_TARGETS` not `-DHIPTARGETS` - docfix: modified cmake build parameters to pass in more linker flags, otherwise linking will fail (might be slight overkill, but without adding `-DCMAKE_SHARED_LINKER_FLAGS` it fails

lhl · 2024-12-16T08:11:56Z

Just as an FYI, while I am able to build/compile and successfully run llama-cli for a basic test w/ the SYCL backend on gfx1100 (Radeon Pro W7900 , RDNA3), things don't seem to run perfectly. Tracking post-build issues separately here: #10850

Rbiessy · 2024-12-16T15:21:42Z

ggml/src/ggml-sycl/CMakeLists.txt

@@ -75,7 +75,7 @@ else()
            message(ERROR "Can't enable SYCL hip backend, GGML_SYCL_DEVICE_ARCH has not been set.")
        endif()
        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl-targets=amdgcn-amd-amdhsa")
-        target_link_libraries(ggml-sycl PRIVATE sycl pthread m dl onemkl)
+        target_link_libraries(ggml-sycl PRIVATE sycl pthread m dl onemath_blas_rocblas)


Why is this needed exactly? onemath_blas_rocblas is a target used for compile-time dispatching but the AMD backend uses the runtime dispatching (see #10584).
If you did this to take into account the renaming of oneMKL Interface to oneMath this would need to become onemath but:

This is a larger change that should be made in a separate PR. Codeplay is planning to work on that, probably early 2025 FYI.

The onemkl target should still be available but deprecated, I'm not 100% sure this works as expected though.

If there are any issues it may be easier to stick to a slightly older version of oneMKL Interface for now: uxlfoundation/oneMath@c00154c

When the onemkl is there it throws an error despite the AMD compiled version of oneMKL being in the LD_LIBRARY_PATH. I can't explain why it worked necessarily, DLL-hell is not my strong suit, I can only say that after building the Codeplay onAPI linked in the SYCL.md doc, no matter what I did cmake would fail complaining about onemkl until I tracked down the linking here and modified it.

Have you gotten this to work successfully without this? I'm using the 2025.0.0 / rocm-6.1.0-linux version from the codeplay site.

I see, thanks. We can either try to understand this if you paste the error log or we can try to reproduce the issue on our side once we have time. Most of us are on holiday so this will take some time. We have not tested on AMD devices recently.
In any case I'd rather make sure we understand what is happening and what is the proper solution before merging this!

Makes sense, I have a bunch of year-end stuff as well, we can revisit in the new year or when #10584 makes some progress, not much point building successfully if it doesn't run anyway. 😅 Happy holidays!

Rbiessy · 2024-12-16T15:23:54Z

docs/backend/SYCL.md


 # Build LLAMA with rocBLAS acceleration through SYCL

 ## AMD
 # Use FP32, FP16 is not supported
 # Find your GGML_SYCL_DEVICE_ARCH with rocminfo, under the key 'Name:'
 GGML_SYCL_DEVICE_ARCH=gfx90a # Example architecture
-cmake -B build -DGGML_SYCL=ON -DGGML_SYCL_TARGET=AMD -DGGML_SYCL_DEVICE_ARCH=${GGML_SYCL_DEVICE_ARCH} -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
+cmake -B build -DGGML_SYCL=ON -DGGML_SYCL_TARGET=AMD -DGGML_SYCL_DEVICE_ARCH=${GGML_SYCL_DEVICE_ARCH} -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_SHARED_LINKER_FLAGS="-L$MKL_ROCBLAS -L/opt/intel/oneapi/mkl/latest/lib/intel64 -lonemath_blas_rocblas -Wl,--no-as-needed -lmkl_sycl -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core"


Adding CMAKE_SHARED_LINKER_FLAGS should not be needed, I'm worried this could hide a deeper issue. The CMake targets should add all the flags needed. Can you describe what was the error without this change and how do you set up your environment?

Also note that all the mkl flags are referring to the Intel oneMKL product which runs only on Intel devices. This is different from oneMath (previously known as oneMKL Interface) which can dispatch to other SYCL devices.

This is another thing where maybe it shouldn't be needed, and maybe hints at some bigger problem, but until I started stuffing in the additional linker flags, it would fail. While maybe not optimal, these were close to the minimum number of changes required for me to successfully build the SYCL backend for my RDNA3 / gfx1100 GPU, and for me, following the instructions/using the code as it is in HEAD 100% failed for me.

This is obviously a very niche codepath (and like I mentioned, even after building reveals that the code itself has problems running), so I if someone else has built this recently and it works, feel free to close out the PR, but I went through the instructions very carefully and I'd expect others to see the same errors I did.

(Personally, I was just curious to see what the inference speed would be after doing some recent llama-bench testing and being surprised that tg128 perf on the hipified ROCm backend was actually slower than Vulkan backend in recent builds and was curious to see how the SYCL backend would compare.)

On a side note, how does it compare? I'm happy to hear Vulkan tg performance is good.

So, sadly after compiling SYCL successfully, I end up getting errors with gemm_batch when trying to benchmark: #10850

But, here's the results for Vulkan vs ROCm backends. While Vulkan predictably loses pp512, it manages to beat ROCm on tg128. The W7900 has 864.0 GB/s of theoretical MBW, so the Vulkan inferences at 49.0% of that while the the ROCm backend is at 40.8%. (Just as a point of comparison, my 3090 with the CUDA backend gets 69.1% of MBW for tg128)

❯ VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.x86_64.json build/bin/llama-bench -m /models/gguf/Meta-Llama-3.1-8B-Instruct- Q4_K_M.gguf ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon Pro W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: | ggml_vulkan: Compiling shaders..........................Done! | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Vulkan | 99 | pp512 | 1331.45 ± 101.65 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | Vulkan | 99 | tg128 | 92.41 ± 0.13 | build: 4ddd199f (4334) ❯ build/bin/llama-bench -m /models/gguf/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Pro W7900, compute capability 11.0, VMM: no | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | ROCm | 99 | pp512 | 2552.75 ± 15.61 | | llama 8B Q4_K - Medium | 4.58 GiB | 8.03 B | ROCm | 99 | tg128 | 77.00 ± 0.49 | build: 4ddd199f (4334)

github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Dec 16, 2024

Rbiessy requested changes Dec 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL: Fixes for building SYCL backend for AMD GPUs #10851

SYCL: Fixes for building SYCL backend for AMD GPUs #10851

lhl commented Dec 16, 2024

lhl commented Dec 16, 2024

Rbiessy Dec 16, 2024

lhl Dec 17, 2024

Rbiessy Dec 17, 2024

lhl Dec 18, 2024

Rbiessy Dec 16, 2024 •

edited

Loading

lhl Dec 17, 2024

0cc4m Dec 18, 2024

lhl Dec 18, 2024

SYCL: Fixes for building SYCL backend for AMD GPUs #10851

Are you sure you want to change the base?

SYCL: Fixes for building SYCL backend for AMD GPUs #10851

Conversation

lhl commented Dec 16, 2024

lhl commented Dec 16, 2024

Rbiessy Dec 16, 2024

Choose a reason for hiding this comment

lhl Dec 17, 2024

Choose a reason for hiding this comment

Rbiessy Dec 17, 2024

Choose a reason for hiding this comment

lhl Dec 18, 2024

Choose a reason for hiding this comment

Rbiessy Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

lhl Dec 17, 2024

Choose a reason for hiding this comment

0cc4m Dec 18, 2024

Choose a reason for hiding this comment

lhl Dec 18, 2024

Choose a reason for hiding this comment

Rbiessy Dec 16, 2024 •

edited

Loading