Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYCL: Fixes for building SYCL backend for AMD GPUs #10851

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lhl
Copy link

@lhl lhl commented Dec 16, 2024

  • cmake has a hardcoded onemkl that will cause linker failure for AMD path and needs to be changed
  • docfix: oneMKL compile needs -DHIP_TARGETS not -DHIPTARGETS
  • docfix: modified cmake build parameters to pass in more linker flags, otherwise linking will fail (might be slight overkill, but without adding -DCMAKE_SHARED_LINKER_FLAGS it fails

Make sure to read the contributing guidelines before submitting a PR

- cmake has a hardcoded `onemkl` that will cause linker failure for AMD
  path and needs to be changed
- docfix: oneMKL compile needs `-DHIP_TARGETS` not `-DHIPTARGETS`
- docfix: modified cmake build parameters to pass in more linker flags,
  otherwise linking will fail (might be slight overkill, but without
  adding `-DCMAKE_SHARED_LINKER_FLAGS` it fails
@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Dec 16, 2024
@lhl
Copy link
Author

lhl commented Dec 16, 2024

Just as an FYI, while I am able to build/compile and successfully run llama-cli for a basic test w/ the SYCL backend on gfx1100 (Radeon Pro W7900 , RDNA3), things don't seem to run perfectly. Tracking post-build issues separately here: #10850

@@ -75,7 +75,7 @@ else()
message(ERROR "Can't enable SYCL hip backend, GGML_SYCL_DEVICE_ARCH has not been set.")
endif()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl-targets=amdgcn-amd-amdhsa")
target_link_libraries(ggml-sycl PRIVATE sycl pthread m dl onemkl)
target_link_libraries(ggml-sycl PRIVATE sycl pthread m dl onemath_blas_rocblas)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed exactly? onemath_blas_rocblas is a target used for compile-time dispatching but the AMD backend uses the runtime dispatching (see #10584).
If you did this to take into account the renaming of oneMKL Interface to oneMath this would need to become onemath but:

  1. This is a larger change that should be made in a separate PR. Codeplay is planning to work on that, probably early 2025 FYI.
  2. The onemkl target should still be available but deprecated, I'm not 100% sure this works as expected though.

If there are any issues it may be easier to stick to a slightly older version of oneMKL Interface for now: uxlfoundation/oneMath@c00154c

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the onemkl is there it throws an error despite the AMD compiled version of oneMKL being in the LD_LIBRARY_PATH. I can't explain why it worked necessarily, DLL-hell is not my strong suit, I can only say that after building the Codeplay onAPI linked in the SYCL.md doc, no matter what I did cmake would fail complaining about onemkl until I tracked down the linking here and modified it.

Have you gotten this to work successfully without this? I'm using the 2025.0.0 / rocm-6.1.0-linux version from the codeplay site.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks. We can either try to understand this if you paste the error log or we can try to reproduce the issue on our side once we have time. Most of us are on holiday so this will take some time. We have not tested on AMD devices recently.
In any case I'd rather make sure we understand what is happening and what is the proper solution before merging this!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I have a bunch of year-end stuff as well, we can revisit in the new year or when #10584 makes some progress, not much point building successfully if it doesn't run anyway. 😅 Happy holidays!


# Build LLAMA with rocBLAS acceleration through SYCL

## AMD
# Use FP32, FP16 is not supported
# Find your GGML_SYCL_DEVICE_ARCH with rocminfo, under the key 'Name:'
GGML_SYCL_DEVICE_ARCH=gfx90a # Example architecture
cmake -B build -DGGML_SYCL=ON -DGGML_SYCL_TARGET=AMD -DGGML_SYCL_DEVICE_ARCH=${GGML_SYCL_DEVICE_ARCH} -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake -B build -DGGML_SYCL=ON -DGGML_SYCL_TARGET=AMD -DGGML_SYCL_DEVICE_ARCH=${GGML_SYCL_DEVICE_ARCH} -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_SHARED_LINKER_FLAGS="-L$MKL_ROCBLAS -L/opt/intel/oneapi/mkl/latest/lib/intel64 -lonemath_blas_rocblas -Wl,--no-as-needed -lmkl_sycl -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core"
Copy link
Collaborator

@Rbiessy Rbiessy Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding CMAKE_SHARED_LINKER_FLAGS should not be needed, I'm worried this could hide a deeper issue. The CMake targets should add all the flags needed. Can you describe what was the error without this change and how do you set up your environment?

Also note that all the mkl flags are referring to the Intel oneMKL product which runs only on Intel devices. This is different from oneMath (previously known as oneMKL Interface) which can dispatch to other SYCL devices.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another thing where maybe it shouldn't be needed, and maybe hints at some bigger problem, but until I started stuffing in the additional linker flags, it would fail. While maybe not optimal, these were close to the minimum number of changes required for me to successfully build the SYCL backend for my RDNA3 / gfx1100 GPU, and for me, following the instructions/using the code as it is in HEAD 100% failed for me.

This is obviously a very niche codepath (and like I mentioned, even after building reveals that the code itself has problems running), so I if someone else has built this recently and it works, feel free to close out the PR, but I went through the instructions very carefully and I'd expect others to see the same errors I did.

(Personally, I was just curious to see what the inference speed would be after doing some recent llama-bench testing and being surprised that tg128 perf on the hipified ROCm backend was actually slower than Vulkan backend in recent builds and was curious to see how the SYCL backend would compare.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a side note, how does it compare? I'm happy to hear Vulkan tg performance is good.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, sadly after compiling SYCL successfully, I end up getting errors with gemm_batch when trying to benchmark: #10850

But, here's the results for Vulkan vs ROCm backends. While Vulkan predictably loses pp512, it manages to beat ROCm on tg128. The W7900 has 864.0 GB/s of theoretical MBW, so the Vulkan inferences at 49.0% of that while the the ROCm backend is at 40.8%. (Just as a point of comparison, my 3090 with the CUDA backend gets 69.1% of MBW for tg128)

❯ VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.x86_64.json build/bin/llama-bench -m /models/gguf/Meta-Llama-3.1-8B-Instruct-
Q4_K_M.gguf 
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Pro W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
ggml_vulkan: Compiling shaders..........................Done!
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |         pp512 |     1331.45 ± 101.65 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         92.41 ± 0.13 |

build: 4ddd199f (4334)

❯ build/bin/llama-bench -m /models/gguf/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Pro W7900, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | ROCm       |  99 |         pp512 |      2552.75 ± 15.61 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | ROCm       |  99 |         tg128 |         77.00 ± 0.49 |

build: 4ddd199f (4334)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants