Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYCL: Fixes for building SYCL backend for AMD GPUs #10851

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions docs/backend/SYCL.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ cmake --build buildWithCublas --config Release
git clone https://github.com/oneapi-src/oneMKL
cd oneMKL
# Find your HIPTARGET with rocminfo, under the key 'Name:'
cmake -B buildWithrocBLAS -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DENABLE_MKLGPU_BACKEND=OFF -DENABLE_MKLCPU_BACKEND=OFF -DENABLE_ROCBLAS_BACKEND=ON -DHIPTARGETS=${HIPTARGET} -DTARGET_DOMAINS=blas
cmake -B buildWithrocBLAS -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DENABLE_MKLGPU_BACKEND=OFF -DENABLE_MKLCPU_BACKEND=OFF -DENABLE_ROCBLAS_BACKEND=ON -DHIP_TARGETS=${HIPTARGET} -DTARGET_DOMAINS=blas
cmake --build buildWithrocBLAS --config Release
```

Expand Down Expand Up @@ -330,17 +330,19 @@ cmake --build build --config Release -j -v

```sh
# Export relevant ENV variables
export LD_LIBRARY_PATH=/path/to/oneMKL/buildWithrocBLAS/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/path/to/oneMKL/buildWithrocBLAS/lib:$LIBRARY_PATH
export CPLUS_INCLUDE_DIR=/path/to/oneMKL/buildWithrocBLAS/include:$CPLUS_INCLUDE_DIR
export MKL_ROCBLAS=/path/to/oneMKL/buildWithrocmBLAS/lib
export LD_LIBRARY_PATH=$MKL_ROCBLAS:$LD_LIBRARY_PATH
export LIBRARY_PATH=$MKL_ROCBLAS:$LIBRARY_PATH
export CPLUS_INCLUDE_DIR=$MKL_ROCBLAS:$CPLUS_INCLUDE_DIR

# Build LLAMA with rocBLAS acceleration through SYCL

## AMD
# Use FP32, FP16 is not supported
# Find your GGML_SYCL_DEVICE_ARCH with rocminfo, under the key 'Name:'
GGML_SYCL_DEVICE_ARCH=gfx90a # Example architecture
cmake -B build -DGGML_SYCL=ON -DGGML_SYCL_TARGET=AMD -DGGML_SYCL_DEVICE_ARCH=${GGML_SYCL_DEVICE_ARCH} -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake -B build -DGGML_SYCL=ON -DGGML_SYCL_TARGET=AMD -DGGML_SYCL_DEVICE_ARCH=${GGML_SYCL_DEVICE_ARCH} -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_SHARED_LINKER_FLAGS="-L$MKL_ROCBLAS -L/opt/intel/oneapi/mkl/latest/lib/intel64 -lonemath_blas_rocblas -Wl,--no-as-needed -lmkl_sycl -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core"
Copy link
Collaborator

@Rbiessy Rbiessy Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding CMAKE_SHARED_LINKER_FLAGS should not be needed, I'm worried this could hide a deeper issue. The CMake targets should add all the flags needed. Can you describe what was the error without this change and how do you set up your environment?

Also note that all the mkl flags are referring to the Intel oneMKL product which runs only on Intel devices. This is different from oneMath (previously known as oneMKL Interface) which can dispatch to other SYCL devices.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another thing where maybe it shouldn't be needed, and maybe hints at some bigger problem, but until I started stuffing in the additional linker flags, it would fail. While maybe not optimal, these were close to the minimum number of changes required for me to successfully build the SYCL backend for my RDNA3 / gfx1100 GPU, and for me, following the instructions/using the code as it is in HEAD 100% failed for me.

This is obviously a very niche codepath (and like I mentioned, even after building reveals that the code itself has problems running), so I if someone else has built this recently and it works, feel free to close out the PR, but I went through the instructions very carefully and I'd expect others to see the same errors I did.

(Personally, I was just curious to see what the inference speed would be after doing some recent llama-bench testing and being surprised that tg128 perf on the hipified ROCm backend was actually slower than Vulkan backend in recent builds and was curious to see how the SYCL backend would compare.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a side note, how does it compare? I'm happy to hear Vulkan tg performance is good.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, sadly after compiling SYCL successfully, I end up getting errors with gemm_batch when trying to benchmark: #10850

But, here's the results for Vulkan vs ROCm backends. While Vulkan predictably loses pp512, it manages to beat ROCm on tg128. The W7900 has 864.0 GB/s of theoretical MBW, so the Vulkan inferences at 49.0% of that while the the ROCm backend is at 40.8%. (Just as a point of comparison, my 3090 with the CUDA backend gets 69.1% of MBW for tg128)

❯ VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.x86_64.json build/bin/llama-bench -m /models/gguf/Meta-Llama-3.1-8B-Instruct-
Q4_K_M.gguf 
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Pro W7900 (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
ggml_vulkan: Compiling shaders..........................Done!
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |         pp512 |     1331.45 ± 101.65 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         92.41 ± 0.13 |

build: 4ddd199f (4334)

❯ build/bin/llama-bench -m /models/gguf/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Pro W7900, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | ROCm       |  99 |         pp512 |      2552.75 ± 15.61 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | ROCm       |  99 |         tg128 |         77.00 ± 0.49 |

build: 4ddd199f (4334)



# build all binary
cmake --build build --config Release -j -v
Expand Down
2 changes: 1 addition & 1 deletion ggml/src/ggml-sycl/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ else()
message(ERROR "Can't enable SYCL hip backend, GGML_SYCL_DEVICE_ARCH has not been set.")
endif()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl-targets=amdgcn-amd-amdhsa")
target_link_libraries(ggml-sycl PRIVATE sycl pthread m dl onemkl)
target_link_libraries(ggml-sycl PRIVATE sycl pthread m dl onemath_blas_rocblas)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed exactly? onemath_blas_rocblas is a target used for compile-time dispatching but the AMD backend uses the runtime dispatching (see #10584).
If you did this to take into account the renaming of oneMKL Interface to oneMath this would need to become onemath but:

  1. This is a larger change that should be made in a separate PR. Codeplay is planning to work on that, probably early 2025 FYI.
  2. The onemkl target should still be available but deprecated, I'm not 100% sure this works as expected though.

If there are any issues it may be easier to stick to a slightly older version of oneMKL Interface for now: uxlfoundation/oneMath@c00154c

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the onemkl is there it throws an error despite the AMD compiled version of oneMKL being in the LD_LIBRARY_PATH. I can't explain why it worked necessarily, DLL-hell is not my strong suit, I can only say that after building the Codeplay onAPI linked in the SYCL.md doc, no matter what I did cmake would fail complaining about onemkl until I tracked down the linking here and modified it.

Have you gotten this to work successfully without this? I'm using the 2025.0.0 / rocm-6.1.0-linux version from the codeplay site.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks. We can either try to understand this if you paste the error log or we can try to reproduce the issue on our side once we have time. Most of us are on holiday so this will take some time. We have not tested on AMD devices recently.
In any case I'd rather make sure we understand what is happening and what is the proper solution before merging this!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I have a bunch of year-end stuff as well, we can revisit in the new year or when #10584 makes some progress, not much point building successfully if it doesn't run anyway. 😅 Happy holidays!

endif()

if (GGML_SYCL_DEVICE_ARCH)
Expand Down
Loading