build failure unable to find library -lhsakmt #181

cb88 · 2024-12-17T17:01:07Z

ld.lld: error: unable to find library -lhsakmt
make[2]: Leaving directory '/home/user/rocm_sdk_builder/builddir/016_03_llvm_project_openmp'
make[2]: Leaving directory '/home/user/rocm_sdk_builder/builddir/016_03_llvm_project_openmp'
[ 51%] Built target Utils.cpp-gfx906.bc
clang++: error: linker command failed with exit code 1 (use -v to see invocation)

This is after ./babs.sh -up ./babs.sh --clean ./babs.sh -b

git rev 84faa05 I was attempting to test on my MI60 but haven't been able to get a clean build on ArchLinux.

lamikr · 2024-12-18T02:14:19Z

Have you been earlier able to build the "016_03_llvm_project_openmp" project.
I know that some people have used arch linux earlier. Do you have multiple versions of it if you do:

cd /opt
find -name libhsakmt.so

I have

./rocm_sdk_612/lib64/libhsakmt.so
./rocm_sdk_612/lib/libhsakmt.so

All libhsak* versions in lib-directory are symlinks to lib64.

ls -la /opt/rocm_sdk_612/lib/libhsakmt.*
lrwxrwxrwx 1 lamikr lamikr 35 Nov 12 00:13 /opt/rocm_sdk_612/lib/libhsakmt.a -> /opt/rocm_sdk_612/lib64/libhsakmt.a
lrwxrwxrwx 1 lamikr lamikr 36 Nov 12 00:12 /opt/rocm_sdk_612/lib/libhsakmt.so -> /opt/rocm_sdk_612/lib/libhsakmt.so.1*
lrwxrwxrwx 1 lamikr lamikr 40 Nov 12 00:12 /opt/rocm_sdk_612/lib/libhsakmt.so.1 -> /opt/rocm_sdk_612/lib/libhsakmt.so.1.0.6*
lrwxrwxrwx 1 lamikr lamikr 42 Nov 12 00:12 /opt/rocm_sdk_612/lib/libhsakmt.so.1.0.6 -> /opt/rocm_sdk_612/lib64/libhsakmt.so.1.0.6*

lamikr · 2024-12-18T02:22:58Z

And let's check that all ldd dependencies are found. What does this show:

ldd /opt/rocm_sdk_612/lib64/libhsakmt.so.1.0.6

cb88 · 2024-12-18T02:26:35Z

After the recent posts in the other ticket its building and appears to be progressing further... I will update here with the results once it completes or not.

lamikr · 2024-12-18T05:18:50Z

Thanks for letting know, it would be nice to know what caused that break. So you have Vega VII to test the gfx906?

cb88 · 2024-12-18T06:26:56Z

I have 2x MI60 (or 32GB MI50 whichever it really is) The build stopped awhile ago, and I reran ./babs.sh -b and it failed here

adding 'torchvision-0.20.0a0+324eea9.dist-info/LICENSE'
adding 'torchvision-0.20.0a0+324eea9.dist-info/METADATA'
adding 'torchvision-0.20.0a0+324eea9.dist-info/WHEEL'
adding 'torchvision-0.20.0a0+324eea9.dist-info/top_level.txt'
adding 'torchvision-0.20.0a0+324eea9.dist-info/RECORD'
removing build/bdist.linux-x86_64/wheel
corrupted size vs. prev_size in fastbins
./build_rocm.sh: line 18: 3102616 Aborted (core dumped) ROCM_PATH=${install_dir_prefix_rocm} FORCE_CUDA=1 TORCHVISION_USE_NVJPEG=0 TORCHVISION_USE_VIDEO_CODEC=0 CC=${CMAKE_C_COMPILER} CXX=${CMAKE_CXX_COMPILER} python setup.py bdist_wheel
build failed: pytorch_vision
error in build cmd: ./build_rocm.sh /opt/rocm_sdk_612

lamikr · 2024-12-18T08:11:52Z

Hmm... Not really sure what is going on. In theory the benchmark should now be able to run some pytorch tests as it has now passed that and is trying now to build pytorch vision.

So are you able to test with

source /opt/rocm_sdk_612/bin/env_rocm.sh
cd /opt/rocm_sdk_612/benchmarks
./run_and_save_benchmarks.sh

If you are in master branch, can you do one more time these commands to verify everything is up to date and then restart pytorch vision build from clean.

./babs.sh -up
./babs.sh -ca
./babs.sh --clean binfo/core/039_03_pytorch_vision.binfo
./babs.sh -b

I have started my self clean build on fedora 40 with gfx906 as an only target. But I need to wait until morning to see the results.

cb88 · 2024-12-18T18:41:31Z

[cb88@M31-AR0 ~]$ cat /opt/rocm_sdk_612/benchmarks/bench.txt
Timestamp for benchmark results: 20241218_133446
Saving to file: 20241218_133446_cpu_vs_gpu_simple.txt
Benchmarking CPU and GPUs
Pytorch version: 2.4.1
ROCM HIP version: 6.1.40093-de7055040
Device: AMD EPYC 7352 24-Core Processor
'CPU time: 35.332 sec
Device: AMD Radeon Graphics
'GPU time: 0.604 sec
Benchmark ready

Saving to file: 20241218_133446_pytorch_dot_products.txt
Pytorch version: 2.4.1
dot product calculation test
tensor([[[ 0.8124, 0.2179, -0.4919, -0.4980, -0.6716, 1.2153, -0.0119,
-0.9560],
[-0.7172, 0.4881, 0.9783, -0.3172, -0.0765, 1.5946, -0.1057,
0.1876],
[ 0.8850, 0.3325, -0.6169, -0.5590, -0.7152, 1.3886, -0.0615,
-1.1245]],

    [[ 0.2982, -0.1511,  0.2687, -0.8882,  0.1656,  0.1409, -1.0829,
       0.6578],
     [-0.2719,  0.9328, -0.8428, -0.5765, -0.2355,  0.1816, -0.3346,
      -0.5164],
     [ 0.8432,  0.4674, -0.1435,  0.2439, -0.3148,  1.1532, -0.3879,
      -0.1294]]], device='cuda:0')

Benchmarking cuda and cpu with Default, Math, Flash Attention amd Memory pytorch backends
Device: AMD Radeon Graphics / cuda:0
Default benchmark:
3205.060 microseconds, 0.0032050598703790454 sec
SDPBackend.MATH benchmark:
3212.746 microseconds, 0.0032127462700009346 sec
SDPBackend.FLASH_ATTENTION benchmark:
SDPBackend.FLASH_ATTENTION cuda:0 is not supported. See warnings for reasons.
SDPBackend.EFFICIENT_ATTENTION benchmark:
SDPBackend.EFFICIENT_ATTENTION cuda:0 is not supported. See warnings for reasons.
Device: AMD EPYC 7352 24-Core Processor / cpu
Default benchmark:
3844997.412 microseconds, 3.844997411943041 sec
SDPBackend.MATH benchmark:
3642490.409 microseconds, 3.6424904089653865 sec
SDPBackend.FLASH_ATTENTION benchmark:
3828689.283 microseconds, 3.8286892829928547 sec
SDPBackend.EFFICIENT_ATTENTION benchmark:
SDPBackend.EFFICIENT_ATTENTION cpu is not supported. See warnings for reasons.
Summary

Pytorch version: 2.4.1
ROCM HIP version: 6.1.40093-de7055040
CPU: AMD EPYC 7352 24-Core Processor
Problem parameters:
Sequence-length: 512
Batch-size: 32
Heads: 16
Embed_dimension: 16
Datatype: torch.float16
Device: AMD Radeon Graphics / cuda:0
Default: 3205.060 ms
SDPBackend.MATH: 3212.746 ms
SDPBackend.FLASH_ATTENTION: -1.000 ms
SDPBackend.EFFICIENT_ATTENTION: -1.000 ms

Device: AMD EPYC 7352 24-Core Processor / cpu
Default: 3844997.412 ms
SDPBackend.MATH: 3642490.409 ms
SDPBackend.FLASH_ATTENTION: 3828689.283 ms
SDPBackend.EFFICIENT_ATTENTION: -1.000 ms

cb88 · 2024-12-18T18:48:03Z

[cb88@M31-AR0 opt]$ find -name libhsakmt.so
./rocm_sdk_612/lib64/libhsakmt.so
./rocm_sdk_612/lib/libhsakmt.so
./rocm/lib/libhsakmt.so

/opt/rocm is binary install from Arch.

[cb88@M31-AR0 opt]$ ls -la /opt/rocm_sdk_612/lib/libhsakmt.*
lrwxrwxrwx 1 cb88 cb88 35 Dec 11 12:28 /opt/rocm_sdk_612/lib/libhsakmt.a -> /opt/rocm_sdk_612/lib64/libhsakmt.a
lrwxrwxrwx 1 cb88 cb88 36 Dec 11 12:28 /opt/rocm_sdk_612/lib/libhsakmt.so -> /opt/rocm_sdk_612/lib/libhsakmt.so.1
lrwxrwxrwx 1 cb88 cb88 40 Dec 11 12:28 /opt/rocm_sdk_612/lib/libhsakmt.so.1 -> /opt/rocm_sdk_612/lib/libhsakmt.so.1.0.6
lrwxrwxrwx 1 cb88 cb88 42 Dec 11 12:28 /opt/rocm_sdk_612/lib/libhsakmt.so.1.0.6 -> /opt/rocm_sdk_612/lib64/libhsakmt.so.1.0.6

lamikr · 2024-12-18T19:31:43Z

So it seems that the original problem with the missing symbol in rocBLAS is solved also for you now and pytorch is able to use the rocBLAS when using MATH backend. LLama.cpp that was also earlier failing for Said-akbar should probably also now work ok if you try to build it with

./babs.sh -ca binfo/extra/ai_tools.blist
./babs.sh -b binfo/extra/ai_tools.blist

and then run

cd /opt/rocm_sdk_612/docs/examples/llm/llama_cpp/
./run_llama_benchmark.sh

There is still this second problem with the flash-attention that needs to be solved. And at the moment I do not have any idea why the pytorch vision build fails for you.

cb88 · 2024-12-18T19:40:38Z

./babs.sh -b binfo/extra/ai_tools.blist ran for a bit then...

-- Found Python: /opt/rocm_sdk_612/bin/python (found version "3.11.9") found components: Interpreter Development.Module Development.SABIModule
-- Found python matching: /opt/rocm_sdk_612/bin/python.
CMake Error at cmake/utils.cmake:37 (message):
Failed to locate torch path: corrupted size vs. prev_size in fastbins

Call Stack (most recent call first):
cmake/utils.cmake:45 (run_python)
CMakeLists.txt:70 (append_cmake_prefix_path)

-- Configuring incomplete, errors occurred!
Traceback (most recent call last):
File "/home/cb88/rocm_sdk_builder/src_projects/vllm/setup.py", line 483, in
setup(
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/init.py", line 87, in setup
return distutils.core.setup(**attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 968, in run_commands
self.run_command(cmd)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
super().run_command(command)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
cmd_obj.run()
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/wheel/_bdist_wheel.py", line 387, in run
self.run_command("build")
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
self.distribution.run_command(command)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
super().run_command(command)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
cmd_obj.run()
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/command/build.py", line 132, in run
self.run_command(cmd_name)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
self.distribution.run_command(command)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
super().run_command(command)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
cmd_obj.run()
File "/home/cb88/rocm_sdk_builder/src_projects/vllm/setup.py", line 235, in run
super().run()
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 84, in run
_build_ext.run(self)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
self.build_extensions()
File "/home/cb88/rocm_sdk_builder/src_projects/vllm/setup.py", line 197, in build_extensions
self.configure(ext)
File "/home/cb88/rocm_sdk_builder/src_projects/vllm/setup.py", line 177, in configure
subprocess.check_call(
File "/opt/rocm_sdk_612/lib/python3.11/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '/home/cb88/rocm_sdk_builder/src_projects/vllm', '-G', 'Ninja', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DVLLM_TARGET_DEVICE=rocm', '-DVLLM_PYTHON_EXECUTABLE=/opt/rocm_sdk_612/bin/python', '-DVLLM_PYTHON_PATH=/home/cb88/rocm_sdk_builder/src_projects/vllm:/opt/rocm_sdk_612/lib/python311.zip:/opt/rocm_sdk_612/lib/python3.11:/opt/rocm_sdk_612/lib/python3.11/lib-dynload:/opt/rocm_sdk_612/lib/python3.11/site-packages', '-DCMAKE_JOB_POOL_COMPILE:STRING=compile', '-DCMAKE_JOB_POOLS:STRING=compile=8']' returned non-zero exit status 1.
corrupted size vs. prev_size in fastbins
./build_rocm.sh: line 22: 3139439 Aborted (core dumped) python setup.py bdist_wheel
build failed: vllm
error in build cmd: ./build_rocm.sh /opt/rocm_sdk_612 gfx906

lamikr · 2024-12-18T21:05:47Z

Hmm, vllm build that is before the llama.cpp build seems to fail for similar type error "corrupted size vs. prev_size in fastbins" than pytorch vision.

How about if you just build the llama.cpp

./babs.sh -b binfo/extra/llama_cpp.binfo

cb88 · 2024-12-18T22:03:59Z

Lllama builds and runs sucessfully

cb88 · 2024-12-18T22:17:31Z

Something is still not right with it though... llama_kv_cache_init: ROCm0 KV buffer size = 4000.00 MiB
ggml_cuda_host_malloc: failed to allocate 156000.00 MiB of pinned memory: out of memory
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 163577856032
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: failed to create context with model (small model here koboldcpp can load fully in VRAM)
main: error: unable to load model
IT didnt matter if I passed -ngl 1 or 999 it still tried to allocate a huge buffer and failed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build failure unable to find library -lhsakmt #181

build failure unable to find library -lhsakmt #181

cb88 commented Dec 17, 2024

lamikr commented Dec 18, 2024 •

edited

Loading

lamikr commented Dec 18, 2024

cb88 commented Dec 18, 2024

lamikr commented Dec 18, 2024

cb88 commented Dec 18, 2024 •

edited

Loading

lamikr commented Dec 18, 2024

cb88 commented Dec 18, 2024

cb88 commented Dec 18, 2024

lamikr commented Dec 18, 2024

cb88 commented Dec 18, 2024

lamikr commented Dec 18, 2024 •

edited

Loading

cb88 commented Dec 18, 2024

cb88 commented Dec 18, 2024

build failure unable to find library -lhsakmt #181

build failure unable to find library -lhsakmt #181

Comments

cb88 commented Dec 17, 2024

lamikr commented Dec 18, 2024 • edited Loading

lamikr commented Dec 18, 2024

cb88 commented Dec 18, 2024

lamikr commented Dec 18, 2024

cb88 commented Dec 18, 2024 • edited Loading

lamikr commented Dec 18, 2024

cb88 commented Dec 18, 2024

cb88 commented Dec 18, 2024

lamikr commented Dec 18, 2024

cb88 commented Dec 18, 2024

lamikr commented Dec 18, 2024 • edited Loading

cb88 commented Dec 18, 2024

cb88 commented Dec 18, 2024

lamikr commented Dec 18, 2024 •

edited

Loading

cb88 commented Dec 18, 2024 •

edited

Loading

lamikr commented Dec 18, 2024 •

edited

Loading