Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build failure unable to find library -lhsakmt #181

Open
cb88 opened this issue Dec 17, 2024 · 13 comments
Open

build failure unable to find library -lhsakmt #181

cb88 opened this issue Dec 17, 2024 · 13 comments

Comments

@cb88
Copy link

cb88 commented Dec 17, 2024

ld.lld: error: unable to find library -lhsakmt
make[2]: Leaving directory '/home/user/rocm_sdk_builder/builddir/016_03_llvm_project_openmp'
make[2]: Leaving directory '/home/user/rocm_sdk_builder/builddir/016_03_llvm_project_openmp'
[ 51%] Built target Utils.cpp-gfx906.bc
clang++: error: linker command failed with exit code 1 (use -v to see invocation)

This is after ./babs.sh -up ./babs.sh --clean ./babs.sh -b

git rev 84faa05 I was attempting to test on my MI60 but haven't been able to get a clean build on ArchLinux.

@lamikr
Copy link
Owner

lamikr commented Dec 18, 2024

Have you been earlier able to build the "016_03_llvm_project_openmp" project.
I know that some people have used arch linux earlier. Do you have multiple versions of it if you do:

cd /opt
find -name libhsakmt.so

I have

./rocm_sdk_612/lib64/libhsakmt.so
./rocm_sdk_612/lib/libhsakmt.so

All libhsak* versions in lib-directory are symlinks to lib64.

ls -la /opt/rocm_sdk_612/lib/libhsakmt.*
lrwxrwxrwx 1 lamikr lamikr 35 Nov 12 00:13 /opt/rocm_sdk_612/lib/libhsakmt.a -> /opt/rocm_sdk_612/lib64/libhsakmt.a
lrwxrwxrwx 1 lamikr lamikr 36 Nov 12 00:12 /opt/rocm_sdk_612/lib/libhsakmt.so -> /opt/rocm_sdk_612/lib/libhsakmt.so.1*
lrwxrwxrwx 1 lamikr lamikr 40 Nov 12 00:12 /opt/rocm_sdk_612/lib/libhsakmt.so.1 -> /opt/rocm_sdk_612/lib/libhsakmt.so.1.0.6*
lrwxrwxrwx 1 lamikr lamikr 42 Nov 12 00:12 /opt/rocm_sdk_612/lib/libhsakmt.so.1.0.6 -> /opt/rocm_sdk_612/lib64/libhsakmt.so.1.0.6*

@lamikr
Copy link
Owner

lamikr commented Dec 18, 2024

And let's check that all ldd dependencies are found. What does this show:

ldd /opt/rocm_sdk_612/lib64/libhsakmt.so.1.0.6

@cb88
Copy link
Author

cb88 commented Dec 18, 2024

After the recent posts in the other ticket its building and appears to be progressing further... I will update here with the results once it completes or not.

@lamikr
Copy link
Owner

lamikr commented Dec 18, 2024

Thanks for letting know, it would be nice to know what caused that break. So you have Vega VII to test the gfx906?

@cb88
Copy link
Author

cb88 commented Dec 18, 2024

I have 2x MI60 (or 32GB MI50 whichever it really is) The build stopped awhile ago, and I reran ./babs.sh -b and it failed here

adding 'torchvision-0.20.0a0+324eea9.dist-info/LICENSE'
adding 'torchvision-0.20.0a0+324eea9.dist-info/METADATA'
adding 'torchvision-0.20.0a0+324eea9.dist-info/WHEEL'
adding 'torchvision-0.20.0a0+324eea9.dist-info/top_level.txt'
adding 'torchvision-0.20.0a0+324eea9.dist-info/RECORD'
removing build/bdist.linux-x86_64/wheel
corrupted size vs. prev_size in fastbins
./build_rocm.sh: line 18: 3102616 Aborted (core dumped) ROCM_PATH=${install_dir_prefix_rocm} FORCE_CUDA=1 TORCHVISION_USE_NVJPEG=0 TORCHVISION_USE_VIDEO_CODEC=0 CC=${CMAKE_C_COMPILER} CXX=${CMAKE_CXX_COMPILER} python setup.py bdist_wheel
build failed: pytorch_vision
error in build cmd: ./build_rocm.sh /opt/rocm_sdk_612

@lamikr
Copy link
Owner

lamikr commented Dec 18, 2024

Hmm... Not really sure what is going on. In theory the benchmark should now be able to run some pytorch tests as it has now passed that and is trying now to build pytorch vision.

So are you able to test with

source /opt/rocm_sdk_612/bin/env_rocm.sh
cd /opt/rocm_sdk_612/benchmarks
./run_and_save_benchmarks.sh

If you are in master branch, can you do one more time these commands to verify everything is up to date and then restart pytorch vision build from clean.

./babs.sh -up
./babs.sh -ca
./babs.sh --clean binfo/core/039_03_pytorch_vision.binfo
./babs.sh -b

I have started my self clean build on fedora 40 with gfx906 as an only target. But I need to wait until morning to see the results.

@cb88
Copy link
Author

cb88 commented Dec 18, 2024

[cb88@M31-AR0 ~]$ cat /opt/rocm_sdk_612/benchmarks/bench.txt
Timestamp for benchmark results: 20241218_133446
Saving to file: 20241218_133446_cpu_vs_gpu_simple.txt
Benchmarking CPU and GPUs
Pytorch version: 2.4.1
ROCM HIP version: 6.1.40093-de7055040
Device: AMD EPYC 7352 24-Core Processor
'CPU time: 35.332 sec
Device: AMD Radeon Graphics
'GPU time: 0.604 sec
Benchmark ready

Saving to file: 20241218_133446_pytorch_dot_products.txt
Pytorch version: 2.4.1
dot product calculation test
tensor([[[ 0.8124, 0.2179, -0.4919, -0.4980, -0.6716, 1.2153, -0.0119,
-0.9560],
[-0.7172, 0.4881, 0.9783, -0.3172, -0.0765, 1.5946, -0.1057,
0.1876],
[ 0.8850, 0.3325, -0.6169, -0.5590, -0.7152, 1.3886, -0.0615,
-1.1245]],

    [[ 0.2982, -0.1511,  0.2687, -0.8882,  0.1656,  0.1409, -1.0829,
       0.6578],
     [-0.2719,  0.9328, -0.8428, -0.5765, -0.2355,  0.1816, -0.3346,
      -0.5164],
     [ 0.8432,  0.4674, -0.1435,  0.2439, -0.3148,  1.1532, -0.3879,
      -0.1294]]], device='cuda:0')

Benchmarking cuda and cpu with Default, Math, Flash Attention amd Memory pytorch backends
Device: AMD Radeon Graphics / cuda:0
Default benchmark:
3205.060 microseconds, 0.0032050598703790454 sec
SDPBackend.MATH benchmark:
3212.746 microseconds, 0.0032127462700009346 sec
SDPBackend.FLASH_ATTENTION benchmark:
SDPBackend.FLASH_ATTENTION cuda:0 is not supported. See warnings for reasons.
SDPBackend.EFFICIENT_ATTENTION benchmark:
SDPBackend.EFFICIENT_ATTENTION cuda:0 is not supported. See warnings for reasons.
Device: AMD EPYC 7352 24-Core Processor / cpu
Default benchmark:
3844997.412 microseconds, 3.844997411943041 sec
SDPBackend.MATH benchmark:
3642490.409 microseconds, 3.6424904089653865 sec
SDPBackend.FLASH_ATTENTION benchmark:
3828689.283 microseconds, 3.8286892829928547 sec
SDPBackend.EFFICIENT_ATTENTION benchmark:
SDPBackend.EFFICIENT_ATTENTION cpu is not supported. See warnings for reasons.
Summary

Pytorch version: 2.4.1
ROCM HIP version: 6.1.40093-de7055040
CPU: AMD EPYC 7352 24-Core Processor
Problem parameters:
Sequence-length: 512
Batch-size: 32
Heads: 16
Embed_dimension: 16
Datatype: torch.float16
Device: AMD Radeon Graphics / cuda:0
Default: 3205.060 ms
SDPBackend.MATH: 3212.746 ms
SDPBackend.FLASH_ATTENTION: -1.000 ms
SDPBackend.EFFICIENT_ATTENTION: -1.000 ms

Device: AMD EPYC 7352 24-Core Processor / cpu
Default: 3844997.412 ms
SDPBackend.MATH: 3642490.409 ms
SDPBackend.FLASH_ATTENTION: 3828689.283 ms
SDPBackend.EFFICIENT_ATTENTION: -1.000 ms

@cb88
Copy link
Author

cb88 commented Dec 18, 2024

[cb88@M31-AR0 opt]$ find -name libhsakmt.so
./rocm_sdk_612/lib64/libhsakmt.so
./rocm_sdk_612/lib/libhsakmt.so
./rocm/lib/libhsakmt.so

/opt/rocm is binary install from Arch.

[cb88@M31-AR0 opt]$ ls -la /opt/rocm_sdk_612/lib/libhsakmt.*
lrwxrwxrwx 1 cb88 cb88 35 Dec 11 12:28 /opt/rocm_sdk_612/lib/libhsakmt.a -> /opt/rocm_sdk_612/lib64/libhsakmt.a
lrwxrwxrwx 1 cb88 cb88 36 Dec 11 12:28 /opt/rocm_sdk_612/lib/libhsakmt.so -> /opt/rocm_sdk_612/lib/libhsakmt.so.1
lrwxrwxrwx 1 cb88 cb88 40 Dec 11 12:28 /opt/rocm_sdk_612/lib/libhsakmt.so.1 -> /opt/rocm_sdk_612/lib/libhsakmt.so.1.0.6
lrwxrwxrwx 1 cb88 cb88 42 Dec 11 12:28 /opt/rocm_sdk_612/lib/libhsakmt.so.1.0.6 -> /opt/rocm_sdk_612/lib64/libhsakmt.so.1.0.6

@lamikr
Copy link
Owner

lamikr commented Dec 18, 2024

So it seems that the original problem with the missing symbol in rocBLAS is solved also for you now and pytorch is able to use the rocBLAS when using MATH backend. LLama.cpp that was also earlier failing for Said-akbar should probably also now work ok if you try to build it with

./babs.sh -ca binfo/extra/ai_tools.blist
./babs.sh -b binfo/extra/ai_tools.blist

and then run

cd /opt/rocm_sdk_612/docs/examples/llm/llama_cpp/
./run_llama_benchmark.sh

There is still this second problem with the flash-attention that needs to be solved. And at the moment I do not have any idea why the pytorch vision build fails for you.

@cb88
Copy link
Author

cb88 commented Dec 18, 2024

./babs.sh -b binfo/extra/ai_tools.blist ran for a bit then...

-- Found Python: /opt/rocm_sdk_612/bin/python (found version "3.11.9") found components: Interpreter Development.Module Development.SABIModule
-- Found python matching: /opt/rocm_sdk_612/bin/python.
CMake Error at cmake/utils.cmake:37 (message):
Failed to locate torch path: corrupted size vs. prev_size in fastbins

Call Stack (most recent call first):
cmake/utils.cmake:45 (run_python)
CMakeLists.txt:70 (append_cmake_prefix_path)

-- Configuring incomplete, errors occurred!
Traceback (most recent call last):
File "/home/cb88/rocm_sdk_builder/src_projects/vllm/setup.py", line 483, in
setup(
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/init.py", line 87, in setup
return distutils.core.setup(**attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 968, in run_commands
self.run_command(cmd)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
super().run_command(command)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
cmd_obj.run()
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/wheel/_bdist_wheel.py", line 387, in run
self.run_command("build")
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
self.distribution.run_command(command)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
super().run_command(command)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
cmd_obj.run()
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/command/build.py", line 132, in run
self.run_command(cmd_name)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
self.distribution.run_command(command)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
super().run_command(command)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
cmd_obj.run()
File "/home/cb88/rocm_sdk_builder/src_projects/vllm/setup.py", line 235, in run
super().run()
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 84, in run
_build_ext.run(self)
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
self.build_extensions()
File "/home/cb88/rocm_sdk_builder/src_projects/vllm/setup.py", line 197, in build_extensions
self.configure(ext)
File "/home/cb88/rocm_sdk_builder/src_projects/vllm/setup.py", line 177, in configure
subprocess.check_call(
File "/opt/rocm_sdk_612/lib/python3.11/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '/home/cb88/rocm_sdk_builder/src_projects/vllm', '-G', 'Ninja', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DVLLM_TARGET_DEVICE=rocm', '-DVLLM_PYTHON_EXECUTABLE=/opt/rocm_sdk_612/bin/python', '-DVLLM_PYTHON_PATH=/home/cb88/rocm_sdk_builder/src_projects/vllm:/opt/rocm_sdk_612/lib/python311.zip:/opt/rocm_sdk_612/lib/python3.11:/opt/rocm_sdk_612/lib/python3.11/lib-dynload:/opt/rocm_sdk_612/lib/python3.11/site-packages', '-DCMAKE_JOB_POOL_COMPILE:STRING=compile', '-DCMAKE_JOB_POOLS:STRING=compile=8']' returned non-zero exit status 1.
corrupted size vs. prev_size in fastbins
./build_rocm.sh: line 22: 3139439 Aborted (core dumped) python setup.py bdist_wheel
build failed: vllm
error in build cmd: ./build_rocm.sh /opt/rocm_sdk_612 gfx906

@lamikr
Copy link
Owner

lamikr commented Dec 18, 2024

Hmm, vllm build that is before the llama.cpp build seems to fail for similar type error "corrupted size vs. prev_size in fastbins" than pytorch vision.

How about if you just build the llama.cpp

./babs.sh -b binfo/extra/llama_cpp.binfo

@cb88
Copy link
Author

cb88 commented Dec 18, 2024

Lllama builds and runs sucessfully

@cb88
Copy link
Author

cb88 commented Dec 18, 2024

Something is still not right with it though... llama_kv_cache_init: ROCm0 KV buffer size = 4000.00 MiB
ggml_cuda_host_malloc: failed to allocate 156000.00 MiB of pinned memory: out of memory
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 163577856032
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: failed to create context with model (small model here koboldcpp can load fully in VRAM)
main: error: unable to load model
IT didnt matter if I passed -ngl 1 or 999 it still tried to allocate a huge buffer and failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants