Use a generic OpenCL ICD loader. #417

maleadt · 2024-04-12T11:42:25Z

Fixes #406. The ICD loader we ship as part of oneAPI_Support_jll, required by oneMKL, apparently doesn't support PVC. Here, I switch to Khronos' generic ICD loader:

julia> using oneAPI

julia> oneAPI.versioninfo()
Binary dependencies:
- NEO: 24.9.28717+0
- libigc: 1.0.16238+0
- gmmlib: 22.3.17+0
- SPIRV_LLVM_Translator_unified: 0.3.0+0
- SPIRV_Tools: 2023.2.0+0

Toolchain:
- Julia: 1.10.2
- LLVM: 15.0.7

1 driver:
- 00000000-0000-0000-17c5-858d0103702d (v1.3.28717, API v1.3.0)

16 devices:
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550

julia> oneArray(rand(Float32, 10, 10)) * oneArray(rand(Float32, 10, 10))
10×10 oneArray{Float32, 2, oneAPI.oneL0.DeviceBuffer}:
 3.19613  2.97887  2.28815   2.74149  1.57031  2.77789  2.67408  2.77693   2.62745  3.46989
 2.57461  2.44688  2.10964   2.6676   2.08912  1.9301   2.9007   1.87115   2.34614  3.43717
 1.40868  1.32582  0.996341  1.50965  1.17594  1.09587  1.40743  0.91676   1.25728  1.6185
 3.03337  2.81118  2.16232   2.80838  1.82865  2.81073  2.65468  2.32454   2.58824  3.28149
 1.43007  1.57745  1.13201   1.34756  1.27351  1.49425  1.75484  0.838633  1.7749   1.4279
 2.36413  2.39614  1.57379   2.37182  1.76427  2.14909  2.76529  1.26664   2.51001  2.37393
 4.23767  3.66995  2.72545   3.38257  2.74463  3.47646  3.34504  3.70197   3.35509  4.60171
 3.26022  3.68492  2.21753   3.24749  2.4025   3.21383  2.5462   2.85667   2.73826  3.29256
 2.55168  2.78619  1.95469   2.51261  1.72095  2.42254  2.95073  1.67603   2.60132  2.92868
 3.78765  3.45718  2.45423   3.32748  2.74399  3.36122  3.4286   3.10559   3.18935  4.2261

maleadt · 2024-04-12T12:53:40Z

The consistent new failure in oneMKL tests is very suspicious.

One potential reason for the crash on free is that libOpenCL.so seems to be double-loaded:

$ LD_LIBRARY_PATH=/tmp/x/lib LD_DEBUG=libs julia --project -e 'using oneAPI; oneArray(rand(Float32, 10, 10)) * oneArray(rand(Float32, 10, 10))' |& grep 'calling init' | grep libOpenCL
    239903:	calling init: /home/sdp/.julia/artifacts/113cdfb400110abbe4d925555e393c8501a9b71c/lib/libOpenCL.so

    239903:	find library=libOpenCL.so [0]; searching
    239903:	  trying file=/home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/libOpenCL.so
    239903:	  trying file=/home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/./libOpenCL.so
    239903:	  trying file=/home/sdp/.julia/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/liblibOpenCL.so
    239903:	calling init: /home/sdp/.julia/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libOpenCL.so

i.e. both the version from OpenCL_jll.jl is loaded (which Julia dlopens), and the one from oneAPI_Support_jll is also loaded (presumably as a dependency). I'm not sure why, but loading a single library twice may very well explain the crash.

However, building a version of oneAPI_Support_jll without libOpenCL.so (without anything from intel-opencl-rt for that matter) throws us back to the old error where MKL fails to do anything becuase... it can't find libOpenCL (despite the fact that it's been loaded!!):

    240605:	calling init: /home/sdp/.julia/artifacts/113cdfb400110abbe4d925555e393c8501a9b71c/lib/libOpenCL.so

    240605:	find library=libOpenCL.so [0]; searching
    240605:	 search path=/home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib		(RUNPATH from file /home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/liboneapi_support.so)
    240605:	  trying file=/home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/libOpenCL.so
    240605:	 search path=/home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/.		(RPATH from file /home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/libpi_level_zero.so)
    240605:	  trying file=/home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/./libOpenCL.so
    240605:	 search cache=/etc/ld.so.cache
    240605:	 search path=/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/lib:/usr/lib		(system search path)
    240605:	  trying file=/lib/x86_64-linux-gnu/libOpenCL.so
    240605:	  trying file=/usr/lib/x86_64-linux-gnu/libOpenCL.so
    240605:	  trying file=/lib/libOpenCL.so
    240605:	  trying file=/usr/lib/libOpenCL.so

Intel MKL FATAL ERROR: Error on loading function 'clGetPlatformIDs'.

maleadt · 2024-04-12T13:12:10Z

SONAMEs are identical:

$ readelf -d /home/sdp/.julia/artifacts/113cdfb400110abbe4d925555e393c8501a9b71c/lib/libOpenCL.so | grep SONAME
 0x000000000000000e (SONAME)             Library soname: [libOpenCL.so.1]
$ readelf -d /home/sdp/.julia/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libOpenCL.so | grep SONAME
 0x000000000000000e (SONAME)             Library soname: [libOpenCL.so.1]

I really don't get why MKL (or rather pi_opencl) would want to load a second copy of libOpenCL, resulting in crashes when it does, or in failures to load OpenCL if it doesn't. SYCL_PI_TRACE also isn't helpful, e.g., in the case where oneAPI_Support_jll is shipped without libopencl.so and thus only the generic one can be loaded:

$ SYCL_PI_TRACE=1 julia --project -e 'using oneAPI; oneArray(rand(Float32, 10, 10)) * oneArray(rand(Float32, 10, 10))'
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so [ PluginVersion: 14.37.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_level_zero.so [ PluginVersion: 14.37.1 ]
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Selected device: -> final score = 1550
SYCL_PI_TRACE[all]:   platform: Intel(R) Level-Zero
SYCL_PI_TRACE[all]:   device: Intel(R) Data Center GPU Max 1550
Intel MKL FATAL ERROR: Error on loading function 'clGetPlatformIDs'.

How does that make sense; libOpenCL is loaded, SYCL_PI_TRACE shows that it's detected, yet MKL complains?

maleadt · 2024-04-12T13:53:52Z

Okay, so apparently MKL's loading of libOpenCL.so ignores the fact that we've already loaded a copy, but does work when first setting LD_LIBRARY_PATH to OpenCL_jll's artifact dir. That's not workable, however, it may not be terrible, because the USM-related crash at exit also happens when only having a single libOpenCL.so loaded (by setting LD_LIBRARY_PATH manually).

So, for the purpose of getting PVC working, we could:

use Khronos' ICD loader by having oneAPI_Support_jll depend on OpenCL_jll
figure out why that crashes at exit
either figure out a way to have MKL's loading of libOpenCL.so pick up the already-loaded copy, or, ship a dummy loader in the same directory (as it seems like MKL doesn't use the copy it loads anyway)

maleadt · 2024-04-12T14:46:19Z

either figure out a way to have MKL's loading of libOpenCL.so pick up the already-loaded copy,

From LD_DEBUG=all:

    260528:	file=/home/sdp/.julia/artifacts/113cdfb400110abbe4d925555e393c8501a9b71c/lib/libOpenCL.so [0];  dynamically loaded by /home/sdp/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/bin/../lib/julia/libjulia-internal.so.1.10 [0]
...
    260528:	file=libOpenCL.so [0];  dynamically loaded by /home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/libmkl_sycl_blas.so.4 [0]
    260528:	find library=libOpenCL.so [0]; searching
    260528:	  trying file=/home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/libOpenCL.so
    260528:	  trying file=/home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/./libOpenCL.so

Yet I can't replicate this by loading libmkl_sycl_blas.so in isolation:

# loading only MKL -> it searches for OpenCL
$ LD_DEBUG=libs julia -e 'using Libdl; Libdl.dlopen("./libmkl_sycl_blas.so")' |& grep -i OpenCL.so
    263413:	find library=libOpenCL.so.1 [0]; searching
    263413:	calling init: /lib/x86_64-linux-gnu/libOpenCL.so.1

# loading oneAPI.jl first, which dlopen's libOpenCL.so
$ LD_DEBUG=libs julia -e 'using oneAPI; using Libdl; Libdl.dlopen("./libmkl_sycl_blas.so")' |& grep -i OpenCL.so
    262486:	calling init: /home/sdp/.julia/artifacts/113cdfb400110abbe4d925555e393c8501a9b71c/lib/libOpenCL.so

ship a dummy loader in the same directory

It won't be possible to make this an actual dummy library, as symbols are looked up in the loaded libOpenCL.so.

Intel MKL FATAL ERROR: Error on loading function 'clGetPlatformIDs'.

maleadt · 2024-04-12T15:07:01Z

The reason I can't replicate this is probably because the lazy load actually only happens during sgemm. Here's the dlopens traced in GDB:

# during __init__ of OpenCL_jll
Thread 1 "julia" hit Breakpoint 1, ___dlopen (file=file@entry=0x7fffffff70c0 "/home/sdp/.julia/artifacts/113cdfb400110abbe4d925555e393c8501a9b71c/lib/libOpenCL.so", mode=mode@entry=9) at ./dlfcn/dlopen.c:77
77	in ./dlfcn/dlopen.c
$105 = 0x7fffffff70c0 "/home/sdp/.julia/artifacts/113cdfb400110abbe4d925555e393c8501a9b71c/lib/libOpenCL.so"
(gdb) bt
#0  ___dlopen (file=file@entry=0x7fffffff70c0 "/home/sdp/.julia/artifacts/113cdfb400110abbe4d925555e393c8501a9b71c/lib/libOpenCL.so", mode=mode@entry=9) at ./dlfcn/dlopen.c:77
#1  0x00007ffff71d6717 in ijl_dlopen (filename=filename@entry=0x7fffffff70c0 "/home/sdp/.julia/artifacts/113cdfb400110abbe4d925555e393c8501a9b71c/lib/libOpenCL.so", flags=flags@entry=68)
    at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/dlload.c:200
#2  0x00007ffff71d6816 in ijl_load_dynamic_library (modname=0x7ffff070a370 "/home/sdp/.julia/artifacts/113cdfb400110abbe4d925555e393c8501a9b71c/lib/libOpenCL.so", flags=68, throw_err=1)
    at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/dlload.c:365
#3  0x00007fffe19b79c8 in julia_#dlopen#3_51543 () at libdl.jl:117
#4  0x00007fffe267347a in julia_dlopen_51536 () at libdl.jl:116
#5  0x00007fffe2673533 in jfptr_dlopen_51537 () from /home/sdp/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/lib/julia/sys.so
#6  0x00007ffff71b6a0e in _jl_invoke (world=<optimized out>, mfunc=0x7fffe32787d0 <jl_system_image_data+2728080>, nargs=2, args=0x7fffffff92b0, F=0x7fffe3278560 <jl_system_image_data+2727456>)
    at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894
#7  ijl_apply_generic (F=<optimized out>, args=0x7fffffff92b0, nargs=<optimized out>) at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
#8  0x00007ffcb4a51273 in macro expansion () at /home/sdp/.julia/packages/JLLWrappers/pG9bm/src/products/library_generators.jl:63
#9  japi1___init___354 () at /home/sdp/.julia/dev/OpenCL_jll/src/wrappers/x86_64-linux-gnu.jl:8
#10 0x00007ffff71b6a0e in _jl_invoke (world=<optimized out>, mfunc=0x7ffcb4a551b0 <jl_system_image_data+5168>, nargs=0, args=0x7fffffff9480, F=0x7ffcb4a54e70 <jl_system_image_data+4336>)
    at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894
#11 ijl_apply_generic (F=<optimized out>, args=args@entry=0x7fffffff9480, nargs=nargs@entry=0) at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
#12 0x00007ffff71f1115 in jl_apply (nargs=1, args=0x7fffffff9478) at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982
#13 jl_module_run_initializer (m=0x7ffcb4a546d0 <jl_system_image_data+2384>) at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/toplevel.c:76
#14 0x00007fffe21c2d7d in julia_run_module_init_80431 () at loading.jl:1134
#15 0x00007fffe101d9cf in julia_register_restored_modules_80415 () at loading.jl:1122
#16 0x00007fffe293d5bc in julia__include_from_serialized_80388 () at loading.jl:1067
#17 0x00007fffe1627dfa in julia__tryrequire_from_serialized_80639 () at loading.jl:1481

# during sgemm in libmkl_sycl_blas
Thread 1 "julia" hit Breakpoint 1, ___dlopen (file=0x7fffffffad08 "libOpenCL.so", mode=257) at ./dlfcn/dlopen.c:77
77	in ./dlfcn/dlopen.c
$250 = 0x7fffffffad08 "libOpenCL.so"
(gdb) bt
#0  ___dlopen (file=0x7fffffffad08 "libOpenCL.so", mode=257) at ./dlfcn/dlopen.c:77
#1  0x00007ffc9b99dbf8 in mkl_cl_load_lib () from /home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/libmkl_sycl_blas.so.4
#2  0x00007ffc994395fd in oneapi::mkl::gpu::mkl_gpu_map_l0_to_cl(int*, _ze_device_handle_t*, _cl_device_id**, _cl_context**) ()
   from /home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/libmkl_sycl_blas.so.4
#3  0x00007ffc994364bb in oneapi::mkl::gpu::add_arch_info(sycl::_V1::queue*, oneapi::mkl::gpu::mkl_gpu_device_info_t*) ()
   from /home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/libmkl_sycl_blas.so.4
#4  0x00007ffc99438ed7 in oneapi::mkl::gpu::get_device_info_with_arch(sycl::_V1::queue*, oneapi::mkl::gpu::mkl_gpu_device_info_t*) ()
   from /home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/libmkl_sycl_blas.so.4
#5  0x00007ffc9acf1a6a in oneapi::mkl::gpu::mkl_blas_gpu_sgemm_driver_sycl(int*, sycl::_V1::queue*, oneapi::mkl::gpu::blas_arg_usm_t*, mkl_gpu_event_list_t*) ()
   from /home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/libmkl_sycl_blas.so.4
#6  0x00007ffc9acdc93b in oneapi::mkl::gpu::sgemm_sycl_internal(sycl::_V1::queue*, MKL_LAYOUT, MKL_TRANSPOSE, MKL_TRANSPOSE, long, long, long, oneapi::mkl::value_or_pointer<float>, float const*, long, float const*, long, oneapi::mkl::value_or_pointer<float>, float*, long, oneapi::mkl::blas::compute_mode, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&, long, long, long) ()
   from /home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/libmkl_sycl_blas.so.4
#7  0x00007ffc9acdaf5d in oneapi::mkl::gpu::sgemm_sycl(sycl::_V1::queue*, MKL_LAYOUT, MKL_TRANSPOSE, MKL_TRANSPOSE, long, long, long, oneapi::mkl::value_or_pointer<float>, float const*, long, float const*, long, oneapi::mkl::value_or_pointer<float>, float*, long, oneapi::mkl::blas::compute_mode, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&, long, long, long) ()
   from /home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/libmkl_sycl_blas.so.4
#8  0x00007ffc9b71ad22 in oneapi::mkl::blas::sgemm(sycl::_V1::queue&, MKL_LAYOUT, oneapi::mkl::transpose, oneapi::mkl::transpose, long, long, long, oneapi::mkl::value_or_pointer<float>, float const*, long, float const*, long, oneapi::mkl::value_or_pointer<float>, float*, long, oneapi::mkl::blas::compute_mode, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&) ()
   from /home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/libmkl_sycl_blas.so.4
#9  0x00007ffc9b6a2f2f in oneapi::mkl::blas::column_major::gemm(sycl::_V1::queue&, oneapi::mkl::transpose, oneapi::mkl::transpose, long, long, long, oneapi::mkl::value_or_pointer<float>, float const*, long, float const*, long, oneapi::mkl::value_or_pointer<float>, float*, long, oneapi::mkl::blas::compute_mode, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&) ()
   from /home/sdp/.julia/artifacts/c834e5913bd3a3923a5029184fba0ad0af2d08e6/lib/libmkl_sycl_blas.so.4
#10 0x00007ffcb43a06a4 in onemklSgemm (device_queue=0x7ffc98c8ed30, transa=<optimized out>, transb=<optimized out>, m=0, n=4858932581021935980, k=<optimized out>, alpha=<optimized out>, a=0xff001ffffffe0000,
    lda=10, b=0xff001ffffffd0000, ldb=10, beta=-7.39849727e-21, c=0xff001ffffffc0000, ldc=10) at /workspace/srcdir/oneAPI.jl/deps/src/onemkl.cpp:682

mode=9, as done by OpenCL_jll, is RTLD_LAZY|RTLD_DEEPBIND (matching JLLWrappers).
mode=257, as done by MKL, is RTLD_LAZY|RTLD_GLOBAL.

maleadt · 2024-04-12T15:24:22Z

I guess my mental model of how libdl works was wrong:

dlopen("/home/sdp/.julia/artifacts/113cdfb400110abbe4d925555e393c8501a9b71c/lib/libOpenCL.so", RTLD_LAZY | RTLD_DEEPBIND) == dlopen("libOpenCL.so", RTLD_LAZY | RTLD_GLOBAL) = false

i.e. loading the same library, once via a full path, once via a path that can be looked up in LD_LIBRARY_PATH, loads the library multiple times.

This does actually work fine if MKL would dlopen using the library's SONAME, i.e., dlopen("libOpenCL.so.1"), but I guess that's not something we could get fixed... If that even would work given the nature of the plugin interface.

Potential solutions:

set LD_LIBRARY_PATH in the JLL's init: doesn't work, only read during startup
patch libmkl: probably not EULA-allowed?
add a symlink to ../opencl_jlls_treesha/lib/libopencl.so: doesn't work with multiple depots, and BB doesn't seem to support shipping such symlinks
ask MKL to first dlopen by SONAME

pengtu · 2024-04-12T17:13:33Z

This is an attempt to solve #406. The ICD loader we ship as part of oneAPI_Support_jll, required by oneMKL, apparently doesn't support PVC (?). Here, I switch to Khronos' generic ICD loader, which should also allow making the oneAPI_Support_jll more lightweight.

The good news is that it works:

julia> using oneAPI

julia> oneAPI.versioninfo()
Binary dependencies:
- NEO: 24.9.28717+0
- libigc: 1.0.16238+0
- gmmlib: 22.3.17+0
- SPIRV_LLVM_Translator_unified: 0.3.0+0
- SPIRV_Tools: 2023.2.0+0

Toolchain:
- Julia: 1.10.2
- LLVM: 15.0.7

1 driver:
- 00000000-0000-0000-17c5-858d0103702d (v1.3.28717, API v1.3.0)

16 devices:
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550

julia> oneArray(rand(Float32, 10, 10)) * oneArray(rand(Float32, 10, 10))
10×10 oneArray{Float32, 2, oneAPI.oneL0.DeviceBuffer}:
 3.19613  2.97887  2.28815   2.74149  1.57031  2.77789  2.67408  2.77693   2.62745  3.46989
 2.57461  2.44688  2.10964   2.6676   2.08912  1.9301   2.9007   1.87115   2.34614  3.43717
 1.40868  1.32582  0.996341  1.50965  1.17594  1.09587  1.40743  0.91676   1.25728  1.6185
 3.03337  2.81118  2.16232   2.80838  1.82865  2.81073  2.65468  2.32454   2.58824  3.28149
 1.43007  1.57745  1.13201   1.34756  1.27351  1.49425  1.75484  0.838633  1.7749   1.4279
 2.36413  2.39614  1.57379   2.37182  1.76427  2.14909  2.76529  1.26664   2.51001  2.37393
 4.23767  3.66995  2.72545   3.38257  2.74463  3.47646  3.34504  3.70197   3.35509  4.60171
 3.26022  3.68492  2.21753   3.24749  2.4025   3.21383  2.5462   2.85667   2.73826  3.29256
 2.55168  2.78619  1.95469   2.51261  1.72095  2.42254  2.95073  1.67603   2.60132  2.92868
 3.78765  3.45718  2.45423   3.32748  2.74399  3.36122  3.4286   3.10559   3.18935  4.2261

... the bad news is that it crashes MKL on process teardown:

julia> exit()

[227689] signal (11.128): Segmentation fault
in expression starting at REPL[4]:1
__x86_indirect_thunk_rax at /home/sdp/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
unknown function (ip: 0x5fffffffffffffff)
_Z13USMFreeHelperP20ur_context_handle_t_Pvb at /home/sdp/.julia/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libpi_level_zero.so (unknown line)
Allocations: 10328633 (Pool: 10317672; Big: 10961); GC: 13
Segmentation fault (core dumped)

GDB reveals:

Thread 1 "julia" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffcb2b3f9b4 in USMFreeHelper(ur_context_handle_t_*, void*, bool) () from /home/sdp/.julia/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libpi_level_zero.so
#2  0x00007ffcb2b3f7cd in urUSMFree () from /home/sdp/.julia/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libpi_level_zero.so
#3  0x00007ffcb2b47ee6 in piextUSMFree () from /home/sdp/.julia/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libpi_level_zero.so
#4  0x00007ffc71239cfb in _pi_result sycl::_V1::detail::plugin::call_nocheck<(sycl::_V1::detail::PiApiKind)95, _pi_context*, void*>(_pi_context*, void*) const ()
   from /home/sdp/.julia/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libsycl.so.7
#5  0x00007ffc71233303 in sycl::_V1::detail::usm::free(void*, sycl::_V1::context const&, sycl::_V1::detail::code_location const&) ()
   from /home/sdp/.julia/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libsycl.so.7
#6  0x00007ffc99e6a3c3 in oneapi::mkl::gpu::zero_pool_free_buffers() () from /home/sdp/.julia/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libmkl_sycl_blas.so.4
#7  0x00007ffc995fb5d8 in mkl_sycl_destructor () from /home/sdp/.julia/artifacts/9ea79ba1e382480323ed972cffadb2b5a703b3eb/lib/libmkl_sycl_blas.so.4
#8  0x00007ffff7dca495 in __run_exit_handlers (status=0, listp=0x7ffff7f9f838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#9  0x00007ffff7dca610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#10 0x0000000000401090 in main (argc=<optimized out>, argv=<optimized out>) at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/cli/loader_exe.c:61

@pengtu You looked into this last time we encountered a similar issue; any thoughts?

Last time it was the oneMKL's gpu_buffer was dangling after Julia has released its oneMKL handle.

https://github.com/JuliaGPU/oneAPI.jl/blob/master/deps/src/onemkl.cpp#L4334

This time, it looks like that we need to add:
oneapi::mkl::gpu::zero_pool_free_buffers()

maleadt · 2024-04-15T08:49:44Z

Filed a bug upstream: uxlfoundation/oneMath#472

maleadt · 2024-04-15T09:57:24Z

MWE of the oneMKL failure:

using oneAPI, LinearAlgebra, Test

function main(p)
    println("Running test with $p batches...")
    m = 15
    n = 10
    elty = Float32

    A = [rand(elty,n,n) for i = 1:p]
    A = [A[i]' * A[i] + I for i = 1:p]
    B = [rand(elty,n,p) for i = 1:p]
    d_A = oneMatrix{elty}[]
    d_B = oneMatrix{elty}[]
    for i in 1:p
        push!(d_A, oneMatrix(A[i]))
        push!(d_B, oneMatrix(B[i]))
    end

    oneMKL.potrf_batched!(d_A)
    oneMKL.potrs_batched!(d_A, d_B)
    for i = 1:p
        LAPACK.potrf!('L', A[i])
        LAPACK.potrs!('L', A[i], B[i])
        if !(B[i] ≈ collect(d_B[i]))
            println("Error in batch $i")
            return false
        end
    end
    return true
end

main(5)
main(10)

This often fails during both the p=5 and p=10 test, and worse, doing two tests like this results in a hard crash during process teardown:

Running test with 5 batches...
Error in batch 4
Running test with 10 batches...
Error in batch 9

[539768] signal (11.1): Segmentation fault
in expression starting at none:0
NEO::Drm::hasPageFaultSupport() const at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
NEO::DrmAllocation::bindBO(NEO::BufferObject*, NEO::OsContext*, unsigned int, std::vector<NEO::BufferObject*, std::allocator<NEO::BufferObject*> >*, bool) at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
NEO::DrmCommandStreamReceiver<NEO::Gen12LpFamily>::processResidency(std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> > const&, unsigned int) at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
NEO::DrmCommandStreamReceiver<NEO::Gen12LpFamily>::flushInternal(NEO::BatchBuffer const&, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> > const&) at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
NEO::DrmCommandStreamReceiver<NEO::Gen12LpFamily>::flush(NEO::BatchBuffer&, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&) at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
NEO::CommandStreamReceiver::submitBatchBuffer(NEO::BatchBuffer&, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&) at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
L0::CommandQueueImp::submitBatchBuffer(unsigned long, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&, void*, bool) at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
L0::CommandQueueHw<(GFXCORE_FAMILY)18>::executeCommandListsRegular(L0::CommandQueueHw<(GFXCORE_FAMILY)18>::CommandListExecutionContext&, unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*, _ze_event_handle_t*, unsigned int, _ze_event_handle_t**) at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
L0::CommandQueueHw<(GFXCORE_FAMILY)18>::executeCommandLists(unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*, bool, _ze_event_handle_t*, unsigned int, _ze_event_handle_t**) at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
L0::zeCommandQueueExecuteCommandLists(_ze_command_queue_handle_t*, unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*) at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
ur_queue_handle_t_::executeCommandList(std::__detail::_Node_iterator<std::pair<_ze_command_list_handle_t* const, pi_command_list_info_t>, false, false>, bool, bool) at /home/tim/.julia/artifacts/b5c94438cbe90f256f82c5eb04f5be5928683beb/lib/libpi_level_zero.so (unknown line)
ur_queue_handle_t_::executeAllOpenCommandLists() at /home/tim/.julia/artifacts/b5c94438cbe90f256f82c5eb04f5be5928683beb/lib/libpi_level_zero.so (unknown line)
urQueueRelease at /home/tim/.julia/artifacts/b5c94438cbe90f256f82c5eb04f5be5928683beb/lib/libpi_level_zero.so (unknown line)
piQueueRelease at /home/tim/.julia/artifacts/b5c94438cbe90f256f82c5eb04f5be5928683beb/lib/libpi_level_zero.so (unknown line)
sycl::_V1::detail::queue_impl::~queue_impl() at /home/tim/.julia/artifacts/b5c94438cbe90f256f82c5eb04f5be5928683beb/lib/libsycl.so.7 (unknown line)
_M_release at /opt/x86_64-linux-gnu/x86_64-linux-gnu/include/c++/8.1.0/bits/shared_ptr_base.h:161 [inlined]
~__shared_count at /opt/x86_64-linux-gnu/x86_64-linux-gnu/include/c++/8.1.0/bits/shared_ptr_base.h:712 [inlined]
~__shared_ptr at /opt/x86_64-linux-gnu/x86_64-linux-gnu/include/c++/8.1.0/bits/shared_ptr_base.h:1151 [inlined]
~queue at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/sycl/queue.hpp:81 [inlined]
~syclQueue_st at /workspace/srcdir/oneAPI.jl/deps/src/sycl.hpp:19 [inlined]
syclQueueDestroy at /workspace/srcdir/oneAPI.jl/deps/src/sycl.cpp:60
syclQueueDestroy at /home/tim/Julia/pkg/oneAPI/lib/support/liboneapi_support.jl:58 [inlined]
#7 at /home/tim/Julia/pkg/oneAPI/lib/sycl/SYCL.jl:74
unknown function (ip: 0x7f07cad2e8c5)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
run_finalizer at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gc.c:318
jl_gc_run_finalizers_in_list at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gc.c:408
run_finalizers at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gc.c:454
ijl_atexit_hook at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/init.c:299
jl_repl_entrypoint at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jlapi.c:732
main at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/cli/loader_exe.c:58
unknown function (ip: 0x7f07d156f249)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 12947355 (Pool: 12931348; Big: 16007); GC: 20

It's disturbing that switching ICD loaders causes oneMKL to fail like that, both in terms of correctness and these segfaults. FWIW, it's not related to libOpenCL.so getting loaded twice, as I've ruled that out by forcibly loading only a single libOpenCL.so through LD_LIBRARY_PATH:

LD_LIBRARY_PATH=/home/tim/.julia/artifacts/113cdfb400110abbe4d925555e393c8501a9b71c/lib LD_DEBUG=libs jl --project wip.jl |& grep libOpenCL
    539472:	calling init: /home/tim/.julia/artifacts/113cdfb400110abbe4d925555e393c8501a9b71c/lib/libOpenCL.so
    539472:	find library=libOpenCL.so [0]; searching
    539472:	  trying file=/home/tim/.julia/artifacts/113cdfb400110abbe4d925555e393c8501a9b71c/lib/libOpenCL.so

@pengtu Back to you, I guess...

pengtu · 2024-04-15T14:24:05Z

@maleadt: Looks like that we cannot call mkl_sycl_destructor in syclQueueDestroy because it also invoke ~queue, so we are calling ~queue twice. I will back out the mkl_sycl_destructor call.

Please rerun both the MKL matmul and the new p(5), p(10) tests and let me know the results. I will ask the MKL team depending on the results.

maleadt · 2024-04-15T20:20:00Z

Rebased, but crashes locally with the same error:

[960736] signal (11.1): Segmentation fault
in expression starting at none:0
_ZNK3NEO3Drm19hasPageFaultSupportEv at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
_ZN3NEO13DrmAllocation6bindBOEPNS_12BufferObjectEPNS_9OsContextEjPSt6vectorIS2_SaIS2_EEb at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
_ZN3NEO24DrmCommandStreamReceiverINS_13Gen12LpFamilyEE16processResidencyERKSt6vectorIPNS_18GraphicsAllocationESaIS5_EEj at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
_ZN3NEO24DrmCommandStreamReceiverINS_13Gen12LpFamilyEE13flushInternalERKNS_11BatchBufferERKSt6vectorIPNS_18GraphicsAllocationESaIS8_EE at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
_ZN3NEO24DrmCommandStreamReceiverINS_13Gen12LpFamilyEE5flushERNS_11BatchBufferERSt6vectorIPNS_18GraphicsAllocationESaIS7_EE at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
_ZN3NEO21CommandStreamReceiver17submitBatchBufferERNS_11BatchBufferERSt6vectorIPNS_18GraphicsAllocationESaIS5_EE at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
_ZN2L015CommandQueueImp17submitBatchBufferEmRSt6vectorIPN3NEO18GraphicsAllocationESaIS4_EEPvb at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
_ZN2L014CommandQueueHwIL14GFXCORE_FAMILY18EE26executeCommandListsRegularERNS2_27CommandListExecutionContextEjPP25_ze_command_list_handle_tP18_ze_fence_handle_tP18_ze_event_handle_tjPSB_ at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
_ZN2L014CommandQueueHwIL14GFXCORE_FAMILY18EE19executeCommandListsEjPP25_ze_command_list_handle_tP18_ze_fence_handle_tbP18_ze_event_handle_tjPS9_ at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
_ZN2L033zeCommandQueueExecuteCommandListsEP26_ze_command_queue_handle_tjPP25_ze_command_list_handle_tP18_ze_fence_handle_t at /home/tim/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1 (unknown line)
_ZN18ur_queue_handle_t_18executeCommandListENSt8__detail14_Node_iteratorISt4pairIKP25_ze_command_list_handle_t22pi_command_list_info_tELb0ELb0EEEbb at /home/tim/.julia/artifacts/afc975f89da04ea8d5e4e31e34b9c0f1c8f3a5da/lib/libpi_level_zero.so (unknown line)
_ZN18ur_queue_handle_t_26executeAllOpenCommandListsEv at /home/tim/.julia/artifacts/afc975f89da04ea8d5e4e31e34b9c0f1c8f3a5da/lib/libpi_level_zero.so (unknown line)
urQueueRelease at /home/tim/.julia/artifacts/afc975f89da04ea8d5e4e31e34b9c0f1c8f3a5da/lib/libpi_level_zero.so (unknown line)
piQueueRelease at /home/tim/.julia/artifacts/afc975f89da04ea8d5e4e31e34b9c0f1c8f3a5da/lib/libpi_level_zero.so (unknown line)
_ZN4sycl3_V16detail10queue_implD2Ev at /home/tim/.julia/artifacts/afc975f89da04ea8d5e4e31e34b9c0f1c8f3a5da/lib/libsycl.so.7 (unknown line)
_M_release at /opt/x86_64-linux-gnu/x86_64-linux-gnu/include/c++/8.1.0/bits/shared_ptr_base.h:161 [inlined]
~__shared_count at /opt/x86_64-linux-gnu/x86_64-linux-gnu/include/c++/8.1.0/bits/shared_ptr_base.h:712 [inlined]
~__shared_ptr at /opt/x86_64-linux-gnu/x86_64-linux-gnu/include/c++/8.1.0/bits/shared_ptr_base.h:1151 [inlined]
~queue at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/sycl/queue.hpp:81 [inlined]
~syclQueue_st at /workspace/srcdir/oneAPI.jl/deps/src/sycl.hpp:19 [inlined]
syclQueueDestroy at /workspace/srcdir/oneAPI.jl/deps/src/sycl.cpp:60
syclQueueDestroy at /home/tim/Julia/pkg/oneAPI/lib/support/liboneapi_support.jl:58 [inlined]
#7 at /home/tim/Julia/pkg/oneAPI/lib/sycl/SYCL.jl:74
unknown function (ip: 0x7fac300ad8d5)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
run_finalizer at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gc.c:318
jl_gc_run_finalizers_in_list at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gc.c:408
run_finalizers at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gc.c:454
ijl_atexit_hook at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/init.c:299
jl_repl_entrypoint at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jlapi.c:732
main at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/cli/loader_exe.c:58
unknown function (ip: 0x7fac3142f249)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 12947981 (Pool: 12931979; Big: 16002); GC: 18

Did it not do that on your system?

I also came across the following:

Running test with 5 batches...
Error in batch 4
Running test with 10 batches...
Error in batch 9
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)

[960203] signal (6.-6): Aborted
in expression starting at none:0
unknown function (ip: 0x7f52dca4fe2c)
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
__verbose_terminate_handler at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/vterminate.cc:95
__terminate at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
terminate at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
__clang_call_terminate at /home/tim/.julia/artifacts/afc975f89da04ea8d5e4e31e34b9c0f1c8f3a5da/lib/libsycl.so.7 (unknown line)
_ZN4sycl3_V16detail10queue_implD2Ev at /home/tim/.julia/artifacts/afc975f89da04ea8d5e4e31e34b9c0f1c8f3a5da/lib/libsycl.so.7 (unknown line)
_M_release at /opt/x86_64-linux-gnu/x86_64-linux-gnu/include/c++/8.1.0/bits/shared_ptr_base.h:161 [inlined]
~__shared_count at /opt/x86_64-linux-gnu/x86_64-linux-gnu/include/c++/8.1.0/bits/shared_ptr_base.h:712 [inlined]
~__shared_ptr at /opt/x86_64-linux-gnu/x86_64-linux-gnu/include/c++/8.1.0/bits/shared_ptr_base.h:1151 [inlined]
~queue at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/sycl/queue.hpp:81 [inlined]
~syclQueue_st at /workspace/srcdir/oneAPI.jl/deps/src/sycl.hpp:19 [inlined]
syclQueueDestroy at /workspace/srcdir/oneAPI.jl/deps/src/sycl.cpp:60
syclQueueDestroy at /home/tim/Julia/pkg/oneAPI/lib/support/liboneapi_support.jl:58 [inlined]
#7 at /home/tim/Julia/pkg/oneAPI/lib/sycl/SYCL.jl:74
unknown function (ip: 0x7f52d612e8a5)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
run_finalizer at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gc.c:318
jl_gc_run_finalizers_in_list at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gc.c:408
run_finalizers at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gc.c:454
ijl_atexit_hook at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/init.c:299
jl_repl_entrypoint at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jlapi.c:732
main at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/cli/loader_exe.c:58
unknown function (ip: 0x7f52dc9ec249)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 12947991 (Pool: 12931988; Big: 16003); GC: 19

maleadt · 2024-04-15T20:26:56Z

I verified in GDB that mkl_sycl_destructor isn't called.

Using a debug/asserts build of NEO/IGC doesn't reveal anything.

pengtu · 2024-04-17T23:23:55Z

The stack trace in GDB shows that SYCL runtime dies in the SYCL queue destructor trying to execute all the pending open commands.

Thread 1 "julia" received signal SIGSEGV, Segmentation fault.
0x00007ffc16e0828e in NEO::Drm::getVirtualMemoryAddressSpace(unsigned int) const () from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
(gdb) up
#1  0x00007ffc16e09b68 in NEO::changeBufferObjectBinding(NEO::Drm*, NEO::OsContext*, unsigned int, NEO::BufferObject*, bool) () from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
(gdb) up
#2  0x00007ffc16e0a1a1 in NEO::Drm::bindBufferObject(NEO::OsContext*, unsigned int, NEO::BufferObject*) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
(gdb) up
#3  0x00007ffc16df5732 in NEO::BufferObject::bind(NEO::OsContext*, unsigned int) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
(gdb) up
#4  0x00007ffc16df23a5 in NEO::DrmAllocation::bindBO(NEO::BufferObject*, NEO::OsContext*, unsigned int, std::vector<NEO::BufferObject*, std::allocator<NEO::BufferObject*> >*, bool) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
(gdb) up
#5  0x00007ffc16e83f57 in NEO::DrmMemoryOperationsHandlerBind::makeResidentWithinOsContext(NEO::OsContext*, ArrayRef<NEO::GraphicsAllocation*>, bool) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
(gdb) up
#6  0x00007ffc16e8417d in NEO::DrmMemoryOperationsHandlerBind::mergeWithResidencyContainer(NEO::OsContext*, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
(gdb) up
#7  0x00007ffc16c5fd7d in NEO::DrmCommandStreamReceiver<NEO::XeHpcCoreFamily>::flush(NEO::BatchBuffer&, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
(gdb) up
#8  0x00007ffc16c6cc3f in NEO::CommandStreamReceiver::submitBatchBuffer(NEO::BatchBuffer&, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
(gdb) up
#9  0x00007ffc16886fb2 in L0::CommandQueueImp::submitBatchBuffer(unsigned long, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&, void*, bool) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
(gdb) up
#10 0x00007ffc16969640 in L0::CommandQueueHw<(GFXCORE_FAMILY)3080>::executeCommandListsRegular(L0::CommandQueueHw<(GFXCORE_FAMILY)3080>::CommandListExecutionContext&, unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*, _ze_event_handle_t*, unsigned int, _ze_event_handle_t**) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
(gdb) up
#11 0x00007ffc1696bb47 in L0::CommandQueueHw<(GFXCORE_FAMILY)3080>::executeCommandLists(unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*, bool, _ze_event_handle_t*, unsigned int, _ze_event_handle_t**) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
(gdb) up
#12 0x00007ffc1687a51d in L0::zeCommandQueueExecuteCommandLists(_ze_command_queue_handle_t*, unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
(gdb) up
#13 0x00007ffc2bd5fb9b in ur_queue_handle_t_::executeCommandList(std::__1::__hash_map_iterator<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::__hash_value_type<_ze_command_list_handle_t*, ur_command_list_info_t>, void*>*> >, bool, bool) () from /home/peng/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libpi_level_zero.so
(gdb) bt
#0  0x00007ffc16e0828e in NEO::Drm::getVirtualMemoryAddressSpace(unsigned int) const ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
#1  0x00007ffc16e09b68 in NEO::changeBufferObjectBinding(NEO::Drm*, NEO::OsContext*, unsigned int, NEO::BufferObject*, bool) () from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
#2  0x00007ffc16e0a1a1 in NEO::Drm::bindBufferObject(NEO::OsContext*, unsigned int, NEO::BufferObject*) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
#3  0x00007ffc16df5732 in NEO::BufferObject::bind(NEO::OsContext*, unsigned int) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
#4  0x00007ffc16df23a5 in NEO::DrmAllocation::bindBO(NEO::BufferObject*, NEO::OsContext*, unsigned int, std::vector<NEO::BufferObject*, std::allocator<NEO::BufferObject*> >*, bool) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
#5  0x00007ffc16e83f57 in NEO::DrmMemoryOperationsHandlerBind::makeResidentWithinOsContext(NEO::OsContext*, ArrayRef<NEO::GraphicsAllocation*>, bool) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
#6  0x00007ffc16e8417d in NEO::DrmMemoryOperationsHandlerBind::mergeWithResidencyContainer(NEO::OsContext*, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
#7  0x00007ffc16c5fd7d in NEO::DrmCommandStreamReceiver<NEO::XeHpcCoreFamily>::flush(NEO::BatchBuffer&, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
#8  0x00007ffc16c6cc3f in NEO::CommandStreamReceiver::submitBatchBuffer(NEO::BatchBuffer&, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
#9  0x00007ffc16886fb2 in L0::CommandQueueImp::submitBatchBuffer(unsigned long, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&, void*, bool) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
#10 0x00007ffc16969640 in L0::CommandQueueHw<(GFXCORE_FAMILY)3080>::executeCommandListsRegular(L0::CommandQueueHw<(GFXCORE_FAMILY)3080>::CommandListExecutionContext&, unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*, _ze_event_handle_t*, unsigned int, _ze_event_handle_t**) ()
--Type <RET> for more, q to quit, c to continue without paging--
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
#11 0x00007ffc1696bb47 in L0::CommandQueueHw<(GFXCORE_FAMILY)3080>::executeCommandLists(unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*, bool, _ze_event_handle_t*, unsigned int, _ze_event_handle_t**) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
#12 0x00007ffc1687a51d in L0::zeCommandQueueExecuteCommandLists(_ze_command_queue_handle_t*, unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*) ()
   from /home/peng/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
#13 0x00007ffc2bd5fb9b in ur_queue_handle_t_::executeCommandList(std::__1::__hash_map_iterator<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::__hash_value_type<_ze_command_list_handle_t*, ur_command_list_info_t>, void*>*> >, bool, bool) () from /home/peng/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libpi_level_zero.so
#14 0x00007ffc2bd5bc6c in ur_queue_handle_t_::executeAllOpenCommandLists() ()
   from /home/peng/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libpi_level_zero.so
#15 0x00007ffc2bd5b558 in urQueueRelease ()
   from /home/peng/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libpi_level_zero.so
#16 0x00007ffc2bd6771b in piQueueRelease ()
   from /home/peng/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libpi_level_zero.so
#17 0x00007ffbe8c20177 in _pi_result sycl::_V1::detail::plugin::call_nocheck<(sycl::_V1::detail::PiApiKind)26, _pi_queue*>(_pi_queue*) const ()
   from /home/peng/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libsycl.so.7
#18 0x00007ffbe8c1f92c in sycl::_V1::detail::queue_impl::~queue_impl() ()
   from /home/peng/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libsycl.so.7
#19 0x00007ffc2c1579fb in syclQueueDestroy ()
   from /home/peng/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/deps/lib/liboneapi_support.so
#20 0x00007fffe0ad3cf0 in ?? ()
#21 0x0000000000000004 in ?? ()
#22 0x0000000005833e80 in ?? ()
#23 0x00007ffc2c85a740 in jl_system_image_data ()
   from /home/peng/.julia/compiled/v1.10/oneAPI_Support_jll/25SX0_2vw3q.so
#24 0x0000000005833e80 in ?? ()
--Type <RET> for more, q to quit, c to continue without paging--
#25 0x00007ffff7d4e038 in ?? ()
#26 0x00007fffed5a2b60 in ?? ()
#27 0x00007fffffffdb80 in ?? ()
#28 0x00007fffe0ad3d36 in ?? ()
#29 0x0000000000007b0e in ?? ()
#30 0x00007fffffffdc68 in ?? ()
#31 0x00007fffffffdc40 in ?? ()
#32 0x00007ffff71b6a0e in _jl_invoke (world=<optimized out>, mfunc=0x7fffed558650, nargs=0, args=0x7fffec588080,
    F=0x16) at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894
#33 ijl_apply_generic (F=<optimized out>, args=0x7fffec588080, nargs=<optimized out>)
    at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)

pengtu · 2024-04-17T23:41:03Z

The program passes with SYCL_PI_LEVEL_ZERO_BATCH_SIZE=1 julia --project test.jl, so batching seems to be the problem. Our current way of forcing an oneMKL task into the execution apparently does not work in this case:

// This is a workaround to flush MKL submissions into Level-zero queue, using
// unspecified but guaranteed behavior of intel-sycl runtime. Once SYCL standard
// committee approves sycl::queue::flush() we will change the macro to use that
#define __FORCE_MKL_FLUSH__(cmd) \
            sycl::get_native<sycl::backend::ext_oneapi_level_zero>(cmd)

pengtu · 2024-04-18T18:16:21Z

@maleadt: Can oneAPI.jl set the environment variable SYCL_PI_LEVEL_ZERO_BATCH_SIZE=1 for this release? It only affects the SYCL kernel submissions. The only SYCL functions that oneAPI.jl calls are oneMKL functions that require eager dispatch so performance impact will be mostly positive. We can also remove the current workaround, which also reduces the oneMKL call overhead.

maleadt · 2024-04-18T18:51:43Z

Seems to work here too, great! Let's see what CI thinks.

maleadt · 2024-04-18T18:55:03Z

Note that on PVC there looks to be another MKL issue lurking:

Test Summary:    | Pass  Fail  Total   Time
level 1          |   79     1     80  21.2s
  T = Float32    |   15           15   3.0s
  T = ComplexF32 |   16     1     17   2.4s
    copy         |    1            1   0.2s
    axpy         |    1            1   0.4s
    axpby        |    1            1   0.2s
    rotate       |    2            2   0.2s
    reflect      |    2            2   0.1s
    scal         |    2            2   0.1s
    nrm2         |    1            1   0.1s
    iamax/iamin  |    2            2   0.2s
    swap         |    2            2   0.1s
    dot          |    2            2   0.4s
    asum         |          1      1   0.4s
  T = Float64    |   15           15   1.5s
  T = ComplexF64 |   17           17   2.0s
  T = Float16    |    7            7   1.1s
  T = ComplexF16 |    9            9  11.0s

I haven't investigated this yet.

maleadt mentioned this pull request Apr 12, 2024

PVC support #406

Closed

3 tasks

maleadt force-pushed the tb/generic_icd_loader branch from e6d25d9 to 0cd3ec0 Compare April 12, 2024 19:39

maleadt force-pushed the tb/generic_icd_loader branch from 1241b1d to aa9fda3 Compare April 15, 2024 09:35

maleadt force-pushed the tb/generic_icd_loader branch from aa9fda3 to 6d05f6c Compare April 15, 2024 17:58

maleadt force-pushed the tb/generic_icd_loader branch from 6d05f6c to a454010 Compare April 16, 2024 18:33

maleadt added 2 commits April 17, 2024 08:23

Use a generic OpenCL ICD loader.

396cd3c

Bump oneAPI_Support instead.

9e42bce

maleadt force-pushed the tb/generic_icd_loader branch from a454010 to 9e42bce Compare April 17, 2024 06:23

New workaround.

9490f1e

maleadt mentioned this pull request Apr 18, 2024

Remove oneMKL workaround #427

Closed

maleadt merged commit d369a07 into master Apr 18, 2024
1 check passed

maleadt deleted the tb/generic_icd_loader branch April 18, 2024 19:44

maleadt mentioned this pull request Jun 5, 2024

Multiplication of StridedMaybeAdjOrTransMat broken for certain matrix sizes #442

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a generic OpenCL ICD loader. #417

Use a generic OpenCL ICD loader. #417

maleadt commented Apr 12, 2024 •

edited

Loading

maleadt commented Apr 12, 2024

maleadt commented Apr 12, 2024

maleadt commented Apr 12, 2024

maleadt commented Apr 12, 2024 •

edited

Loading

maleadt commented Apr 12, 2024 •

edited

Loading

maleadt commented Apr 12, 2024 •

edited

Loading

pengtu commented Apr 12, 2024

maleadt commented Apr 15, 2024

maleadt commented Apr 15, 2024

pengtu commented Apr 15, 2024

maleadt commented Apr 15, 2024

maleadt commented Apr 15, 2024

pengtu commented Apr 17, 2024 •

edited by maleadt

Loading

pengtu commented Apr 17, 2024 •

edited by maleadt

Loading

pengtu commented Apr 18, 2024

maleadt commented Apr 18, 2024

maleadt commented Apr 18, 2024

Use a generic OpenCL ICD loader. #417

Use a generic OpenCL ICD loader. #417

Conversation

maleadt commented Apr 12, 2024 • edited Loading

maleadt commented Apr 12, 2024

maleadt commented Apr 12, 2024

maleadt commented Apr 12, 2024

maleadt commented Apr 12, 2024 • edited Loading

maleadt commented Apr 12, 2024 • edited Loading

maleadt commented Apr 12, 2024 • edited Loading

pengtu commented Apr 12, 2024

maleadt commented Apr 15, 2024

maleadt commented Apr 15, 2024

pengtu commented Apr 15, 2024

maleadt commented Apr 15, 2024

maleadt commented Apr 15, 2024

pengtu commented Apr 17, 2024 • edited by maleadt Loading

pengtu commented Apr 17, 2024 • edited by maleadt Loading

pengtu commented Apr 18, 2024

maleadt commented Apr 18, 2024

maleadt commented Apr 18, 2024

maleadt commented Apr 12, 2024 •

edited

Loading

maleadt commented Apr 12, 2024 •

edited

Loading

maleadt commented Apr 12, 2024 •

edited

Loading

maleadt commented Apr 12, 2024 •

edited

Loading

pengtu commented Apr 17, 2024 •

edited by maleadt

Loading

pengtu commented Apr 17, 2024 •

edited by maleadt

Loading