-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Dose the opa-psm2 support GDRDMA on the ROCm platform, and I have to do what to enable GDRDMA ?
I use opa-psm2-PSM2_11.2.NCCL and psm2-nccl plugin on ROCm platform, env set like this:
export PSM2_GPUDIRECT=1
export PSM2_CUDA=1
run with rccl-tests and got the error:
node37.219242 Unhandled error in TID Update: Bad address
[node37:219242] *** Process received signal ***
[node37:219242] Signal: Aborted (6)
[node37:219242] Signal code: (-6)
[node37:219242] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b7bc83a65d0]
[node37:219242] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2b7bcdacd207]
[node37:219242] [ 2] /lib64/libc.so.6(abort+0x148)[0x2b7bcdace8f8]
[node37:219242] [ 3] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x16054)[0x2b7decdf3054]
[node37:219242] [ 4] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x1660d)[0x2b7decdf360d]
[node37:219242] [ 5] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x37e7c)[0x2b7dece14e7c]
[node37:219242] [ 6] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x3824c)[0x2b7dece1524c]
[node37:219242] [ 7] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x34331)[0x2b7dece11331]
[node37:219242] [ 8] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x34898)[0x2b7dece11898]
[node37:219242] [ 9] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x3b793)[0x2b7dece18793]
[node37:219242] [10] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x3c690)[0x2b7dece19690]
[node37:219242] [11] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x28c8f)[0x2b7dece05c8f]
[node37:219242] [12] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x2653c)[0x2b7dece0353c]
[node37:219242] [13] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(+0x23f8f)[0x2b7dece00f8f]
[node37:219242] [14] /home/fd/opa-psm2-install/usr/lib64/libpsm2.so.2(psm2_mq_ipeek+0x7c)[0x2b7decdfaeec]
[node37:219242] [15] /home/fd/psm2-nccl-master/librccl-net.so(psm2_nccl_test+0xb3)[0x2b7e23a027b3]
Debug the core file:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/home/fd/rccl-tests-master/build/all_gather_perf --minbytes=2621'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00002b7bcdacd207 in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x2b7f14a00700 (LWP 219289))]
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.172-2.el7.x86_64 elfutils-libs-0.172-2.el7.x86_64 glibc-2.17-260.el7.x86_64 infinipath-psm-3.3-26_g604758e_open.2.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-9.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 libibverbs-17.2-3.el7.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-17.2-3.el7.x86_64 libstdc++-4.8.5-36.el7.x86_64 libuuid-2.23.2-59.el7.x86_64 numactl-libs-2.0.9-7.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 systemd-libs-219-62.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0 0x00002b7bcdacd207 in raise () from /lib64/libc.so.6
#1 0x00002b7bcdace8f8 in abort () from /lib64/libc.so.6
#2 0x00002b7decdf3054 in psmi_errhandler_psm (ep=ep@entry=0x0, err=err@entry=PSM2_INTERNAL_ERR, error_string=error_string@entry=0x2b7f149f9acc " Unhandled error in TID Update: Bad address\n", token=token@entry=0x2b7f149f9ac0)
at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_error.c:96
#3 0x00002b7decdf360d in psmi_handle_error (ep=0xfffffffffffffffe, error=PSM2_INTERNAL_ERR, buf=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_error.c:183
#4 0x00002b7dece14e7c in ips_tidcache_register (tidc=tidc@entry=0x2b7e303bf458, start=start@entry=47820958728192, length=131072, firstidx=firstidx@entry=0x2b7f149f9e4c)
at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_tidcache.c:221
#5 0x00002b7dece1524c in ips_tidcache_acquire (tidc=tidc@entry=0x2b7e303bf458, buf=0x2b7e2f420000, length=length@entry=0x2b7f149f9ef0, tid_array=tid_array@entry=0x2b7e303bf734, tidcnt=tidcnt@entry=0x2b7f149f9ef4,
tidoff=tidoff@entry=0x2b7f149f9eec) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_tidcache.c:471
#6 0x00002b7dece11331 in ips_tid_recv_alloc_frag (nbytes_this=131072, tidrecvc=0x2b7e303bf650, protoexp=0x2b7e303bf440) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:1969
#7 ips_tid_recv_alloc (ptidrecvc=, nbytes_this=131072, getreq=, ipsaddr=0x2b7e30853210, protoexp=0x2b7e303bf440) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:2135
#8 ips_tid_pendtids_timer_callback (timer=timer@entry=0x2b7e303bf610, current=current@entry=0) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:2379
#9 0x00002b7dece11898 in ips_protoexp_tid_get_from_token (protoexp=0x2b7e303bf440, buf=0x2b7e2f420000, length=2097152, epaddr=0x2b7e30853210, remote_tok=1023, flags=, callback=0x2b7dece16b50 <ips_proto_mq_rv_complete_exp>,
context=0x2b7e301b9920) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_expected.c:587
#10 0x00002b7dece18793 in ips_proto_mq_rts_match_callback (req=0x2b7e301b9920, was_posted=was_posted@entry=1) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_mq.c:1152
#11 0x00002b7dece19690 in ips_proto_mq_handle_rts (rcv_ev=0x2b7f149fa200) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_mq.c:1536
#12 0x00002b7dece05c8f in ips_proto_process_packet (rcv_ev=0x2b7f149fa200) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_proto_help.h:555
#13 ips_recvhdrq_progress (recvq=0x2b7e301bfb98) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ips_recvhdrq.c:543
#14 0x00002b7dece0353c in ips_ptl_poll (ptl_gen=0x2b7e301b9e80, _ignored=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/ptl_ips/ptl.c:541
#15 0x00002b7dece00f8f in __psmi_poll_internal (ep=0x2b7e301b9ac0, poll_amsh=) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm.c:1071
#16 0x00002b7decdfaeec in psmi_mq_ipeek_inner (status_copy=, status=0x0, oreq=0x2b7f149fa438, mq=0x2b7e3010bf80) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_mq.c:1135
#17 _psm2_mq_ipeek (mq=0x2b7e3010bf80, oreq=0x2b7f149fa438, status=0x0) at /home/fd/opa-psm2-PSM2_11.2.NCCL/psm_mq.c:1174
#18 0x00002b7e23a027b3 in psm2_nccl_test () from /home/fd/psm2-nccl-master/librccl-net.so
#19 0x00002b7bc88fee2d in ncclNetTest (request=0x3586a, done=0x2b7f149fa4f4, size=0x2b7f149fa4cc) at /home/fd/rccl-dtk-21.10/src/include/net.h:29
#20 netRecvProxy (args=) at /home/fd/rccl-dtk-21.10/src/transport/net.cc:516
#21 0x00002b7bc8916de4 in progressOps (state=, opsPtr=, idle=, comm=) at /home/fd/rccl-dtk-21.10/src/proxy.cc:342
#22 persistentThread (comm=0x2b7e30000c00) at /home/fd/rccl-dtk-21.10/src/proxy.cc:440
#23 0x00002b7bc839edd5 in start_thread () from /lib64/libpthread.so.0
#24 0x00002b7bcdb94ead in clone () from /lib64/libc.so.6
(gdb)