-
Notifications
You must be signed in to change notification settings - Fork 17
Description
I got a Segmentation fault when I run the Unit Tests.
The libfabric.so's version is v1.11.0
$ mpirun -np 2 -H node1,node2 -x HFI_UNIT=0 ./nccl_message_transfer
TRACE: Function: main Line: 58: NET/OFI Using CUDA device 0 for memory allocation
INFO: Function: ofi_init Line: 1006: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
TRACE: Function: find_ofi_provider Line: 525: NET/OFI Could not find any optimal provider supporting GPUDirect RDMA
INFO: Function: ofi_init Line: 1033: NET/OFI Selected Provider is psm2
INFO: Function: main Line: 69: NET/OFI Process rank 0 started. NCCLNet device used on node13 is AWS Libfabric.
INFO: Function: main Line: 73: NET/OFI Received 1 network devices
[node13:203154] *** Process received signal ***
[node13:203154] Signal: Segmentation fault (11)
[node13:203154] Signal code: Address not mapped (1)
[node13:203154] Failing at address: (nil)
[node13:203154] [ 0] /public/home/fd/fabric-install/lib/libfabric.so.1(+0x11e1c5)[0x2b01dac9a1c5]
[node13:203154] [ 1] /lib64/libpthread.so.0(+0xf5d0)[0x2b01dbd955d0]
[node13:203154] [ 2] /lib64/libc.so.6(+0x8c3b1)[0x2b01dc62a3b1]
[node13:203154] [ 3] /lib64/libc.so.6(__strdup+0xe)[0x2b01dc62a0be]
[node13:203154] [ 4] /public/home/fd/ofi-install/lib/librccl-net.so.0(+0x1dab)[0x2b01d9b8adab]
[node13:203154] [ 5] ./nccl_message_transfer[0x4013ef]
[node13:203154] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b01dc5c03d5]
[node13:203154] [ 7] ./nccl_message_transfer[0x402023]
[node13:203154] *** End of error message ***
In function ofi_getProperties line 1198, The nic_info->device_attr->name is null, it has coredump for strdup