Skip to content

Unit Tests Segmentation fault #1

@flyingdown

Description

@flyingdown

I got a Segmentation fault when I run the Unit Tests.
The libfabric.so's version is v1.11.0

$ mpirun -np 2 -H node1,node2 -x HFI_UNIT=0 ./nccl_message_transfer
TRACE: Function: main Line: 58: NET/OFI Using CUDA device 0 for memory allocation
INFO: Function: ofi_init Line: 1006: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
TRACE: Function: find_ofi_provider Line: 525: NET/OFI Could not find any optimal provider supporting GPUDirect RDMA
INFO: Function: ofi_init Line: 1033: NET/OFI Selected Provider is psm2
INFO: Function: main Line: 69: NET/OFI Process rank 0 started. NCCLNet device used on node13 is AWS Libfabric.
INFO: Function: main Line: 73: NET/OFI Received 1 network devices
[node13:203154] *** Process received signal ***
[node13:203154] Signal: Segmentation fault (11)
[node13:203154] Signal code: Address not mapped (1)
[node13:203154] Failing at address: (nil)
[node13:203154] [ 0] /public/home/fd/fabric-install/lib/libfabric.so.1(+0x11e1c5)[0x2b01dac9a1c5]
[node13:203154] [ 1] /lib64/libpthread.so.0(+0xf5d0)[0x2b01dbd955d0]
[node13:203154] [ 2] /lib64/libc.so.6(+0x8c3b1)[0x2b01dc62a3b1]
[node13:203154] [ 3] /lib64/libc.so.6(__strdup+0xe)[0x2b01dc62a0be]
[node13:203154] [ 4] /public/home/fd/ofi-install/lib/librccl-net.so.0(+0x1dab)[0x2b01d9b8adab]
[node13:203154] [ 5] ./nccl_message_transfer[0x4013ef]
[node13:203154] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b01dc5c03d5]
[node13:203154] [ 7] ./nccl_message_transfer[0x402023]
[node13:203154] *** End of error message ***

In function ofi_getProperties line 1198, The nic_info->device_attr->name is null, it has coredump for strdup

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions