Skip to content

libpsm2 cannot find sysfs entry for hfi1_0 on rdma-core v24.0 #43

@bsmith94

Description

@bsmith94

libpsm2 looks for sysfs entries under the path /sys/class/infiniband/hfi1_x. With rdma-core v24.0, the device is renamed according to its device type, PCI bus and device, a la "predictable interface names". This is described at https://patchwork.kernel.org/cover/10870443/ .

On my host, the sysfs path for hfi1_0 is /sys/class/infiniband/opap129s. Thus, libpsm2 fails to find the hfi1_0 sysfs entry in hfi_sysfs_port_open.

The behavior can be observed by executing fi_info on a Debian sid/bullseye host with libfabric-bin and libpsm2-2 installed. The psm2 providers will not be listed in the output. Debug output indicates that no active psm2 device is found.

$ FI_LOG_LEVEL=debug fi_info
...
libfabric:psm2:core:psmx2_init_lib():236<info> PSM2 header version = (2, 1)
libfabric:psm2:core:psmx2_init_lib():238<info> PSM2 library version = (2, 1)
libfabric:psm2:core:psmx2_init_lib():241<info> PSM2 multi-ep feature enabled.
libfabric:psm2:core:psmx2_update_hfi_info():338<warn> Failed to read number of free contexts from HFI unit 0
libfabric:psm2:core:psmx2_update_hfi_info():379<info> hfi1 units: total 1, active 0; hfi1 contexts: total 0, free 0
libfabric:psm2:core:psmx2_update_hfi_info():390<info> Tx/Rx contexts: 0 in total, 0 available.
libfabric:psm2:core:psmx2_getinfo():436<info> no PSM2 device is active.
libfabric:core:core:fi_getinfo_():751<warn> fi_getinfo: provider psm2 returned -61 (No data available)
...

I have found two orthogonal workarounds for this problem:

  1. Use HFI_SYSFS_PATH e.g. HFI_SYSFS_PATH=/sys/class/infiniband/opap129s fi_info. The "129" portion of the HFI_SYSFS_PATH value needs to be set according to the PCI bus of the HFI card.
  2. Or, modify /lib/udev/rules.d/60-rdma-persistent-naming.rules to contain ACTION=="add", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_KERNEL"

While there is a workaround, libpsm2 should address the new, default RDMA device naming scheme. opa_sysfs.c:sysfs_init() looks like the place to start.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions