You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running a MPI application with NVIDIA HPC-X (which bundles UCX, Open MPI, etc.) inside a Singularity container, intra-node communication fails with the following message:
cma_ep.c:88 process_vm_readv(pid=1723503 {0x14745e5ac800,61928}-->{0x150dba573e00,61928}) returned -1: Bad address
==== backtrace (tid:1723498) ====
0 0x0000000000003803 uct_cma_ep_tx_error() /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/uct/sm/scopy/cma/cma_ep.c:85
...
or: MLNX_OFED version ofed_info -s: OFED-internal-5.8-2.0.3:
HW information from ibstat or ibv_devinfo -vv command
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.39.2048
Hardware version: 0
Node GUID: 0x0800380001c2ec98
System image GUID: 0x0800380001c2ec98
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 907
LMC: 0
SM lid: 34
Capability mask: 0xa651e848
Port GUID: 0x0800380001c2ec98
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4123
Number of ports: 1
Firmware version: 20.39.2048
Hardware version: 0
Node GUID: 0x0800380001c2cb4a
System image GUID: 0x0800380001c2cb4a
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 845
LMC: 0
SM lid: 34
Capability mask: 0xa651e848
Port GUID: 0x0800380001c2cb4a
Link layer: InfiniBand
CA 'mlx5_2'
CA type: MT4123
Number of ports: 1
Firmware version: 20.39.2048
Hardware version: 0
Node GUID: 0x0800380001c2cb48
System image GUID: 0x0800380001c2cb48
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 843
LMC: 0
SM lid: 34
Capability mask: 0xa651e848
Port GUID: 0x0800380001c2cb48
Link layer: InfiniBand
CA 'mlx5_3'
CA type: MT4123
Number of ports: 1
Firmware version: 20.39.2048
Hardware version: 0
Node GUID: 0x0800380001c2ec9a
System image GUID: 0x0800380001c2ec9a
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 911
LMC: 0
SM lid: 34
Capability mask: 0xa651e848
Port GUID: 0x0800380001c2ec9a
Link layer: InfiniBand
For GPU related issues:
GPU type: A100
Cuda:
Drivers version: 530.30.02
Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv:
Note: If I instead use an Nvidia container base image where ucx has been configured --with-knem also, then the applications runs fine as knem is chosen to do zero-copy shared memory transfers between processes on the same node . The reason that I am not currently using such a hpc-x configuration with knem enabled is that I could find one where
Describe the bug
When running a MPI application with NVIDIA HPC-X (which bundles UCX, Open MPI, etc.) inside a Singularity container, intra-node communication fails with the following message:
Steps to Reproduce
singularity build app.sif app.def
app.def.txt
Setup and versions
rpm -q rdma-core
:rdma-core-58mlnx43-1.58203.x86_64
ofed_info -s
:OFED-internal-5.8-2.0.3:
ibstat
oribv_devinfo -vv
command530.30.02
lsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
:lsmod|grep gdrdrv gdrdrv 24576 0 nvidia 54976512 257 nvidia_uvm,nv_peer_mem,gdrdrv,nvidia_modeset
Additional information (depending on the issue)
4.1.7
ucx_info -d
to show transports and devices recognized by UCXNote: If I instead use an Nvidia container base image where ucx has been configured
--with-knem
also, then the applications runs fine asknem
is chosen to do zero-copy shared memory transfers between processes on the same node . The reason that I am not currently using such a hpc-x configuration with knem enabled is that I could find one whereare also enabled
The text was updated successfully, but these errors were encountered: