Intra-node communication fails when using Nvidia hpc-x in a Singularity container #10404

mredenti · 2025-01-06T15:33:48Z

Describe the bug

When running a MPI application with NVIDIA HPC-X (which bundles UCX, Open MPI, etc.) inside a Singularity container, intra-node communication fails with the following message:

cma_ep.c:88   process_vm_readv(pid=1723503 {0x14745e5ac800,61928}-->{0x150dba573e00,61928}) returned -1: Bad address
==== backtrace (tid:1723498) ====
 0 0x0000000000003803 uct_cma_ep_tx_error()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/uct/sm/scopy/cma/cma_ep.c:85
...

Steps to Reproduce

singularity build app.sif app.def
app.def.txt

#!/bin/bash
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --nodes=2
#SBATCH --gres=gpu:4

module load openmpi/4.1.6--nvhpc--24.3

mpirun -np 8 singularity exec --nv --no-home app.sif Fall3d.x All Raikoke-2019.inp 2 2 2

UCX version used

ucx_info -v 
# Library version: 1.17.0
# Library path: /opt/nvidia/hpc_sdk/Linux_x86_64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ucx/lib/libucs.so.0
# API headers version: 1.17.0
# Git branch '', revision 39c8f9b
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --without-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.5.1/redhat8 --with-gdrcopy --prefix=/build-result/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx --with-bfd=/hpc/local/oss/binutils/2.37/redhat8

Setup and versions

OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)

cat /etc/redhat-release
Red Hat Enterprise Linux release 8.7 (Ootpa)

For RDMA/IB/RoCE related issues:
- Driver version:
  - rpm -q rdma-core : rdma-core-58mlnx43-1.58203.x86_64
  - or: MLNX_OFED version ofed_info -s: OFED-internal-5.8-2.0.3:
- HW information from ibstat or ibv_devinfo -vv command

CA 'mlx5_0'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.39.2048
        Hardware version: 0
        Node GUID: 0x0800380001c2ec98
        System image GUID: 0x0800380001c2ec98
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 907
                LMC: 0
                SM lid: 34
                Capability mask: 0xa651e848
                Port GUID: 0x0800380001c2ec98
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.39.2048
        Hardware version: 0
        Node GUID: 0x0800380001c2cb4a
        System image GUID: 0x0800380001c2cb4a
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 845
                LMC: 0
                SM lid: 34
                Capability mask: 0xa651e848
                Port GUID: 0x0800380001c2cb4a
                Link layer: InfiniBand
CA 'mlx5_2'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.39.2048
        Hardware version: 0
        Node GUID: 0x0800380001c2cb48
        System image GUID: 0x0800380001c2cb48
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 843
                LMC: 0
                SM lid: 34
                Capability mask: 0xa651e848
                Port GUID: 0x0800380001c2cb48
                Link layer: InfiniBand
CA 'mlx5_3'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.39.2048
        Hardware version: 0
        Node GUID: 0x0800380001c2ec9a
        System image GUID: 0x0800380001c2ec9a
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 911
                LMC: 0
                SM lid: 34
                Capability mask: 0xa651e848
                Port GUID: 0x0800380001c2ec9a
                Link layer: InfiniBand

For GPU related issues:
- GPU type: A100
- Cuda:
  - Drivers version: 530.30.02
  - Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv:

lsmod|grep gdrdrv
gdrdrv                 24576  0
nvidia              54976512  257 nvidia_uvm,nv_peer_mem,gdrdrv,nvidia_modeset

Additional information (depending on the issue)

OpenMPI version: 4.1.7
Output of ucx_info -d to show transports and devices recognized by UCX

Singularity> ucx_info -d | grep Transport
#      Transport: self
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: sysv
#      Transport: posix
#      Transport: cuda_copy
#      Transport: cuda_ipc
#      Transport: gdr_copy
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: cma

Note: If I instead use an Nvidia container base image where ucx has been configured --with-knem also, then the applications runs fine as knem is chosen to do zero-copy shared memory transfers between processes on the same node . The reason that I am not currently using such a hpc-x configuration with knem enabled is that I could find one where

#      Transport: cuda_copy
#      Transport: cuda_ipc
#      Transport: gdr_copy

are also enabled

The text was updated successfully, but these errors were encountered:

mredenti added the Bug label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intra-node communication fails when using Nvidia hpc-x in a Singularity container #10404

Intra-node communication fails when using Nvidia hpc-x in a Singularity container #10404

mredenti commented Jan 6, 2025 •

edited

Loading

Intra-node communication fails when using Nvidia hpc-x in a Singularity container #10404

Intra-node communication fails when using Nvidia hpc-x in a Singularity container #10404

Comments

mredenti commented Jan 6, 2025 • edited Loading

Describe the bug

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)

mredenti commented Jan 6, 2025 •

edited

Loading