Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intra-node communication fails when using Nvidia hpc-x in a Singularity container #10404

Open
mredenti opened this issue Jan 6, 2025 · 0 comments
Labels

Comments

@mredenti
Copy link

mredenti commented Jan 6, 2025

Describe the bug

When running a MPI application with NVIDIA HPC-X (which bundles UCX, Open MPI, etc.) inside a Singularity container, intra-node communication fails with the following message:

cma_ep.c:88   process_vm_readv(pid=1723503 {0x14745e5ac800,61928}-->{0x150dba573e00,61928}) returned -1: Bad address
==== backtrace (tid:1723498) ====
 0 0x0000000000003803 uct_cma_ep_tx_error()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/uct/sm/scopy/cma/cma_ep.c:85
...

Steps to Reproduce

#!/bin/bash
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --nodes=2
#SBATCH --gres=gpu:4

module load openmpi/4.1.6--nvhpc--24.3

mpirun -np 8 singularity exec --nv --no-home app.sif Fall3d.x All Raikoke-2019.inp 2 2 2
  • UCX version used
ucx_info -v 
# Library version: 1.17.0
# Library path: /opt/nvidia/hpc_sdk/Linux_x86_64/24.11/comm_libs/12.6/hpcx/hpcx-2.20/ucx/lib/libucs.so.0
# API headers version: 1.17.0
# Git branch '', revision 39c8f9b
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --without-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.5.1/redhat8 --with-gdrcopy --prefix=/build-result/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx --with-bfd=/hpc/local/oss/binutils/2.37/redhat8

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
cat /etc/redhat-release
Red Hat Enterprise Linux release 8.7 (Ootpa)
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • rpm -q rdma-core : rdma-core-58mlnx43-1.58203.x86_64
      • or: MLNX_OFED version ofed_info -s: OFED-internal-5.8-2.0.3:
    • HW information from ibstat or ibv_devinfo -vv command
CA 'mlx5_0'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.39.2048
        Hardware version: 0
        Node GUID: 0x0800380001c2ec98
        System image GUID: 0x0800380001c2ec98
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 907
                LMC: 0
                SM lid: 34
                Capability mask: 0xa651e848
                Port GUID: 0x0800380001c2ec98
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.39.2048
        Hardware version: 0
        Node GUID: 0x0800380001c2cb4a
        System image GUID: 0x0800380001c2cb4a
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 845
                LMC: 0
                SM lid: 34
                Capability mask: 0xa651e848
                Port GUID: 0x0800380001c2cb4a
                Link layer: InfiniBand
CA 'mlx5_2'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.39.2048
        Hardware version: 0
        Node GUID: 0x0800380001c2cb48
        System image GUID: 0x0800380001c2cb48
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 843
                LMC: 0
                SM lid: 34
                Capability mask: 0xa651e848
                Port GUID: 0x0800380001c2cb48
                Link layer: InfiniBand
CA 'mlx5_3'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.39.2048
        Hardware version: 0
        Node GUID: 0x0800380001c2ec9a
        System image GUID: 0x0800380001c2ec9a
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 911
                LMC: 0
                SM lid: 34
                Capability mask: 0xa651e848
                Port GUID: 0x0800380001c2ec9a
                Link layer: InfiniBand
  • For GPU related issues:
    • GPU type: A100
    • Cuda:
      • Drivers version: 530.30.02
      • Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv:
lsmod|grep gdrdrv
gdrdrv                 24576  0
nvidia              54976512  257 nvidia_uvm,nv_peer_mem,gdrdrv,nvidia_modeset

Additional information (depending on the issue)

  • OpenMPI version: 4.1.7
  • Output of ucx_info -d to show transports and devices recognized by UCX
Singularity> ucx_info -d | grep Transport
#      Transport: self
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: sysv
#      Transport: posix
#      Transport: cuda_copy
#      Transport: cuda_ipc
#      Transport: gdr_copy
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: cma

Note: If I instead use an Nvidia container base image where ucx has been configured --with-knem also, then the applications runs fine as knem is chosen to do zero-copy shared memory transfers between processes on the same node . The reason that I am not currently using such a hpc-x configuration with knem enabled is that I could find one where

#      Transport: cuda_copy
#      Transport: cuda_ipc
#      Transport: gdr_copy

are also enabled

@mredenti mredenti added the Bug label Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant