NVIDIA NVSHMEM 3.6.5 Release Notes

NVIDIA® NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM provides a NVIDIA^® CUDA^® kernel-side interface that allows CUDA threads to access any location in the symmetrically-distributed memory.

The release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 3.6.5 and earlier releases.

Key Features and Enhancements

The NVSHMEM release includes the following key features and enhancements:

Features
- Added configuration file support (similar to NCCL) for easy and repeatable environment variable management.
- Added experimental NVSHMEM LTO-IR (Link-Time Optimization IR) library build option for improved device code optimization.
- Added enhanced user buffer registration with preferred address support via nvshmemx_buffer_register_symmetric.
- Added error code return values for tile API calls to improve error handling.
- Added multi-NIC support for libfabric transport with round-robin NIC selection. The new environment variable NVSHMEM_LIBFABRIC_MAX_NIC_PER_PE controls the maximum number of NICs used per PE.
- Improved version mismatch error messages to include detailed host and device library version information.
Bug Fixes
- Fixed IBGDA to activate CST optimizations on supported architectures.
- Fixed LLVM Dead Store Elimination removing WQE stores in bitcode library builds.
- Fixed NVSHMEM_TEAM_SHARED initialization for imbalanced P2P-connected groups of PEs in MNNVL setups.
- Fixed libfabric transport compatibility issues between build and runtime versions.
- Fixed PMIx bootstrap to handle empty environment variables correctly.
- Fixed host RMA off-stream transport capability check.

The NVSHMEM4Py 0.3.0 release includes the following:

Added CuTe DSL support for NVSHMEM4Py with device-side bindings for RMA, collective, AMO, and memory operations.
Added device-side construction of peer and multicast tensors for CuTe DSL.
Added helper functions to simplify NVSHMEM/CuTe DSL usage.
Made Numba-CUDA an optional dependency and bumped minimum version to 0.25.
Fixed peer/multimem buffer tracking assumptions for parent buffer cleanup.

Compatibility

NVSHMEM 3.6.5 has been tested with the following:

CUDA Toolkit:
- 12.8
- 12.9
- 13.2
CPUs
- x86 processors
- NVIDIA Grace™ processors
GPUs:
- NVIDIA Ampere A100
- NVIDIA Hopper™
- NVIDIA Blackwell
NCCL 2.29.7

Limitations

NVSHMEM is not compatible with the PMI client library on Cray systems, and must use the NVSHMEM internal PMI-2 client library.
- You can launch jobs with the PMI bootstrap by specifying --mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2, or directly by using the MPI or SHMEM bootstraps.
- You can also set PMI-2 as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when you build NVSHMEM.
The libfabric transport currently does not support VMM, so you must disable VMM by setting NVSHMEM_DISABLE_CUDA_VMM=1.
Systems with PCIe peer-to-peer communication must do one of the following:
- Provide InfiniBand to support NVSHMEM atomics API calls.
- Use NVSHMEM’s UCX transport, which uses sockets for atomics if InfiniBand is absent.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure ordering and visibility between the source and destination PEs. They do not ensure global ordering and visibility.
When built with GDRCopy, and when using InfiniBand on versions of the 460 driver prior to 460.106.00, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed in the CUDA release 460 driver and in drivers with version 470 and later.
IBGDA does not work with CX-4 when the link layer is Ethernet (RoCE).
NVSHMEM is not supported on Grace with Ada L40 platforms.
NVSHMEM is not supported in virtualized environments (VM).
User buffers registered with nvshmemx_buffer_register_symmetric lack support for libfabric transport to perform GPU-GPU communication over remote networks (EFA, Slingshot, etc.).
When registering Extended GPU memory (EGM) user buffers with nvshmemx_buffer_register_symmetric, the buffers on different PEs must belong to distinct CPU sockets within a node. You can achieve this by selecting GPUs on a different NUMA domain using the CUDA_VISIBLE_DEVICES environment variable.
When using the Libfabric transport with NVSHMEM_LIBFABRIC_PROVIDER=EFA, you must ensure that the libfabric environment variable FI_EFA_ENABLE_SHM_TRANSFER is set to 0 before launching the application. While NVSHMEM sets this variable during initialization, it may be ignored by the EFA provider if it was already initialized by the launcher, for example when using mpirun.

Deprecated Features

C++11 support is deprecated, NVSHMEM 3.7 will switch to C++17 to align with the CUDA 13 toolchain.

Known Issues

The internal layout of RC-connected QPs changed starting in 3.5.21, causing ABI compatibility breakage when enabling IBGDA.
Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
When you enable UCX remote transport with NVSHMEM_REMOTE_TRANSPORT=UCX, you may observe a data mismatch when scaling 32 PEs or more on DGX-2 platform.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVSHMEM 3.6.5-0

Choose a tag to compare

Sorry, something went wrong.