Releases: NVIDIA/nvshmem
NVSHMEM 3.6.5-0
NVIDIA NVSHMEM 3.6.5 Release Notes
NVIDIA® NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM provides a NVIDIA® CUDA® kernel-side interface that allows CUDA threads to access any location in the symmetrically-distributed memory.
The release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 3.6.5 and earlier releases.
Key Features and Enhancements
The NVSHMEM release includes the following key features and enhancements:
- Features
- Added configuration file support (similar to NCCL) for easy and repeatable environment variable management.
- Added experimental NVSHMEM LTO-IR (Link-Time Optimization IR) library build option for improved device code optimization.
- Added enhanced user buffer registration with preferred address support via
nvshmemx_buffer_register_symmetric. - Added error code return values for tile API calls to improve error handling.
- Added multi-NIC support for
libfabrictransport with round-robin NIC selection. The new environment variableNVSHMEM_LIBFABRIC_MAX_NIC_PER_PEcontrols the maximum number of NICs used per PE. - Improved version mismatch error messages to include detailed host and device library version information.
- Bug Fixes
- Fixed IBGDA to activate CST optimizations on supported architectures.
- Fixed LLVM Dead Store Elimination removing WQE stores in bitcode library builds.
- Fixed
NVSHMEM_TEAM_SHAREDinitialization for imbalanced P2P-connected groups of PEs in MNNVL setups. - Fixed
libfabrictransport compatibility issues between build and runtime versions. - Fixed PMIx bootstrap to handle empty environment variables correctly.
- Fixed host RMA off-stream transport capability check.
The NVSHMEM4Py 0.3.0 release includes the following:
- Added CuTe DSL support for NVSHMEM4Py with device-side bindings for RMA, collective, AMO, and memory operations.
- Added device-side construction of peer and multicast tensors for CuTe DSL.
- Added helper functions to simplify NVSHMEM/CuTe DSL usage.
- Made Numba-CUDA an optional dependency and bumped minimum version to 0.25.
- Fixed peer/multimem buffer tracking assumptions for parent buffer cleanup.
Compatibility
NVSHMEM 3.6.5 has been tested with the following:
- CUDA Toolkit:
- 12.8
- 12.9
- 13.2
- CPUs
- x86 processors
- NVIDIA Grace™ processors
- GPUs:
- NVIDIA Ampere A100
- NVIDIA Hopper™
- NVIDIA Blackwell
- NCCL 2.29.7
Limitations
- NVSHMEM is not compatible with the PMI client library on Cray systems, and must use the NVSHMEM internal PMI-2 client library.
- You can launch jobs with the PMI bootstrap by specifying
--mpi=pmi2to Slurm andNVSHMEM_BOOTSTRAP_PMI=PMI-2, or directly by using the MPI or SHMEM bootstraps. - You can also set PMI-2 as the default PMI by setting
NVSHMEM_DEFAULT_PMI2=1when you build NVSHMEM.
- You can launch jobs with the PMI bootstrap by specifying
- The
libfabrictransport currently does not support VMM, so you must disable VMM by settingNVSHMEM_DISABLE_CUDA_VMM=1. - Systems with PCIe peer-to-peer communication must do one of the following:
- Provide InfiniBand to support NVSHMEM atomics API calls.
- Use NVSHMEM’s UCX transport, which uses sockets for atomics if InfiniBand is absent.
nvshmem_barrier*,nvshmem_quiet, andnvshmem_wait_untilonly ensure ordering and visibility between the source and destination PEs. They do not ensure global ordering and visibility.- When built with GDRCopy, and when using InfiniBand on versions of the 460 driver prior to 460.106.00, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed in the CUDA release 460 driver and in drivers with version 470 and later.
- IBGDA does not work with CX-4 when the link layer is Ethernet (RoCE).
- NVSHMEM is not supported on Grace with Ada L40 platforms.
- NVSHMEM is not supported in virtualized environments (VM).
- User buffers registered with
nvshmemx_buffer_register_symmetriclack support forlibfabrictransport to perform GPU-GPU communication over remote networks (EFA, Slingshot, etc.). - When registering Extended GPU memory (EGM) user buffers with
nvshmemx_buffer_register_symmetric, the buffers on different PEs must belong to distinct CPU sockets within a node. You can achieve this by selecting GPUs on a different NUMA domain using theCUDA_VISIBLE_DEVICESenvironment variable. - When using the Libfabric transport with
NVSHMEM_LIBFABRIC_PROVIDER=EFA, you must ensure that thelibfabricenvironment variableFI_EFA_ENABLE_SHM_TRANSFERis set to0before launching the application. While NVSHMEM sets this variable during initialization, it may be ignored by the EFA provider if it was already initialized by the launcher, for example when using mpirun.
Deprecated Features
- C++11 support is deprecated, NVSHMEM 3.7 will switch to C++17 to align with the CUDA 13 toolchain.
Known Issues
- The internal layout of RC-connected QPs changed starting in 3.5.21, causing ABI compatibility breakage when enabling IBGDA.
- Complex types, which are enabled by setting
NVSHMEM_COMPLEX_SUPPORTat compile time, are not currently supported. - When you enable UCX remote transport with
NVSHMEM_REMOTE_TRANSPORT=UCX, you may observe a data mismatch when scaling 32 PEs or more on DGX-2 platform.
NVSHMEM 3.5.21-0
NVIDIA® NVSHMEM 3.5.21 Release Notes
NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM provides a NVIDIA® CUDA® kernel-side interface that allows CUDA threads to access any location in the symmetrically-distributed memory.
The release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 3.5.21 and earlier releases.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements:
- Fixed a bug that was related to ABI compatibility breakage for the internal team structure.
NVSHMEM4Py release 0.2.2 includes the following:
- Removed an incorrect assumption that any NVSHMEM4Py-managed buffers will have at most one child buffer (peer or multicast)
Compatibility
NVSHMEM 3.5.21 has been tested with the following:
NCCL:
- 2.28.3
CUDA Toolkit:
- 12.4
- 12.9
- 13.0
- 13.1
CPUs:
- x86 and NVIDIA Grace™ processors
GPUs:
- NVIDIA Ampere A100
- NVIDIA Hopper™
- NVIDIA Blackwell
Limitations
Same as 3.5.19
Known Issues
- The internal layout of RC-connected QPs changed starting in 3.5.19 causing ABI compatibility breakage when enabling IBGDA.
NVSHMEM 3.5.19-1
NVIDIA® NVSHMEM 3.5.19 Release Notes
NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM provides a NVIDIA® CUDA® kernel-side interface that allows CUDA threads to access any location in the symmetrically-distributed memory.
The release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 3.4.5 and earlier releases.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements:
- Added qpair-specific API calls (
nvshmemx_qp_*) that provide RMA operations on specific queue pairs abstracted vianvshmemx_qp_handle_t. - Added tile-granular RMA routines
tile_put,tile_get, andtile_broadcastAPI calls. - Added LLVM bitcode library support for IBGDA.
- Added option to pass the CUDA device to
nvshmemx_init_attrto set a device when using NVSHMEM. - Added environment variable
NVSHMEM_MAX_PEER_STREAMSto set the maximum number of CUDA streams per node. - Renamed
tile_allreducetotile_reduceandtile_reducetotile_rooted_reduceto align with other NVSHMEM collectives. - Removed static-only version
libnvshmem.a. Link instead tolibnvshmem_hostandlibnvshmem_device. - Improved EFA transport (libfabric) with multiple bug fixes and performance improvements for AWS environments.
- Updated default number of used QPs from 4 to 8 for full bandwidth with data direct NICs.
- Changed default
NVSHMEM_MAX_MEMORY_PER_GPUfrom 128 GiB to 256 GiB. - Improved
NVSHMEM_HCA_PREFIXto accept^and updated default value. - Added CUTLASS support to tile API calls.
- Removed
reallocandalltoallsdeclarations because these functions are not implemented in NVSHMEM. - Updated hydra installation script to install version 4.3.2.
- Improved error catching and reporting for initialization and synchronization routines.
- Fixed race condition in barrier causing hangs on unordered networks.
- Fixed reduce test validation issues in cases when PE count is not a power of 2.
- Fixed stream memory operations to use
cuStreamWriteValueonly for self-writes. - Fixed
nvshmem_callocimplementation to account for thecountargument. - Fixed several minor bugs and memory leaks.
NVSHMEM4Py release 0.2.1 includes the following:
- Added NVSHMEM4Py device API calls. Using the Numba-Cuda DSL, you can write fused compute-comms kernels in Python. NVSHMEM4Py device API calls cover collectives, one-sided RMA, and atomic memory operations (AMOs).
- Added ability to allocate Fortran-memory-ordered arrays and tensors.
- Removed requirement to explicitly set
LD_LIBRARY_PATHto find NVSHMEM with cuda-pathfinder. - Added support for multicast memory buffers and array/tensor wrappers.
- Numerous bug fixes and minor interoperability enhancements.
Compatibility
NVSHMEM 3.5.19 has been tested with the following:
CUDA Toolkit:
- 12.4
- 12.9
- 13.0
- 13.1
CPUs:
- x86 and NVIDIA Grace™ processors
GPUs:
- NVIDIA Ampere A100
- NVIDIA Hopper™
- NVIDIA Blackwell®
Limitations
- NVSHMEM is not compatible with the PMI client library on Cray systems,
and must use the NVSHMEM internal PMI-2 client library.- You can launch jobs with the PMI bootstrap by specifying
--mpi=pmi2
to Slurm andNVSHMEM_BOOTSTRAP_PMI=PMI-2, or directly by using the MPI or SHMEM bootstraps. - You can also set PMI-2 as the default PMI by setting
NVSHMEM_DEFAULT_PMI2=1
when you build NVSHMEM.
- You can launch jobs with the PMI bootstrap by specifying
- The
libfabrictransport currently does not support VMM, so you must disable VMM
by settingNVSHMEM_DISABLE_CUDA_VMM=1. - Systems with PCIe peer-to-peer communication must do one of the following:
- Provide InfiniBand to support NVSHMEM atomics API calls.
- Use NVSHMEM’s UCX transport, which uses sockets for atomics if InfiniBand is absent.
nvshmem_barrier*,nvshmem_quiet, andnvshmem_wait_untilonly ensure ordering
and visibility between the source and destination PEs. They do not ensure global ordering and visibility.- When built with GDRCopy, and when using InfiniBand on versions
of the 460 driver prior to 460.106.00, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed in the CUDA release 460 driver and in drivers with version 470 and later. - IBGDA does not work with CX-4 when the link layer is Ethernet (RoCE).
- NVSHMEM is not supported on Grace with Ada L40 platforms.
- NVSHMEM is not supported in virtualized environments (VM).
- User buffers registered with
nvshmemx_buffer_register_symmetric
lack support forlibfabrictransport to perform GPU-GPU communication over Remote networks (EFA, Slingshot, etc.). - When registering Extended GPU memory (EGM) user buffers with
nvshmemx_buffer_register_symmetric, the buffers on different PEs must belong to distinct CPU sockets within a node. This can be achieved by selecting GPUs on a different NUMA domain using theCUDA_VISIBLE_DEVICESenvironment variable. - When using the Libfabric transport with
NVSHMEM_LIBFABRIC_PROVIDER=EFA, you must ensure that thelibfabricenvironment variableFI_EFA_ENABLE_SHM_TRANSFERis set to0before launching their application. While NVSHMEM
sets this variable during initialization, it may be ignored by the EFA provider if it was already initialized by the launcher, for example when using mpirun. - Due to the LLVM ecosystem's CUDA support matrix,
libnvshmem_device.bcbitcode library has only been qualified for use with CUDA toolkits with major version 12. Support for CUDA version 13 is experimental, and it is recommended that users build the NVSHMEM bitcode library from source with LLVM 22 for use with CUDA 13.
Deprecated Features
- Support for
libnvshmem.ais now deprecated.
Known Issues
- Complex types, which are enabled by setting
NVSHMEM_COMPLEX_SUPPORT
at compile time, are not currently supported. - When enabling
libfabrictransport withNVSHMEM_LIBFABRIC_PROVIDER=EFA,
certain operations are experimental and may cause the application kernel to hang in the following operations:- Device side
nvshmem_put/nvshmem_getwithnvshmem_barrier - Host side
nvshmem_put_on_stream/nvshmem_get_on_stream
- Device side
- When you enable UCX remote transport with
NVSHMEM_REMOTE_TRANSPORT=UCX,
you may observe a data mismatch when scaling 32 PEs or more on DGX-2 platform.
NVSHMEM 3.4.5-0
NVIDIA® NVSHMEM 3.4.5 Release Notes
NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM provides a NVIDIA® CUDA® kernel-side interface that allows CUDA threads to access any location in the symmetrically-distributed memory.
The release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 3.4.5 and earlier releases.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements:
- Added support for data direct NIC configurations in the IB transports. Added a new environment variable,
NVSHMEM_DISABLE_DATA_DIRECT, to force disable data direct NIC even when present. - Added support for CPU-Assisted IBGDA without the use of GDRCopy or the x86 regkey setting.
Systems not supporting the other methods will automatically fall back to this new method.
It enables the use of IBGDA on a broad range of systems without the need for administrator intervention. - Added a new environment variable
NVSHMEM_HCA_PREFIXto enable IB transports on systems which
name their HCA devices in a non-standard way (for example,ipb*instead ofmlx5*). - Deprecated support for the combined
libnvshmem.ahost and device static library.
Compatibility
NVSHMEM 3.4.5 has been tested with the following:
CUDA Toolkit:
- 12.2
- 12.6
- 12.9
- 13.0
CPUs:
- x86 and NVIDIA Grace™ processors
GPUs:
- NVIDIA Ampere A100
- NVIDIA Hopper™
- NVIDIA Blackwell®
Limitations
- NVSHMEM is not compatible with the PMI client library on Cray systems,
and must use the NVSHMEM internal PMI-2 client library.- You can launch jobs with the PMI bootstrap by specifying
--mpi=pmi2
to Slurm andNVSHMEM_BOOTSTRAP_PMI=PMI-2, or directly by using the MPI or SHMEM bootstraps. - You can also set PMI-2 as the default PMI by setting
NVSHMEM_DEFAULT_PMI2=1
when you build NVSHMEM.
- You can launch jobs with the PMI bootstrap by specifying
- The
libfabrictransport does not support VMM yet, so you must disable VMM
by settingNVSHMEM_DISABLE_CUDA_VMM=1. - Systems with PCIe peer-to-peer communication require one of the following:
- InfiniBand to support NVSHMEM atomics APIs
- Using NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics
nvshmem_barrier*,nvshmem_quiet, ornvshmem_wait_untilonly ensures ordering
and visibility between the source and destination PEs and does not ensure global ordering and visibility.- When built with GDRCopy, and when using InfiniBand on earlier versions
of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed in the CUDA release 460 driver and in release 470 and later. - IBGDA does not work with CX-4 when the link layer is Ethernet (RoCE).
- NVSHMEM is not supported on Grace with Ada L40 platforms.
- NVSHMEM is not supported on virtualized environments (VM).
- User buffers registered with
nvshmemx_buffer_register_symmetric
lack support forlibfabrictransport to perform GPU-GPU communication over Remote networks (EFA, Slingshot, etc.). - When registering Extended GPU memory (EGM) user buffers with
nvshmemx_buffer_register_symmetric, the buffers on different PEs must belong to distinct CPU sockets within a node. This can be achieved by selecting GPUs on a different NUMA domain using theCUDA_VISIBLE_DEVICESenvironment variable. - When using the Libfabric transport with
NVSHMEM_LIBFABRIC_PROVIDER=EFA, you must ensure that thelibfabricenvironment variableFI_EFA_ENABLE_SHM_TRANSFERis set to0before launching their application. While NVSHMEM
does set this variable during initialization, it can be ignored by the EFA provider if it was already initialized by the launcher, for example when using mpirun.
Deprecated Features
- Support for
libnvshmem.ais now deprecated.
Known Issues
- Complex types, which are enabled by setting
NVSHMEM_COMPLEX_SUPPORT
at compile time, are not currently supported. - When enabling
libfabrictransport withNVSHMEM_LIBFABRIC_PROVIDER=EFA,
certain operations are experimental and may cause the application kernel to hang in the following operations:- Device side
nvshmem_put/nvshmem_getwithnvshmem_barrier - Host side
nvshmem_put_on_stream/nvshmem_get_on_stream
- Device side
- When you enable UCX remote transport with
NVSHMEM_REMOTE_TRANSPORT=UCX,
you may observe a data mismatch when scaling 32 PEs or more on DGX-2 platform.