Releases: openucx/ucx
Releases · openucx/ucx
v1.19.1-rc1
1.19.1 (Sep 18, 2025)
Features:
UCP
- Do not require transport memory support if rendezvous protocol is not used
Build
- Added CUDA 13 support to the release pipeline
v1.19.0
1.19.0 (August 6, 2025)
Features:
UCP
- Enabled multi-GPU support within a single process
- Added dynamic selection between strong and weak fences in RMA flush operations
- Improved endpoint reconfiguration capabilities
- Added All2All lane selection for multi-NIC-GPU systems
- Improved rkey debug info when config cache limit is reached
- Improved UCP protocol selection based on available memory types
- Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
- Improved RNDV performance with device-local staging buffers
- Enabled error handling for RMA get_offload protocols
UCT
- Defined uct_rkey_unpack_v2 API to support passing sys-dev
RDMA CORE (IB, ROCE, etc.)
- Added SRD transport support in EFA with reordering, AM, and control operations
- Removed XGVMI BF2 support (umem)
- Removed device memory indirect key
- Fixed VFS objects for DCIs and pools
- Added routing table cache to the reachability check
- Fixed strict order usage in IB auxiliary rkeys
- Improved various init logging messages
CUDA
- Added multi-context support for remote key unpacking to CUDA IPC
- Added context switching aware resource management to CUDA IPC
- Use buffer ID to detect VA recycling in CUDA IPC
- Added support for allocating CUDA memory on specific system devices
- Added multi-device support in CUDA copy
- Improved protocol lane selection for GPU memory operations
- Relaxed CUDA context requirements in CUDA copy
- Added deadlock prevention in CUDA copy
- Added support for address range detection for VMM
- Enabled memory attributes query after switching CUDA GPU
- Added multi-GPU send tests for CUDA transports
- Removed host-to-host performance estimation from CUDA copy transport
- Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
- Improved various init logging messages
ROCM
- Added control parameters for IPC handle cache and signal pool size
- Optimized ROCm memory type detection with caching
UCS
- Removed compilation warnings
Tools
- Added name filter option (-F 'str') to ucx_info for config and feature dumps
- Improved ucx_info input validation
Bugfixes:
UCP
- Made UCX_TLS=^ib disable all transports including auxiliary
- Fixed send request status handling
- Fixed performance degradation in RNDV by optimizing md cache updates
- Fixed protocol selection when first lane is filtered out by fragment size
- Fixed rkey selection by using memory registration flag
UCT
RDMA CORE (IB, ROCE, etc.)
- Improved reliability of DC transport by adding DCI validation and separating connection logic
- Fixed segfault in DC fence operation
GPU (CUDA, ROCM)
- Updated ROCm configuration for ROCm 6.3 compatibility
- Fixed system device detection for CUDA async memory operations
- Fixed legacy type detection during CUDA IPC mpack
- Fixed CUDA IPC RMA operations by using correct context for local buffers
UCS
- Use UCS function for counting leading zeros on x86 architecture
- Fixed a compilation warning
Shared Memory
- Fixed FIFO availability check for sm transport
Documentation
- Fixed open-mpi clone instruction
Build
- Fixed enum-int-mismatch warnings with GCC 15
v1.19.0-rc2
1.19.0 (June 18, 2025)
Features:
UCP
- Enabled multi-GPU support within a single process
- Added dynamic selection between strong and weak fences in RMA flush operations
- Improved endpoint reconfiguration capabilities
- Added All2All lane selection for multi-NIC-GPU systems
- Improved rkey debug info when config cache limit is reached
- Improved UCP protocol selection based on available memory types
- Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
- Improved RNDV performance with device-local staging buffers
- Enabled error handling for RMA get_offload protocols
UCT
- Defined uct_rkey_unpack_v2 API to support passing sys-dev
RDMA CORE (IB, ROCE, etc.)
- Added SRD transport support in EFA with reordering, AM, and control operations
- Removed XGVMI BF2 support (umem)
- Removed device memory indirect key
- Fixed VFS objects for DCIs and pools
- Added routing table cache to the reachability check
- Fixed strict order usage in IB auxiliary rkeys
- Improved various init logging messages
CUDA
- Added multi-context support for remote key unpacking to CUDA IPC
- Added context switching aware resource management to CUDA IPC
- Use buffer ID to detect VA recycling in CUDA IPC
- Added support for allocating CUDA memory on specific system devices
- Added multi-device support in CUDA copy
- Improved protocol lane selection for GPU memory operations
- Relaxed CUDA context requirements in CUDA copy
- Added deadlock prevention in CUDA copy
- Added support for address range detection for VMM
- Enabled memory attributes query after switching CUDA GPU
- Added multi-GPU send tests for CUDA transports
- Removed host-to-host performance estimation from CUDA copy transport
- Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
- Improved various init logging messages
ROCM
- Added control parameters for IPC handle cache and signal pool size
- Optimized ROCm memory type detection with caching
UCS
- Removed compilation warnings
Tools
- Added name filter option (-F 'str') to ucx_info for config and feature dumps
- Improved ucx_info input validation
Bugfixes:
UCP
- Made UCX_TLS=^ib disable all transports including auxiliary
- Fixed send request status handling
- Fixed performance degradation in RNDV by optimizing md cache updates
- Fixed protocol selection when first lane is filtered out by fragment size
- Fixed rkey selection by using memory registration flag
UCT
RDMA CORE (IB, ROCE, etc.)
- Improved reliability of DC transport by adding DCI validation and separating connection logic
- Fixed segfault in DC fence operation
GPU (CUDA, ROCM)
- Updated ROCm configuration for ROCm 6.3 compatibility
- Fixed system device detection for CUDA async memory operations
- Fixed legacy type detection during CUDA IPC mpack
- Fixed CUDA IPC RMA operations by using correct context for local buffers
UCS
- Use UCS function for counting leading zeros on x86 architecture
- Fixed a compilation warning
Shared Memory
- Fixed FIFO availability check for sm transport
Documentation
- Fixed open-mpi clone instruction
Build
- Fixed enum-int-mismatch warnings with GCC 15
v1.19.0-rc1
1.19.0 (June 18, 2025)
Features:
UCP
- Enabled multi-GPU support within a single process
- Added dynamic selection between strong and weak fences in RMA flush operations
- Improved endpoint reconfiguration capabilities
- Added All2All lane selection for multi-NIC-GPU systems
- Improved rkey debug info when config cache limit is reached
- Improved UCP protocol selection based on available memory types
- Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
- Improved RNDV performance with device-local staging buffers
- Enabled error handling for RMA get_offload protocols
UCT
- Defined uct_rkey_unpack_v2 API to support passing sys-dev
RDMA CORE (IB, ROCE, etc.)
- Added SRD transport support in EFA with reordering, AM, and control operations
- Removed XGVMI BF2 support (umem)
- Removed device memory indirect key
- Fixed VFS objects for DCIs and pools
- Added routing table cache to the reachability check
- Fixed strict order usage in IB auxiliary rkeys
- Improved various init logging messages
CUDA
- Added multi-context support for remote key unpacking to CUDA IPC
- Added context switching aware resource management to CUDA IPC
- Use buffer ID to detect VA recycling in CUDA IPC
- Added support for allocating CUDA memory on specific system devices
- Added multi-device support in CUDA copy
- Improved protocol lane selection for GPU memory operations
- Relaxed CUDA context requirements in CUDA copy
- Added deadlock prevention in CUDA copy
- Added support for address range detection for VMM
- Enabled memory attributes query after switching CUDA GPU
- Added multi-GPU send tests for CUDA transports
- Removed host-to-host performance estimation from CUDA copy transport
- Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
- Improved various init logging messages
ROCM
- Added control parameters for IPC handle cache and signal pool size
- Optimized ROCm memory type detection with caching
UCS
- Removed compilation warnings
Tools
- Added name filter option (-F 'str') to ucx_info for config and feature dumps
- Improved ucx_info input validation
Bugfixes:
UCP
- Made UCX_TLS=^ib disable all transports including auxiliary
- Fixed send request status handling
- Fixed performance degradation in RNDV by optimizing md cache updates
- Fixed protocol selection when first lane is filtered out by fragment size
- Fixed rkey selection by using memory registration flag
UCT
RDMA CORE (IB, ROCE, etc.)
- Improved reliability of DC transport by adding DCI validation and separating connection logic
- Fixed segfault in DC fence operation
GPU (CUDA, ROCM)
- Updated ROCm configuration for ROCm 6.3 compatibility
- Fixed system device detection for CUDA async memory operations
- Fixed legacy type detection during CUDA IPC mpack
- Fixed CUDA IPC RMA operations by using correct context for local buffers
UCS
- Use UCS function for counting leading zeros on x86 architecture
- Fixed a compilation warning
Shared Memory
- Fixed FIFO availability check for sm transport
Documentation
- Fixed open-mpi clone instruction
Build
- Fixed enum-int-mismatch warnings with GCC 15
v1.18.1
1.18.1 (April 28, 2025)
Features:
CUDA
- Added config keys to update cuda_copy bandwidth for coherent platforms
- Improved cache invalidation of memory allocated using CUDA memory pool
AZP
- Added Ubuntu 24.04 to build and release pipeline
Bugfixes:
UCP
- Fixed assertion failure when maximum lane fragment is smaller than AM header
- Fixed potential active message user header use after free with protocol reconfiguration
CUDA
- Fixed registration of CUDA Fabric memory allocated by UCT
- Fixed VA recycling check of memory allocated using VMM and CUDA memory pool
RDMA CORE (IB, ROCE, etc.)
- Do not use ConnectX-8 SMI subdevices for communication
- Fixed remote access error by disabling ODP when the device supports DDP
- Fixed configuration logic by disabling DDP when AR is disabled
UCM
- Fixed crash with bistro hooks for CUDA 12.9 on amd64
v1.18.1 RC3
1.18.1-rc3 (April 17, 2025)
Bugfixes:
UCM
- Fixed crash with bistro hooks for CUDA 12.9 on amd64
v1.18.1 RC2
1.18.1-rc2 (April 9, 2025)
Features:
CUDA
- Added config keys to update cuda_copy bandwidth for coherent platforms
- Improved cache invalidation of memory allocated using CUDA memory pool
Bugfixes:
UCP
- Fixed assertion failure when maximum lane fragment is smaller than AM header
CUDA
- Fixed registration of CUDA Fabric memory allocated by UCT
- Fixed VA recycling check of memory allocated using VMM and CUDA memory pool
RDMA CORE (IB, ROCE, etc.)
- Do not use ConnectX-8 SMI subdevices for communication
- Fixed remote access error by disabling ODP when the device supports DDP
- Fixed configuration logic by disabling DDP when AR is disabled
v1.18.1 RC1
1.18.1-rc1 (February 20, 2025)
Features:
AZP
- Added Ubuntu 24.04 to build and release pipeline
Bugfixes:
UCP
- Fixed potential active message user header use after free with protocol reconfiguration
v1.18.0
1.18.0 (January 17, 2025)
Features:
UCP
- Enabled using CUDA staging buffers for pipeline protocols by default
- Added endpoint reconfiguration support for non-reused p2p scenarios
- Enabled non-cacheable memory domains, activated for gdr_copy
- Added user_data parameter to ucp_ep_query
- Added support for host memory pipeline through CUDA buffers for rendezvous protocol
- Added global VA infrastructure and memory region in absence of error handling
- Made protocol performance node names more informative
- Enforced always running on the same thread in single thread mode
- Multiple improvements in protocols selection infrastructure
- Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping
- Allowed up-to 64 endpoint lanes for systems with many transports or devices
- Added usage tracker to worker
- Improved various logging messages
RDMA CORE (IB, ROCE, etc.)
- Added environment variable to manage DC initiator capacity
- Added DC dcs_hybrid policy
- Reduced MLX5/DV stack size consumption
- Added ODP support for verbs and mlx5dv
- Added support of CUDA managed memory on IB when ODP is available
- Added support of Adaptive Routing on RoCE
- Enabled use of implicit ODP with relaxed ordering
- Improved GPU-Direct detection in IB transport
- Increased DC initiator default count to 32 for performance optimization
- Added ConnectX-8 device support with DDP
- Added support for subnet filter list for RoCE interfaces
- Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports
- Added IB MLX5 as a separate UCX module with separate RPM sub-package
- Added initial support for GGA transport, for fast DPU memory access
- Set IB DevX atomic mode based on device capabilities
- Removed DC keepalive mechanism, since the keepalive is done on UCP layer
- Optimized cross-gVMI memory registration using indirect memory keys cache
- Improved various logging messages
CUDA
- Added multi-node NVlink support
- Added CUDA Fabric memory support with detection and allocation
- Improved gdr_copy latency estimations on AMD Milan systems
- Added check for gdr_copy runtime/build version mismatch
- Added handling missing IPC capability when unpacking keys
- Added caching for CUDA IPC memory pool import operation
- Added gdr_copy variables to optimize performance on Grace Hopper systems
- Improved CUDA IPC concurrency for a larger count of reachable peers
UCS
- Added support for wildcards in configuration parameter names
- Added ASAN protection to several internal data structures
- Reduced stack usage in topology detection code
- Improved bitmaps configuration parsing with wider bitfield
- Added options to set topology distance between devices
- Optimized VFS unix socket watch by using user private folder
- Added general IP subnet matching infrastructure
- Extend array data structure to support user-provided array copy routine
- Improved time units description
UCM
- Extend CUDA memory hooks to include memory mapping APIs
Tools
- Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest
- Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest
- Improved ucx_perftest uni-directional test with added fence
- Detailed ucx_perftest batch section of command-line documentation
Documentation
- Added a section regarding adaptive routing on RoCE
Architecture
- Added CPU Model for MI300A
- Added Fujitsu ARM specific values to ucx.conf
- Added AMD Turin support
- Added an optimized non-temporal memory copy implementation for AMD CPU
Build
- Improved compiler error reporting with added flag
- Improved coverity script to allow faster turnaround time
- Improved Intel Compiler detection and support
GO
- Added multi-send flag and user memh support in request params
Packaging
- Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments
Bugfixes:
UCP
- Fixed stack overflow in exported rkey unpack
- Removed extra remote-cpu overhead from protocol estimation for zcopy
- Fixed performance estimation for rndv pipeline protocols
- Fixed ATP sending by picking the correct lane
- Fixed missing reg_id on memh creation
- Fixed repeated invalidations by retaining existing access flags
- Fixed abort reason propagation for rendezvous RTR mtype
- Do not check transport availability if it is disabled by UCX_TLS environment variable
- Fixed wrong flag being used for checking BCOPY capability
- Fixed sending too many ATPs for small messages
- Enforced 16 bits size for Active Messages identifiers
- Fixed unnecessary status check for emulated AMO
- Fixed more than one fragment sending in rendezvous pipeline
- Fixed crash by using biggest max frag across all lanes
- Fixed missing memory handle flags by copying from parent to child
- Fixed worker interface activate count
- Fixed flush requests by replacing ATP/flush lane map with lane indexes
- Fixed lost uct_flags when merging memory regions
UCT
- Fixed memory domain UCT flags description
RDMA CORE (IB, ROCE, etc.)
- Fixed FETCH_ADD remote access error for ODP/KSM case
- Fixed missing conditional compilation checks for DM
- Fixed IB MD allocation naming typo
- Fixed invalid GIDs filter in IB
- Fixed flags usage in MLX5 zcopy_post
- Do not limit ODP registration retries
- Fixed JUCX failures by considering the number of supported completion vectors
CUDA
- Fixed async memory handling using CUDA memory type on Grace
- Added rcache overhead in performance estimation
- Fixed gdr_copy performance regression by providing maximum estimation between get and put
- Fixed CUDA IPC reachability check
- Fixed crash in MPI_Finalize when CUDA context is destroyed
- Always require rcache by default for gdr_copy
- Fixed crash in gdr_copy cleanup when registration cache is disabled
- Fixed CUDA copy memory domain allocations
- Fixed multiple tests for gdr_copy transport
- Fixed race condition in CUDA IPC peer accessible cache
UCS
- Fixed a crash by using heap allocation to process expired timers in batch
- Fixed allocation issue on memtrack dump
- Fixed deletion of the monitored folder in VFS
- Fixed unsafe resize for DC initiator array
- Fixed function macro invocation to match C standard
- Fixed calling async handler on already released resource
- Fixed performance by setting higher bandwidth for different NUMA nodes on Grace
- Fixed undeclared value error in timer conversion routine
- Fixed uninitialized value access in registration cache
UCM
- Fixed race condition in parsing proc maps
- Fixed mremap failure while parsing /proc/self/maps
ROCM
- Fixed ROCM interface reachability test
- Fixed memory domain fork test
TCP
- Always bind endpoint to interface
Tools
- Fixed buffer size potential overflow in ucx_perftest
- Fixed missing address when packing memory keys on ucx_perftest
- Fixed memory leak for endpoint report in ucx_info
- Fixed build without openmp in ucx_perftest
- Fixed UCT device override on server side on ucx_perftest
Build
- Fixed using correct ASAN version for running tests
Configuration
- Used POSIX bourne syntax to check equality
- Fixed build failure by using proper flags in compiler.m4
- Fixed perftest MAD support default guessing
GO
- Added serialized thread mode to avoid subtle races between threads
- Fixed make distcheck
v1.18.0 RC3
1.18.0-rc3 (December 23, 2024)
Features:
UCP
- Enabled using CUDA staging buffers for pipeline protocols by default
- Added endpoint reconfiguration support for non-reused p2p scenarios
- Enabled non-cacheable memory domains, activated for gdr_copy
- Added user_data parameter to ucp_ep_query
- Added support for host memory pipeline through CUDA buffers for rendezvous protocol
- Added global VA infrastructure and memory region in absence of error handling
- Made protocol performance node names more informative
- Enforced always running on the same thread in single thread mode
- Multiple improvements in protocols selection infrastructure
- Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping
- Allowed up-to 64 endpoint lanes for systems with many transports or devices
- Added usage tracker to worker
- Improved various logging messages
RDMA CORE (IB, ROCE, etc.)
- Added environment variable to manage DC initiator capacity
- Added DC dcs_hybrid policy
- Reduced MLX5/DV stack size consumption
- Added ODP support for verbs and mlx5dv
- Added support of CUDA managed memory on IB when ODP is available
- Added support of Adaptive Routing on RoCE
- Enabled use of implicit ODP with relaxed ordering
- Improved GPU-Direct detection in IB transport
- Increased DC initiator default count to 32 for performance optimization
- Added ConnectX-8 device support with DDP
- Added support for subnet filter list for RoCE interfaces
- Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports
- Added IB MLX5 as a separate UCX module with separate RPM sub-package
- Added initial support for GGA transport, for fast DPU memory access
- Set IB DevX atomic mode based on device capabilities
- Removed DC keepalive mechanism, since the keepalive is done on UCP layer
- Optimized cross-gVMI memory registration using indirect memory keys cache
- Improved various logging messages
CUDA
- Added multi-node NVlink support
- Added CUDA Fabric memory support with detection and allocation
- Improved gdr_copy latency estimations on AMD Milan systems
- Added check for gdr_copy runtime/build version mismatch
- Added handling missing IPC capability when unpacking keys
- Added caching for CUDA IPC memory pool import operation
- Added gdr_copy variables to optimize performance on Grace Hopper systems
- Improved CUDA IPC concurrency for a larger count of reachable peers
UCS
- Added support for wildcards in configuration parameter names
- Added ASAN protection to several internal data structures
- Reduced stack usage in topology detection code
- Improved bitmaps configuration parsing with wider bitfield
- Added options to set topology distance between devices
- Optimized VFS unix socket watch by using user private folder
- Added general IP subnet matching infrastructure
- Extend array data structure to support user-provided array copy routine
- Improved time units description
UCM
- Extend CUDA memory hooks to include memory mapping APIs
Tools
- Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest
- Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest
- Improved ucx_perftest uni-directional test with added fence
- Detailed ucx_perftest batch section of command-line documentation
Documentation
- Added a section regarding adaptive routing on RoCE
Architecture
- Added CPU Model for MI300A
- Added Fujitsu ARM specific values to ucx.conf
- Added AMD Turin support
- Added an optimized non-temporal memory copy implementation for AMD CPU
Build
- Improved compiler error reporting with added flag
- Improved coverity script to allow faster turnaround time
- Improved Intel Compiler detection and support
GO
- Added multi-send flag and user memh support in request params
Packaging
- Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments
Bugfixes:
UCP
- Fixed stack overflow in exported rkey unpack
- Removed extra remote-cpu overhead from protocol estimation for zcopy
- Fixed performance estimation for rndv pipeline protocols
- Fixed ATP sending by picking the correct lane
- Fixed missing reg_id on memh creation
- Fixed repeated invalidations by retaining existing access flags
- Fixed abort reason propagation for rendezvous RTR mtype
- Do not check transport availability if it is disabled by UCX_TLS environemnt variable
- Fixed wrong flag being used for checking BCOPY capability
- Fixed sending too many ATPs for small messages
- Enforced 16 bits size for Active Messages identifiers
- Fixed unnecessary status check for emulated AMO
- Fixed more than one fragment sending in rendezvous pipeline
- Fixed crash by using biggest max frag across all lanes
- Fixed missing memory handle flags by copying from parent to child
- Fixed worker interface activate count
- Fixed flush requests by replacing ATP/flush lane map with lane indexes
- Fixed lost uct_flags when merging memory regions
UCT
- Fixed memory domain UCT flags description
RDMA CORE (IB, ROCE, etc.)
- Fixed FETCH_ADD remote access error for ODP/KSM case
- Fixed missing conditional compilation checks for DM
- Fixed IB MD allocation naming typo
- Fixed invalid GIDs filter in IB
- Fixed flags usage in MLX5 zcopy_post
- Do not limit ODP registration retries
- Fixed JUCX failures by considering the number of supported completion vectors
CUDA
- Fixed async memory handling using CUDA memory type on Grace
- Added rcache overhead in performance estimation
- Fixed gdr_copy performance regression by providing maximum estimation between get and put
- Fixed CUDA IPC reachability check
- Fixed crash in MPI_Finalize when CUDA context is destroyed
- Always require rcache by default for gdr_copy
- Fixed crash in gdr_copy cleanup when registration cache is disabled
- Fixed CUDA copy memory domain allocations
- Fixed multiple tests for gdr_copy transport
- Fixed race condition in CUDA IPC peer accessible cache
UCS
- Fixed a crash by using heap allocation to process expired timers in batch
- Fixed allocation issue on memtrack dump
- Fixed deletion of the monitored folder in VFS
- Fixed unsafe resize for DC initiator array
- Fixed function macro invocation to match C standard
- Fixed calling async handler on already released resource
- Fixed performance by setting higher bandwidth for different NUMA nodes on Grace
- Fixed undeclared value error in timer conversion routine
- Fixed uninitialized value access in registration cache
UCM
- Fixed race condition in parsing proc maps
- Fixed mremap failure while parsing /proc/self/maps
ROCM
- Fixed ROCM interface reachability test
- Fixed memory domain fork test
TCP
- Always bind endpoint to interface
Tools
- Fixed buffer size potential overflow in ucx_perftest
- Fixed missing address when packing memory keys on ucx_perftest
- Fixed memory leak for endpoint report in ucx_info
- Fixed build without openmp in ucx_perftest
- Fixed UCT device override on server side on ucx_perftest
Build
- Fixed using correct ASAN version for running tests
Configuration
- Used POSIX bourne syntax to check equality
- Fixed build failure by using proper flags in compiler.m4
- Fixed perftest MAD support default guessing
GO
- Added serialized thread mode to avoid subtle races between threads
- Fixed make distcheck