Skip to content

Releases: openucx/ucx

v1.19.1-rc1

21 Sep 13:36
41180bd
Compare
Choose a tag to compare
v1.19.1-rc1 Pre-release
Pre-release

1.19.1 (Sep 18, 2025)

Features:

UCP

  • Do not require transport memory support if rendezvous protocol is not used

Build

  • Added CUDA 13 support to the release pipeline

v1.19.0

06 Aug 12:23
e463614
Compare
Choose a tag to compare

1.19.0 (August 6, 2025)

Features:

UCP

  • Enabled multi-GPU support within a single process
  • Added dynamic selection between strong and weak fences in RMA flush operations
  • Improved endpoint reconfiguration capabilities
  • Added All2All lane selection for multi-NIC-GPU systems
  • Improved rkey debug info when config cache limit is reached
  • Improved UCP protocol selection based on available memory types
  • Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
  • Improved RNDV performance with device-local staging buffers
  • Enabled error handling for RMA get_offload protocols

UCT

  • Defined uct_rkey_unpack_v2 API to support passing sys-dev

RDMA CORE (IB, ROCE, etc.)

  • Added SRD transport support in EFA with reordering, AM, and control operations
  • Removed XGVMI BF2 support (umem)
  • Removed device memory indirect key
  • Fixed VFS objects for DCIs and pools
  • Added routing table cache to the reachability check
  • Fixed strict order usage in IB auxiliary rkeys
  • Improved various init logging messages

CUDA

  • Added multi-context support for remote key unpacking to CUDA IPC
  • Added context switching aware resource management to CUDA IPC
  • Use buffer ID to detect VA recycling in CUDA IPC
  • Added support for allocating CUDA memory on specific system devices
  • Added multi-device support in CUDA copy
  • Improved protocol lane selection for GPU memory operations
  • Relaxed CUDA context requirements in CUDA copy
  • Added deadlock prevention in CUDA copy
  • Added support for address range detection for VMM
  • Enabled memory attributes query after switching CUDA GPU
  • Added multi-GPU send tests for CUDA transports
  • Removed host-to-host performance estimation from CUDA copy transport
  • Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
  • Improved various init logging messages

ROCM

  • Added control parameters for IPC handle cache and signal pool size
  • Optimized ROCm memory type detection with caching

UCS

  • Removed compilation warnings

Tools

  • Added name filter option (-F 'str') to ucx_info for config and feature dumps
  • Improved ucx_info input validation

Bugfixes:

UCP

  • Made UCX_TLS=^ib disable all transports including auxiliary
  • Fixed send request status handling
  • Fixed performance degradation in RNDV by optimizing md cache updates
  • Fixed protocol selection when first lane is filtered out by fragment size
  • Fixed rkey selection by using memory registration flag

UCT

RDMA CORE (IB, ROCE, etc.)

  • Improved reliability of DC transport by adding DCI validation and separating connection logic
  • Fixed segfault in DC fence operation

GPU (CUDA, ROCM)

  • Updated ROCm configuration for ROCm 6.3 compatibility
  • Fixed system device detection for CUDA async memory operations
  • Fixed legacy type detection during CUDA IPC mpack
  • Fixed CUDA IPC RMA operations by using correct context for local buffers

UCS

  • Use UCS function for counting leading zeros on x86 architecture
  • Fixed a compilation warning

Shared Memory

  • Fixed FIFO availability check for sm transport

Documentation

  • Fixed open-mpi clone instruction

Build

  • Fixed enum-int-mismatch warnings with GCC 15

v1.19.0-rc2

22 Jul 08:17
13ae265
Compare
Choose a tag to compare
v1.19.0-rc2 Pre-release
Pre-release

1.19.0 (June 18, 2025)

Features:

UCP

  • Enabled multi-GPU support within a single process
  • Added dynamic selection between strong and weak fences in RMA flush operations
  • Improved endpoint reconfiguration capabilities
  • Added All2All lane selection for multi-NIC-GPU systems
  • Improved rkey debug info when config cache limit is reached
  • Improved UCP protocol selection based on available memory types
  • Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
  • Improved RNDV performance with device-local staging buffers
  • Enabled error handling for RMA get_offload protocols

UCT

  • Defined uct_rkey_unpack_v2 API to support passing sys-dev

RDMA CORE (IB, ROCE, etc.)

  • Added SRD transport support in EFA with reordering, AM, and control operations
  • Removed XGVMI BF2 support (umem)
  • Removed device memory indirect key
  • Fixed VFS objects for DCIs and pools
  • Added routing table cache to the reachability check
  • Fixed strict order usage in IB auxiliary rkeys
  • Improved various init logging messages

CUDA

  • Added multi-context support for remote key unpacking to CUDA IPC
  • Added context switching aware resource management to CUDA IPC
  • Use buffer ID to detect VA recycling in CUDA IPC
  • Added support for allocating CUDA memory on specific system devices
  • Added multi-device support in CUDA copy
  • Improved protocol lane selection for GPU memory operations
  • Relaxed CUDA context requirements in CUDA copy
  • Added deadlock prevention in CUDA copy
  • Added support for address range detection for VMM
  • Enabled memory attributes query after switching CUDA GPU
  • Added multi-GPU send tests for CUDA transports
  • Removed host-to-host performance estimation from CUDA copy transport
  • Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
  • Improved various init logging messages

ROCM

  • Added control parameters for IPC handle cache and signal pool size
  • Optimized ROCm memory type detection with caching

UCS

  • Removed compilation warnings

Tools

  • Added name filter option (-F 'str') to ucx_info for config and feature dumps
  • Improved ucx_info input validation

Bugfixes:

UCP

  • Made UCX_TLS=^ib disable all transports including auxiliary
  • Fixed send request status handling
  • Fixed performance degradation in RNDV by optimizing md cache updates
  • Fixed protocol selection when first lane is filtered out by fragment size
  • Fixed rkey selection by using memory registration flag

UCT

RDMA CORE (IB, ROCE, etc.)

  • Improved reliability of DC transport by adding DCI validation and separating connection logic
  • Fixed segfault in DC fence operation

GPU (CUDA, ROCM)

  • Updated ROCm configuration for ROCm 6.3 compatibility
  • Fixed system device detection for CUDA async memory operations
  • Fixed legacy type detection during CUDA IPC mpack
  • Fixed CUDA IPC RMA operations by using correct context for local buffers

UCS

  • Use UCS function for counting leading zeros on x86 architecture
  • Fixed a compilation warning

Shared Memory

  • Fixed FIFO availability check for sm transport

Documentation

  • Fixed open-mpi clone instruction

Build

  • Fixed enum-int-mismatch warnings with GCC 15

v1.19.0-rc1

24 Jun 12:22
71a4b63
Compare
Choose a tag to compare
v1.19.0-rc1 Pre-release
Pre-release

1.19.0 (June 18, 2025)

Features:

UCP

  • Enabled multi-GPU support within a single process
  • Added dynamic selection between strong and weak fences in RMA flush operations
  • Improved endpoint reconfiguration capabilities
  • Added All2All lane selection for multi-NIC-GPU systems
  • Improved rkey debug info when config cache limit is reached
  • Improved UCP protocol selection based on available memory types
  • Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
  • Improved RNDV performance with device-local staging buffers
  • Enabled error handling for RMA get_offload protocols

UCT

  • Defined uct_rkey_unpack_v2 API to support passing sys-dev

RDMA CORE (IB, ROCE, etc.)

  • Added SRD transport support in EFA with reordering, AM, and control operations
  • Removed XGVMI BF2 support (umem)
  • Removed device memory indirect key
  • Fixed VFS objects for DCIs and pools
  • Added routing table cache to the reachability check
  • Fixed strict order usage in IB auxiliary rkeys
  • Improved various init logging messages

CUDA

  • Added multi-context support for remote key unpacking to CUDA IPC
  • Added context switching aware resource management to CUDA IPC
  • Use buffer ID to detect VA recycling in CUDA IPC
  • Added support for allocating CUDA memory on specific system devices
  • Added multi-device support in CUDA copy
  • Improved protocol lane selection for GPU memory operations
  • Relaxed CUDA context requirements in CUDA copy
  • Added deadlock prevention in CUDA copy
  • Added support for address range detection for VMM
  • Enabled memory attributes query after switching CUDA GPU
  • Added multi-GPU send tests for CUDA transports
  • Removed host-to-host performance estimation from CUDA copy transport
  • Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
  • Improved various init logging messages

ROCM

  • Added control parameters for IPC handle cache and signal pool size
  • Optimized ROCm memory type detection with caching

UCS

  • Removed compilation warnings

Tools

  • Added name filter option (-F 'str') to ucx_info for config and feature dumps
  • Improved ucx_info input validation

Bugfixes:

UCP

  • Made UCX_TLS=^ib disable all transports including auxiliary
  • Fixed send request status handling
  • Fixed performance degradation in RNDV by optimizing md cache updates
  • Fixed protocol selection when first lane is filtered out by fragment size
  • Fixed rkey selection by using memory registration flag

UCT

RDMA CORE (IB, ROCE, etc.)

  • Improved reliability of DC transport by adding DCI validation and separating connection logic
  • Fixed segfault in DC fence operation

GPU (CUDA, ROCM)

  • Updated ROCm configuration for ROCm 6.3 compatibility
  • Fixed system device detection for CUDA async memory operations
  • Fixed legacy type detection during CUDA IPC mpack
  • Fixed CUDA IPC RMA operations by using correct context for local buffers

UCS

  • Use UCS function for counting leading zeros on x86 architecture
  • Fixed a compilation warning

Shared Memory

  • Fixed FIFO availability check for sm transport

Documentation

  • Fixed open-mpi clone instruction

Build

  • Fixed enum-int-mismatch warnings with GCC 15

v1.18.1

28 Apr 16:20
d9aa565
Compare
Choose a tag to compare

1.18.1 (April 28, 2025)

Features:

CUDA

  • Added config keys to update cuda_copy bandwidth for coherent platforms
  • Improved cache invalidation of memory allocated using CUDA memory pool

AZP

  • Added Ubuntu 24.04 to build and release pipeline

Bugfixes:

UCP

  • Fixed assertion failure when maximum lane fragment is smaller than AM header
  • Fixed potential active message user header use after free with protocol reconfiguration

CUDA

  • Fixed registration of CUDA Fabric memory allocated by UCT
  • Fixed VA recycling check of memory allocated using VMM and CUDA memory pool

RDMA CORE (IB, ROCE, etc.)

  • Do not use ConnectX-8 SMI subdevices for communication
  • Fixed remote access error by disabling ODP when the device supports DDP
  • Fixed configuration logic by disabling DDP when AR is disabled

UCM

  • Fixed crash with bistro hooks for CUDA 12.9 on amd64

v1.18.1 RC3

17 Apr 17:02
938ffcd
Compare
Choose a tag to compare
v1.18.1 RC3 Pre-release
Pre-release

1.18.1-rc3 (April 17, 2025)

Bugfixes:

UCM

  • Fixed crash with bistro hooks for CUDA 12.9 on amd64

v1.18.1 RC2

09 Apr 16:12
81baeb1
Compare
Choose a tag to compare
v1.18.1 RC2 Pre-release
Pre-release

1.18.1-rc2 (April 9, 2025)

Features:

CUDA

  • Added config keys to update cuda_copy bandwidth for coherent platforms
  • Improved cache invalidation of memory allocated using CUDA memory pool

Bugfixes:

UCP

  • Fixed assertion failure when maximum lane fragment is smaller than AM header

CUDA

  • Fixed registration of CUDA Fabric memory allocated by UCT
  • Fixed VA recycling check of memory allocated using VMM and CUDA memory pool

RDMA CORE (IB, ROCE, etc.)

  • Do not use ConnectX-8 SMI subdevices for communication
  • Fixed remote access error by disabling ODP when the device supports DDP
  • Fixed configuration logic by disabling DDP when AR is disabled

v1.18.1 RC1

21 Feb 22:58
3ed7241
Compare
Choose a tag to compare
v1.18.1 RC1 Pre-release
Pre-release

1.18.1-rc1 (February 20, 2025)

Features:

AZP

  • Added Ubuntu 24.04 to build and release pipeline

Bugfixes:

UCP

  • Fixed potential active message user header use after free with protocol reconfiguration

v1.18.0

21 Jan 09:39
693d028
Compare
Choose a tag to compare

1.18.0 (January 17, 2025)

Features:

UCP

  • Enabled using CUDA staging buffers for pipeline protocols by default
  • Added endpoint reconfiguration support for non-reused p2p scenarios
  • Enabled non-cacheable memory domains, activated for gdr_copy
  • Added user_data parameter to ucp_ep_query
  • Added support for host memory pipeline through CUDA buffers for rendezvous protocol
  • Added global VA infrastructure and memory region in absence of error handling
  • Made protocol performance node names more informative
  • Enforced always running on the same thread in single thread mode
  • Multiple improvements in protocols selection infrastructure
  • Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping
  • Allowed up-to 64 endpoint lanes for systems with many transports or devices
  • Added usage tracker to worker
  • Improved various logging messages

RDMA CORE (IB, ROCE, etc.)

  • Added environment variable to manage DC initiator capacity
  • Added DC dcs_hybrid policy
  • Reduced MLX5/DV stack size consumption
  • Added ODP support for verbs and mlx5dv
  • Added support of CUDA managed memory on IB when ODP is available
  • Added support of Adaptive Routing on RoCE
  • Enabled use of implicit ODP with relaxed ordering
  • Improved GPU-Direct detection in IB transport
  • Increased DC initiator default count to 32 for performance optimization
  • Added ConnectX-8 device support with DDP
  • Added support for subnet filter list for RoCE interfaces
  • Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports
  • Added IB MLX5 as a separate UCX module with separate RPM sub-package
  • Added initial support for GGA transport, for fast DPU memory access
  • Set IB DevX atomic mode based on device capabilities
  • Removed DC keepalive mechanism, since the keepalive is done on UCP layer
  • Optimized cross-gVMI memory registration using indirect memory keys cache
  • Improved various logging messages

CUDA

  • Added multi-node NVlink support
  • Added CUDA Fabric memory support with detection and allocation
  • Improved gdr_copy latency estimations on AMD Milan systems
  • Added check for gdr_copy runtime/build version mismatch
  • Added handling missing IPC capability when unpacking keys
  • Added caching for CUDA IPC memory pool import operation
  • Added gdr_copy variables to optimize performance on Grace Hopper systems
  • Improved CUDA IPC concurrency for a larger count of reachable peers

UCS

  • Added support for wildcards in configuration parameter names
  • Added ASAN protection to several internal data structures
  • Reduced stack usage in topology detection code
  • Improved bitmaps configuration parsing with wider bitfield
  • Added options to set topology distance between devices
  • Optimized VFS unix socket watch by using user private folder
  • Added general IP subnet matching infrastructure
  • Extend array data structure to support user-provided array copy routine
  • Improved time units description

UCM

  • Extend CUDA memory hooks to include memory mapping APIs

Tools

  • Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest
  • Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest
  • Improved ucx_perftest uni-directional test with added fence
  • Detailed ucx_perftest batch section of command-line documentation

Documentation

  • Added a section regarding adaptive routing on RoCE

Architecture

  • Added CPU Model for MI300A
  • Added Fujitsu ARM specific values to ucx.conf
  • Added AMD Turin support
  • Added an optimized non-temporal memory copy implementation for AMD CPU

Build

  • Improved compiler error reporting with added flag
  • Improved coverity script to allow faster turnaround time
  • Improved Intel Compiler detection and support

GO

  • Added multi-send flag and user memh support in request params

Packaging

  • Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments

Bugfixes:

UCP

  • Fixed stack overflow in exported rkey unpack
  • Removed extra remote-cpu overhead from protocol estimation for zcopy
  • Fixed performance estimation for rndv pipeline protocols
  • Fixed ATP sending by picking the correct lane
  • Fixed missing reg_id on memh creation
  • Fixed repeated invalidations by retaining existing access flags
  • Fixed abort reason propagation for rendezvous RTR mtype
  • Do not check transport availability if it is disabled by UCX_TLS environment variable
  • Fixed wrong flag being used for checking BCOPY capability
  • Fixed sending too many ATPs for small messages
  • Enforced 16 bits size for Active Messages identifiers
  • Fixed unnecessary status check for emulated AMO
  • Fixed more than one fragment sending in rendezvous pipeline
  • Fixed crash by using biggest max frag across all lanes
  • Fixed missing memory handle flags by copying from parent to child
  • Fixed worker interface activate count
  • Fixed flush requests by replacing ATP/flush lane map with lane indexes
  • Fixed lost uct_flags when merging memory regions

UCT

  • Fixed memory domain UCT flags description

RDMA CORE (IB, ROCE, etc.)

  • Fixed FETCH_ADD remote access error for ODP/KSM case
  • Fixed missing conditional compilation checks for DM
  • Fixed IB MD allocation naming typo
  • Fixed invalid GIDs filter in IB
  • Fixed flags usage in MLX5 zcopy_post
  • Do not limit ODP registration retries
  • Fixed JUCX failures by considering the number of supported completion vectors

CUDA

  • Fixed async memory handling using CUDA memory type on Grace
  • Added rcache overhead in performance estimation
  • Fixed gdr_copy performance regression by providing maximum estimation between get and put
  • Fixed CUDA IPC reachability check
  • Fixed crash in MPI_Finalize when CUDA context is destroyed
  • Always require rcache by default for gdr_copy
  • Fixed crash in gdr_copy cleanup when registration cache is disabled
  • Fixed CUDA copy memory domain allocations
  • Fixed multiple tests for gdr_copy transport
  • Fixed race condition in CUDA IPC peer accessible cache

UCS

  • Fixed a crash by using heap allocation to process expired timers in batch
  • Fixed allocation issue on memtrack dump
  • Fixed deletion of the monitored folder in VFS
  • Fixed unsafe resize for DC initiator array
  • Fixed function macro invocation to match C standard
  • Fixed calling async handler on already released resource
  • Fixed performance by setting higher bandwidth for different NUMA nodes on Grace
  • Fixed undeclared value error in timer conversion routine
  • Fixed uninitialized value access in registration cache

UCM

  • Fixed race condition in parsing proc maps
  • Fixed mremap failure while parsing /proc/self/maps

ROCM

  • Fixed ROCM interface reachability test
  • Fixed memory domain fork test

TCP

  • Always bind endpoint to interface

Tools

  • Fixed buffer size potential overflow in ucx_perftest
  • Fixed missing address when packing memory keys on ucx_perftest
  • Fixed memory leak for endpoint report in ucx_info
  • Fixed build without openmp in ucx_perftest
  • Fixed UCT device override on server side on ucx_perftest

Build

  • Fixed using correct ASAN version for running tests

Configuration

  • Used POSIX bourne syntax to check equality
  • Fixed build failure by using proper flags in compiler.m4
  • Fixed perftest MAD support default guessing

GO

  • Added serialized thread mode to avoid subtle races between threads
  • Fixed make distcheck

v1.18.0 RC3

23 Dec 17:06
9ce35d0
Compare
Choose a tag to compare
v1.18.0 RC3 Pre-release
Pre-release

1.18.0-rc3 (December 23, 2024)

Features:

UCP

  • Enabled using CUDA staging buffers for pipeline protocols by default
  • Added endpoint reconfiguration support for non-reused p2p scenarios
  • Enabled non-cacheable memory domains, activated for gdr_copy
  • Added user_data parameter to ucp_ep_query
  • Added support for host memory pipeline through CUDA buffers for rendezvous protocol
  • Added global VA infrastructure and memory region in absence of error handling
  • Made protocol performance node names more informative
  • Enforced always running on the same thread in single thread mode
  • Multiple improvements in protocols selection infrastructure
  • Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping
  • Allowed up-to 64 endpoint lanes for systems with many transports or devices
  • Added usage tracker to worker
  • Improved various logging messages

RDMA CORE (IB, ROCE, etc.)

  • Added environment variable to manage DC initiator capacity
  • Added DC dcs_hybrid policy
  • Reduced MLX5/DV stack size consumption
  • Added ODP support for verbs and mlx5dv
  • Added support of CUDA managed memory on IB when ODP is available
  • Added support of Adaptive Routing on RoCE
  • Enabled use of implicit ODP with relaxed ordering
  • Improved GPU-Direct detection in IB transport
  • Increased DC initiator default count to 32 for performance optimization
  • Added ConnectX-8 device support with DDP
  • Added support for subnet filter list for RoCE interfaces
  • Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports
  • Added IB MLX5 as a separate UCX module with separate RPM sub-package
  • Added initial support for GGA transport, for fast DPU memory access
  • Set IB DevX atomic mode based on device capabilities
  • Removed DC keepalive mechanism, since the keepalive is done on UCP layer
  • Optimized cross-gVMI memory registration using indirect memory keys cache
  • Improved various logging messages

CUDA

  • Added multi-node NVlink support
  • Added CUDA Fabric memory support with detection and allocation
  • Improved gdr_copy latency estimations on AMD Milan systems
  • Added check for gdr_copy runtime/build version mismatch
  • Added handling missing IPC capability when unpacking keys
  • Added caching for CUDA IPC memory pool import operation
  • Added gdr_copy variables to optimize performance on Grace Hopper systems
  • Improved CUDA IPC concurrency for a larger count of reachable peers

UCS

  • Added support for wildcards in configuration parameter names
  • Added ASAN protection to several internal data structures
  • Reduced stack usage in topology detection code
  • Improved bitmaps configuration parsing with wider bitfield
  • Added options to set topology distance between devices
  • Optimized VFS unix socket watch by using user private folder
  • Added general IP subnet matching infrastructure
  • Extend array data structure to support user-provided array copy routine
  • Improved time units description

UCM

  • Extend CUDA memory hooks to include memory mapping APIs

Tools

  • Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest
  • Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest
  • Improved ucx_perftest uni-directional test with added fence
  • Detailed ucx_perftest batch section of command-line documentation

Documentation

  • Added a section regarding adaptive routing on RoCE

Architecture

  • Added CPU Model for MI300A
  • Added Fujitsu ARM specific values to ucx.conf
  • Added AMD Turin support
  • Added an optimized non-temporal memory copy implementation for AMD CPU

Build

  • Improved compiler error reporting with added flag
  • Improved coverity script to allow faster turnaround time
  • Improved Intel Compiler detection and support

GO

  • Added multi-send flag and user memh support in request params

Packaging

  • Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments

Bugfixes:

UCP

  • Fixed stack overflow in exported rkey unpack
  • Removed extra remote-cpu overhead from protocol estimation for zcopy
  • Fixed performance estimation for rndv pipeline protocols
  • Fixed ATP sending by picking the correct lane
  • Fixed missing reg_id on memh creation
  • Fixed repeated invalidations by retaining existing access flags
  • Fixed abort reason propagation for rendezvous RTR mtype
  • Do not check transport availability if it is disabled by UCX_TLS environemnt variable
  • Fixed wrong flag being used for checking BCOPY capability
  • Fixed sending too many ATPs for small messages
  • Enforced 16 bits size for Active Messages identifiers
  • Fixed unnecessary status check for emulated AMO
  • Fixed more than one fragment sending in rendezvous pipeline
  • Fixed crash by using biggest max frag across all lanes
  • Fixed missing memory handle flags by copying from parent to child
  • Fixed worker interface activate count
  • Fixed flush requests by replacing ATP/flush lane map with lane indexes
  • Fixed lost uct_flags when merging memory regions

UCT

  • Fixed memory domain UCT flags description

RDMA CORE (IB, ROCE, etc.)

  • Fixed FETCH_ADD remote access error for ODP/KSM case
  • Fixed missing conditional compilation checks for DM
  • Fixed IB MD allocation naming typo
  • Fixed invalid GIDs filter in IB
  • Fixed flags usage in MLX5 zcopy_post
  • Do not limit ODP registration retries
  • Fixed JUCX failures by considering the number of supported completion vectors

CUDA

  • Fixed async memory handling using CUDA memory type on Grace
  • Added rcache overhead in performance estimation
  • Fixed gdr_copy performance regression by providing maximum estimation between get and put
  • Fixed CUDA IPC reachability check
  • Fixed crash in MPI_Finalize when CUDA context is destroyed
  • Always require rcache by default for gdr_copy
  • Fixed crash in gdr_copy cleanup when registration cache is disabled
  • Fixed CUDA copy memory domain allocations
  • Fixed multiple tests for gdr_copy transport
  • Fixed race condition in CUDA IPC peer accessible cache

UCS

  • Fixed a crash by using heap allocation to process expired timers in batch
  • Fixed allocation issue on memtrack dump
  • Fixed deletion of the monitored folder in VFS
  • Fixed unsafe resize for DC initiator array
  • Fixed function macro invocation to match C standard
  • Fixed calling async handler on already released resource
  • Fixed performance by setting higher bandwidth for different NUMA nodes on Grace
  • Fixed undeclared value error in timer conversion routine
  • Fixed uninitialized value access in registration cache

UCM

  • Fixed race condition in parsing proc maps
  • Fixed mremap failure while parsing /proc/self/maps

ROCM

  • Fixed ROCM interface reachability test
  • Fixed memory domain fork test

TCP

  • Always bind endpoint to interface

Tools

  • Fixed buffer size potential overflow in ucx_perftest
  • Fixed missing address when packing memory keys on ucx_perftest
  • Fixed memory leak for endpoint report in ucx_info
  • Fixed build without openmp in ucx_perftest
  • Fixed UCT device override on server side on ucx_perftest

Build

  • Fixed using correct ASAN version for running tests

Configuration

  • Used POSIX bourne syntax to check equality
  • Fixed build failure by using proper flags in compiler.m4
  • Fixed perftest MAD support default guessing

GO

  • Added serialized thread mode to avoid subtle races between threads
  • Fixed make distcheck