Releases · openucx/ucx

21 Sep 13:36

amastbaum

v1.19.1-rc1

41180bd

v1.19.1-rc1 Pre-release

Pre-release

1.19.1 (Sep 18, 2025)

Features:

UCP

Do not require transport memory support if rendezvous protocol is not used

Build

Added CUDA 13 support to the release pipeline

Assets 27

06 Aug 12:23

amastbaum

v1.19.0

e463614

v1.19.0 Latest

Latest

1.19.0 (August 6, 2025)

Features:

UCP

Enabled multi-GPU support within a single process
Added dynamic selection between strong and weak fences in RMA flush operations
Improved endpoint reconfiguration capabilities
Added All2All lane selection for multi-NIC-GPU systems
Improved rkey debug info when config cache limit is reached
Improved UCP protocol selection based on available memory types
Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
Improved RNDV performance with device-local staging buffers
Enabled error handling for RMA get_offload protocols

UCT

Defined uct_rkey_unpack_v2 API to support passing sys-dev

RDMA CORE (IB, ROCE, etc.)

Added SRD transport support in EFA with reordering, AM, and control operations
Removed XGVMI BF2 support (umem)
Removed device memory indirect key
Fixed VFS objects for DCIs and pools
Added routing table cache to the reachability check
Fixed strict order usage in IB auxiliary rkeys
Improved various init logging messages

CUDA

Added multi-context support for remote key unpacking to CUDA IPC
Added context switching aware resource management to CUDA IPC
Use buffer ID to detect VA recycling in CUDA IPC
Added support for allocating CUDA memory on specific system devices
Added multi-device support in CUDA copy
Improved protocol lane selection for GPU memory operations
Relaxed CUDA context requirements in CUDA copy
Added deadlock prevention in CUDA copy
Added support for address range detection for VMM
Enabled memory attributes query after switching CUDA GPU
Added multi-GPU send tests for CUDA transports
Removed host-to-host performance estimation from CUDA copy transport
Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
Improved various init logging messages

ROCM

Added control parameters for IPC handle cache and signal pool size
Optimized ROCm memory type detection with caching

UCS

Removed compilation warnings

Tools

Added name filter option (-F 'str') to ucx_info for config and feature dumps
Improved ucx_info input validation

Bugfixes:

UCP

Made UCX_TLS=^ib disable all transports including auxiliary
Fixed send request status handling
Fixed performance degradation in RNDV by optimizing md cache updates
Fixed protocol selection when first lane is filtered out by fragment size
Fixed rkey selection by using memory registration flag

UCT

RDMA CORE (IB, ROCE, etc.)

Improved reliability of DC transport by adding DCI validation and separating connection logic
Fixed segfault in DC fence operation

GPU (CUDA, ROCM)

Updated ROCm configuration for ROCm 6.3 compatibility
Fixed system device detection for CUDA async memory operations
Fixed legacy type detection during CUDA IPC mpack
Fixed CUDA IPC RMA operations by using correct context for local buffers

UCS

Use UCS function for counting leading zeros on x86 architecture
Fixed a compilation warning

Shared Memory

Fixed FIFO availability check for sm transport

Documentation

Fixed open-mpi clone instruction

Build

Fixed enum-int-mismatch warnings with GCC 15

Assets 23

ucx-1.19.0-1.el7.src.rpm

sha256:9f35542c06b4355ff5a3fb92c6803a97a1c188e86ab20422ce3dc3e4c277068c

3.24 MB 2025-08-06T12:08:08Z
ucx-1.19.0-1.el8.src.rpm

sha256:a95393c5eabb4b6ebabbfa399e5a0c4464509f2780129e934facf0f73b8bd508

3.28 MB 2025-08-06T12:08:02Z
ucx-1.19.0-centos7-mofed5-cuda11-x86_64.tar.bz2

sha256:7dba3d8c1eaa3c5c91591caf313f6206afb133ec115b77a48672dc9f8e4a8856

7.46 MB 2025-08-06T12:14:10Z
ucx-1.19.0-centos7-mofed5-cuda12-x86_64.tar.bz2

sha256:54c03428564188901360a832d9f3cff79addd80623de320b28f36e150d2a93ce

7.47 MB 2025-08-06T12:14:17Z
ucx-1.19.0-centos8-mofed5-cuda11-aarch64.tar.bz2

sha256:35370106151cd67fdd902ae6bb0fc7c94a55d6fe5f3ba9a62b59141b08dcd2db

8.72 MB 2025-08-06T12:13:35Z
ucx-1.19.0-centos8-mofed5-cuda11-x86_64.tar.bz2

sha256:4a5a277cc2625a09018be1f351c4ab26991907e2ce5cdd79902374c5b5260a30

9.34 MB 2025-08-06T12:15:06Z
ucx-1.19.0-ubuntu16.04-mofed5-cuda11-x86_64.tar.bz2

sha256:570be0b15369cfb63d0bc8d2fa6c2ace1275a4db233b64045d39c57d143a6df9

1.63 MB 2025-08-06T12:17:42Z
ucx-1.19.0-ubuntu18.04-mofed5-cuda11-aarch64.tar.bz2

sha256:a8853242fa883ae060326e52ba8e5e7bb473b5a3d94e85b5bdc9fa89d83c48ec

1.46 MB 2025-08-06T12:14:51Z
ucx-1.19.0-ubuntu18.04-mofed5-cuda11-x86_64.tar.bz2

sha256:086e10d6260af1ba95f6ac08d6734e297276b0d5a16a8a50f67a1f24e1e8837a

1.56 MB 2025-08-06T12:15:42Z
ucx-1.19.0-ubuntu18.04-mofed5-cuda12-x86_64.tar.bz2

sha256:23b0a06c522011b26cca8546b385e0d42c0b310c463637dd97ecbc1b074245ce

1.56 MB 2025-08-06T12:15:51Z
Source code (zip)

2025-08-05T17:10:05Z
Source code (tar.gz)

2025-08-05T17:10:05Z

22 Jul 08:17

amastbaum

v1.19.0-rc2

13ae265

v1.19.0-rc2 Pre-release

Pre-release

1.19.0 (June 18, 2025)

Features:

UCP

Enabled multi-GPU support within a single process
Added dynamic selection between strong and weak fences in RMA flush operations
Improved endpoint reconfiguration capabilities
Added All2All lane selection for multi-NIC-GPU systems
Improved rkey debug info when config cache limit is reached
Improved UCP protocol selection based on available memory types
Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
Improved RNDV performance with device-local staging buffers
Enabled error handling for RMA get_offload protocols

UCT

Defined uct_rkey_unpack_v2 API to support passing sys-dev

RDMA CORE (IB, ROCE, etc.)

Added SRD transport support in EFA with reordering, AM, and control operations
Removed XGVMI BF2 support (umem)
Removed device memory indirect key
Fixed VFS objects for DCIs and pools
Added routing table cache to the reachability check
Fixed strict order usage in IB auxiliary rkeys
Improved various init logging messages

CUDA

Added multi-context support for remote key unpacking to CUDA IPC
Added context switching aware resource management to CUDA IPC
Use buffer ID to detect VA recycling in CUDA IPC
Added support for allocating CUDA memory on specific system devices
Added multi-device support in CUDA copy
Improved protocol lane selection for GPU memory operations
Relaxed CUDA context requirements in CUDA copy
Added deadlock prevention in CUDA copy
Added support for address range detection for VMM
Enabled memory attributes query after switching CUDA GPU
Added multi-GPU send tests for CUDA transports
Removed host-to-host performance estimation from CUDA copy transport
Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
Improved various init logging messages

ROCM

Added control parameters for IPC handle cache and signal pool size
Optimized ROCm memory type detection with caching

UCS

Removed compilation warnings

Tools

Added name filter option (-F 'str') to ucx_info for config and feature dumps
Improved ucx_info input validation

Bugfixes:

UCP

Made UCX_TLS=^ib disable all transports including auxiliary
Fixed send request status handling
Fixed performance degradation in RNDV by optimizing md cache updates
Fixed protocol selection when first lane is filtered out by fragment size
Fixed rkey selection by using memory registration flag

UCT

RDMA CORE (IB, ROCE, etc.)

Improved reliability of DC transport by adding DCI validation and separating connection logic
Fixed segfault in DC fence operation

GPU (CUDA, ROCM)

Updated ROCm configuration for ROCm 6.3 compatibility
Fixed system device detection for CUDA async memory operations
Fixed legacy type detection during CUDA IPC mpack
Fixed CUDA IPC RMA operations by using correct context for local buffers

UCS

Use UCS function for counting leading zeros on x86 architecture
Fixed a compilation warning

Shared Memory

Fixed FIFO availability check for sm transport

Documentation

Fixed open-mpi clone instruction

Build

Fixed enum-int-mismatch warnings with GCC 15

Assets 23

24 Jun 12:22

amastbaum

v1.19.0-rc1

71a4b63

v1.19.0-rc1 Pre-release

Pre-release

1.19.0 (June 18, 2025)

Features:

UCP

Enabled multi-GPU support within a single process
Added dynamic selection between strong and weak fences in RMA flush operations
Improved endpoint reconfiguration capabilities
Added All2All lane selection for multi-NIC-GPU systems
Improved rkey debug info when config cache limit is reached
Improved UCP protocol selection based on available memory types
Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
Improved RNDV performance with device-local staging buffers
Enabled error handling for RMA get_offload protocols

UCT

Defined uct_rkey_unpack_v2 API to support passing sys-dev

RDMA CORE (IB, ROCE, etc.)

Added SRD transport support in EFA with reordering, AM, and control operations
Removed XGVMI BF2 support (umem)
Removed device memory indirect key
Fixed VFS objects for DCIs and pools
Added routing table cache to the reachability check
Fixed strict order usage in IB auxiliary rkeys
Improved various init logging messages

CUDA

Added multi-context support for remote key unpacking to CUDA IPC
Added context switching aware resource management to CUDA IPC
Use buffer ID to detect VA recycling in CUDA IPC
Added support for allocating CUDA memory on specific system devices
Added multi-device support in CUDA copy
Improved protocol lane selection for GPU memory operations
Relaxed CUDA context requirements in CUDA copy
Added deadlock prevention in CUDA copy
Added support for address range detection for VMM
Enabled memory attributes query after switching CUDA GPU
Added multi-GPU send tests for CUDA transports
Removed host-to-host performance estimation from CUDA copy transport
Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
Improved various init logging messages

ROCM

Added control parameters for IPC handle cache and signal pool size
Optimized ROCm memory type detection with caching

UCS

Removed compilation warnings

Tools

Added name filter option (-F 'str') to ucx_info for config and feature dumps
Improved ucx_info input validation

Bugfixes:

UCP

Made UCX_TLS=^ib disable all transports including auxiliary
Fixed send request status handling
Fixed performance degradation in RNDV by optimizing md cache updates
Fixed protocol selection when first lane is filtered out by fragment size
Fixed rkey selection by using memory registration flag

UCT

RDMA CORE (IB, ROCE, etc.)

Improved reliability of DC transport by adding DCI validation and separating connection logic
Fixed segfault in DC fence operation

GPU (CUDA, ROCM)

Updated ROCm configuration for ROCm 6.3 compatibility
Fixed system device detection for CUDA async memory operations
Fixed legacy type detection during CUDA IPC mpack
Fixed CUDA IPC RMA operations by using correct context for local buffers

UCS

Use UCS function for counting leading zeros on x86 architecture
Fixed a compilation warning

Shared Memory

Fixed FIFO availability check for sm transport

Documentation

Fixed open-mpi clone instruction

Build

Fixed enum-int-mismatch warnings with GCC 15

Assets 23

28 Apr 16:20

tvegas1

v1.18.1

d9aa565

v1.18.1

1.18.1 (April 28, 2025)

Features:

CUDA

Added config keys to update cuda_copy bandwidth for coherent platforms
Improved cache invalidation of memory allocated using CUDA memory pool

AZP

Added Ubuntu 24.04 to build and release pipeline

Bugfixes:

UCP

Fixed assertion failure when maximum lane fragment is smaller than AM header
Fixed potential active message user header use after free with protocol reconfiguration

CUDA

Fixed registration of CUDA Fabric memory allocated by UCT
Fixed VA recycling check of memory allocated using VMM and CUDA memory pool

RDMA CORE (IB, ROCE, etc.)

Do not use ConnectX-8 SMI subdevices for communication
Fixed remote access error by disabling ODP when the device supports DDP
Fixed configuration logic by disabling DDP when AR is disabled

UCM

Fixed crash with bistro hooks for CUDA 12.9 on amd64

Assets 23

17 Apr 17:02

tvegas1

v1.18.1-rc3

938ffcd

v1.18.1 RC3 Pre-release

Pre-release

1.18.1-rc3 (April 17, 2025)

Bugfixes:

UCM

Fixed crash with bistro hooks for CUDA 12.9 on amd64

Assets 23

09 Apr 16:12

tvegas1

v1.18.1-rc2

81baeb1

v1.18.1 RC2 Pre-release

Pre-release

1.18.1-rc2 (April 9, 2025)

Features:

CUDA

Added config keys to update cuda_copy bandwidth for coherent platforms
Improved cache invalidation of memory allocated using CUDA memory pool

Bugfixes:

UCP

Fixed assertion failure when maximum lane fragment is smaller than AM header

CUDA

Fixed registration of CUDA Fabric memory allocated by UCT
Fixed VA recycling check of memory allocated using VMM and CUDA memory pool

RDMA CORE (IB, ROCE, etc.)

Do not use ConnectX-8 SMI subdevices for communication
Fixed remote access error by disabling ODP when the device supports DDP
Fixed configuration logic by disabling DDP when AR is disabled

Assets 23

21 Feb 22:58

tvegas1

v1.18.1-rc1

3ed7241

v1.18.1 RC1 Pre-release

Pre-release

1.18.1-rc1 (February 20, 2025)

Features:

AZP

Added Ubuntu 24.04 to build and release pipeline

Bugfixes:

UCP

Fixed potential active message user header use after free with protocol reconfiguration

Assets 23

21 Jan 09:39

tvegas1

v1.18.0

693d028

v1.18.0

1.18.0 (January 17, 2025)

Features:

UCP

Enabled using CUDA staging buffers for pipeline protocols by default
Added endpoint reconfiguration support for non-reused p2p scenarios
Enabled non-cacheable memory domains, activated for gdr_copy
Added user_data parameter to ucp_ep_query
Added support for host memory pipeline through CUDA buffers for rendezvous protocol
Added global VA infrastructure and memory region in absence of error handling
Made protocol performance node names more informative
Enforced always running on the same thread in single thread mode
Multiple improvements in protocols selection infrastructure
Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping
Allowed up-to 64 endpoint lanes for systems with many transports or devices
Added usage tracker to worker
Improved various logging messages

RDMA CORE (IB, ROCE, etc.)

Added environment variable to manage DC initiator capacity
Added DC dcs_hybrid policy
Reduced MLX5/DV stack size consumption
Added ODP support for verbs and mlx5dv
Added support of CUDA managed memory on IB when ODP is available
Added support of Adaptive Routing on RoCE
Enabled use of implicit ODP with relaxed ordering
Improved GPU-Direct detection in IB transport
Increased DC initiator default count to 32 for performance optimization
Added ConnectX-8 device support with DDP
Added support for subnet filter list for RoCE interfaces
Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports
Added IB MLX5 as a separate UCX module with separate RPM sub-package
Added initial support for GGA transport, for fast DPU memory access
Set IB DevX atomic mode based on device capabilities
Removed DC keepalive mechanism, since the keepalive is done on UCP layer
Optimized cross-gVMI memory registration using indirect memory keys cache
Improved various logging messages

CUDA

Added multi-node NVlink support
Added CUDA Fabric memory support with detection and allocation
Improved gdr_copy latency estimations on AMD Milan systems
Added check for gdr_copy runtime/build version mismatch
Added handling missing IPC capability when unpacking keys
Added caching for CUDA IPC memory pool import operation
Added gdr_copy variables to optimize performance on Grace Hopper systems
Improved CUDA IPC concurrency for a larger count of reachable peers

UCS

Added support for wildcards in configuration parameter names
Added ASAN protection to several internal data structures
Reduced stack usage in topology detection code
Improved bitmaps configuration parsing with wider bitfield
Added options to set topology distance between devices
Optimized VFS unix socket watch by using user private folder
Added general IP subnet matching infrastructure
Extend array data structure to support user-provided array copy routine
Improved time units description

UCM

Extend CUDA memory hooks to include memory mapping APIs

Tools

Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest
Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest
Improved ucx_perftest uni-directional test with added fence
Detailed ucx_perftest batch section of command-line documentation

Documentation

Added a section regarding adaptive routing on RoCE

Architecture

Added CPU Model for MI300A
Added Fujitsu ARM specific values to ucx.conf
Added AMD Turin support
Added an optimized non-temporal memory copy implementation for AMD CPU

Build

Improved compiler error reporting with added flag
Improved coverity script to allow faster turnaround time
Improved Intel Compiler detection and support

GO

Added multi-send flag and user memh support in request params

Packaging

Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments

Bugfixes:

UCP

Fixed stack overflow in exported rkey unpack
Removed extra remote-cpu overhead from protocol estimation for zcopy
Fixed performance estimation for rndv pipeline protocols
Fixed ATP sending by picking the correct lane
Fixed missing reg_id on memh creation
Fixed repeated invalidations by retaining existing access flags
Fixed abort reason propagation for rendezvous RTR mtype
Do not check transport availability if it is disabled by UCX_TLS environment variable
Fixed wrong flag being used for checking BCOPY capability
Fixed sending too many ATPs for small messages
Enforced 16 bits size for Active Messages identifiers
Fixed unnecessary status check for emulated AMO
Fixed more than one fragment sending in rendezvous pipeline
Fixed crash by using biggest max frag across all lanes
Fixed missing memory handle flags by copying from parent to child
Fixed worker interface activate count
Fixed flush requests by replacing ATP/flush lane map with lane indexes
Fixed lost uct_flags when merging memory regions

UCT

Fixed memory domain UCT flags description

RDMA CORE (IB, ROCE, etc.)

Fixed FETCH_ADD remote access error for ODP/KSM case
Fixed missing conditional compilation checks for DM
Fixed IB MD allocation naming typo
Fixed invalid GIDs filter in IB
Fixed flags usage in MLX5 zcopy_post
Do not limit ODP registration retries
Fixed JUCX failures by considering the number of supported completion vectors

CUDA

Fixed async memory handling using CUDA memory type on Grace
Added rcache overhead in performance estimation
Fixed gdr_copy performance regression by providing maximum estimation between get and put
Fixed CUDA IPC reachability check
Fixed crash in MPI_Finalize when CUDA context is destroyed
Always require rcache by default for gdr_copy
Fixed crash in gdr_copy cleanup when registration cache is disabled
Fixed CUDA copy memory domain allocations
Fixed multiple tests for gdr_copy transport
Fixed race condition in CUDA IPC peer accessible cache

UCS

Fixed a crash by using heap allocation to process expired timers in batch
Fixed allocation issue on memtrack dump
Fixed deletion of the monitored folder in VFS
Fixed unsafe resize for DC initiator array
Fixed function macro invocation to match C standard
Fixed calling async handler on already released resource
Fixed performance by setting higher bandwidth for different NUMA nodes on Grace
Fixed undeclared value error in timer conversion routine
Fixed uninitialized value access in registration cache

UCM

Fixed race condition in parsing proc maps
Fixed mremap failure while parsing /proc/self/maps

ROCM

Fixed ROCM interface reachability test
Fixed memory domain fork test

TCP

Always bind endpoint to interface

Tools

Fixed buffer size potential overflow in ucx_perftest
Fixed missing address when packing memory keys on ucx_perftest
Fixed memory leak for endpoint report in ucx_info
Fixed build without openmp in ucx_perftest
Fixed UCT device override on server side on ucx_perftest

Build

Fixed using correct ASAN version for running tests

Configuration

Used POSIX bourne syntax to check equality
Fixed build failure by using proper flags in compiler.m4
Fixed perftest MAD support default guessing

GO

Added serialized thread mode to avoid subtle races between threads
Fixed make distcheck

Assets 21

23 Dec 17:06

tvegas1

v1.18.0-rc3

9ce35d0

v1.18.0 RC3 Pre-release

Pre-release

1.18.0-rc3 (December 23, 2024)

Features:

UCP

Enabled using CUDA staging buffers for pipeline protocols by default
Added endpoint reconfiguration support for non-reused p2p scenarios
Enabled non-cacheable memory domains, activated for gdr_copy
Added user_data parameter to ucp_ep_query
Added support for host memory pipeline through CUDA buffers for rendezvous protocol
Added global VA infrastructure and memory region in absence of error handling
Made protocol performance node names more informative
Enforced always running on the same thread in single thread mode
Multiple improvements in protocols selection infrastructure
Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping
Allowed up-to 64 endpoint lanes for systems with many transports or devices
Added usage tracker to worker
Improved various logging messages

RDMA CORE (IB, ROCE, etc.)

Added environment variable to manage DC initiator capacity
Added DC dcs_hybrid policy
Reduced MLX5/DV stack size consumption
Added ODP support for verbs and mlx5dv
Added support of CUDA managed memory on IB when ODP is available
Added support of Adaptive Routing on RoCE
Enabled use of implicit ODP with relaxed ordering
Improved GPU-Direct detection in IB transport
Increased DC initiator default count to 32 for performance optimization
Added ConnectX-8 device support with DDP
Added support for subnet filter list for RoCE interfaces
Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports
Added IB MLX5 as a separate UCX module with separate RPM sub-package
Added initial support for GGA transport, for fast DPU memory access
Set IB DevX atomic mode based on device capabilities
Removed DC keepalive mechanism, since the keepalive is done on UCP layer
Optimized cross-gVMI memory registration using indirect memory keys cache
Improved various logging messages

CUDA

Added multi-node NVlink support
Added CUDA Fabric memory support with detection and allocation
Improved gdr_copy latency estimations on AMD Milan systems
Added check for gdr_copy runtime/build version mismatch
Added handling missing IPC capability when unpacking keys
Added caching for CUDA IPC memory pool import operation
Added gdr_copy variables to optimize performance on Grace Hopper systems
Improved CUDA IPC concurrency for a larger count of reachable peers

UCS

Added support for wildcards in configuration parameter names
Added ASAN protection to several internal data structures
Reduced stack usage in topology detection code
Improved bitmaps configuration parsing with wider bitfield
Added options to set topology distance between devices
Optimized VFS unix socket watch by using user private folder
Added general IP subnet matching infrastructure
Extend array data structure to support user-provided array copy routine
Improved time units description

UCM

Extend CUDA memory hooks to include memory mapping APIs

Tools

Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest
Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest
Improved ucx_perftest uni-directional test with added fence
Detailed ucx_perftest batch section of command-line documentation

Documentation

Added a section regarding adaptive routing on RoCE

Architecture

Added CPU Model for MI300A
Added Fujitsu ARM specific values to ucx.conf
Added AMD Turin support
Added an optimized non-temporal memory copy implementation for AMD CPU

Build

Improved compiler error reporting with added flag
Improved coverity script to allow faster turnaround time
Improved Intel Compiler detection and support

GO

Added multi-send flag and user memh support in request params

Packaging

Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments

Bugfixes:

UCP

Fixed stack overflow in exported rkey unpack
Removed extra remote-cpu overhead from protocol estimation for zcopy
Fixed performance estimation for rndv pipeline protocols
Fixed ATP sending by picking the correct lane
Fixed missing reg_id on memh creation
Fixed repeated invalidations by retaining existing access flags
Fixed abort reason propagation for rendezvous RTR mtype
Do not check transport availability if it is disabled by UCX_TLS environemnt variable
Fixed wrong flag being used for checking BCOPY capability
Fixed sending too many ATPs for small messages
Enforced 16 bits size for Active Messages identifiers
Fixed unnecessary status check for emulated AMO
Fixed more than one fragment sending in rendezvous pipeline
Fixed crash by using biggest max frag across all lanes
Fixed missing memory handle flags by copying from parent to child
Fixed worker interface activate count
Fixed flush requests by replacing ATP/flush lane map with lane indexes
Fixed lost uct_flags when merging memory regions

UCT

Fixed memory domain UCT flags description

RDMA CORE (IB, ROCE, etc.)

Fixed FETCH_ADD remote access error for ODP/KSM case
Fixed missing conditional compilation checks for DM
Fixed IB MD allocation naming typo
Fixed invalid GIDs filter in IB
Fixed flags usage in MLX5 zcopy_post
Do not limit ODP registration retries
Fixed JUCX failures by considering the number of supported completion vectors

CUDA

Fixed async memory handling using CUDA memory type on Grace
Added rcache overhead in performance estimation
Fixed gdr_copy performance regression by providing maximum estimation between get and put
Fixed CUDA IPC reachability check
Fixed crash in MPI_Finalize when CUDA context is destroyed
Always require rcache by default for gdr_copy
Fixed crash in gdr_copy cleanup when registration cache is disabled
Fixed CUDA copy memory domain allocations
Fixed multiple tests for gdr_copy transport
Fixed race condition in CUDA IPC peer accessible cache

UCS

Fixed a crash by using heap allocation to process expired timers in batch
Fixed allocation issue on memtrack dump
Fixed deletion of the monitored folder in VFS
Fixed unsafe resize for DC initiator array
Fixed function macro invocation to match C standard
Fixed calling async handler on already released resource
Fixed performance by setting higher bandwidth for different NUMA nodes on Grace
Fixed undeclared value error in timer conversion routine
Fixed uninitialized value access in registration cache

UCM

Fixed race condition in parsing proc maps
Fixed mremap failure while parsing /proc/self/maps

ROCM

Fixed ROCM interface reachability test
Fixed memory domain fork test

TCP

Always bind endpoint to interface

Tools

Fixed buffer size potential overflow in ucx_perftest
Fixed missing address when packing memory keys on ucx_perftest
Fixed memory leak for endpoint report in ucx_info
Fixed build without openmp in ucx_perftest
Fixed UCT device override on server side on ucx_perftest

Build

Fixed using correct ASAN version for running tests

Configuration

Used POSIX bourne syntax to check equality
Fixed build failure by using proper flags in compiler.m4
Fixed perftest MAD support default guessing

GO

Added serialized thread mode to avoid subtle races between threads
Fixed make distcheck

Assets 21

Releases: openucx/ucx

v1.19.1-rc1

1.19.1 (Sep 18, 2025)

Features:

UCP

Build

Uh oh!

v1.19.0

1.19.0 (August 6, 2025)

Features:

UCP

UCT

RDMA CORE (IB, ROCE, etc.)

CUDA

ROCM

UCS

Tools

Bugfixes:

UCP

UCT

RDMA CORE (IB, ROCE, etc.)

GPU (CUDA, ROCM)

UCS

Shared Memory

Documentation

Build

Uh oh!

v1.19.0-rc2

1.19.0 (June 18, 2025)

Features:

UCP

UCT

RDMA CORE (IB, ROCE, etc.)

CUDA

ROCM

UCS

Tools

Bugfixes:

UCP

UCT

RDMA CORE (IB, ROCE, etc.)

GPU (CUDA, ROCM)

UCS

Shared Memory

Documentation

Build

Uh oh!

v1.19.0-rc1

1.19.0 (June 18, 2025)

Features:

UCP

UCT

RDMA CORE (IB, ROCE, etc.)

CUDA

ROCM

UCS

Tools

Bugfixes:

UCP

UCT

RDMA CORE (IB, ROCE, etc.)

GPU (CUDA, ROCM)

UCS

Shared Memory

Documentation

Build

Uh oh!

v1.18.1

1.18.1 (April 28, 2025)

Features:

CUDA

AZP

Bugfixes:

UCP

CUDA

RDMA CORE (IB, ROCE, etc.)

UCM

Uh oh!

v1.18.1 RC3

1.18.1-rc3 (April 17, 2025)