UCT/CUDA_IPC: Use buffer id to detect VA recylcing #10405

Akshay-Venkatesh · 2025-01-06T21:29:01Z

Why/What?

Export handles are used for VA recycling check today but this is not guaranteed to be unique for newer memory allocators like VMM, and CUDA memory pools. We need to switch to using buffer_id attribute to perform this check.

Akshay-Venkatesh · 2025-01-06T21:30:37Z

@yosefe , @tvegas1 We got reports of VA recycling not being detected internally. This PR aims to address the issue for newer allocators. If the changes look good, is there an option to backport this to 1.18 at all?

src/uct/cuda/cuda_ipc/cuda_ipc_md.h

iyastreb · 2025-01-07T06:54:11Z

src/uct/cuda/cuda_ipc/cuda_ipc_cache.c

+    arg1     = (const void *)&key->ph;
+    arg2     = (const void *)&region->key.ph;
+#endif
+        if (memcmp(arg1, arg2, cmp_size) == 0) {


Does it makes sense to update the test to cover this behavior?

yosefe · 2025-01-07T08:45:02Z

src/uct/cuda/cuda_ipc/cuda_ipc_cache.c

+#if HAVE_CUDA_FABRIC
+    cmp_size = sizeof(key->ph.buffer_id);
+    arg1     = (const void *)&key->ph.buffer_id;
+    arg2     = (const void *)&region->key.ph.buffer_id;
+#else
+    cmp_size = sizeof(key->ph);
+    arg1     = (const void *)&key->ph;
+    arg2     = (const void *)&region->key.ph;
+#endif


need to fix indent

any reason to not compare buffer_id also for legacy, regardless of HAVE_CUDA_FABRIC?

any reason to not compare buffer_id also for legacy, regardless of HAVE_CUDA_FABRIC?

I had this to begin with but key->ph is of type CUipcMemHandle when HAVE_CUDA_FABRIC is not declared. So we would need to change the type if we want to also include buffer_id.

src/uct/cuda/cuda_ipc/cuda_ipc_md.h

tvegas1 · 2025-01-07T15:25:38Z

Export handles are used for VA recycling check today but this is not guaranteed to be unique for newer memory allocators like VMM, and CUDA memory pools. We need to switch to using buffer_id attribute to perform this check.

This PR solves the case where cuda reused pointer value (alloc/free/alloc), and failed to detect that it's actually not same allocation instance, right?

tvegas1 · 2025-01-13T15:48:05Z

src/uct/cuda/cuda_ipc/cuda_ipc_cache.c

@@ -124,7 +124,7 @@ static ucs_status_t uct_cuda_ipc_close_memhandle(uct_cuda_ipc_cache_region_t *re
                        (CUdeviceptr)region->mapped_addr, region->key.b_len));
        }
    } else if (region->key.ph.handle_type == UCT_CUDA_IPC_KEY_HANDLE_TYPE_MEMPOOL) {
-        return UCT_CUDADRV_FUNC_LOG_WARN(cuMemPoolDestroy(region->key.ph.pool));
+        return UCT_CUDADRV_FUNC_LOG_WARN(cuMemFree((CUdeviceptr)region->mapped_addr));


i did not understand why this change? previously we were erroneously freeing mem pool?

@tvegas1 It not erroneous but excessive. If a suballocation from the same mempool gets sent over in the future, then we would have benefited from using the same imported mempool but closememhandle occurs when there is VA recycling detected. This change addresses the problem by deferring all imported mempool releases to the point of UCP context destruction. The drawback of this approach as with previous IPC handling is that if the exporter explicitly destroys the mempool, there is a reference on the importer side until context destruction. This needs to be separately handled for legacy memory and newer allocations like VMM and Mallocasync in a different PR.

UCT/CUDA_IPC: Use buffer id to detect VA recylcing

c3944ca

Akshay-Venkatesh requested review from brminich, yosefe and tvegas1 January 6, 2025 21:29

UCT/CUDA_IPC: build fix for non-fabric cases

efcbcd5

iyastreb reviewed Jan 7, 2025

View reviewed changes

yosefe reviewed Jan 7, 2025

View reviewed changes

Akshay-Venkatesh added 2 commits January 8, 2025 00:17

UCT/CUDA_IPC: free mapped ptr; defer pooldestroy to cache purge phase

759925b

UCT/CUDA_IPC: fix indentation

b9685de

tvegas1 reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCT/CUDA_IPC: Use buffer id to detect VA recylcing #10405

UCT/CUDA_IPC: Use buffer id to detect VA recylcing #10405

Akshay-Venkatesh commented Jan 6, 2025

Akshay-Venkatesh commented Jan 6, 2025

iyastreb Jan 7, 2025

yosefe Jan 7, 2025

Akshay-Venkatesh Jan 7, 2025

tvegas1 commented Jan 7, 2025

tvegas1 Jan 13, 2025

Akshay-Venkatesh Jan 13, 2025

UCT/CUDA_IPC: Use buffer id to detect VA recylcing #10405

Are you sure you want to change the base?

UCT/CUDA_IPC: Use buffer id to detect VA recylcing #10405

Conversation

Akshay-Venkatesh commented Jan 6, 2025

Why/What?

Akshay-Venkatesh commented Jan 6, 2025

iyastreb Jan 7, 2025

Choose a reason for hiding this comment

yosefe Jan 7, 2025

Choose a reason for hiding this comment

Akshay-Venkatesh Jan 7, 2025

Choose a reason for hiding this comment

tvegas1 commented Jan 7, 2025

tvegas1 Jan 13, 2025

Choose a reason for hiding this comment

Akshay-Venkatesh Jan 13, 2025

Choose a reason for hiding this comment