-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[TRTRTX EP] Implement OrtExternalResourceImporter for D3D12-CUDA interop #26829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Introduces the OrtExternalResourceImporter API enabling execution providers to import D3D12 shared resources and timeline fences for zero-copy GPU-to-GPU data sharing with ORT inference. Public API additions: - OrtExternalResourceImporter capability object - OrtExternalMemoryHandle for imported D3D12 allocations - OrtExternalSemaphoreHandle for imported D3D12 timeline fences - SessionGetEpDeviceForOutputs to query output EP device placement - RunOptions_SetSyncStream to associate sync stream for async execution EP Plugin API: - OrtExternalResourceImporterImpl interface for EP implementations - OrtEpFactory::CreateExternalResourceImporterForDevice extension Design: - No GPU virtual addresses in public API - EP-agnostic design allows any EP to implement import - Capability discovery with explicit ORT_NOT_IMPLEMENTED - Follows existing patterns (Allocator, DataTransfer, SyncStream) Includes example_plugin_ep mock implementation and autoep tests.
This commit adds the OrtExternalResourceImporter implementation for the NvTensorRtRtx execution provider, enabling zero-copy D3D12 to CUDA memory sharing and GPU synchronization. Implementation: - NvTrtRtxExternalResourceImporterImpl: Full implementation of the OrtExternalResourceImporter interface using CUDA Driver APIs - Memory import: cuImportExternalMemory for D3D12_RESOURCE and D3D12_HEAP - Semaphore import: cuImportExternalSemaphore for D3D12_FENCE - Tensor creation: CreateTensorFromMemory wraps imported CUDA device pointers - Synchronization: WaitSemaphore/SignalSemaphore using cuWaitExternalSemaphoresAsync/cuSignalExternalSemaphoresAsync Tests (nv_external_resource_importer_test.cc): - CreateExternalResourceImporter: Basic importer creation - CanImportMemoryCapabilities: D3D12 Resource/Heap capability queries - CanImportSemaphoreCapabilities: D3D12 Fence capability queries - ImportD3D12SharedResource: Memory import validation - CreateTensorFromImportedMemory: Tensor creation with CUDA device ptr verification - ImportD3D12Fence: Semaphore import validation - WaitAndSignalSemaphore: Bidirectional D3D12-CUDA sync - FullInferenceWithExternalMemory: E2E test with ReLU model verifying D3D12 upload -> CUDA inference -> D3D12 readback pipeline
onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc
Outdated
Show resolved
Hide resolved
onnxruntime/test/providers/nv_tensorrt_rtx/nv_external_resource_importer_test.cc
Outdated
Show resolved
Hide resolved
- Deleted the sync_stream member from OrtRunOptions structure. - Removed the RunOptions_SetSyncStream API and its implementation. - Updated related C++ API and example implementations to reflect the removal of sync stream functionality. - Adjusted tests to remove references to RunOptions_SetSyncStream. - Introduced new structures for external memory and semaphore handles to improve resource management. - Ensured backward compatibility by checking EP version support for external resource import.
…l resource structs
| OrtStatus* status = impl.ort_api.CreateTensorWithDataAsOrtValue( | ||
| memory_info, | ||
| data_ptr, | ||
| handle->size_bytes - tensor_desc->offset_bytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
|
|
||
| // Calculate the data pointer with tensor offset | ||
| void* data_ptr = reinterpret_cast<void*>(handle->mapped_ptr + tensor_desc->offset_bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
[ Repeating comment from https://github.com//issues/26821 ] |
- Added `ep_interop_api.h` to define the Interop API for external resource importers. - Implemented functions for creating and managing external resource importers, including memory and semaphore import capabilities. - Updated `onnxruntime_c_api.cc` to integrate the new Interop API, replacing previous external resource importer implementations. - Modified `ort_apis.h` to declare the new Interop API functions. - Refactored tests in `test_external_resource_importer.cc` to utilize the new Interop API for external resource importer operations.
|
Additions in #26948, otherwise the concept looks good to me. |
Description
This commit adds the OrtExternalResourceImporter implementation for the NvTensorRtRtx execution provider, enabling zero-copy D3D12 to CUDA memory sharing and GPU synchronization.
Implementation:
Tests (nv_external_resource_importer_test.cc):
Motivation and Context
#26821