Skip to content

Conversation

@nieubank
Copy link
Contributor

Description

This commit adds the OrtExternalResourceImporter implementation for the NvTensorRtRtx execution provider, enabling zero-copy D3D12 to CUDA memory sharing and GPU synchronization.

Implementation:

  • NvTrtRtxExternalResourceImporterImpl: Full implementation of the OrtExternalResourceImporter interface using CUDA Driver APIs
  • Memory import: cuImportExternalMemory for D3D12_RESOURCE and D3D12_HEAP
  • Semaphore import: cuImportExternalSemaphore for D3D12_FENCE
  • Tensor creation: CreateTensorFromMemory wraps imported CUDA device pointers
  • Synchronization: WaitSemaphore/SignalSemaphore using cuWaitExternalSemaphoresAsync/cuSignalExternalSemaphoresAsync

Tests (nv_external_resource_importer_test.cc):

  • CreateExternalResourceImporter: Basic importer creation
  • CanImportMemoryCapabilities: D3D12 Resource/Heap capability queries
  • CanImportSemaphoreCapabilities: D3D12 Fence capability queries
  • ImportD3D12SharedResource: Memory import validation
  • CreateTensorFromImportedMemory: Tensor creation with CUDA device ptr verification
  • ImportD3D12Fence: Semaphore import validation
  • WaitAndSignalSemaphore: Bidirectional D3D12-CUDA sync
  • FullInferenceWithExternalMemory: E2E test with ReLU model verifying D3D12 upload -> CUDA inference -> D3D12 readback pipeline

Motivation and Context

#26821

Introduces the OrtExternalResourceImporter API enabling execution providers
to import D3D12 shared resources and timeline fences for zero-copy GPU-to-GPU
data sharing with ORT inference.

Public API additions:
- OrtExternalResourceImporter capability object
- OrtExternalMemoryHandle for imported D3D12 allocations
- OrtExternalSemaphoreHandle for imported D3D12 timeline fences
- SessionGetEpDeviceForOutputs to query output EP device placement
- RunOptions_SetSyncStream to associate sync stream for async execution

EP Plugin API:
- OrtExternalResourceImporterImpl interface for EP implementations
- OrtEpFactory::CreateExternalResourceImporterForDevice extension

Design:
- No GPU virtual addresses in public API
- EP-agnostic design allows any EP to implement import
- Capability discovery with explicit ORT_NOT_IMPLEMENTED
- Follows existing patterns (Allocator, DataTransfer, SyncStream)

Includes example_plugin_ep mock implementation and autoep tests.
This commit adds the OrtExternalResourceImporter implementation for the
NvTensorRtRtx execution provider, enabling zero-copy D3D12 to CUDA memory
sharing and GPU synchronization.

Implementation:
- NvTrtRtxExternalResourceImporterImpl: Full implementation of the
  OrtExternalResourceImporter interface using CUDA Driver APIs
- Memory import: cuImportExternalMemory for D3D12_RESOURCE and D3D12_HEAP
- Semaphore import: cuImportExternalSemaphore for D3D12_FENCE
- Tensor creation: CreateTensorFromMemory wraps imported CUDA device pointers
- Synchronization: WaitSemaphore/SignalSemaphore using
  cuWaitExternalSemaphoresAsync/cuSignalExternalSemaphoresAsync

Tests (nv_external_resource_importer_test.cc):
- CreateExternalResourceImporter: Basic importer creation
- CanImportMemoryCapabilities: D3D12 Resource/Heap capability queries
- CanImportSemaphoreCapabilities: D3D12 Fence capability queries
- ImportD3D12SharedResource: Memory import validation
- CreateTensorFromImportedMemory: Tensor creation with CUDA device ptr verification
- ImportD3D12Fence: Semaphore import validation
- WaitAndSignalSemaphore: Bidirectional D3D12-CUDA sync
- FullInferenceWithExternalMemory: E2E test with ReLU model verifying
  D3D12 upload -> CUDA inference -> D3D12 readback pipeline
@nieubank nieubank requested a review from skottmckay December 18, 2025 21:05
- Deleted the sync_stream member from OrtRunOptions structure.
- Removed the RunOptions_SetSyncStream API and its implementation.
- Updated related C++ API and example implementations to reflect the removal of sync stream functionality.
- Adjusted tests to remove references to RunOptions_SetSyncStream.
- Introduced new structures for external memory and semaphore handles to improve resource management.
- Ensured backward compatibility by checking EP version support for external resource import.
OrtStatus* status = impl.ort_api.CreateTensorWithDataAsOrtValue(
memory_info,
data_ptr,
handle->size_bytes - tensor_desc->offset_bytes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle->size_bytes - tensor_desc->offset_bytes,

Should we check for overlfow?

}

// Calculate the data pointer with tensor offset
void* data_ptr = reinterpret_cast<void*>(handle->mapped_ptr + tensor_desc->offset_bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle->mapped_ptr + tensor_desc->offset_byte

Suggest a check that this does not bring us outside tensor memory.

@praneshgo
Copy link
Contributor

praneshgo commented Dec 29, 2025

[ Repeating comment from https://github.com//issues/26821 ]
Setting up CUDA context in a certain way using CIG params, ID3D12Device and ID3D12CommandQueue has a lot of performance benefits (reference - #26543 there's a CUContext setup in nv_provider_factory.cc:SetupCigContextImpl). On NV GPUs, this allows simultaneous execution of graphics and compute work on the GPU thereby maximizing its utilization and avoiding unnecessary context switch penalty.
We ideally would want to setup this context as not doing so would mean missing out on these critical performance improvements.

- Added `ep_interop_api.h` to define the Interop API for external resource importers.
- Implemented functions for creating and managing external resource importers, including memory and semaphore import capabilities.
- Updated `onnxruntime_c_api.cc` to integrate the new Interop API, replacing previous external resource importer implementations.
- Modified `ort_apis.h` to declare the new Interop API functions.
- Refactored tests in `test_external_resource_importer.cc` to utilize the new Interop API for external resource importer operations.
@gedoensmax
Copy link
Contributor

Additions in #26948, otherwise the concept looks good to me.

Base automatically changed from nieubank/ext_importer to main January 9, 2026 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants