Cross adapter memory and synchronization #2443

Agrael1 · 2024-10-04T15:38:41Z

Hello,

I would like to bring something new to Vulkan and create a new extension. However I would like to get some validation first, before diving in the paperwork.

I would like to introduce a cross device memory and semaphore handles. The mechanism on user side would be the same as creating shared handle. However, since it is cross device there are some significant considerations to be made. Let me present the rough outline.

Memory:

memory could be device local and host visible (ReBAR), that ensures best performance on write. The idea is the same as cross-adapter swapchain implementation, that copies images to the adapter that has output monitor plugged.
it may create only linear buffers, since tiling between devices may differ. Maybe tiling compatibility could be checked, in that case textures may also be created.
the memory is rw on producer adapter and read only from other devices, that map the memory as their local, that will benefit the security.
handle is just VkDeviceMemory, but it can be reexported with win32/fd (is it possible?)

Semaphore:

can be imported cross adapter
use system memory, like syncfd handles
timeline semaphore will also be available, the value will be shared across devices

Of course this is not entirely complete, and I would like to work on details. I am aware of external_memory, but it requires readback memory (host visible, host coherent, host cached) and wild amount of hacks. And I know only about syncfd semaphores that are capable of cross-adapter support. The proposal would optimize the data paths and make sync easier.

I would like to hear your feedback!

cubanismo · 2024-10-16T01:06:10Z

If you only care about Linux, save yourself a ton of time and use the existing dma-buf and sync FD extensions (Looks like you've already found the latter. dma-buf is defined in VK_EXT_external_memory_dma_buf and the very complementary VK_EXT_image_drm_format_modifier). If you only care about Linux and Windows, you could use dma-buf on Linux and D3D-allocated objects on Windows. The Windows side would be a little hacky, but you could get something working.

If you want to solve this generally, you'd have to be willing to champion a lot of work and expound on the relevant use cases that make it worth all the driver vendors' time to do their part of the work. You've correctly identified some of the pieces needed, but not all of them (E.g., you can't use VkDeviceMemory handles, because they're not dispatchable, and hence their scope is that of the device that created them. You'd need some new type of handle, support for that in the loader somehow, support in every driver you'd want to use it in, support in the validation layers, etc.) The design has been roughly scoped before, and is a huge effort from both a spec, loader, and driver perspective.

Agrael1 · 2024-10-16T07:41:53Z

Well, windows solution is something that came to my mind as well.
Question, can the existing handles like Win32 and regular FD be reused for such extension? Such extension would then require only mem descriptor to feature a single flag, which is cross-device flag. Similar to syncfd this can be done for win32.

About vendors' time: they have everything in place already, since d3d12 supports such actions. The current scope is problematic, since d3d12 does not allow sharing device memory under protected sessions, and the usecase is medical equipment.

About validation, if the design will use Win32 handles and FD, the validation stays the same, since imported type is opaque and must be used the same as existing semaphore and memory.

I'd like to try to do spec, since cross-adapter topic is becoming more relevant as everyone progresses with moving to vulkan. Big industries will eventually want to use more than one device and not be bound by CPU waiting and be in secure context. Although I acknowledge the amount of work to be done.

TomOlson · 2024-10-16T18:04:58Z

@Agrael1 the WG discussed this today. I've assigned it to the chair of the System Integration TSG, which owns external memory / synchronization and platform interfaces in general. The TSG will discuss and see if there is interest in engaging or if they have any advice. As @cubanismo (who is a TSG member) said, solving the general problem would be a huge amount of work, both on the spec side and in terms of persuading driver / platform / CTS / validation project leads of the importance of supporting it.

One thing I would suggest is to explain the use case and constraints in more detail, to help the TSG understand why doing all this work is necessary. How does the use case drive the need for cross-device memory and synchronization? If it's just a matter of leveraging multiple adapters, why aren't device groups sufficient? Et cetera...

Agrael1 · 2024-10-16T19:29:19Z

Device group requires SLI/Crossfire to work. All the memory becomes shared between devices. This reduces overall capacity of the system and brings in vendor specific requirements.

There are currently 2 concrete use-cases that I know of.

Live production software: pretty niche but still one of the consumers. There is an application, that drives real-time broadcast for video walls for example. There are 2 or more devices onboard. One is used specifically for presentation to video wall over encoded streams (SDI/NDI) and is used to perform color space corrections and encoding. The other device is doing decoding+live assembly. Those 2 devices must communicate, since presentation device is not capable of making large amounts of computations, aside from stream encoding, and first device either can't perform presentation, or would be overloaded with such tasks, which would make live broadcasting unviable. Proposed change would make communications between videocards easier and would not require same cards, which subsequently reduces costs for building systems.
Medical device: there is a new generation of devices, that have functions similar to MRI, but are local and real-time (can't really share any more details). This device uses 2 graphics cards, and it uses dedicated compute card for real time information filtering and decoding. The second one is used for 3D rendering. Throughput is huge, you can't just sacrifice memory from any of the cards. And timings on the performance as well as security are critical points for medical software.

Points: tasks with higher computation capacity often require dedicated devices to run smoothly. As shown dedicated adapters are not always the same model and may not have compatibility with device group. (I have 3090 to run computations and A series to use stereo display).

For regular graphics development that may bring possibilities of heterogenous device acceleration with dedicated nodes which may read specific parts of each other memory.

Overall such extension is less convenient then device group, but more specific to work dedication.

oddhack added the System Integration label Oct 16, 2024

TomOlson assigned linyaa-kiwi Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross adapter memory and synchronization #2443

Cross adapter memory and synchronization #2443

Agrael1 commented Oct 4, 2024

cubanismo commented Oct 16, 2024

Agrael1 commented Oct 16, 2024

TomOlson commented Oct 16, 2024

Agrael1 commented Oct 16, 2024

Cross adapter memory and synchronization #2443

Cross adapter memory and synchronization #2443

Comments

Agrael1 commented Oct 4, 2024

cubanismo commented Oct 16, 2024

Agrael1 commented Oct 16, 2024

TomOlson commented Oct 16, 2024

Agrael1 commented Oct 16, 2024