Avoid calling useResource on resources in argument buffers #2402

js6i · 2024-12-03T15:03:08Z

This PR implements execution barriers with Metal fences and puts all resources in a residency set to avoid having to useResource all resources in bound argument buffers. That makes it possible to run programs that use descriptor indexing with large descriptor tables efficiently.

Consider a pipeline executing some render passes with a couple vertex to fragment barriers:

1 2   3 4   5 6
v v B v v B v v
f f B f f B f f

Here v and f symbolize the vertex and fragment stages of a render pass, and B stands for the barrier.
In this example, stages v1 and v2 need to run before f3..6, and v1..4 before f5 and f6.

To implement this I maintain a set of fences that will be waited on before each stage, and updated after it. Here's a diagram with the fences a and b placed before the stage symbol when waited on, and after when updated:

1  2     3   4     5  6
va va B avb avb B av av
f  f  B af  af  B bf bf

Here v1 updates fence a, v4 waits for a and updates b, f4 waits for a, etc.

Note that the synchronization is a little stronger than the original - v3..6 are forced to execute after v1 and v2. This is for practical reasons - I want to keep a constant, limited set of fences active, only wait for one fence per stage pair, and only update one fence per stage.

There's some things that could be improved here:

Keep the number of fences in flight more limited, reuse, at the potential cost of incurring extra synchronization.
Don't add so many release handlers. I am quite defensive with retain/release here, but doing any less caused use after free errors. I think it should be possible to do better though, or at least maybe batch the releases in a single handler.
I think the fences should be assigned per queue, not device, and I'm a bit worried about using fences across queues. I don't think we want to rely on which queue we'll be be executing on to encode though.

etang-cw · 2024-12-05T19:50:14Z

MoltenVK/MoltenVK/GPUObjects/MVKDevice.mm

+	@synchronized (_physicalDevice->getMTLDevice()) {
+		for (auto fence: _activeBarriers[stage]) {


Vulkan barriers run in submission order, so the fact that this is on MVKDevice (and requires synchronization) worries me
Have you tested what happens if e.g. you encode command buffers in immediate mode and then submit them in the opposite order that you encoded them? Yes, it won't crash thanks to the @synchronized but the fact that this is in a place that requires synchronization at all means that two threads could fight over the _activeBarriers list and probably do unexpected (but non-crashy) things.

Also, any reason you're retaining and releasing all the fences? Don't they live as long as the MVKDevice (which according to Vulkan should outlive any active work on it)?

Right, that's a good point about keeping the fences there, in addition to the multiple queue problem.

Maybe I could avoid requiring to encode only after submit (which would let us keep fences on MVKQueue) by keeping most fences local to the command buffer, and doing some boundary trick to synchronize between submissions on the queue. Not sure what that trick is yet.

The fences are currently only supposed to live as long as the last command buffer that uses them. When one gets removed from all wait/update slots, the only references left are those attached to the command buffer. It sure is more retaining and releasing than I originally expected, so I might just pull the trigger and keep a fixed number of reusable fences..

One possibility is to make sure the last group in a submission always updates a known fence, and then always start with waiting on that fence on new submissions:

1 2 3 4 5 6 avb avb B bvc bvc B cva cva f f B bf bf B cf cf

(And if you go the reusable fence route, just have everyone use the same array of fences. Always start at index 0, and update index 0 at the end of a submission. Note that fences in Metal, like barriers in Vulkan, also work in submission order, so the worst that could happen using the same fences across multiple encoders at once is more synchronization than you wanted, but assuming you don't mix fences for different pipeline stages, I don't think that will be a big issue.)

billhollings · 2024-12-10T00:03:30Z

Since there are a few design and implementation points under discussion, I've moved this to WIP.

etang-cw · 2024-12-17T16:07:02Z

MoltenVK/MoltenVK/GPUObjects/MVKDevice.mm

+	// Initialize fences for execution barriers
+	for (auto &stage: _barrierFences) for (auto &fence: stage) fence = [_physicalDevice->getMTLDevice() newFence];


Could you give the fences labels like [fence setLabel:[NSString stringWithFormat:@"%s Fence %d", stageName(stage), idx]]? Would be very convenient for debugging.

Sure, pushed it.

js6i · 2024-12-17T16:38:38Z

Note that I removed the host stage, I don't think it needs to be explicit, but there probably? should be some waits in applyMemoryBarrier and applyBufferMemoryBarrier before synchronizeResource. (hence still WIP)
I don't think pullFromDevice needs any as callers require the client to sync with device in some other way, which I think is sufficient?

etang-cw · 2024-12-17T19:30:26Z

Note that I removed the host stage, I don't think it needs to be explicit

My understanding is that Metal guarantees memory coherency once you're able to observe that an operation has completed (e.g. through a shared event or by checking the completed status of a command buffer), so I think this is correct, since you'd need to do the same even with the host memory barrier in Vulkan.

Some old Metal docs:

Similarly, after the MTLDevice object executes a MTLCommandBuffer object, the host CPU is only guaranteed to observe any changes the MTLDevice object makes to the storage allocation of any resource referenced by that command buffer if the command buffer has completed execution (that is, the status property of the MTLCommandBuffer object is MTLCommandBufferStatusCompleted).

js6i · 2024-12-19T17:16:32Z

Alright, my concern with synchronizeResource memory barriers seems moot, as it's only relevant on non-Apple devices, which don't support residency sets anyway.

billhollings · 2024-12-31T16:04:54Z

@js6i I see you've removed the WIP tag. Is this PR ready for overall review and merging?

js6i · 2024-12-31T21:23:10Z

@js6i I see you've removed the WIP tag. Is this PR ready for overall review and merging?

Yes, I meant to submit it for review.

js6i force-pushed the barriers branch from fa77f68 to 4b95eb7 Compare December 4, 2024 10:03

etang-cw reviewed Dec 5, 2024

View reviewed changes

billhollings changed the title ~~Avoid calling useResource on resources in argument buffers~~ WIP: Avoid calling useResource on resources in argument buffers Dec 10, 2024

Implement barriers using Metal fences

c20828d

js6i force-pushed the barriers branch from 4b95eb7 to 0464099 Compare December 17, 2024 14:25

etang-cw reviewed Dec 17, 2024

View reviewed changes

js6i force-pushed the barriers branch from 81e8c23 to c9ed102 Compare December 19, 2024 16:51

js6i added 2 commits December 19, 2024 18:11

Put resources in residency sets

1510c68

MVKDevice: Add debug labels to barrier fences.

edaefc8

js6i force-pushed the barriers branch from c9ed102 to edaefc8 Compare December 19, 2024 17:12

js6i changed the title ~~WIP: Avoid calling useResource on resources in argument buffers~~ Avoid calling useResource on resources in argument buffers Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid calling useResource on resources in argument buffers #2402

Avoid calling useResource on resources in argument buffers #2402

js6i commented Dec 3, 2024

etang-cw Dec 5, 2024

js6i Dec 6, 2024

etang-cw Dec 7, 2024

billhollings commented Dec 10, 2024

etang-cw Dec 17, 2024

js6i Dec 17, 2024

js6i commented Dec 17, 2024

etang-cw commented Dec 17, 2024

js6i commented Dec 19, 2024

billhollings commented Dec 31, 2024

js6i commented Dec 31, 2024

		@synchronized (_physicalDevice->getMTLDevice()) {
		for (auto fence: _activeBarriers[stage]) {

		// Initialize fences for execution barriers
		for (auto &stage: _barrierFences) for (auto &fence: stage) fence = [_physicalDevice->getMTLDevice() newFence];

Avoid calling useResource on resources in argument buffers #2402

Are you sure you want to change the base?

Avoid calling useResource on resources in argument buffers #2402

Conversation

js6i commented Dec 3, 2024

etang-cw Dec 5, 2024

Choose a reason for hiding this comment

js6i Dec 6, 2024

Choose a reason for hiding this comment

etang-cw Dec 7, 2024

Choose a reason for hiding this comment

billhollings commented Dec 10, 2024

etang-cw Dec 17, 2024

Choose a reason for hiding this comment

js6i Dec 17, 2024

Choose a reason for hiding this comment

js6i commented Dec 17, 2024

etang-cw commented Dec 17, 2024

js6i commented Dec 19, 2024

billhollings commented Dec 31, 2024

js6i commented Dec 31, 2024