Fix and Reenable Ring Allgather Cuda Ipc Test #5429
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR #5325 disabled it after removing the barrier in ipc exchangeHandles.
It was hard to reason through, but I realized that the get zcpy protocol aligns nicely with the way the ring allgather algorithm works here. We 1) signal that the current buffer is ready to be get, and 2) matmul it on the same stream after signaling, and 3) get the next buffer on the next stream. On the next j iteration, the buffer is ready and we're back at step 1.
The semantics of the get protocol allows the removal of the sendWait and recvWait steps in the algorithm loop. I think the put protocol can do something similar, but the algorithm would need to be rewritten so that it's working with the current and previous buffers in the ring, instead of current and the next buffers. For now I just skipped the test if the put protocol is enabled.