Skip to content

Conversation

@nsarka
Copy link
Member

@nsarka nsarka commented Oct 24, 2025

PR #5325 disabled it after removing the barrier in ipc exchangeHandles.

It was hard to reason through, but I realized that the get zcpy protocol aligns nicely with the way the ring allgather algorithm works here. We 1) signal that the current buffer is ready to be get, and 2) matmul it on the same stream after signaling, and 3) get the next buffer on the next stream. On the next j iteration, the buffer is ready and we're back at step 1.

The semantics of the get protocol allows the removal of the sendWait and recvWait steps in the algorithm loop. I think the put protocol can do something similar, but the algorithm would need to be rewritten so that it's working with the current and previous buffers in the ring, instead of current and the next buffers. For now I just skipped the test if the put protocol is enabled.

@nsarka nsarka self-assigned this Oct 24, 2025
@nsarka nsarka force-pushed the nsarka/fix-ipc-ag-test branch from a69357e to bbe2c4a Compare October 24, 2025 20:16
@nsarka nsarka changed the title Reenable Ring Allgather Cuda Ipc Test Fix and Reenable Ring Allgather Cuda Ipc Test Oct 24, 2025
@nsarka
Copy link
Member Author

nsarka commented Oct 24, 2025

!test

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks. Let's just make sure CI passes

@nsarka nsarka merged commit cbb1b31 into NVIDIA:main Oct 27, 2025
62 of 67 checks passed
samnordmann added a commit that referenced this pull request Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants