-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll/han: Add alltoall algorithm #12387
base: main
Are you sure you want to change the base?
Conversation
Some motivating data collected on 2K ranks (32 hpc7g): <style> </style>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just two minor comments
@lrbison On a high level - I am a bit bothered by the fact that we are implementing point-to-point communication directly in HAN. It breaches the encapsulation in HAN where other collective algorithms are reusing other coll modules. In other words, HAN implicitly composes hierarchical collectives of basic collectives. Related to encapsulation, I have been thinking about the the benefits of this new algorithm. I think the biggest innovation is to separate inter-node from intra-node comms which significantly reduces network load. The intra-node offload is only a secondary concern, and we might get away with PML instead of smsc. IMO the new XHC module is a better home for the smsc stuff, and later we can adopt XHC in HAN. |
@wenduwan Yes, I understand your concern about the encapsulation. The problem is that alltoall is not easily/efficiently composed of smaller operations. By the nature of ALL to all, it is difficult to hierarchically subdivide the problem. I did think of three ways to subivide:
In both of those cases above the thing limiting performance is actually the local communication. Finally, there is one more case:
|
Looks like CI has a I will investigate, and also re-try the test including xpmem on my own cluster. |
The failure is a symptom of the OB1 PML code's recent problems in INTER communicators. Unrelated to this change. See Issue 12407. Ignoring that unrelated problem for a moment, I think this change is ready for review again. |
About smsc - I remember that it does not always work for accelerator, so should we add a check to unselect this algorithm for device memory? It might be tricky since it requires another consensus within the global communicator since some rank might be on accelerator but others not. I was debating with myself if we should leave smsc out of this PR - would that cause a dramatic performance penalty? In other words, is smsc essential to achieve a reasonable/significant improvement over the current alltoall? |
Yes. The primary advantage of this PR is that we avoid an extra copy and lots of extra synchronization by exposing all the send buffers to all other on-node ranks via mapped memory from SMSC module (and specifically, an SMSC module that implements mapped memory, which is currently only XPMEM)
Hm, good point. Technically we could achieve this with local-to-node collective as long as the fallback implemented the same off-node pattern, but in the short-term a comm-wide allreduce would probably be needed. Let me experiment with this. |
I have a different take on this. In my opinion the greatest value proposition of this PR is that we avoid the expensive cost associated with high network incast and outcast. In the context of HAN, every collective will benefit from smsc but I doubt if that is the foremost advantage. I imagine we can claim a bigger win if we can devise a reusable smsc pattern for allreduce, bcast, etc. too. |
I realized I just sidestepped all intermediary datatypes and their conversion in this PR which is going to cause lots of failures. I'm marking this as draft while I fix that. I will:
|
2319b3c
to
06fa035
Compare
I have completed a significant update to this PR. The changes corrected glaring issues for DDT and device memory, while largely retaining the original performance.
I'm marking the PR ready for review again. Please take a look. |
I collected some performance numbers on one of our platforms for the alltoall_using_smsc algorithm, and compared to whatever the default is when we exclude han ( I think it is tuned, but I didn't verify). The setup is 4 nodes, 48 processes per node (192 total), Slingshot11 with libfabric 1.20.1, and XPMEM. I collected numbers when using mtl/ofi (in which case even intra-node communication will go through the cxi
I think the numbers look good overall for the new algorithm, but I can't yet make sense on why the btl/ofi performs better than with mtl/ofi for this particular scenario (where the bulk of the intra-node communication would probably go through the smsc/xpmem component in the han algorithm). |
Increase coll:han:get_algorithm verbosity level from 1 to 30, to avoid flooding terminal at any verbosity level. Thirty seems to be used for most of the other han dynamic selection prints. Signed-off-by: Luke Robison <[email protected]>
This will allow HAN collectives to check for and use SMSC methods to direct-map peer memory during on-node communication. Signed-off-by: Luke Robison <[email protected]>
I have addressed all outstanding comments. Thank you for the testing by @edgargabriel and @hppritcha! I made only small configuration-parameter related changes in my latest push:
Please take a look. I'm eager to merge this as it has been open a while, and I have a han alltoallv PR coming soon which builds on it. |
Add Alltoall algorithm to coll/han. Each rank on one host is assigned a single partner on a remote host and vice versa. Then the rank collects all the data its partner will need to receive from it's host, and sends it in one large send, and likewise receives it's data in one large recv, then cycles to the next host. This algorithm is only selected when SMSC component has ability to direct-map peer memory, which only exists for XPMEM module. Signed-off-by: Luke Robison <[email protected]>
This most recent push fixed two issues that primarily affected push-mode, which is only used when sendbuf is non-contiguous or is on device memory:
|
Add two Alltoall algorithms to coll/han. Both algorithms use the same
communication pattern. Each rank on one host is assigned a single
partner on a remote host and vice versa. Then the rank collects all
the data its partner will need to receive from it's host, and sends it
in one large send, and likewise receives it's data in one large recv,
then cycles to the next host.
The two algorithms are:
and each rank has a copy of all local data. Only recommended for
small message sizes.
direct-map local memory before copying into a packed send buffer.
Currently only the XPMEM-based smsc module supports this operation.