coll/han: Add alltoall algorithm #12387

lrbison · 2024-03-01T15:49:37Z

Add two Alltoall algorithms to coll/han. Both algorithms use the same
communication pattern. Each rank on one host is assigned a single
partner on a remote host and vice versa. Then the rank collects all
the data its partner will need to receive from it's host, and sends it
in one large send, and likewise receives it's data in one large recv,
then cycles to the next host.

The two algorithms are:

mca_coll_han_alltoall_using_allgather: gathering data is done once
and each rank has a copy of all local data. Only recommended for
small message sizes.
mca_coll_han_alltoall_using_smsc: ranks use smsc module to
direct-map local memory before copying into a packed send buffer.
Currently only the XPMEM-based smsc module supports this operation.

lrbison · 2024-03-08T18:59:26Z

Some motivating data collected on 2K ranks (32 hpc7g):

	Han-smsc	Tuned
1	302.2	372.5
2	308.9	441.5
4	308.4	436.5
8	331.9	666.0
16	351.9	814.5
32	333.3	1703.2
64	435.5	15613.4
128	1248.6	27603.3
256	1866.1	31895.6
512	3421.5	37748.5
1024	6306.1	30685.0
2048	12603.2	38110.2
4096	24445.4	37636.9
8192	50277.3	47228.5
16384	97058.5	93082.9
32768	192255.5	185665.3
65536	392959.9	370929.0
131072	795386.1	741670.5
262144	1592377.8	1531173.9

devreal

LGTM, just two minor comments

ompi/mca/coll/han/coll_han.h

ompi/mca/coll/han/coll_han_alltoall.c

wenduwan · 2024-03-19T14:26:36Z

@lrbison On a high level - I am a bit bothered by the fact that we are implementing point-to-point communication directly in HAN. It breaches the encapsulation in HAN where other collective algorithms are reusing other coll modules. In other words, HAN implicitly composes hierarchical collectives of basic collectives.

Related to encapsulation, I have been thinking about the the benefits of this new algorithm. I think the biggest innovation is to separate inter-node from intra-node comms which significantly reduces network load. The intra-node offload is only a secondary concern, and we might get away with PML instead of smsc. IMO the new XHC module is a better home for the smsc stuff, and later we can adopt XHC in HAN.

lrbison · 2024-03-20T20:02:26Z

@wenduwan Yes, I understand your concern about the encapsulation. The problem is that alltoall is not easily/efficiently composed of smaller operations. By the nature of ALL to all, it is difficult to hierarchically subdivide the problem.

I did think of three ways to subivide:

each local node does allgather, then sends some portion of the data off-node. This is implemented in this PR as alltoall_using_allgather, but it's a terrible idea for anything but the smallest message sizes due to the amount of wasted data movement (and extra memory buffers).
similar to above, but use gather. We don't waste data now, but now we have to do many many gathers for each remote node. Depending on the configuration, you probably also need a scatter on the remote side, and some transpose operation somewhere. This is also slow.

In both of those cases above the thing limiting performance is actually the local communication. Finally, there is one more case:

alltoall can be composed of smaller, local, alltoallv operations. This has the potential to be the right way forward however there are two problems: (1) an efficient, smsc-based alltoallv, hasn't been implemented yet (I plan on doing so using the pattern I've written above), and (2) the setup for any smsc-based collective requires sharing the base address of all rank's buffers. In my implementation here we pay that cost up front by an on-node allgather call and re-use the buffer address for each "round" of network communication. However, if we call alltoallv independently at each round, then the alltoallv will have to pay that cost every time. Conversely if we shove all communication into one round, we require very large temporary memory buffers, and it starts to look more like the allgather version.

lrbison · 2024-03-21T13:26:00Z

Looks like CI has a ~~real failure~~ in test_cco_buf_inter.TestCCOBufInter.testAlltoall. This is particularly surprising to me since I doubt CI setup has XPMEM installed, so I would expect it to just fall back to previous alltoall algorithm.

I will investigate, and also re-try the test including xpmem on my own cluster.

lrbison · 2024-03-21T14:54:42Z

The failure is a symptom of the OB1 PML code's recent problems in INTER communicators. Unrelated to this change. See Issue 12407. Ignoring that unrelated problem for a moment, I think this change is ready for review again.

wenduwan · 2024-03-21T21:42:34Z

About smsc - I remember that it does not always work for accelerator, so should we add a check to unselect this algorithm for device memory? It might be tricky since it requires another consensus within the global communicator since some rank might be on accelerator but others not.

I was debating with myself if we should leave smsc out of this PR - would that cause a dramatic performance penalty? In other words, is smsc essential to achieve a reasonable/significant improvement over the current alltoall?

lrbison · 2024-03-25T14:29:03Z

if we should leave smsc out of this PR - would that cause a dramatic performance penalty?

Yes. The primary advantage of this PR is that we avoid an extra copy and lots of extra synchronization by exposing all the send buffers to all other on-node ranks via mapped memory from SMSC module (and specifically, an SMSC module that implements mapped memory, which is currently only XPMEM)

so should we add a check to unselect this algorithm for device memory? It might be tricky since it requires another consensus within the global communicator since some rank might be on accelerator but others not.

Hm, good point. Technically we could achieve this with local-to-node collective as long as the fallback implemented the same off-node pattern, but in the short-term a comm-wide allreduce would probably be needed. Let me experiment with this.

wenduwan · 2024-03-25T15:19:33Z

The primary advantage of this PR is that we avoid an extra copy

I have a different take on this. In my opinion the greatest value proposition of this PR is that we avoid the expensive cost associated with high network incast and outcast. In the context of HAN, every collective will benefit from smsc but I doubt if that is the foremost advantage. I imagine we can claim a bigger win if we can devise a reusable smsc pattern for allreduce, bcast, etc. too.

lrbison · 2024-03-29T18:02:26Z

I realized I just sidestepped all intermediary datatypes and their conversion in this PR which is going to cause lots of failures. I'm marking this as draft while I fix that. I will:

Have followers pack data into leader buffer (which is reverse of current situation).
Add synchronization to account for the fact that leader doesn't know when followers are done packing.

lrbison · 2024-04-06T03:54:43Z

I have completed a significant update to this PR. The changes corrected glaring issues for DDT and device memory, while largely retaining the original performance.

The function properly uses convertors now, and properly detects non-contiguous and device-based memory.
In cases where packing is required, each low comm will decide collectively to pack their local data types into an SMSC-exposed bounce buffer of other low rank. This however requires a few extra barriers, so it is slower.
In cases where packing is not required, the function uses the algorithm which I posted in this PR originally, where each rank packs data from SMSC-exposed send buffers of other low ranks into it's own bounce buffer. This is the fast-path.
I removed alltoall_using_allgather. The performance was never better than the smsc version, and only rarely better than tuned, and only for small messages. For this reason I decided it was easier to remove than fix up the convertor usage with no benefit.

I'm marking the PR ready for review again. Please take a look.

lrbison · 2024-04-06T04:09:27Z

Updated latency results on 8x hpc7g on AWS with libfabric/EFA:

ompi/mca/coll/han/coll_han_component.c

edgargabriel · 2024-06-12T22:40:52Z

I collected some performance numbers on one of our platforms for the alltoall_using_smsc algorithm, and compared to whatever the default is when we exclude han ( I think it is tuned, but I didn't verify). The setup is 4 nodes, 48 processes per node (192 total), Slingshot11 with libfabric 1.20.1, and XPMEM. I collected numbers when using mtl/ofi (in which case even intra-node communication will go through the cxi provider on this platform) and with btl/ofi + sm (btl_ofi_mode = 2).

# OSU MPI All-to-All Personalized Exchange Latency Test v7.2
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
#         mtl/ofi(cxi)         mtl/ofi(cxi)    btl/ofi(cxi) +sm     btl/ofi(cxi) + sm
#         han w/               w/o han         han w/               w/o han
#        alltoall_using_smsc        -          alltoall_using_smsc   -
1                     100.92        271.86         82.11             263.98
2                      98.27	    274.35         82.50	     268.71
4                      98.58	    305.49         90.40	     273.53
8                     109.88	    299.68         90.48	     275.52
16                    112.29	    299.36         91.08	     277.94
32                    113.56	    302.20         92.22	     282.38
64                    114.49	    334.46         94.11	     349.93
128                   132.85	    417.54        141.01	     405.80
256                   161.04	    646.71        169.68	     484.11
512                   243.60	   1105.65        236.03	     708.03
1024                  394.66	    465.13        357.29	     568.11
2048                  682.63	    812.04        578.92	    5030.75
4096                 1288.69	   1436.90       1030.49	    3613.06
8192                 2631.71	   3248.87       1893.00	    2856.04
16384                5336.48	   5462.84       3739.46	    5112.37
32768               11335.81	  10098.63       8581.11	    9556.01
65536               23813.99	  18430.23      22019.84	   19049.81
131072              51259.17	  36470.53      42873.01	   38622.50
262144             112964.87	  79266.76      91327.69	   67715.78

I think the numbers look good overall for the new algorithm, but I can't yet make sense on why the btl/ofi performs better than with mtl/ofi for this particular scenario (where the bulk of the intra-node communication would probably go through the smsc/xpmem component in the han algorithm).

Increase coll:han:get_algorithm verbosity level from 1 to 30, to avoid flooding terminal at any verbosity level. Thirty seems to be used for most of the other han dynamic selection prints. Signed-off-by: Luke Robison <[email protected]>

This will allow HAN collectives to check for and use SMSC methods to direct-map peer memory during on-node communication. Signed-off-by: Luke Robison <[email protected]>

lrbison · 2024-06-21T19:37:01Z

I have addressed all outstanding comments. Thank you for the testing by @edgargabriel and @hppritcha! I made only small configuration-parameter related changes in my latest push:

I removed an unused parameter that I introduced early in my development.
I removed an unused alltoall_algorithm parameter, which was unused anyways, since coll_han_use_alltoall_algorithm is already doing this for me.
I added a trivial fallback "algorithm" so that the user can select it easily in case performance regression is observed

Please take a look. I'm eager to merge this as it has been open a while, and I have a han alltoallv PR coming soon which builds on it.

Add Alltoall algorithm to coll/han. Each rank on one host is assigned a single partner on a remote host and vice versa. Then the rank collects all the data its partner will need to receive from it's host, and sends it in one large send, and likewise receives it's data in one large recv, then cycles to the next host. This algorithm is only selected when SMSC component has ability to direct-map peer memory, which only exists for XPMEM module. Signed-off-by: Luke Robison <[email protected]>

lrbison · 2024-06-29T03:46:02Z

This most recent push fixed two issues that primarily affected push-mode, which is only used when sendbuf is non-contiguous or is on device memory:

Bad pointer math and push-mode send sizes were omitting the counts (!) and so gave invalid data.
A logic error that was related to the extra barrier and the timing of completions in "push" mode.

lrbison requested a review from bosilca March 1, 2024 15:49

github-actions bot added the Target: main label Mar 1, 2024

devreal reviewed Mar 11, 2024

View reviewed changes

ompi/mca/coll/han/coll_han.h Outdated Show resolved Hide resolved

ompi/mca/coll/han/coll_han_alltoall.c Outdated Show resolved Hide resolved

juntangc reviewed Mar 11, 2024

View reviewed changes

ompi/mca/coll/han/coll_han_alltoall.c Outdated Show resolved Hide resolved

ompi/mca/coll/han/coll_han_alltoall.c Outdated Show resolved Hide resolved

ompi/mca/coll/han/coll_han_alltoall.c Show resolved Hide resolved

lrbison force-pushed the ringleaders branch from b35fc44 to c15396f Compare March 20, 2024 22:31

lrbison marked this pull request as draft March 29, 2024 18:03

lrbison force-pushed the ringleaders branch 2 times, most recently from 2319b3c to 06fa035 Compare April 6, 2024 03:51

lrbison marked this pull request as ready for review April 6, 2024 03:55

lrbison force-pushed the ringleaders branch from 06fa035 to 6515553 Compare May 30, 2024 17:39

lrbison commented May 31, 2024

View reviewed changes

ompi/mca/coll/han/coll_han_component.c Outdated Show resolved Hide resolved

lrbison added 2 commits June 21, 2024 15:49

coll/han: reduce coll:han:get_algorithm prints

730b410

Increase coll:han:get_algorithm verbosity level from 1 to 30, to avoid flooding terminal at any verbosity level. Thirty seems to be used for most of the other han dynamic selection prints. Signed-off-by: Luke Robison <[email protected]>

coll/han: require smsc endpoint

6091a92

This will allow HAN collectives to check for and use SMSC methods to direct-map peer memory during on-node communication. Signed-off-by: Luke Robison <[email protected]>

lrbison force-pushed the ringleaders branch from 6515553 to 34a4698 Compare June 21, 2024 19:16

lrbison force-pushed the ringleaders branch from 34a4698 to 07cbfbd Compare June 29, 2024 03:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coll/han: Add alltoall algorithm #12387

coll/han: Add alltoall algorithm #12387

lrbison commented Mar 1, 2024

lrbison commented Mar 8, 2024 •

edited

Loading

devreal left a comment

wenduwan commented Mar 19, 2024

lrbison commented Mar 20, 2024 •

edited

Loading

lrbison commented Mar 21, 2024 •

edited

Loading

lrbison commented Mar 21, 2024

wenduwan commented Mar 21, 2024

lrbison commented Mar 25, 2024

wenduwan commented Mar 25, 2024 •

edited

Loading

lrbison commented Mar 29, 2024

lrbison commented Apr 6, 2024

lrbison commented Apr 6, 2024

edgargabriel commented Jun 12, 2024

lrbison commented Jun 21, 2024

lrbison commented Jun 29, 2024 •

edited

Loading

coll/han: Add alltoall algorithm #12387

Are you sure you want to change the base?

coll/han: Add alltoall algorithm #12387

Conversation

lrbison commented Mar 1, 2024

lrbison commented Mar 8, 2024 • edited Loading

devreal left a comment

Choose a reason for hiding this comment

wenduwan commented Mar 19, 2024

lrbison commented Mar 20, 2024 • edited Loading

lrbison commented Mar 21, 2024 • edited Loading

lrbison commented Mar 21, 2024

wenduwan commented Mar 21, 2024

lrbison commented Mar 25, 2024

wenduwan commented Mar 25, 2024 • edited Loading

lrbison commented Mar 29, 2024

lrbison commented Apr 6, 2024

lrbison commented Apr 6, 2024

edgargabriel commented Jun 12, 2024

lrbison commented Jun 21, 2024

lrbison commented Jun 29, 2024 • edited Loading

lrbison commented Mar 8, 2024 •

edited

Loading

lrbison commented Mar 20, 2024 •

edited

Loading

lrbison commented Mar 21, 2024 •

edited

Loading

wenduwan commented Mar 25, 2024 •

edited

Loading

lrbison commented Jun 29, 2024 •

edited

Loading