mca/coll: Add any radix k for alltoall bruck algorithm #12453

jiaxiyan · 2024-04-05T23:03:42Z

This method extends ompi_coll_base_alltoall_intra_bruck to handle any radix k.

lrbison

Looks good @jiaxiyan. A few small suggestions.

Have you checked that the default faninout will be 2 for the dynamic rules so as to not change performance?

Thanks

ompi/mca/coll/base/coll_base_alltoall.c

bosilca · 2024-04-29T19:01:34Z

I was planning to review it after they figure out the issue with replacing the blocking ompi_coll_base_sendrecv with non-blocking communications.

lrbison · 2024-05-07T17:33:07Z

@bosilca I don't follow this comment:

when the upper and lower level of the communication stack both try to generate multi-rail traffic things cant go well.

Are you suggesting that collectives shouldn't be making PML isend/irecv directly?

bosilca · 2024-05-08T08:42:38Z

No, I was wondering how would a multi-rail enabled collective component (which will indeed post multiple isend requests) interacts with a multi-rail enabled low-level communication library (which will split all large messages across multiple lanes) ?

wenduwan · 2024-05-16T23:00:21Z

@jiaxiyan Please fix the build.

jiaxiyan · 2024-05-16T23:02:19Z

We decide to keep the blocking sendrecv because the performance is slightly better than nonblocking. (See test data below)
We need to wait for each isend to complete before we can copy back new data from recvbuf to tmpbuf, so it does not show much improvement.

OSU MPI All-to-All Latency (On 16 hpc7g, 1024 ranks)

ompi_coll_base_sendrecv

Message size	k=2	k = 4	k = 8	k = 16	k = 32
1	146.11	201.24	290.62	426.96	847.91
2	141.23	201.72	295.76	431.24	844.11
4	146.65	207.2	290.8	438.15	847.54
8	154.93	211.39	299.4	439.36	859.98
16	280.06	222.51	319.33	445.43	860.43
32	348.79	474.72	355.11	500.69	872.44
64	544.38	603.89	779.21	576.34	916.67
128	1059.73	932.32	1054.84	1323.35	968.53
256	21981.76	1822.93	1765.59	1875.43	2212.38
512	24050.4	3830.67	3719	3146.81	3036.46
1024	33873.93	8328.25	7612.07	6758.61	5348.24
2048	59072.64	18545.69	15535.72	14898.76	10101.14
4096	80879.25	37798.9	33115.21	27867.95	19722.63
8192	137532.69	80270.04	63793.19	61073.62	38873.77

isend irecv

Message size	k=2	k = 4	k = 8	k = 16	k = 32
1	138.96	204.74	291.22	428.89	847.15
2	141.24	205.62	290.97	430.1	848.62
4	145	204.63	295.42	438.76	851.13
8	156.82	211.84	298.93	444.47	865.3
16	279.29	223.89	324.6	449.07	862.43
32	350.09	479.04	357.18	502.31	880.38
64	540.79	601.15	799.61	583.52	932.97
128	1032.12	958.54	1087.27	1326.93	975.33
256	21958.61	1850.11	1753.25	1856.64	2210.36
512	23963.49	3837.31	3731	3241.24	3051.9
1024	33949.96	8348.78	7598.88	6802.96	5308.44
2048	59115.35	18520.23	15578.96	14939.42	10035.18
4096	80790.02	37814.46	33030	27908.15	19750.56
8192	137384.46	80003.8	63834.89	61007.52	38989.87

juntangc · 2024-05-17T01:25:15Z

Bruck's method is designed to work in upper_comm with only the local/node leaders who can communicate directly with
any other ranks (in upper comm) and that every pair of ranks are equally distant. (details in this paper - https://ieeexplore.ieee.org/document/642949).

Bruck's method is designed for optimizing latency by sacrificing bandwidth. So you are not likely to have performance improvement for mid/large messages by using a larger k by limiting ranks_per_node to be 1. It should be used in other node-aware approach for the all-to-all communication for internode only. The procedure looks like the data on each node will be gathered and combined by the local/node leaders; internode all-to-all using Bruck's method; scatter data from local/node leaders to other ranks on the same node.

jiaxiyan · 2024-05-21T23:28:13Z

@bosilca @lrbison Can you give this another review? Thanks!

bosilca · 2024-05-23T04:26:57Z

ompi/mca/coll/base/coll_base_alltoall.c

+ *         [03]   [13]   [23]   [33]   [43]   [53]
+ *         [04]   [14]   [24]   [34]   [44]   [54]
+ *         [05]   [15]   [25]   [35]   [45]   [55]
+ *   After local rotation


Why do you really need the local rotation ? I understand it makes the code easier to maintain but in has a significant and finally unnecessary cost, because at the end you are building the datatype by hand without taking advantage of it's continuity in memory.

Second remark is related to the cost of creating and committing the datatype. I'm almost certain that this cost is expensive, especially for the middle range messages where the k-ary bruck is supposed to behave best. The result is that you pay a high cost to prepare the datatype, resulting in a non contiguous datatype while leads to lower performance communications (because non-contiguous data usually lead to copy in/out protocols). If instead of building the datatype you copy the data into a contiguous buffer, you avoid the cost of the datatype construction and communicate from contiguous to contiguous memory, with better outcome. The only potential drawback is the extra local copy before the send (similar to a pack).

~~@jiaxiyan Please update the comment section after removing the rotation step(s).~~

Ignore

bosilca · 2024-05-23T04:31:22Z

ompi/mca/coll/base/coll_base_alltoall.c

-        err = ompi_datatype_destroy(&new_ddt);
-        if (err != MPI_SUCCESS) { line = __LINE__; goto err_hndl;  }
+            /* Commit the new datatype */
+            err = ompi_datatype_commit(&new_ddt);


I think the real reason you are not seeing benefits by using nonblocking communications is due to the cost of the datatype creation. During this time, the posted communications will not be able to progress, which means the bandwidth is wasted until all datatype are build and you reach the waitall part.

Sorry I missed this comment from earlier. We profiled the application and found that datatype creation time was insignificant, compared to the memcpy and wait time.

burlen · 2024-05-28T15:20:58Z

there should be some additions to the user guide accompanying this change set that explain the algorithm and the mca tuning parameters. other wise it will be practically useless to any one but you. the capability for user tuning is important and should not be neglected.

ompi/mca/coll/tuned/coll_tuned_decision_fixed.c

burlen · 2024-05-28T15:39:09Z

note: this recent paper described a similar algorithm. they derive a mathematical model for selection of k, and found that a default value of k=sqrt(P), with P the comm size, works well.

wenduwan · 2024-06-18T17:05:16Z

ompi/mca/coll/base/coll_base_alltoall.c

+ *         [03]   [13]   [23]   [33]   [43]   [53]
+ *         [04]   [14]   [24]   [34]   [44]   [54]
+ *         [05]   [15]   [25]   [35]   [45]   [55]
+ *   After local rotation


~~@jiaxiyan Please update the comment section after removing the rotation step(s).~~

Ignore

ompi/mca/coll/base/coll_base_alltoall.c

ompi/mca/coll/tuned/coll_tuned_decision_fixed.c

jiaxiyan · 2024-06-18T22:54:38Z

@bosilca I experimented with your proposal to pack/unpack data into contiguous buffer instead of creating a datatype in 1b7e254
I profiled the run and saw the extra local copy actually creates a lot of overhead and has a regression for small message sizes compared to creating and committing the datatype(see data below). So I think we should use the previous commit.

OSU MPI All-to-All Latency (On 16 hpc7g, 1024 ranks)

Message size	create datatype	pack into contiguous buffer
1	146.11	197.28
2	141.23	196.42
4	146.65	197.07
8	154.93	195.45
16	280.06	209.76
32	348.79	309.00
64	544.38	535.37
128	1059.73	879.60
256	21981.76	23591.02
512	24050.4	3805.94
1024	33873.93	32402.23
2048	59072.64	53316.38
4096	80879.25	119598.57
8192	137532.69	158336.34

This method extends ompi_coll_base_alltoall_intra_bruck to handle any radix k. Signed-off-by: Jessie Yang <[email protected]>

bosilca · 2024-06-19T15:18:14Z

Thanks for taking a look. I have to say your results are concerning, because the extra overhead is terribly high, so high that I wonder how one small memcpy (for the 1 byte message as an example) could add 30ms. Would you mind sharing the code with me (maybe privately by email) ?

jiaxiyan · 2024-06-24T17:20:42Z

@bosilca Did you have a chance to look at 1b7e254?

wenduwan · 2024-07-09T13:59:30Z

@bosilca Since this PR is backward compatible and does not cause performance regression for the default k=2 case, I wonder if we can merge it and work on improvement as a separate project? Thanks!

bosilca · 2024-07-09T22:33:49Z

From a code perspective this looks ok. But I still question the need for this, from a logical perspective. If I look at the results posted earlier here, the k=2 behave great until we hit 128 bytes, when it jumps up to 20x justifying the use of a larger radix. Clearly that jump in performance is problematic, and it solely leads to the tuned poor performance. Where is that jump coming from ?

wenduwan · 2024-07-10T22:35:47Z

@bosilca Makes sense. Jessie has rotated to another high-priority project. I will take a look at the latency jump.

wenduwan · 2024-07-12T22:12:10Z

@bosilca I did an analysis of the algorithm with EFA. The weird performance bumps at some message sizes is a result of 2 factors:

The number of send-recv rounds at different radix: A higher radix requires more rounds, because each round carries less data.
The datatype size(message size) in each round.

Below is an example on 16 nodes with 64 cores. I chose:

ppn: 16, 32, 64
radix: 2, 4, 8, 16

128-byte MPI_Alltoall

16 nodes x 64 ppn	Radix	Total rounds	Bytes per round
	2	10	65,536
	4	15	32,768
	8	22	16,384(round 1- 21), 65,536(round 22)
	16	33	81,92(round 1 - 30), 32,768(round 31 - 33)

16 nodes x 32 ppn	Radix	Total rounds	Bytes per round
	2	9	32,768
	4	13	16,384(round 1 - 12), 32,768(round 13)
	8	21	8,192
	16	31	4,096(round 1 - 30), 32,768(round 31)

16 nodes x 16 ppn	Radix	Total rounds	Bytes per round
	2	8	16,384
	4	12	8,192
	8	17	4096(round 1 - 14), 8192(round 15 - 17)
	16	30	2,048

The non-linear latency metrics we observed is the result of network layer, i.e. libfabric + EFA network. EFA uses a datagram protocol(SRD), which exposes MTU=8 KiB. For larger messages, libfabric will have to switch to other mechanisms:

Segmentation into < 8K messages, and send them one by one; or,
Use EFA device RDMA capability to initiate a one-sided operations. For this platform this can be done by RDMA read from the receiver side - note that this is a rendezvous process.

This means we cannot apply the theoretical latency formula in the Bruck paper, at least for EFA, due to the non-linear data transfer behavior.

github-actions bot added the Target: main label Apr 5, 2024

jiaxiyan marked this pull request as draft April 5, 2024 23:05

jiaxiyan marked this pull request as ready for review April 8, 2024 18:55

jiaxiyan requested a review from lrbison April 8, 2024 18:55

lrbison reviewed Apr 25, 2024

View reviewed changes

ompi/mca/coll/base/coll_base_alltoall.c Outdated Show resolved Hide resolved

ompi/mca/coll/base/coll_base_alltoall.c Outdated Show resolved Hide resolved

ompi/mca/coll/base/coll_base_alltoall.c Show resolved Hide resolved

juntangc reviewed Apr 26, 2024

View reviewed changes

ompi/mca/coll/base/coll_base_alltoall.c Show resolved Hide resolved

hppritcha requested review from bosilca, lrbison and juntangc April 28, 2024 17:44

jiaxiyan force-pushed the tuned_alltoall branch from e335cd8 to 87e6eb8 Compare May 7, 2024 17:19

wenduwan mentioned this pull request May 15, 2024

pml/cm: release datatype object in send completion callback #12544

Merged

jiaxiyan force-pushed the tuned_alltoall branch 2 times, most recently from fd2a6bb to 188f023 Compare May 16, 2024 22:45

jiaxiyan force-pushed the tuned_alltoall branch from 188f023 to 570ace9 Compare May 16, 2024 23:06

jiaxiyan force-pushed the tuned_alltoall branch from 570ace9 to cbda25d Compare May 21, 2024 22:33

bosilca reviewed May 23, 2024

View reviewed changes

burlen reviewed May 28, 2024

View reviewed changes

ompi/mca/coll/tuned/coll_tuned_decision_fixed.c Outdated Show resolved Hide resolved

jiaxiyan force-pushed the tuned_alltoall branch from cbda25d to 1b7e254 Compare June 18, 2024 01:04

wenduwan reviewed Jun 18, 2024

View reviewed changes

jiaxiyan force-pushed the tuned_alltoall branch from 1b7e254 to 6b67973 Compare June 18, 2024 22:33

mca/coll: Add any radix k for alltoall bruck algorithm

7db24aa

This method extends ompi_coll_base_alltoall_intra_bruck to handle any radix k. Signed-off-by: Jessie Yang <[email protected]>

jiaxiyan force-pushed the tuned_alltoall branch from 6b67973 to 7db24aa Compare June 18, 2024 22:58

burlen mentioned this pull request Jun 26, 2024

Improve documentation of the coll tuned component #12641

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mca/coll: Add any radix k for alltoall bruck algorithm #12453

mca/coll: Add any radix k for alltoall bruck algorithm #12453

jiaxiyan commented Apr 5, 2024

lrbison left a comment

bosilca commented Apr 29, 2024

lrbison commented May 7, 2024

bosilca commented May 8, 2024

wenduwan commented May 16, 2024

jiaxiyan commented May 16, 2024

juntangc commented May 17, 2024

jiaxiyan commented May 21, 2024

bosilca May 23, 2024

wenduwan Jun 18, 2024 •

edited

Loading

bosilca May 23, 2024

wenduwan Jul 12, 2024

burlen commented May 28, 2024

burlen commented May 28, 2024

wenduwan Jun 18, 2024 •

edited

Loading

jiaxiyan commented Jun 18, 2024

bosilca commented Jun 19, 2024

jiaxiyan commented Jun 24, 2024

wenduwan commented Jul 9, 2024

bosilca commented Jul 9, 2024

wenduwan commented Jul 10, 2024

wenduwan commented Jul 12, 2024

mca/coll: Add any radix k for alltoall bruck algorithm #12453

Are you sure you want to change the base?

mca/coll: Add any radix k for alltoall bruck algorithm #12453

Conversation

jiaxiyan commented Apr 5, 2024

lrbison left a comment

Choose a reason for hiding this comment

bosilca commented Apr 29, 2024

lrbison commented May 7, 2024

bosilca commented May 8, 2024

wenduwan commented May 16, 2024

jiaxiyan commented May 16, 2024

OSU MPI All-to-All Latency (On 16 hpc7g, 1024 ranks)

juntangc commented May 17, 2024

jiaxiyan commented May 21, 2024

bosilca May 23, 2024

Choose a reason for hiding this comment

wenduwan Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

bosilca May 23, 2024

Choose a reason for hiding this comment

wenduwan Jul 12, 2024

Choose a reason for hiding this comment

burlen commented May 28, 2024

burlen commented May 28, 2024

wenduwan Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

jiaxiyan commented Jun 18, 2024

OSU MPI All-to-All Latency (On 16 hpc7g, 1024 ranks)

bosilca commented Jun 19, 2024

jiaxiyan commented Jun 24, 2024

wenduwan commented Jul 9, 2024

bosilca commented Jul 9, 2024

wenduwan commented Jul 10, 2024

wenduwan commented Jul 12, 2024

wenduwan Jun 18, 2024 •

edited

Loading

wenduwan Jun 18, 2024 •

edited

Loading