-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mca/coll: Add any radix k for alltoall bruck algorithm #12453
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @jiaxiyan. A few small suggestions.
Have you checked that the default faninout will be 2 for the dynamic rules so as to not change performance?
Thanks
I was planning to review it after they figure out the issue with replacing the blocking ompi_coll_base_sendrecv with non-blocking communications. |
@bosilca I don't follow this comment:
Are you suggesting that collectives shouldn't be making PML isend/irecv directly? |
No, I was wondering how would a multi-rail enabled collective component (which will indeed post multiple isend requests) interacts with a multi-rail enabled low-level communication library (which will split all large messages across multiple lanes) ? |
fd2a6bb
to
188f023
Compare
@jiaxiyan Please fix the build. |
We decide to keep the blocking sendrecv because the performance is slightly better than nonblocking. (See test data below) OSU MPI All-to-All Latency (On 16 hpc7g, 1024 ranks)ompi_coll_base_sendrecv
isend irecv
|
Bruck's method is designed to work in upper_comm with only the local/node leaders who can communicate directly with Bruck's method is designed for optimizing latency by sacrificing bandwidth. So you are not likely to have performance improvement for mid/large messages by using a larger k by limiting ranks_per_node to be 1. It should be used in other node-aware approach for the all-to-all communication for internode only. The procedure looks like the data on each node will be gathered and combined by the local/node leaders; internode all-to-all using Bruck's method; scatter data from local/node leaders to other ranks on the same node. |
* [03] [13] [23] [33] [43] [53] | ||
* [04] [14] [24] [34] [44] [54] | ||
* [05] [15] [25] [35] [45] [55] | ||
* After local rotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you really need the local rotation ? I understand it makes the code easier to maintain but in has a significant and finally unnecessary cost, because at the end you are building the datatype by hand without taking advantage of it's continuity in memory.
Second remark is related to the cost of creating and committing the datatype. I'm almost certain that this cost is expensive, especially for the middle range messages where the k-ary bruck is supposed to behave best. The result is that you pay a high cost to prepare the datatype, resulting in a non contiguous datatype while leads to lower performance communications (because non-contiguous data usually lead to copy in/out protocols). If instead of building the datatype you copy the data into a contiguous buffer, you avoid the cost of the datatype construction and communicate from contiguous to contiguous memory, with better outcome. The only potential drawback is the extra local copy before the send (similar to a pack).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiaxiyan Please update the comment section after removing the rotation step(s).
Ignore
err = ompi_datatype_destroy(&new_ddt); | ||
if (err != MPI_SUCCESS) { line = __LINE__; goto err_hndl; } | ||
/* Commit the new datatype */ | ||
err = ompi_datatype_commit(&new_ddt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the real reason you are not seeing benefits by using nonblocking communications is due to the cost of the datatype creation. During this time, the posted communications will not be able to progress, which means the bandwidth is wasted until all datatype are build and you reach the waitall part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this comment from earlier. We profiled the application and found that datatype creation time was insignificant, compared to the memcpy and wait time.
there should be some additions to the user guide accompanying this change set that explain the algorithm and the mca tuning parameters. other wise it will be practically useless to any one but you. the capability for user tuning is important and should not be neglected. |
note: this recent paper described a similar algorithm. they derive a mathematical model for selection of k, and found that a default value of k=sqrt(P), with P the comm size, works well. |
* [03] [13] [23] [33] [43] [53] | ||
* [04] [14] [24] [34] [44] [54] | ||
* [05] [15] [25] [35] [45] [55] | ||
* After local rotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiaxiyan Please update the comment section after removing the rotation step(s).
Ignore
@bosilca I experimented with your proposal to pack/unpack data into contiguous buffer instead of creating a datatype in 1b7e254 OSU MPI All-to-All Latency (On 16 hpc7g, 1024 ranks)
|
This method extends ompi_coll_base_alltoall_intra_bruck to handle any radix k. Signed-off-by: Jessie Yang <[email protected]>
Thanks for taking a look. I have to say your results are concerning, because the extra overhead is terribly high, so high that I wonder how one small memcpy (for the 1 byte message as an example) could add 30ms. Would you mind sharing the code with me (maybe privately by email) ? |
@bosilca Since this PR is backward compatible and does not cause performance regression for the default k=2 case, I wonder if we can merge it and work on improvement as a separate project? Thanks! |
From a code perspective this looks ok. But I still question the need for this, from a logical perspective. If I look at the results posted earlier here, the k=2 behave great until we hit 128 bytes, when it jumps up to 20x justifying the use of a larger radix. Clearly that jump in performance is problematic, and it solely leads to the tuned poor performance. Where is that jump coming from ? |
@bosilca Makes sense. Jessie has rotated to another high-priority project. I will take a look at the latency jump. |
@bosilca I did an analysis of the algorithm with EFA. The weird performance bumps at some message sizes is a result of 2 factors:
Below is an example on 16 nodes with 64 cores. I chose:
The non-linear latency metrics we observed is the result of network layer, i.e. libfabric + EFA network. EFA uses a datagram protocol(SRD), which exposes MTU=8 KiB. For larger messages, libfabric will have to switch to other mechanisms:
This means we cannot apply the theoretical latency formula in the Bruck paper, at least for EFA, due to the non-linear data transfer behavior. |
This method extends ompi_coll_base_alltoall_intra_bruck to handle any radix k.