Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v5.0.x] opal/common/ofi: refactor NIC selection logic #12135

Merged
merged 3 commits into from
Feb 7, 2024

Conversation

wenduwan
Copy link
Contributor

This patch refactors the OFI NIC selection logic. It foremost improves the NIC search algorithm. Instead of searching for the closest NICs on the system, this patch directly compares the distances of the given providers and selects the nearest NIC.

This change also makes it explicit that if the process is unbound, or the distance cannot be reliably calculated, a provider will be selected in round-robin fashion.

Signed-off-by: Wenduo Wang [email protected]
(cherry picked from commit f5f3b93)

@github-actions github-actions bot added this to the v5.0.1 milestone Nov 28, 2023
@rhc54
Copy link
Contributor

rhc54 commented Nov 28, 2023

Just FWIW: I had time at SC to meet with some fabric folks as well as Brice to talk about this "distance" thing. Bottom line is that it is actually meaningless - and has been for years. The only thing you really care about is that the NIC be attached to your local PCIe bus (as opposed to the bus on another package). Beyond that, you really aren't measuring anything meaningful.

So the "nearest" NIC will always be the first NIC on the PCIe bus in your local package. Doesn't matter where the proc is bound - answer is always the same. For other NICs hanging off of that bus, you can split hairs and claim there is a "distance" measured by position down the bus - but the vendors all say that is pointless as it is so small to be meaningless.

If you only have one bus being shared by multiple packages (a very old architecture), then the answers don't change. The "nearest" NIC is still the first one on that PCIe bus, and distances to the other NICs on that bus can be considered identical.

All of which led me to question why we are wasting so much time (both people and compute cycles) beating on this question when the fabric vendors say "it makes no difference - just take a NIC on the local PCIe bus of your package"? I'm wondering if we aren't overthinking things?

@wenduwan
Copy link
Contributor Author

@rhc54
For this patch, it primarily fixes a bug in the previous algorithm, orthogonal to the validity of the distance metric. I have explained this in the previous PR, but in short, opal_common_ofi_select_provider previously did not find the closest provider from the provider list, but instead attempts to match the global minimum against the provider list(which is also inefficient), if there is not a match then it will do round robin. This refactoring changes that to compare the NICs within the provider list and find the closest one, breaking ties by round robin.

@rhc54
Copy link
Contributor

rhc54 commented Nov 28, 2023

I understand that - my point is that the entire code may be moot. According to the vendors I met with, all we need to do is (a) verify that the PCIe is local to our package (which it certainly will be minus a very old chip), and (b) round-robin the assignments. Everything else that we are doing is really just unnecessary complexity that gains us nothing.

So I suggest that the correct PR is to simply remove all this code and round-robin the assignments.

@wenduwan
Copy link
Contributor Author

I see your point but I respectfully disagree on disregarding the distance entirely - I look at the problem from a generic software engineering pov. There are 2 separate aspects to the NIC selection problem: 1) calculating a meaningful distance metric and 2) using that distance metric to determine the nearest NIC. I believe this change is in the right direction by separating out the get_provider_distance function - the distance might not be meaningful at the moment but that could change later, and at that point the logic will work and better.

@rhc54
Copy link
Contributor

rhc54 commented Nov 28, 2023

🤷‍♂️ It's your guys code, not mine, so I have no strong opinion on it. Again, FWIW: the results when run on various topologies yields the exact same distance for every proc to each NIC. So the round-robin code is what is going to be operating in all cases. Likelihood of someone partitioning the package to put multiple PCIe's in it, each with NICs on it so that the distance from proc to NIC depends on binding, is....awfully low, at least in my remaining lifetime 😄

@wenduwan
Copy link
Contributor Author

Right I attempted to calculate the distance metric by counting hwloc objects between the processor and NIC but that wasn't easy either.

An alternative would be for the application to ask for a specific NIC, assuming that the application has a perfect knowledge of the PCI topology. Naturally that means this NIC selection logic would be no-op.

@wenduwan wenduwan changed the title opal/common/ofi: refactor NIC selection logic [v5.0.x] opal/common/ofi: refactor NIC selection logic Nov 30, 2023
@wenduwan wenduwan marked this pull request as draft November 30, 2023 21:44
@wenduwan
Copy link
Contributor Author

I promised @jsquyres to summarize the nic selection logic before merging this PR

@wenduwan
Copy link
Contributor Author

wenduwan commented Dec 4, 2023

Need to pick #12144 once it's merged. There is a bug.

@janjust
Copy link
Contributor

janjust commented Dec 4, 2023

@wenduwan I wouldn't mind if you just include the commits from #12144 as part of this PR
Or alternatively just force-push those commits to here if that works.

@wenduwan wenduwan self-assigned this Dec 18, 2023
@janjust janjust modified the milestones: v5.0.1, v5.0.2 Jan 8, 2024
@janjust
Copy link
Contributor

janjust commented Jan 11, 2024

@jsquyres Hey could you take a look at #12144? Does the documentation around nic selection make sense?
If so, then @wenduwan will merge, and cherry-pick to this PR

This patch refactors the OFI NIC selection logic. It foremost improves
the NIC search algorithm. Instead of searching for the closest NICs on
the system, this patch directly compares the distances of the given
providers and selects the nearest NIC.

This change also makes it explicit that if the process is unbound, or
the distance cannot be reliably calculated, a provider will be selected
in round-robin fashion.

Signed-off-by: Wenduo Wang <[email protected]>
(cherry picked from commit f5f3b93)
This change fixes current round-robin selection logic:
- Only providers of the same type should be considered, i.e. providers that
match the head of the list. This deviates from the documented behavior.
- For unbound process the selection should be based on its local rank, i.e.
rank among processes on the same node. Currently only the first NIC will be
selected.

Signed-off-by: Wenduo Wang <[email protected]>
(cherry picked from commit b061f96)
The documentation needs an update to reflect latest implementation.
The original cpuset matching logic has been replaced with a new distance
calculation algorithm.
This change also clarifies the round-robin selection process when we need to
break a tie.

Signed-off-by: Wenduo Wang <[email protected]>
(cherry picked from commit 3aba0bb)
@wenduwan
Copy link
Contributor Author

Changes are included in latest push - all of them are backport. I suggest merging after 5.0.2 release.

@janjust
Copy link
Contributor

janjust commented Jan 23, 2024

@wenduwan so we can remove the draft tag from this PR?

@wenduwan wenduwan marked this pull request as ready for review January 23, 2024 16:51
@wenduwan
Copy link
Contributor Author

@janjust Thanks it's ready for re-review.

@wenduwan wenduwan merged commit fe8ba1b into open-mpi:v5.0.x Feb 7, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants