Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v5.0.x] opal/common/ofi: refactor NIC selection logic #12135

Merged
merged 3 commits into from
Feb 7, 2024

Commits on Jan 22, 2024

  1. opal/common/ofi: refactor NIC selection logic

    This patch refactors the OFI NIC selection logic. It foremost improves
    the NIC search algorithm. Instead of searching for the closest NICs on
    the system, this patch directly compares the distances of the given
    providers and selects the nearest NIC.
    
    This change also makes it explicit that if the process is unbound, or
    the distance cannot be reliably calculated, a provider will be selected
    in round-robin fashion.
    
    Signed-off-by: Wenduo Wang <[email protected]>
    (cherry picked from commit f5f3b93)
    wenduwan committed Jan 22, 2024
    Configuration menu
    Copy the full SHA
    d29a19b View commit details
    Browse the repository at this point in the history
  2. opal/ofi: fix round-robin selection logic

    This change fixes current round-robin selection logic:
    - Only providers of the same type should be considered, i.e. providers that
    match the head of the list. This deviates from the documented behavior.
    - For unbound process the selection should be based on its local rank, i.e.
    rank among processes on the same node. Currently only the first NIC will be
    selected.
    
    Signed-off-by: Wenduo Wang <[email protected]>
    (cherry picked from commit b061f96)
    wenduwan committed Jan 22, 2024
    Configuration menu
    Copy the full SHA
    f9800fd View commit details
    Browse the repository at this point in the history
  3. opal/ofi: update nic selection function doc

    The documentation needs an update to reflect latest implementation.
    The original cpuset matching logic has been replaced with a new distance
    calculation algorithm.
    This change also clarifies the round-robin selection process when we need to
    break a tie.
    
    Signed-off-by: Wenduo Wang <[email protected]>
    (cherry picked from commit 3aba0bb)
    wenduwan committed Jan 22, 2024
    Configuration menu
    Copy the full SHA
    1b1dd85 View commit details
    Browse the repository at this point in the history