Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opal/mca/ofi: select NIC closest to accelerator if requested #11716

Merged
merged 1 commit into from
Aug 26, 2024

Conversation

wenduwan
Copy link
Contributor

@wenduwan wenduwan commented May 24, 2023

This patch introduces a new capability to select NIC closest to the user requested accelerator (PCI) device. The implementation should suit all accelerator types, i.e. cuda & rocm. This change addresses #11696

In this patch, we introduce a overriding logic when an accelerator has been initialized - instead of selecting a NIC on the package, we select a NIC closest to the accelerator(might be on a different package).

To enable this feature, the application should set the MCA parameter OMPI_MCA_opal_common_ofi_accelerator_rank(default -1) to a non-negative integer, which represents the process rank(0-based) on the same accelerator.

The impl depends on the following APIs:

  • accelerator.get_device_pci_attr: Retrieve the PCI info of the accelerator.
  • hwloc_get_pcidev_by_busid: Get the hwloc object of the accelerator and provider(NIC)
  • hwloc_get_common_ancestor_obj: Get the closest common ancestor hwloc object between the accelerator and provider

The NIC selection logic can be summarized as following:

  • Among available NICs, find those closest to the accelerator device. Here we choose to not use the pmix_device_distance_t or hwloc_distances_s for practical reasons - they are not computable for every platform, e.g. on AWS EC2 we cannot reliably get such values between GPU and NIC. Instead the device proximity is measured as distance(GPU, common ancestor) + distance(NIC, common ancestor), see https://www.open-mpi.org/projects/hwloc/doc/v2.9.1/a00359.php
  • When there is a tie, break the tie using a modulo (local rank on the same accelerator) % (number of nearest providers). Note that we do not have a good way to calculate local rank on the same accelerator, so instead we reuse local rank on the same package as a proxy.

@rhc54
Copy link
Contributor

rhc54 commented May 24, 2023

Please see #11696 (comment) for a suggested generalized approach to this. I believe you have done the single GPU case described in that comment - you might want to consider the extension to multiple GPUs.

@wenduwan
Copy link
Contributor Author

wenduwan commented May 24, 2023

@rhc54 Thanks for the discourse on #11696

You are right that this patch only addresses the single GPU case. I don't have a good grasp on how multi GPU should be handled. I think it demands considerable work to design and implement.

Copy link
Member

@edgargabriel edgargabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me as far as I can tell.

opal/mca/common/ofi/common_ofi.c Outdated Show resolved Hide resolved
@wenduwan
Copy link
Contributor Author

I need to rebase this change if #11689 is merged.

@rhc54
Copy link
Contributor

rhc54 commented Jul 3, 2024

Just a reminder from prior conversations about this topic. We know this algorithm isn't actually correct as the HWLOC "depth" does not correlate to communication distance. We do have a topic scheduled for offline discussion with vendors about this issue. However, in this case, this approach probably won't hurt as you'll just wind up doing a round-robin of the devices (which is what we've observed in the past, unless something has changed since the last time you tried this), and for now that's probably the best you can do anyway.

@wenduwan wenduwan marked this pull request as ready for review July 3, 2024 16:01
@wenduwan
Copy link
Contributor Author

wenduwan commented Jul 3, 2024

@rhc54 You are right. I figured that the depth attribute is not that much helpful. Instead, I chose a more direct distance measure, based on the assumptions:

  • Any NIC and GPU should have at least 1 common ancestor
  • The number of objects between NIC(or GPU) and the common ancestor can be reliably computed

Then I can use this imperfect metric objects_between(GPU, common ancestor) + objects_between(NIC, common ancestor) to qualitatively compare GPU-NIC distances.

This is largely an experimental feature, so I'm adding a switch to disable it by default.

@rhc54
Copy link
Contributor

rhc54 commented Jul 3, 2024

Worth trying - I agree with the default switch, though. It's a rather difficult problem and so far every attempt has failed to produce desired results. We need a better understanding of the signal flow in the system. For example, intervening objects have no impact on message traversal times because the electronics in each device on the buss don't intercept/relay the signal - it's just a buss that they all are hanging off of, and distance along the buss is largely irrelevant given the speed of the electrons and the physical distances involved.

What does seem to matter is any intervening device that actually does do an intercept/relay operation - e.g., to switch from one buss to another where injection into the target buss requires traffic coordination. So moving from the main memory buss to the PCI buss costs you something - but talking to anything on that PCI buss is the same as talking to anything else on the buss. Doesn't matter where on the PCI buss you sit.

Picking the right combination therefore seems to devolve into minimizing buss transitions (e.g., having a NIC on one PCI buss and the GPU on another is probably not good - unless you have a cross-device harness, in which case the two communicate without transferring across PCI) and balancing loads. We can compute the first - the second is less clear without making assumptions on how the application might use the devices.

Hope to get some of this clarified and quantified in upcoming months.

This patch introduces the capability to select the closest NIC to the
accelerator device. If the accelerator or NIC PCI information is not
available, fallback to select the NIC on the closest package.

To enable this feature, the application should set the MCA parameter
OMPI_MCA_opal_common_ofi_accelerator_rank(default -1) to a non-negative
integer, which represents the process rank(0-based) on the same
accelerator.

The distance between the acclerator device and NIC is the number of
hwloc objects inbetween, which includes the lowest common ancestor on
the hwloc topology tree.

Signed-off-by: Wenduo Wang <[email protected]>
@wenduwan
Copy link
Contributor Author

Chatted with @naughtont3 offline. I will merge this PR for now. New issues will be filed later based on additional testing.

@wenduwan wenduwan merged commit 7d20b86 into open-mpi:main Aug 26, 2024
16 checks passed
@wenduwan wenduwan deleted the accelerator_awareness branch August 26, 2024 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants