Allow NVSHMEM PE to NIC mapping to be initialized by DeepEP rank #466

smarterclayton · 2025-10-24T15:00:26Z

The nvshmemi_get_devices_by_distance default initialization method in NVSHMEM does not work optimally for GPU configurations where 2 GPU and 2 RDMA NIC share a PCIe bus, such as the x86 based GCP A3 Ultra H200 and A4 B200 instance types: https://cloud.google.com/compute/docs/gpus/gpu-network-bandwidth#h200-gpus. GPU0 and GPU1 (on two independent processes) observe NIC0 and NIC1 on the same PCIe switch as being equidistant and the default configuration for DeepEP + NVSHMEM results in both GPUs being assigned NIC0 from nvshmemi_get_devices_by_distance, halving the observed bandwidth for RDMA in test_internode.py and in vLLM wide-EP (because 4 of 8 NIC are enabled).

The alternative is a static mapping between GPU host index (PE) and NIC index (HCA), but the NVSHMEMX_INIT_WITH_UNIQUEID initialization method bypasses setting mype_node and npes_node.

The nvshmemi_boot_handle.pg_rank for this initialization method is always 0 and the nvshmem_boot_handle.pg_size is always 2, and as a result mype_node and npes_node are set to 0 and 2 respectively for all PEs initializing. This prevents NVSHMEM_ENABLE_NIC_PE_MAPPING=1 from selecting from a static list of devices by mype_node / local rank in transport.cpp#nvshmemi_setup_connections:

selected_devices[0] =
  nvshmemi_state->mype_node % (tcurr->n_devices > 0
    ? tcurr->n_devices : 1);

To allow static assignment, we introduce a new DEEP_EP_DEVICE_TO_HCA_MAPPING environment variable during deep_ep.Buffer initialization that accepts <cuda_device_id>:<HCA_name>:<HCA_port> and uses torch.cuda.current_device() to set NVSHMEM_HCA_LIST to the appropriate <HCA_name>:<HCA_port>, or error if no such device was listed.

We are proposing the change to DeepEP because the choice of initialization method determines how HCA to GPU binding can be achieved across multiple existing NVSHMEM versions. Because vLLM 0.11.0 uses CUDA_VISIBLE_DEVICES to associate GPU devices to each rank, we reverse map the visible devices to the current cuda device.

On GCP we would propose this configuration for all full host workloads on A3U and A4 instance types (H200/B200 + x86 with shared PCIe hubs per 2 NIC / 2 GPU):

DEEP_EP_DEVICE_TO_HCA_MAPPING=0:mlx5_0:1,1:mlx5_1:1,2:mlx5_2:1,3:mlx5_3:1,4:mlx5_4:1,5:mlx5_5:1,6:mlx5_6:1,7:mlx5_7:1

Tested with NVSHMEM 3.3.20 and vLLM 0.11.0 and vLLM main @ 2025/10/23.

Co-Authored-By: Keon Jang [email protected]

The `nvshmemi_get_devices_by_distance` default initialization method in NVSHMEM does not work optimally for GPU configurations where 2 GPU and 2 RDMA NIC share a PCIe bus, such as the x86 based GCP A3 Ultra H200 and A4 B200 instance types: https://cloud.google.com/compute/docs/gpus/gpu-network-bandwidth#h200-gpus. GPU0 and GPU1 (on two independent processes) can observe NIC0, NIC1 on the same PCIe switch are equidistant and result in both GPUs leveraging NIC0, halving the observed bandwidth for RDMA in test_internode.py and in vLLM wide-EP. The alternative is a static mapping between GPU host index (PE) and NIC index (HCA), but the NVSHMEMX_INIT_WITH_UNIQUEID initialization method bypasses setting `mype_node` and `npes_node`. The `nvshmemi_boot_handle.pg_rank` for this initialization method is always 0 and the `nvshmem_boot_handle.pg_size` is always 2, preventing NVSHMEM_ENABLE_NIC_PE_MAPPING from leveraging a static list of devices in transport.cpp#nvshmemi_setup_connections: selected_devices[0] = nvshmemi_state->mype_node % (tcurr->n_devices > 0 ? tcurr->n_devices : 1); has mype_node = 0 for all devices. To allow static assignment, introduce a DEEP_EP_DEVICE_TO_HCA_MAPPING environment variable during Buffer python initialization that accepts `<cuda_device_id>:<HCA_name>:<HCA_port>` and resolves `torch.cuda.current_device()` to set NVSHMEM_HCA_LIST to the appropriate value or error. Co-Authored-By: Keon Jang <[email protected]> Signed-off-by: Clayton Coleman <[email protected]>

Perform a mapping between integer CUDA_VISIBLE_DEVICES values to find the host current device for DEEP_EP_DEVICE_TO_HCA_MAPPING. Signed-off-by: Clayton Coleman <[email protected]>

sphish · 2025-11-05T02:54:26Z

Why not just set the NVSHMEM environment variables directly?

smarterclayton · 2025-11-06T00:51:46Z

They have to be set per rank, so we would need to change all callers of deep_ep.Buffer to set those environment variables before invoking, including test_low_latency and test_internode (which we use to validate network performance of the supporting RDMA clusters). Since we expected all consumers of DeepEP to need this mapping on our H200 and B200 VM families, we wanted to push the fix as low in the software stack as possible (so below SGLang, vLLM, NeMO, etc) so that those frameworks don't have to change how they call DeepEP. We considered changing NVSHMEM but that would likely still require a change to Buffer initialization to pass local rank.

sphish · 2025-11-06T06:08:04Z

NVSHMEM stores local rank information, for example in mype_node in the code.
I believe the issue you’re experiencing is quite common — applications using NVSHMEM outside of DeepEP are likely to encounter it as well. It would be best to submit a PR to NVSHMEM to resolve this problem.

sphish · 2025-11-06T06:11:32Z

@seth-howell Could you help have a look at this？

keonjang-google · 2025-11-07T20:46:27Z

mype_node is 0 for all the PEs because DeepEP initializes NVSHMEM separately for each rank IIUC.

seth-howell · 2025-12-03T21:37:48Z

@sphish
When working with multiple instances of NVSHMEM, there's no way for us to know which NIC is claimed by a process in the separate instance. The best way to do this currently is to set the NVSHMEM mapping environment variables.

Why not just set the NVSHMEM environment variables directly?
It looks like the main benefit of using a DeepEP scoped environment variable is that you only have to set one, rather than both.

I believe the issue you’re experiencing is quite common — applications using NVSHMEM outside of DeepEP are likely to encounter it as well.

The topology detection only fails when both cases are true:

The application launches multiple instances of NVSHMEM within a single application.
There are two GPUs and two NICs at equal distance from each other in the PCIe hierarchy.
Any applications not satisfying both conditions will run fine without any environment variables set.

The google use-case of DeepEP internode + the above hardware configurations is the first time we have seen this problem.

@smarterclayton

They have to be set per rank, so we would need to change all callers of deep_ep.Buffer to set those environment variables before invoking, including test_low_latency

The low_latency kernels only create one instance of NVSHMEM, so the topology detection and assignment work correctly in that case, properly distributing one unique NIC per process. Please let me know if you've had a different experience.

This solution is correct and likely the easiest to implement. FWIW, I would recommend merging it to unblock Google.

Other possible solutions

DeepEP internode specific - requires changes in DeepEP
If the internode initialization code is modified to instantiate NVSHMEM once across all processes and use NVSHMEMX_TEAM_SAME_MYPE_NODE to determine when to send over RDMA, it will assign the NICs properly.
This is what the low_latency kernel is doing today.

More general, but potentially limited by environment - No required changes in DeepEP
I believe it was @keonjang-google who suggested we look into adding GPU index as a consideration during topology detection. This is valid in some environments, but may not work in others.
An applications launched with Slurms gres features, or any application relying on CUDA_VISIBLE_DEVICES on a per-process basis would enumerate a subset of devices in an arbitrary order to the application. At the very least, this would require querying PCIe BDFs of GPUs not used in the current NVSHMEM instance and assigning them an ID independent from the CUDA enumeration. This would take some time to implement.
There are other edge cases (kata containers, VMs) where the GPUs may not even be visible at the PCIe level.

smarterclayton changed the title ~~Allow NVSHMEM PE to NIC to be initialized by rank~~ Allow NVSHMEM PE to NIC mapping to be initialized by DeepEP rank Oct 24, 2025

smarterclayton and others added 2 commits October 29, 2025 14:52

If CUDA_VISIBLE_DEVICES is set, map cuda.current_device()

c66b146

Perform a mapping between integer CUDA_VISIBLE_DEVICES values to find the host current device for DEEP_EP_DEVICE_TO_HCA_MAPPING. Signed-off-by: Clayton Coleman <[email protected]>

smarterclayton force-pushed the nic_pe_alignment branch from 1075d34 to c66b146 Compare October 29, 2025 18:52

smarterclayton mentioned this pull request Oct 30, 2025

Update DeepEP to a version with a patch for setting NVSHMEM HCA mappings to CUDA device llm-d/llm-d#397

Merged

Gregory-Pereira mentioned this pull request Oct 30, 2025

DO NOT MERGE: A patch to DeepEP v1.2.1 to test RDMA binding llm-d/llm-d#349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow NVSHMEM PE to NIC mapping to be initialized by DeepEP rank #466

Allow NVSHMEM PE to NIC mapping to be initialized by DeepEP rank #466

Uh oh!

smarterclayton commented Oct 24, 2025 •

edited

Loading

Uh oh!

sphish commented Nov 5, 2025

Uh oh!

smarterclayton commented Nov 6, 2025 •

edited

Loading

Uh oh!

sphish commented Nov 6, 2025

Uh oh!

sphish commented Nov 6, 2025

Uh oh!

keonjang-google commented Nov 7, 2025

Uh oh!

seth-howell commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Allow NVSHMEM PE to NIC mapping to be initialized by DeepEP rank #466

Are you sure you want to change the base?

Allow NVSHMEM PE to NIC mapping to be initialized by DeepEP rank #466

Uh oh!

Conversation

smarterclayton commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sphish commented Nov 5, 2025

Uh oh!

smarterclayton commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sphish commented Nov 6, 2025

Uh oh!

sphish commented Nov 6, 2025

Uh oh!

keonjang-google commented Nov 7, 2025

Uh oh!

seth-howell commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

smarterclayton commented Oct 24, 2025 •

edited

Loading

smarterclayton commented Nov 6, 2025 •

edited

Loading