UCCL-EP: GPU-initiated Expert-parallel Communication

GPU-initiated expert-parallel communication (e.g., DeepEP) is the key to efficient and large-scale EP. However, it cannot run on heterogeneous platforms due to tight coupling between GPU and NIC (e.g., with IBGDA). UCCL-EP has the same interface and functionality as DeepEP, and enables GPU-initiated communication for MoE models across heterogeneous GPUs (e.g., Nvidia, AMD) and NICs (e.g., EFA, Broadcom, CX7), with superior performance to the state-of-the-art.

Prerequisite

We provide a script to install dependencies (tested on p5en, p6-b200, AMD MI300x), assuming under a Python environment:

./install_deps.sh

Build on CUDA

You can directly build and install into your Python env:

python setup.py install

You can also use make to build and install (might deprecate in the future):

make -j install

Alternatively, you can build uccl.ep wheel using docker then install:

# Under uccl
bash build.sh cu12 ep --install

Note: docker-built uccl.ep wheel currently does not work on p6-b200, see #554.

Build on ROCm

You can directly build and install into your Python env:

python setup.py install

Alternatively, you can build uccl.ep wheel for ROCm7 using docker then install:

# Under uccl
bash build.sh roc7 ep --install

Test import `uccl.ep`

python -c "import torch; import uccl.ep"

Note:

If you hit some CUDA error: invalid device function, it is likely that the GPU arch auto-detection fails. You can forcely specify the arch by setting TORCH_CUDA_ARCH_LIST=gfx950 (eg, default gfx942 for MI300X/MI325X, gfx950 for MI355X) during compilation.
If you hit any weird compilation errors, try python setup.py clean.

Example APIs

Dispatch and combine:

packed_recv_x, packed_recv_count, handle, event, hook = buffer.low_latency_dispatch(
    current_x,
    topk_idx,
    num_tokens,
    num_experts,
    use_fp8=dispatch_use_fp8,
    round_scale=round_scale,
    use_ue8m0=use_ue8m0,
    cumulative_local_expert_recv_stats=cumulative_local_expert_recv_stats,
    async_finish=not return_recv_hook,
    return_recv_hook=return_recv_hook,
)

combined_x, event, hook = buffer.low_latency_combine(
    simulated_gemm_x,
    topk_idx,
    topk_weights,
    handle,
    use_logfmt=use_logfmt,
    async_finish=not return_recv_hook,
    zero_copy=zero_copy,
    return_recv_hook=return_recv_hook,
    out=out,
)

Initialization and tear down:

proxies, workers = initialize_uccl(scratch, num_rdma_bytes, rank, num_ranks, group, args.num_experts)
destroy_uccl(proxies, workers)

Benchmark

In ep folder, the benchmark can be run with torchrun.

Intranode Test

torchrun --standalone --nproc_per_node=8 \
  bench/test_intranode.py --num-tokens 4096 \
  --hidden 7168 --num-topk 8 --num-experts 256

Internode Low Latency Test

torchrun --nnodes=4 --nproc_per_node=8 --node_rank=<rank> \
  --master_addr=<ip> --master_port=12355 \
  bench/test_low_latency.py --num-tokens=128 \
  --hidden=7168 --num-topk=8 --num-experts=288

Internode Normal Mode (Throughput) Test

torchrun --nnodes=4 --nproc_per_node=8 --node_rank=<rank> \
  --master_addr=<ip> --master_port=12355 \
  bench/test_internode.py  --num-tokens=4096 \
  --hidden=7168 --num-topk=8 --num-experts=288 --test-ll-compatibility

Notes:

To avoid possible hangs, we suggest setting env variables explicitly including NCCL_IB_GID_INDEX, UCCL_IB_GID_INDEX, NCCL_SOCKET_IFNAME, and UCCL_SOCKET_IFNAME:
- UCCL_IB_GID_INDEX should be the same as NCCL_IB_GID_INDEX like if you were using NCCL.
- UCCL_SOCKET_IFNAME should be the interface that you would use for the --master_addr in torchrun.
For Broadcom Thor-2, we suggest setting UCCL_IB_MAX_INFLIGHT_BYTES=1572864 UCCL_IB_MAX_INFLIGHT_NORMAL=1 to enforce strict flow control, avoiding CQE error 12 (Transport Retry Counter Exceeded).
For AMD Pollara AI NIC, we suggest setting UCCL_IB_MAX_INFLIGHT_BYTES=2097152 UCCL_IB_MAX_INFLIGHT_NORMAL=1.
Please refer to bench/baseline for running more baselines including Torch, NVSHMEM, and pplx-kernels on EFA.

Environment Variable	Description	Default Value
UCCL_SOCKET_IFNAME	Boostrapping interface	null
UCCL_IB_GID_INDEX	GID index in RDMA network	-1
UCCL_IB_MAX_INFLIGHT_BYTES	Max inflight bytes per GPU/NIC	SIZE_MAX (no limit)
UCCL_IB_MAX_INFLIGHT_NORMAL	Max inflight writes per GPU/NIC in HT	8
UCCL_IB_MAX_INFLIGHT_LOW_LATENCY	Max inflight writes per GPU/NIC in LL	32
UCCL_IB_SL	Service level in RDMA network	3/8 (IB/EFA)
UCCL_IB_TC	Traffic class in RDMA network	104/0 (IB/EFA)
UCCL_EP_ENABLE_AGGRESSIVE_ATOMIC	Use relaxed atomics with manual fences instead of acquire/release semantics (AMD only)	0 (disabled)

Results

Normal kernels with NVLink and RDMA forwarding

On p5en

We test normal kernels on 8x H200 + 16x 200Gb/s EFA with each GPU connected to two 200 Gb/s EFA RDMA network cards. We follow the DeepSeek-V3 pretraining configuration (4096 tokens per batch, 7168 hidden, top-4 groups, top-8 experts, FP8 dispatch and BF16 combine).

Type	Dispatch #EP	Bottleneck bandwidth & latency	Combine #EP	Bottleneck bandwidth & latency
Intranode	8	320 GB/s (NVLink), 500 µs	8	319 GB/s (NVLink), 973 µs
Internode	16	50 GB/s (RDMA), 1196 µs	16	18 GB/s (RDMA), 6379 µs
Internode	24	53 GB/s (RDMA), 1633 µs	24	26 GB/s (RDMA), 6365 µs
Internode	32	54 GB/s (RDMA), 2022 µs	32	43 GB/s (RDMA), 4899 µs

On p6-b200

We test normal kernels on 8x B200 + 8x 400Gb/s EFA with each GPU connected to a 400Gb/s EFA RDMA network card.

Type	Dispatch #EP	Bottleneck bandwidth & latency	Combine #EP	Bottleneck bandwidth & latency
Intranode	8	280 GB/s (NVLink), 571 µs	8	426 GB/s (NVLink), 727 µs
Internode	16	53 GB/s (RDMA), 1141 µs	16	60 GB/s (RDMA), 1965 µs
Internode	24	53 GB/s (RDMA), 1637 µs	24	59 GB/s (RDMA), 2887 µs
Internode	32	53 GB/s (RDMA), 2072 µs	32	57 GB/s (RDMA), 3724 µs

On AMD MI300X with CX7 InfiniBand

Type	FP8 Dispatch #EP	Bottleneck bandwidth	BF16 Dispatch #EP	Bottleneck bandwidth	Combine #EP	Bottleneck bandwidth
Intranode	8	260 GB/s (xGMI)	8	295 GB/s (xGMI)	8	304 GB/s (xGMI)
Internode	16	74 GB/s (RDMA)	16	82 GB/s (RDMA)	16	78 GB/s (RDMA)
Internode	32	60 GB/s (RDMA)	32	61 GB/s (RDMA)	32	60 GB/s (RDMA)
Internode	64	52 GB/s (RDMA)	32	53 GB/s (RDMA)	64	51 GB/s (RDMA)

On AMD MI355X with Pollara AI NIC InfiniBand

Type	FP8 Dispatch #EP	Bottleneck bandwidth	BF16 Dispatch #EP	Bottleneck bandwidth	Combine #EP	Bottleneck bandwidth
Intranode	8	299 GB/s (xGMI)	8	336 GB/s (xGMI)	8	333 GB/s (xGMI)
Internode	16	82 GB/s (RDMA)	16	82 GB/s (RDMA)	16	82 GB/s (RDMA)
Internode	32	59 GB/s (RDMA)	32	58 GB/s (RDMA)	32	59 GB/s (RDMA)
Internode	64	50 GB/s (RDMA)	32	49 GB/s (RDMA)	64	49 GB/s (RDMA)

On AMD MI300X with Broadcom Thor2

Type	FP8 Dispatch #EP	Bottleneck bandwidth	BF16 Dispatch #EP	Bottleneck bandwidth	Combine #EP	Bottleneck bandwidth
Internode	16	71 GB/s (RDMA)	16	81 GB/s (RDMA)	16	45 GB/s (RDMA)
Internode	32	49 GB/s (RDMA)	32	55 GB/s (RDMA)	32	50 GB/s (RDMA)

Low-latency kernels with pure RDMA

AMD MI300X with CX7 InfiniBand

Dispatch #EP	Latency	RDMA bandwidth	Combine #EP	Latency	RDMA bandwidth
8	65 us	114 GB/s	8	92 us	157 GB/s
16	136 us	55 GB/s	16	207 us	70 GB/s
32	224 us	30 GB/s	32	341 us	42 GB/s

On p6-b200

We test low-latency kernels on 8x B200 + 8x 400Gb/s EFA.

Dispatch #EP	Latency	RDMA bandwidth	Combine #EP	Latency	RDMA bandwidth
16	228 µs	33 GB/s	16	318 µs	46 GB/s
24	448 µs	17 GB/s	24	566 µs	26 GB/s
32	406 µs	19 GB/s	32	617 µs	24 GB/s

On p5en

We test low-latency kernels on 8x H200 + 16x 200Gb/s EFA, following a DeepSeek-V3 inference setting (128 tokens per batch, 7168 hidden, top-8 experts, FP8 dispatch / BF16 combine).

Dispatch #EP	Latency	RDMA bandwidth	Combine #EP	Latency	RDMA bandwidth
16	226 µs	36 GB/s	16	293 µs	48 GB/s
24	386 µs	20 GB/s	24	580 µs	26 GB/s
32	465 µs	16 GB/s	32	694 µs	25 GB/s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCCL-EP: GPU-initiated Expert-parallel Communication

Prerequisite

Build on CUDA

Build on ROCm

Test import `uccl.ep`

Example APIs

Benchmark

Intranode Test

Internode Low Latency Test

Internode Normal Mode (Throughput) Test

Results

Normal kernels with NVLink and RDMA forwarding

On p5en

On p6-b200

On AMD MI300X with CX7 InfiniBand

On AMD MI355X with Pollara AI NIC InfiniBand

On AMD MI300X with Broadcom Thor2

Low-latency kernels with pure RDMA

AMD MI300X with CX7 InfiniBand

On p6-b200

On p5en

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

UCCL-EP: GPU-initiated Expert-parallel Communication

Prerequisite

Build on CUDA

Build on ROCm

Test import uccl.ep

Example APIs

Benchmark

Intranode Test

Internode Low Latency Test

Internode Normal Mode (Throughput) Test

Results

Normal kernels with NVLink and RDMA forwarding

On p5en

On p6-b200

On AMD MI300X with CX7 InfiniBand

On AMD MI355X with Pollara AI NIC InfiniBand

On AMD MI300X with Broadcom Thor2

Low-latency kernels with pure RDMA

AMD MI300X with CX7 InfiniBand

On p6-b200

On p5en

Test import `uccl.ep`