Skip to content

Commit 71d6529

Browse files
committed
minor mod on the text
1 parent 8f12ded commit 71d6529

1 file changed

Lines changed: 2 additions & 14 deletions

File tree

src/content/posts/uccl-ep-full.md

Lines changed: 2 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,8 @@ We have since evaluated UCCL-EP on a diverse set of platforms spanning NVIDIA an
8585

8686
**Baselines:** NCCL/RCCL, DeepEP (NVIDIA-only), Perplexity Kernels ([PPLX](https://github.com/perplexityai/pplx-garden)), and CPU-assisted IBGDA. UCCL-EP uses 4 CPU proxy threads per GPU.
8787

88+
Please reach out to us if you would like to improve and evaluate EP communication on your own platform!
89+
8890
---
8991

9092
## Microbenchmark Results
@@ -145,20 +147,6 @@ Following a DeepSeek-V3 inference setting (128 tokens, 7168 hidden, top-8 expert
145147

146148
---
147149

148-
### On InfiniBand (NVIDIA CX7)
149-
150-
On the Nebius testbed (H100 + CX7 InfiniBand), we compare UCCL-EP against both the original DeepEP and PPLX at EP32.
151-
152-
In **LL mode**, UCCL-EP incurs slightly higher latency than DeepEP and PPLX due to the CPU proxy overhead on small messages. However, in **HT mode**, UCCL-EP achieves latency **within 5% of DeepEP** for dispatch while outperforming PPLX by **2.1x** (dispatch) and **1.6x** (combine). This shows that UCCL-EP preserves DeepEP-level performance on throughput-oriented workloads even without IBGDA.
153-
154-
<div class="not-prose my-6 grid w-full grid-cols-2 items-start justify-items-center gap-5 [&_img]:!my-0 [&_img]:h-auto [&_img]:max-w-[400px] [&_img]:min-w-0 [&_img]:w-full">
155-
<img src="https://raw.githubusercontent.com/uccl-project/uccl-project.github.io/uccl-ep-full-blogpost/assets/uccl-ep-full/nebius_dispatch_ll_ht.png" alt="Nebius dispatch" width="400"/>
156-
<img src="https://raw.githubusercontent.com/uccl-project/uccl-project.github.io/uccl-ep-full-blogpost/assets/uccl-ep-full/nebius_combine_ll_ht.png" alt="Nebius combine" width="400"/>
157-
</div>
158-
<p align="center"><em>EP32 dispatch (left) and combine (right) comparison on H100 + CX7 InfiniBand. UCCL-EP matches DeepEP in HT mode and significantly outperforms PPLX.</em></p>
159-
160-
---
161-
162150
### On AMD GPUs
163151

164152
UCCL-EP enables GPU-initiated token-level EP communication on AMD GPUs. We evaluate on MI300X with both CX7 InfiniBand (OCI) and Broadcom Thor-2 NICs (Vultr).

0 commit comments

Comments
 (0)