Skip to content

[P2P] Added congestion control support (Timely, Swift)#837

Open
andrzej-k wants to merge 1 commit intouccl-project:mainfrom
andrzej-k:ak_cc_mod
Open

[P2P] Added congestion control support (Timely, Swift)#837
andrzej-k wants to merge 1 commit intouccl-project:mainfrom
andrzej-k:ak_cc_mod

Conversation

@andrzej-k
Copy link
Copy Markdown
Contributor

@andrzej-k andrzej-k commented Mar 24, 2026

Description

The intention is to allow RoCE EP and P2P to use congestion control algorithms in addition to currently supported flow control. That would be useful in environments without PFC support.

This PR is a first step to enable that. The overall plan:

  • Add Timely and Swift support in P2P, as part of that moved CC algorithms (Timely, Swift, EQDS) out of collectives to shared location.
  • Add Timely and Swift support in EP
  • Add EQDS support in P2P first, then EP.

Test results - two node setup:

export UCCL_P2P_RDMA_CC=swift

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 5 RDMA device(s)
  [0] 
  [1] 
  [2] 
  [3] 
  [4] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 4 (irdma-mkp0)
System assigned port: 38717
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 41241
Connected to <IP>:41241 (fd=63)
Accepted connection fd=64 from <IP>:57828
[Client] Connected to <IP>:41241 (GPU 0) conn_id=0
[Client]    256 B :   0.13 Gbps |   0.02 GB/s  | 0.000016 s
[Client]   1.0 KB :   0.52 Gbps |   0.07 GB/s  | 0.000016 s
[Client]   4.0 KB :   2.03 Gbps |   0.25 GB/s  | 0.000016 s
[Client]  16.0 KB :   8.03 Gbps |   1.00 GB/s  | 0.000016 s
[Client]  64.0 KB :  28.32 Gbps |   3.54 GB/s  | 0.000019 s
[Client] 256.0 KB :  76.63 Gbps |   9.58 GB/s  | 0.000027 s
[Client]   1.0 MB : 139.22 Gbps |  17.40 GB/s  | 0.000060 s
[Client]  10.0 MB : 184.99 Gbps |  23.12 GB/s  | 0.000453 s
[Client]  16.0 MB : 186.98 Gbps |  23.37 GB/s  | 0.000718 s
[Client] 100.0 MB : 186.41 Gbps |  23.30 GB/s  | 0.004500 s
[Client] Benchmark complete
Destroying Engine...
Engine destroyed

export UCCL_P2P_RDMA_CC=timely

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 5 RDMA device(s)
  [0]
  [1] 
  [2] 
  [3] 
  [4] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 4 (irdma-mkp0)
System assigned port: 38567
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 34689
Connected to <IP>:34689 (fd=59)
Accepted connection fd=64 from <IP>:36800
[Client] Connected to <IP>:34689 (GPU 0) conn_id=0
[Client]    256 B :   0.13 Gbps |   0.02 GB/s  | 0.000016 s
[Client]   1.0 KB :   0.53 Gbps |   0.07 GB/s  | 0.000016 s
[Client]   4.0 KB :   2.03 Gbps |   0.25 GB/s  | 0.000016 s
[Client]  16.0 KB :   8.10 Gbps |   1.01 GB/s  | 0.000016 s
[Client]  64.0 KB :  28.11 Gbps |   3.51 GB/s  | 0.000019 s
[Client] 256.0 KB :  75.48 Gbps |   9.44 GB/s  | 0.000028 s
[Client]   1.0 MB : 140.82 Gbps |  17.60 GB/s  | 0.000060 s
[Client]  10.0 MB : 184.47 Gbps |  23.06 GB/s  | 0.000455 s
[Client]  16.0 MB : 187.26 Gbps |  23.41 GB/s  | 0.000717 s
[Client] 100.0 MB : 186.34 Gbps |  23.29 GB/s  | 0.004502 s
[Client] Benchmark complete
Destroying Engine...
Engine destroyed

unset UCCL_P2P_RDMA_CC

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 5 RDMA device(s)
  [0] 
  [1]
  [2]
  [3] 
  [4] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 4 (irdma-mkp0)
System assigned port: 46773
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 36325
Connected to <IP>:36325 (fd=63)
Accepted connection fd=64 from <IP>:35360
[Client] Connected to <IP>:36325 (GPU 0) conn_id=0
[Client]    256 B :   0.13 Gbps |   0.02 GB/s  | 0.000016 s
[Client]   1.0 KB :   0.52 Gbps |   0.06 GB/s  | 0.000016 s
[Client]   4.0 KB :   2.07 Gbps |   0.26 GB/s  | 0.000016 s
[Client]  16.0 KB :   8.05 Gbps |   1.01 GB/s  | 0.000016 s
[Client]  64.0 KB :  28.73 Gbps |   3.59 GB/s  | 0.000018 s
[Client] 256.0 KB :  76.76 Gbps |   9.59 GB/s  | 0.000027 s
[Client]   1.0 MB : 141.41 Gbps |  17.68 GB/s  | 0.000059 s
[Client]  10.0 MB : 185.05 Gbps |  23.13 GB/s  | 0.000453 s
[Client]  16.0 MB : 187.25 Gbps |  23.41 GB/s  | 0.000717 s
[Client] 100.0 MB : 186.45 Gbps |  23.31 GB/s  | 0.004499 s
[Client] Benchmark complete
Destroying Engine...
Engine destroyed

Type of Change

  • Bug fix
  • New feature
  • Documentation update

How Has This Been Tested?

Include any tests here.

  • Unit tests
  • Integration tests
  • Manual testing

Checklist

  • I have run format.sh to follow the style guidelines.
  • I have run build.sh to verify compilation.
  • I have removed redundant variables and comments.
  • I have updated the documentation.
  • I have added tests.

* Moved CC algos to shared location.
* In P2P added support for Timely and Swift.

Signed-off-by: Andrzej Kuriata <andrzej.kuriata@intel.com>
@andrzej-k andrzej-k changed the title Allow EP and P2P to use congestion control - part 1 (extract CC algos) [P2P] Added congestion control support (Timely, Swift) Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant