Skip to content

feat: Support scaling num_experts beyond block size in dispatch_send kernel#9

Open
crgg1433 wants to merge 1 commit intoperplexityai:mainfrom
crgg1433:optimize-dispatch-scaling
Open

feat: Support scaling num_experts beyond block size in dispatch_send kernel#9
crgg1433 wants to merge 1 commit intoperplexityai:mainfrom
crgg1433:optimize-dispatch-scaling

Conversation

@crgg1433
Copy link
Copy Markdown

Summary:

Removed the assert(num_experts <= NUM_THREADS) constraint in a2a_dispatch_send_kernel by replacing the single-pass prefix scan with a chunked two-level warp scan. This decouples num_experts from the block thread count, allowing the kernel to scale to larger expert counts (e.g., >1024) and enabling support for MoE with many experts.


Purpose:

The original implementation performs a single-pass prefix scan over tokens_per_expert, assigning one expert per thread. This design assumes num_experts <= NUM_THREADS (up to 1024 on Hopper), which breaks when the number of experts exceeds the block size.

In practice, num_experts represents the total number of all-to-all participants. When expert tensor parallelism is enabled, this value becomes total_experts × TP degree, which can easily exceed 1024. The previous restriction therefore prevented support for models with larger expert counts or sharded experts across tensor-parallel groups.


Test:

tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP2-NIC1-FP32] [2026-03-16 22:59:54.755] 6410 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp7me8ygg9/pplx_garden_parallel_init, world_size=2
[2026-03-16 22:59:54.757] 6409 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp7me8ygg9/pplx_garden_parallel_init, world_size=2
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 22:59:56.322] 6410 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 22:59:56.322] 6409 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 22:59:56.557] 6409 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 22:59:56.579] 6410 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 22:59:56.811] 6409 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 22:59:56.814] 6410 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 22:59:57.345] 6409 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 22:59:57.345] 6410 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP2-NIC1-FP32-MIXED] [2026-03-16 22:59:59.803] 6615 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmpcr8147l4/pplx_garden_parallel_init, world_size=2
[2026-03-16 22:59:59.805] 6614 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmpcr8147l4/pplx_garden_parallel_init, world_size=2
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:00:01.406] 6615 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:01.406] 6614 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:00:01.643] 6614 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:01.665] 6615 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:01.903] 6615 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:01.906] 6614 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:02.432] 6615 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:00:02.432] 6614 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP2-NIC1-FP32-NVL] [2026-03-16 23:00:04.905] 6819 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp6d7h803q/pplx_garden_parallel_init, world_size=2
[2026-03-16 23:00:04.906] 6820 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp6d7h803q/pplx_garden_parallel_init, world_size=2
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:00:06.425] 6820 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:06.425] 6819 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:00:06.911] 6819 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (2)
[2026-03-16 23:00:06.911] 6820 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (2)
[2026-03-16 23:00:07.153] 6819 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:07.154] 6820 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:07.970] 6820 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:00:07.970] 6819 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP2-NIC1-BF16] [2026-03-16 23:00:10.507] 7045 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp4hzm6rhd/pplx_garden_parallel_init, world_size=2
[2026-03-16 23:00:10.508] 7044 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp4hzm6rhd/pplx_garden_parallel_init, world_size=2
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:00:12.042] 7045 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:12.042] 7044 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:00:12.277] 7044 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:12.298] 7045 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:12.535] 7044 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:12.536] 7045 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:13.064] 7044 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:00:13.065] 7045 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP2-NIC1-BF16-PADDED] [2026-03-16 23:00:15.530] 7250 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp6ocvt3qi/pplx_garden_parallel_init, world_size=2
[2026-03-16 23:00:15.531] 7249 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp6ocvt3qi/pplx_garden_parallel_init, world_size=2
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:00:17.085] 7250 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:17.085] 7249 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:00:17.324] 7249 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:17.345] 7250 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:17.577] 7250 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:17.582] 7249 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:18.111] 7249 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:00:18.111] 7250 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP2-NIC1-FP8] [2026-03-16 23:00:20.644] 7455 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp7yfkqa6s/pplx_garden_parallel_init, world_size=2
[2026-03-16 23:00:20.645] 7454 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp7yfkqa6s/pplx_garden_parallel_init, world_size=2
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:00:22.278] 7455 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:22.279] 7454 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:00:22.511] 7454 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:22.532] 7455 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:22.762] 7455 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:22.762] 7454 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:23.302] 7455 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:00:23.302] 7454 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP4-NIC1-FP32] [2026-03-16 23:00:26.286] 7660 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp30f2m7cn/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:26.543] 7659 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp30f2m7cn/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:26.711] 7662 INFO     pplx_garden.distributed.process_group [rank=3] Initializing global process group. device=cuda:3, init_method=file:///tmp/tmp30f2m7cn/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:26.776] 7661 INFO     pplx_garden.distributed.process_group [rank=2] Initializing global process group. device=cuda:2, init_method=file:///tmp/tmp30f2m7cn/pplx_garden_parallel_init, world_size=4
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2026-03-16 23:00:29.106] 7659 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[2026-03-16 23:00:29.106] 7660 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:29.106] 7661 INFO     pplx_garden.distributed.process_group [rank=2] Initialized global process group.
[2026-03-16 23:00:29.106] 7662 INFO     pplx_garden.distributed.process_group [rank=3] Initialized global process group.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:00:29.507] 7661 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:29.509] 7659 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:29.510] 7660 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:29.530] 7662 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:29.972] 7661 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:29.975] 7659 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:29.975] 7660 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:29.976] 7662 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:31.186] 7659 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:00:31.196] 7661 INFO     pplx_garden.distributed.process_group [rank=2] Destroyed global process group.
[2026-03-16 23:00:31.196] 7660 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:00:31.199] 7662 INFO     pplx_garden.distributed.process_group [rank=3] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP4-NIC2-BF16] [2026-03-16 23:00:34.612] 8087 INFO     pplx_garden.distributed.process_group [rank=3] Initializing global process group. device=cuda:3, init_method=file:///tmp/tmp2qadoz15/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:34.622] 8084 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp2qadoz15/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:34.748] 8085 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp2qadoz15/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:34.748] 8086 INFO     pplx_garden.distributed.process_group [rank=2] Initializing global process group. device=cuda:2, init_method=file:///tmp/tmp2qadoz15/pplx_garden_parallel_init, world_size=4
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2026-03-16 23:00:37.309] 8084 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[2026-03-16 23:00:37.309] 8085 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:37.309] 8087 INFO     pplx_garden.distributed.process_group [rank=3] Initialized global process group.
[2026-03-16 23:00:37.309] 8086 INFO     pplx_garden.distributed.process_group [rank=2] Initialized global process group.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:00:37.745] 8085 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:37.754] 8084 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:37.756] 8086 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:37.768] 8087 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:39.679] 8087 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:39.700] 8084 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:39.715] 8086 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:39.733] 8085 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:40.370] 8087 INFO     pplx_garden.distributed.process_group [rank=3] Destroyed global process group.
[2026-03-16 23:00:40.380] 8084 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:00:40.390] 8085 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:00:40.390] 8086 INFO     pplx_garden.distributed.process_group [rank=2] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP4-DP2-NIC1-BF16] [2026-03-16 23:00:43.626] 8510 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp8ohxqu0b/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:43.856] 8512 INFO     pplx_garden.distributed.process_group [rank=3] Initializing global process group. device=cuda:3, init_method=file:///tmp/tmp8ohxqu0b/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:43.988] 8511 INFO     pplx_garden.distributed.process_group [rank=2] Initializing global process group. device=cuda:2, init_method=file:///tmp/tmp8ohxqu0b/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:44.003] 8509 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp8ohxqu0b/pplx_garden_parallel_init, world_size=4
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2026-03-16 23:00:46.522] 8510 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:46.522] 8509 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[2026-03-16 23:00:46.522] 8512 INFO     pplx_garden.distributed.process_group [rank=3] Initialized global process group.
[2026-03-16 23:00:46.522] 8511 INFO     pplx_garden.distributed.process_group [rank=2] Initialized global process group.
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:00:47.133] 8510 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:00:47.138] 8509 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:00:47.290] 8512 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:00:47.292] 8511 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:00:47.707] 8509 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:47.708] 8510 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:47.710] 8511 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:47.711] 8512 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:49.081] 8509 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:00:49.092] 8511 INFO     pplx_garden.distributed.process_group [rank=2] Destroyed global process group.
[2026-03-16 23:00:49.092] 8512 INFO     pplx_garden.distributed.process_group [rank=3] Destroyed global process group.
[2026-03-16 23:00:49.092] 8510 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP8-BF16] [2026-03-16 23:00:53.429] 8929 INFO     pplx_garden.distributed.process_group [rank=7] Initializing global process group. device=cuda:7, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:00:53.858] 8922 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:00:54.031] 8925 INFO     pplx_garden.distributed.process_group [rank=3] Initializing global process group. device=cuda:3, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:00:54.318] 8926 INFO     pplx_garden.distributed.process_group [rank=4] Initializing global process group. device=cuda:4, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:00:54.410] 8923 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:00:54.551] 8928 INFO     pplx_garden.distributed.process_group [rank=6] Initializing global process group. device=cuda:6, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:00:54.553] 8927 INFO     pplx_garden.distributed.process_group [rank=5] Initializing global process group. device=cuda:5, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:00:54.738] 8924 INFO     pplx_garden.distributed.process_group [rank=2] Initializing global process group. device=cuda:2, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2026-03-16 23:00:58.303] 8922 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[2026-03-16 23:00:58.303] 8925 INFO     pplx_garden.distributed.process_group [rank=3] Initialized global process group.
[2026-03-16 23:00:58.303] 8929 INFO     pplx_garden.distributed.process_group [rank=7] Initialized global process group.
[2026-03-16 23:00:58.303] 8928 INFO     pplx_garden.distributed.process_group [rank=6] Initialized global process group.
[2026-03-16 23:00:58.303] 8924 INFO     pplx_garden.distributed.process_group [rank=2] Initialized global process group.
[2026-03-16 23:00:58.303] 8926 INFO     pplx_garden.distributed.process_group [rank=4] Initialized global process group.
[2026-03-16 23:00:58.303] 8923 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:58.303] 8927 INFO     pplx_garden.distributed.process_group [rank=5] Initialized global process group.
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:00:58.982] 8925 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:00:58.985] 8927 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:00:58.989] 8926 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:00:58.990] 8923 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:00:58.994] 8928 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:00:59.003] 8924 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:00:59.004] 8922 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:00:59.012] 8929 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:01.167] 8923 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:01.203] 8924 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:01.233] 8928 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:01.234] 8929 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:01.242] 8927 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:01.258] 8925 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:01.285] 8926 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:01.517] 8922 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:02.834] 8922 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:01:02.864] 8928 INFO     pplx_garden.distributed.process_group [rank=6] Destroyed global process group.
[2026-03-16 23:01:02.904] 8923 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:01:02.915] 8924 INFO     pplx_garden.distributed.process_group [rank=2] Destroyed global process group.
[2026-03-16 23:01:02.925] 8925 INFO     pplx_garden.distributed.process_group [rank=3] Destroyed global process group.
[2026-03-16 23:01:02.935] 8926 INFO     pplx_garden.distributed.process_group [rank=4] Destroyed global process group.
[2026-03-16 23:01:02.945] 8927 INFO     pplx_garden.distributed.process_group [rank=5] Destroyed global process group.
[2026-03-16 23:01:02.945] 8929 INFO     pplx_garden.distributed.process_group [rank=7] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP8-FP8] [2026-03-16 23:01:07.746] 9842 INFO     pplx_garden.distributed.process_group [rank=7] Initializing global process group. device=cuda:7, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:01:08.397] 9840 INFO     pplx_garden.distributed.process_group [rank=5] Initializing global process group. device=cuda:5, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:01:08.445] 9841 INFO     pplx_garden.distributed.process_group [rank=6] Initializing global process group. device=cuda:6, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:01:08.573] 9837 INFO     pplx_garden.distributed.process_group [rank=2] Initializing global process group. device=cuda:2, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:01:08.666] 9838 INFO     pplx_garden.distributed.process_group [rank=3] Initializing global process group. device=cuda:3, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:01:08.683] 9839 INFO     pplx_garden.distributed.process_group [rank=4] Initializing global process group. device=cuda:4, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:01:09.016] 9835 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:01:09.069] 9836 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2026-03-16 23:01:12.700] 9840 INFO     pplx_garden.distributed.process_group [rank=5] Initialized global process group.
[2026-03-16 23:01:12.700] 9842 INFO     pplx_garden.distributed.process_group [rank=7] Initialized global process group.
[2026-03-16 23:01:12.700] 9835 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[2026-03-16 23:01:12.700] 9838 INFO     pplx_garden.distributed.process_group [rank=3] Initialized global process group.
[2026-03-16 23:01:12.700] 9837 INFO     pplx_garden.distributed.process_group [rank=2] Initialized global process group.
[2026-03-16 23:01:12.700] 9839 INFO     pplx_garden.distributed.process_group [rank=4] Initialized global process group.
[2026-03-16 23:01:12.700] 9841 INFO     pplx_garden.distributed.process_group [rank=6] Initialized global process group.
[2026-03-16 23:01:12.700] 9836 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:01:13.359] 9837 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:13.367] 9838 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:13.368] 9840 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:13.368] 9836 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:13.372] 9839 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:13.382] 9835 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:13.382] 9841 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:13.386] 9842 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:17.257] 9838 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:17.279] 9837 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:17.289] 9842 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:17.289] 9841 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:17.313] 9839 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:17.351] 9835 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:17.609] 9840 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:20.354] 9836 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:21.793] 9836 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:01:21.803] 9841 INFO     pplx_garden.distributed.process_group [rank=6] Destroyed global process group.
[2026-03-16 23:01:21.813] 9835 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:01:21.823] 9838 INFO     pplx_garden.distributed.process_group [rank=3] Destroyed global process group.
[2026-03-16 23:01:21.823] 9837 INFO     pplx_garden.distributed.process_group [rank=2] Destroyed global process group.
[2026-03-16 23:01:21.833] 9839 INFO     pplx_garden.distributed.process_group [rank=4] Destroyed global process group.
[2026-03-16 23:01:21.843] 9842 INFO     pplx_garden.distributed.process_group [rank=7] Destroyed global process group.
[2026-03-16 23:01:21.843] 9840 INFO     pplx_garden.distributed.process_group [rank=5] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP4-FP8-NVL2] [2026-03-16 23:01:25.850] 10750 INFO     pplx_garden.distributed.process_group [rank=2] Initializing global process group. device=cuda:2, init_method=file:///tmp/tmpgeyr2uee/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:01:25.958] 10748 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmpgeyr2uee/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:01:25.970] 10749 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmpgeyr2uee/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:01:26.010] 10751 INFO     pplx_garden.distributed.process_group [rank=3] Initializing global process group. device=cuda:3, init_method=file:///tmp/tmpgeyr2uee/pplx_garden_parallel_init, world_size=4
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2026-03-16 23:01:28.446] 10748 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[2026-03-16 23:01:28.446] 10749 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:01:28.446] 10750 INFO     pplx_garden.distributed.process_group [rank=2] Initialized global process group.
[2026-03-16 23:01:28.446] 10751 INFO     pplx_garden.distributed.process_group [rank=3] Initialized global process group.
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:01:29.220] 10748 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:29.227] 10749 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:29.370] 10750 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:29.372] 10751 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:31.352] 10748 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:31.359] 10749 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:31.391] 10750 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:31.463] 10751 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:33.034] 10748 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:01:33.044] 10749 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:01:33.052] 10750 INFO     pplx_garden.distributed.process_group [rank=2] Destroyed global process group.
[2026-03-16 23:01:33.054] 10751 INFO     pplx_garden.distributed.process_group [rank=3] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP4-DP2-NVL2] [2026-03-16 23:01:36.371] 11222 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmpiwixp19o/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:01:36.560] 11221 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmpiwixp19o/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:01:36.566] 11223 INFO     pplx_garden.distributed.process_group [rank=2] Initializing global process group. device=cuda:2, init_method=file:///tmp/tmpiwixp19o/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:01:36.801] 11224 INFO     pplx_garden.distributed.process_group [rank=3] Initializing global process group. device=cuda:3, init_method=file:///tmp/tmpiwixp19o/pplx_garden_parallel_init, world_size=4
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2026-03-16 23:01:39.157] 11221 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[2026-03-16 23:01:39.157] 11222 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:01:39.157] 11224 INFO     pplx_garden.distributed.process_group [rank=3] Initialized global process group.
[2026-03-16 23:01:39.157] 11223 INFO     pplx_garden.distributed.process_group [rank=2] Initialized global process group.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:01:40.224] 11221 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:40.238] 11222 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:40.319] 11223 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:40.329] 11224 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:40.760] 11223 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:40.761] 11224 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:40.764] 11221 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:40.766] 11222 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:42.246] 11221 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:01:42.250] 11222 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:01:42.255] 11223 INFO     pplx_garden.distributed.process_group [rank=2] Destroyed global process group.
[2026-03-16 23:01:42.261] 11224 INFO     pplx_garden.distributed.process_group [rank=3] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP2-EMPTY] [2026-03-16 23:01:44.989] 11682 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp14mj98yr/pplx_garden_parallel_init, world_size=2
[2026-03-16 23:01:45.009] 11683 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp14mj98yr/pplx_garden_parallel_init, world_size=2
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:01:46.548] 11683 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:01:46.548] 11682 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:01:46.763] 11682 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:01:46.784] 11683 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:01:47.010] 11682 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:47.014] 11683 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:47.571] 11683 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:01:47.571] 11682 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
PASSED

======================================================================================================== warnings summary ========================================================================================================
../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1397
  /usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1397: PytestConfigWarning: Unknown config option: asyncio_default_fixture_loop_scope

    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================================================================================== 14 passed, 1 warning in 119.11s (0:01:59) ============================================================================================

…kernel

Replace the single-pass prefix scan with a chunked two-level warp scan,
allowing num_experts to scale independently of block size,
enabling support for MoE models with larger expert counts.
@crgg1433 crgg1433 marked this pull request as draft March 17, 2026 00:45
@crgg1433 crgg1433 marked this pull request as ready for review March 17, 2026 00:47
@crgg1433
Copy link
Copy Markdown
Author

Hi @abcdabcd987,

This PR removes the num_experts <= NUM_THREADS hard constraint in the dispatch_send kernel by introducing a chunked two-level warp scan. This enables scaling to much larger expert counts beyond 1024, which was previously a scaling bottleneck.

Tests are green. Could you please take a look? Thanks!

@crgg1433
Copy link
Copy Markdown
Author

crgg1433 commented Apr 1, 2026

Hi @abcdabcd987 @nandor, just following up on this PR. It would be great to get your feedback when you have time. Let me know if anything is blocking!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant