feat: Support scaling num_experts beyond block size in dispatch_send kernel by crgg1433 · Pull Request #9 · perplexityai/pplx-garden

crgg1433 · 2026-03-17T00:43:01Z

Summary:

Removed the assert(num_experts <= NUM_THREADS) constraint in a2a_dispatch_send_kernel by replacing the single-pass prefix scan with a chunked two-level warp scan. This decouples num_experts from the block thread count, allowing the kernel to scale to larger expert counts (e.g., >1024) and enabling support for MoE with many experts.

Purpose:

The original implementation performs a single-pass prefix scan over tokens_per_expert, assigning one expert per thread. This design assumes num_experts <= NUM_THREADS (up to 1024 on Hopper), which breaks when the number of experts exceeds the block size.

In practice, num_experts represents the total number of all-to-all participants. When expert tensor parallelism is enabled, this value becomes total_experts × TP degree, which can easily exceed 1024. The previous restriction therefore prevented support for models with larger expert counts or sharded experts across tensor-parallel groups.

Test:

tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP2-NIC1-FP32] [2026-03-16 22:59:54.755] 6410 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp7me8ygg9/pplx_garden_parallel_init, world_size=2
[2026-03-16 22:59:54.757] 6409 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp7me8ygg9/pplx_garden_parallel_init, world_size=2
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 22:59:56.322] 6410 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 22:59:56.322] 6409 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 22:59:56.557] 6409 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 22:59:56.579] 6410 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 22:59:56.811] 6409 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 22:59:56.814] 6410 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 22:59:57.345] 6409 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 22:59:57.345] 6410 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP2-NIC1-FP32-MIXED] [2026-03-16 22:59:59.803] 6615 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmpcr8147l4/pplx_garden_parallel_init, world_size=2
[2026-03-16 22:59:59.805] 6614 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmpcr8147l4/pplx_garden_parallel_init, world_size=2
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:00:01.406] 6615 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:01.406] 6614 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:00:01.643] 6614 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:01.665] 6615 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:01.903] 6615 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:01.906] 6614 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:02.432] 6615 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:00:02.432] 6614 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP2-NIC1-FP32-NVL] [2026-03-16 23:00:04.905] 6819 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp6d7h803q/pplx_garden_parallel_init, world_size=2
[2026-03-16 23:00:04.906] 6820 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp6d7h803q/pplx_garden_parallel_init, world_size=2
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:00:06.425] 6820 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:06.425] 6819 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:00:06.911] 6819 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (2)
[2026-03-16 23:00:06.911] 6820 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (2)
[2026-03-16 23:00:07.153] 6819 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:07.154] 6820 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:07.970] 6820 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:00:07.970] 6819 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP2-NIC1-BF16] [2026-03-16 23:00:10.507] 7045 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp4hzm6rhd/pplx_garden_parallel_init, world_size=2
[2026-03-16 23:00:10.508] 7044 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp4hzm6rhd/pplx_garden_parallel_init, world_size=2
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:00:12.042] 7045 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:12.042] 7044 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:00:12.277] 7044 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:12.298] 7045 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:12.535] 7044 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:12.536] 7045 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:13.064] 7044 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:00:13.065] 7045 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP2-NIC1-BF16-PADDED] [2026-03-16 23:00:15.530] 7250 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp6ocvt3qi/pplx_garden_parallel_init, world_size=2
[2026-03-16 23:00:15.531] 7249 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp6ocvt3qi/pplx_garden_parallel_init, world_size=2
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:00:17.085] 7250 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:17.085] 7249 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:00:17.324] 7249 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:17.345] 7250 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:17.577] 7250 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:17.582] 7249 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:18.111] 7249 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:00:18.111] 7250 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP2-NIC1-FP8] [2026-03-16 23:00:20.644] 7455 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp7yfkqa6s/pplx_garden_parallel_init, world_size=2
[2026-03-16 23:00:20.645] 7454 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp7yfkqa6s/pplx_garden_parallel_init, world_size=2
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:00:22.278] 7455 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:22.279] 7454 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:00:22.511] 7454 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:22.532] 7455 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:00:22.762] 7455 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:22.762] 7454 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:23.302] 7455 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:00:23.302] 7454 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP4-NIC1-FP32] [2026-03-16 23:00:26.286] 7660 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp30f2m7cn/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:26.543] 7659 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp30f2m7cn/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:26.711] 7662 INFO     pplx_garden.distributed.process_group [rank=3] Initializing global process group. device=cuda:3, init_method=file:///tmp/tmp30f2m7cn/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:26.776] 7661 INFO     pplx_garden.distributed.process_group [rank=2] Initializing global process group. device=cuda:2, init_method=file:///tmp/tmp30f2m7cn/pplx_garden_parallel_init, world_size=4
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2026-03-16 23:00:29.106] 7659 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[2026-03-16 23:00:29.106] 7660 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:29.106] 7661 INFO     pplx_garden.distributed.process_group [rank=2] Initialized global process group.
[2026-03-16 23:00:29.106] 7662 INFO     pplx_garden.distributed.process_group [rank=3] Initialized global process group.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:00:29.507] 7661 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:29.509] 7659 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:29.510] 7660 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:29.530] 7662 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:29.972] 7661 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:29.975] 7659 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:29.975] 7660 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:29.976] 7662 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:31.186] 7659 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:00:31.196] 7661 INFO     pplx_garden.distributed.process_group [rank=2] Destroyed global process group.
[2026-03-16 23:00:31.196] 7660 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:00:31.199] 7662 INFO     pplx_garden.distributed.process_group [rank=3] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP4-NIC2-BF16] [2026-03-16 23:00:34.612] 8087 INFO     pplx_garden.distributed.process_group [rank=3] Initializing global process group. device=cuda:3, init_method=file:///tmp/tmp2qadoz15/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:34.622] 8084 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp2qadoz15/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:34.748] 8085 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp2qadoz15/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:34.748] 8086 INFO     pplx_garden.distributed.process_group [rank=2] Initializing global process group. device=cuda:2, init_method=file:///tmp/tmp2qadoz15/pplx_garden_parallel_init, world_size=4
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2026-03-16 23:00:37.309] 8084 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[2026-03-16 23:00:37.309] 8085 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:37.309] 8087 INFO     pplx_garden.distributed.process_group [rank=3] Initialized global process group.
[2026-03-16 23:00:37.309] 8086 INFO     pplx_garden.distributed.process_group [rank=2] Initialized global process group.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:00:37.745] 8085 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:37.754] 8084 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:37.756] 8086 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:37.768] 8087 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (1)
[2026-03-16 23:00:39.679] 8087 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:39.700] 8084 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:39.715] 8086 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:39.733] 8085 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:40.370] 8087 INFO     pplx_garden.distributed.process_group [rank=3] Destroyed global process group.
[2026-03-16 23:00:40.380] 8084 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:00:40.390] 8085 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:00:40.390] 8086 INFO     pplx_garden.distributed.process_group [rank=2] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP4-DP2-NIC1-BF16] [2026-03-16 23:00:43.626] 8510 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp8ohxqu0b/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:43.856] 8512 INFO     pplx_garden.distributed.process_group [rank=3] Initializing global process group. device=cuda:3, init_method=file:///tmp/tmp8ohxqu0b/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:43.988] 8511 INFO     pplx_garden.distributed.process_group [rank=2] Initializing global process group. device=cuda:2, init_method=file:///tmp/tmp8ohxqu0b/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:00:44.003] 8509 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp8ohxqu0b/pplx_garden_parallel_init, world_size=4
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2026-03-16 23:00:46.522] 8510 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:46.522] 8509 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[2026-03-16 23:00:46.522] 8512 INFO     pplx_garden.distributed.process_group [rank=3] Initialized global process group.
[2026-03-16 23:00:46.522] 8511 INFO     pplx_garden.distributed.process_group [rank=2] Initialized global process group.
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:00:47.133] 8510 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:00:47.138] 8509 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:00:47.290] 8512 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:00:47.292] 8511 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:00:47.707] 8509 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:47.708] 8510 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:47.710] 8511 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:47.711] 8512 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:00:49.081] 8509 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:00:49.092] 8511 INFO     pplx_garden.distributed.process_group [rank=2] Destroyed global process group.
[2026-03-16 23:00:49.092] 8512 INFO     pplx_garden.distributed.process_group [rank=3] Destroyed global process group.
[2026-03-16 23:00:49.092] 8510 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP8-BF16] [2026-03-16 23:00:53.429] 8929 INFO     pplx_garden.distributed.process_group [rank=7] Initializing global process group. device=cuda:7, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:00:53.858] 8922 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:00:54.031] 8925 INFO     pplx_garden.distributed.process_group [rank=3] Initializing global process group. device=cuda:3, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:00:54.318] 8926 INFO     pplx_garden.distributed.process_group [rank=4] Initializing global process group. device=cuda:4, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:00:54.410] 8923 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:00:54.551] 8928 INFO     pplx_garden.distributed.process_group [rank=6] Initializing global process group. device=cuda:6, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:00:54.553] 8927 INFO     pplx_garden.distributed.process_group [rank=5] Initializing global process group. device=cuda:5, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:00:54.738] 8924 INFO     pplx_garden.distributed.process_group [rank=2] Initializing global process group. device=cuda:2, init_method=file:///tmp/tmph0n5x4bi/pplx_garden_parallel_init, world_size=8
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2026-03-16 23:00:58.303] 8922 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[2026-03-16 23:00:58.303] 8925 INFO     pplx_garden.distributed.process_group [rank=3] Initialized global process group.
[2026-03-16 23:00:58.303] 8929 INFO     pplx_garden.distributed.process_group [rank=7] Initialized global process group.
[2026-03-16 23:00:58.303] 8928 INFO     pplx_garden.distributed.process_group [rank=6] Initialized global process group.
[2026-03-16 23:00:58.303] 8924 INFO     pplx_garden.distributed.process_group [rank=2] Initialized global process group.
[2026-03-16 23:00:58.303] 8926 INFO     pplx_garden.distributed.process_group [rank=4] Initialized global process group.
[2026-03-16 23:00:58.303] 8923 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:00:58.303] 8927 INFO     pplx_garden.distributed.process_group [rank=5] Initialized global process group.
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:00:58.982] 8925 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:00:58.985] 8927 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:00:58.989] 8926 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:00:58.990] 8923 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:00:58.994] 8928 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:00:59.003] 8924 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:00:59.004] 8922 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:00:59.012] 8929 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:01.167] 8923 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:01.203] 8924 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:01.233] 8928 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:01.234] 8929 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:01.242] 8927 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:01.258] 8925 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:01.285] 8926 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:01.517] 8922 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:02.834] 8922 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:01:02.864] 8928 INFO     pplx_garden.distributed.process_group [rank=6] Destroyed global process group.
[2026-03-16 23:01:02.904] 8923 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:01:02.915] 8924 INFO     pplx_garden.distributed.process_group [rank=2] Destroyed global process group.
[2026-03-16 23:01:02.925] 8925 INFO     pplx_garden.distributed.process_group [rank=3] Destroyed global process group.
[2026-03-16 23:01:02.935] 8926 INFO     pplx_garden.distributed.process_group [rank=4] Destroyed global process group.
[2026-03-16 23:01:02.945] 8927 INFO     pplx_garden.distributed.process_group [rank=5] Destroyed global process group.
[2026-03-16 23:01:02.945] 8929 INFO     pplx_garden.distributed.process_group [rank=7] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP8-FP8] [2026-03-16 23:01:07.746] 9842 INFO     pplx_garden.distributed.process_group [rank=7] Initializing global process group. device=cuda:7, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:01:08.397] 9840 INFO     pplx_garden.distributed.process_group [rank=5] Initializing global process group. device=cuda:5, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:01:08.445] 9841 INFO     pplx_garden.distributed.process_group [rank=6] Initializing global process group. device=cuda:6, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:01:08.573] 9837 INFO     pplx_garden.distributed.process_group [rank=2] Initializing global process group. device=cuda:2, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:01:08.666] 9838 INFO     pplx_garden.distributed.process_group [rank=3] Initializing global process group. device=cuda:3, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:01:08.683] 9839 INFO     pplx_garden.distributed.process_group [rank=4] Initializing global process group. device=cuda:4, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:01:09.016] 9835 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[2026-03-16 23:01:09.069] 9836 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmpr8ki2xcc/pplx_garden_parallel_init, world_size=8
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2026-03-16 23:01:12.700] 9840 INFO     pplx_garden.distributed.process_group [rank=5] Initialized global process group.
[2026-03-16 23:01:12.700] 9842 INFO     pplx_garden.distributed.process_group [rank=7] Initialized global process group.
[2026-03-16 23:01:12.700] 9835 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[2026-03-16 23:01:12.700] 9838 INFO     pplx_garden.distributed.process_group [rank=3] Initialized global process group.
[2026-03-16 23:01:12.700] 9837 INFO     pplx_garden.distributed.process_group [rank=2] Initialized global process group.
[2026-03-16 23:01:12.700] 9839 INFO     pplx_garden.distributed.process_group [rank=4] Initialized global process group.
[2026-03-16 23:01:12.700] 9841 INFO     pplx_garden.distributed.process_group [rank=6] Initialized global process group.
[2026-03-16 23:01:12.700] 9836 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:01:13.359] 9837 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:13.367] 9838 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:13.368] 9840 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:13.368] 9836 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:13.372] 9839 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:13.382] 9835 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:13.382] 9841 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:13.386] 9842 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (8) + NVLink (1)
[2026-03-16 23:01:17.257] 9838 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:17.279] 9837 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:17.289] 9842 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:17.289] 9841 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:17.313] 9839 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:17.351] 9835 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:17.609] 9840 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:20.354] 9836 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:21.793] 9836 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:01:21.803] 9841 INFO     pplx_garden.distributed.process_group [rank=6] Destroyed global process group.
[2026-03-16 23:01:21.813] 9835 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:01:21.823] 9838 INFO     pplx_garden.distributed.process_group [rank=3] Destroyed global process group.
[2026-03-16 23:01:21.823] 9837 INFO     pplx_garden.distributed.process_group [rank=2] Destroyed global process group.
[2026-03-16 23:01:21.833] 9839 INFO     pplx_garden.distributed.process_group [rank=4] Destroyed global process group.
[2026-03-16 23:01:21.843] 9842 INFO     pplx_garden.distributed.process_group [rank=7] Destroyed global process group.
[2026-03-16 23:01:21.843] 9840 INFO     pplx_garden.distributed.process_group [rank=5] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP4-FP8-NVL2] [2026-03-16 23:01:25.850] 10750 INFO     pplx_garden.distributed.process_group [rank=2] Initializing global process group. device=cuda:2, init_method=file:///tmp/tmpgeyr2uee/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:01:25.958] 10748 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmpgeyr2uee/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:01:25.970] 10749 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmpgeyr2uee/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:01:26.010] 10751 INFO     pplx_garden.distributed.process_group [rank=3] Initializing global process group. device=cuda:3, init_method=file:///tmp/tmpgeyr2uee/pplx_garden_parallel_init, world_size=4
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2026-03-16 23:01:28.446] 10748 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[2026-03-16 23:01:28.446] 10749 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:01:28.446] 10750 INFO     pplx_garden.distributed.process_group [rank=2] Initialized global process group.
[2026-03-16 23:01:28.446] 10751 INFO     pplx_garden.distributed.process_group [rank=3] Initialized global process group.
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:01:29.220] 10748 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:29.227] 10749 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:29.370] 10750 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:29.372] 10751 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:31.352] 10748 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:31.359] 10749 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:31.391] 10750 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:31.463] 10751 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:33.034] 10748 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:01:33.044] 10749 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:01:33.052] 10750 INFO     pplx_garden.distributed.process_group [rank=2] Destroyed global process group.
[2026-03-16 23:01:33.054] 10751 INFO     pplx_garden.distributed.process_group [rank=3] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP4-DP2-NVL2] [2026-03-16 23:01:36.371] 11222 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmpiwixp19o/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:01:36.560] 11221 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmpiwixp19o/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:01:36.566] 11223 INFO     pplx_garden.distributed.process_group [rank=2] Initializing global process group. device=cuda:2, init_method=file:///tmp/tmpiwixp19o/pplx_garden_parallel_init, world_size=4
[2026-03-16 23:01:36.801] 11224 INFO     pplx_garden.distributed.process_group [rank=3] Initializing global process group. device=cuda:3, init_method=file:///tmp/tmpiwixp19o/pplx_garden_parallel_init, world_size=4
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2026-03-16 23:01:39.157] 11221 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[2026-03-16 23:01:39.157] 11222 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:01:39.157] 11224 INFO     pplx_garden.distributed.process_group [rank=3] Initialized global process group.
[2026-03-16 23:01:39.157] 11223 INFO     pplx_garden.distributed.process_group [rank=2] Initialized global process group.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:01:40.224] 11221 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:40.238] 11222 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:40.319] 11223 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:40.329] 11224 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (4) + NVLink (2)
[2026-03-16 23:01:40.760] 11223 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:40.761] 11224 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:40.764] 11221 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:40.766] 11222 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:42.246] 11221 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
[2026-03-16 23:01:42.250] 11222 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:01:42.255] 11223 INFO     pplx_garden.distributed.process_group [rank=2] Destroyed global process group.
[2026-03-16 23:01:42.261] 11224 INFO     pplx_garden.distributed.process_group [rank=3] Destroyed global process group.
PASSED
tests/p2p_all_to_all/test_p2p_all_to_all.py::test_p2p_all_to_all[TP2-EMPTY] [2026-03-16 23:01:44.989] 11682 INFO     pplx_garden.distributed.process_group [rank=0] Initializing global process group. device=cuda:0, init_method=file:///tmp/tmp14mj98yr/pplx_garden_parallel_init, world_size=2
[2026-03-16 23:01:45.009] 11683 INFO     pplx_garden.distributed.process_group [rank=1] Initializing global process group. device=cuda:1, init_method=file:///tmp/tmp14mj98yr/pplx_garden_parallel_init, world_size=2
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-03-16 23:01:46.548] 11683 INFO     pplx_garden.distributed.process_group [rank=1] Initialized global process group.
[2026-03-16 23:01:46.548] 11682 INFO     pplx_garden.distributed.process_group [rank=0] Initialized global process group.
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-03-16 23:01:46.763] 11682 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:01:46.784] 11683 INFO     pplx_garden.kernels.p2p_all_to_all Setting up RDMA (2) + NVLink (1)
[2026-03-16 23:01:47.010] 11682 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:47.014] 11683 INFO     tests.p2p_all_to_all.test_p2p_all_to_all Stopping all-to-all
[2026-03-16 23:01:47.571] 11683 INFO     pplx_garden.distributed.process_group [rank=1] Destroyed global process group.
[2026-03-16 23:01:47.571] 11682 INFO     pplx_garden.distributed.process_group [rank=0] Destroyed global process group.
PASSED

======================================================================================================== warnings summary ========================================================================================================
../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1397
  /usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1397: PytestConfigWarning: Unknown config option: asyncio_default_fixture_loop_scope

    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================================================================================== 14 passed, 1 warning in 119.11s (0:01:59) ============================================================================================

…kernel Replace the single-pass prefix scan with a chunked two-level warp scan, allowing num_experts to scale independently of block size, enabling support for MoE models with larger expert counts.

crgg1433 · 2026-03-17T00:52:13Z

Hi @abcdabcd987,

This PR removes the num_experts <= NUM_THREADS hard constraint in the dispatch_send kernel by introducing a chunked two-level warp scan. This enables scaling to much larger expert counts beyond 1024, which was previously a scaling bottleneck.

Tests are green. Could you please take a look? Thanks!

crgg1433 · 2026-04-01T22:28:57Z

Hi @abcdabcd987 @nandor, just following up on this PR. It would be great to get your feedback when you have time. Let me know if anything is blocking!

feat: Support scaling num_experts beyond block size in dispatch_send …

7e19cc0

…kernel Replace the single-pass prefix scan with a chunked two-level warp scan, allowing num_experts to scale independently of block size, enabling support for MoE models with larger expert counts.

crgg1433 marked this pull request as draft March 17, 2026 00:45

crgg1433 marked this pull request as ready for review March 17, 2026 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support scaling num_experts beyond block size in dispatch_send kernel#9

feat: Support scaling num_experts beyond block size in dispatch_send kernel#9
crgg1433 wants to merge 1 commit intoperplexityai:mainfrom
crgg1433:optimize-dispatch-scaling

crgg1433 commented Mar 17, 2026

Uh oh!

crgg1433 commented Mar 17, 2026

Uh oh!

crgg1433 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

crgg1433 commented Mar 17, 2026

Summary:

Purpose:

Test:

Uh oh!

crgg1433 commented Mar 17, 2026

Uh oh!

crgg1433 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant