Understanding PCIe topology is critical for multi-GPU inference on RTX PRO 6000 Blackwell systems. The path data takes between GPUs directly determines AllReduce latency, which dominates decode throughput for tensor-parallel inference.
- PCIe Topology Basics
- Direct-Attach vs Switch Topologies
- AMD Turin vs Genoa
- PCIe Switches
- nvidia-smi Topology Output
- Impact on Inference Performance
- Topology Selection Guide
In a multi-GPU system, each GPU connects to the CPU(s) through PCIe lanes. The path between any two GPUs determines the communication latency and bandwidth available for tensor-parallel operations like AllReduce.
Key concepts:
| Term | Meaning |
|---|---|
| Root Complex | The PCIe controller inside the CPU die. All PCIe lanes originate here. |
| Root Port | An individual x16 slot's connection point on the root complex. |
| PCIe Switch | An external chip that multiplexes multiple x16 downstream ports onto a single x16 (or wider) upstream link to the CPU. |
| P2P (Peer-to-Peer) | Direct GPU-to-GPU data transfer without bouncing through system RAM. |
| NUMA Node | A CPU socket and its directly-attached memory. Cross-NUMA transfers traverse the inter-socket fabric (xGMI on AMD). |
| PHB | PCIe Host Bridge -- GPUs sharing the same root complex but different root ports. |
| PIX | GPUs sharing the same PCIe switch. |
| SYS | GPUs separated by the inter-socket (cross-NUMA) fabric link. |
Each GPU connects directly to a CPU root port. With dual AMD EPYC CPUs, you get 128 PCIe Gen5 lanes total (64 per socket), supporting 4 GPUs per socket at x16 each.
+---------+ xGMI (192-256 GB/s) +---------+
| CPU 0 |<===========================>| CPU 1 |
| (NUMA0) | | (NUMA1) |
+----+----+ +----+----+
/ / \ \ / / \ \
x16 x16 x16 x16 x16 x16 x16 x16
| | | | | | | |
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
Same-NUMA P2P: ~0.36 us latency, ~56 GB/s unidirectional
Cross-NUMA P2P: ~0.44 us latency (Turin) / ~14 us (Genoa w/o P2P)
Advantages:
- 128 total PCIe lanes directly to CPUs and local RAM
- Best for KV cache offload to system RAM (no shared uplink bottleneck)
- Best for training workloads requiring heavy RAM-to-GPU throughput
- Simpler hardware (no switch boards to buy)
Disadvantages:
- Cross-NUMA GPU traffic traverses xGMI fabric
- Two CPUs cost more and consume more power
- NCCL may misdetect AMD xGMI bandwidth (requires tuning)
GPUs connect to PCIe switches, which provide full-bandwidth P2P between GPUs on the same switch without going through the CPU.
+-------+
| CPU |
+---+---+
x16 | x16
+------+------+
| |
+----+----+ +----+----+
| Switch | | Switch |
| (100L) | | (100L) |
+--+--+--++ ++--+--+--+
/ | | \ / | | \
x16 x16 x16 x16 x16 x16 x16 x16
| | | | | | | |
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
Same-switch P2P: ~0.45 us, ~56 GB/s (full x16 each, simultaneously)
Cross-switch P2P: traverses CPU root complex
Advantages:
- GPU-to-GPU P2P within a switch uses full x16 bandwidth each, simultaneously, without sharing
- ~100 ns cut-through latency on PCIe switches vs 1-2 us on CPU root port
- Single CPU reduces cost, power, and complexity
- Avoids cross-NUMA overhead entirely (single socket)
Disadvantages:
- Shared x16 uplink to CPU limits host RAM bandwidth (bottleneck for KV cache offload)
- Switch hardware cost (EUR 2,000-2,400 per 100-lane switch)
- Cross-switch traffic still routes through CPU
A single large switch with 128 lanes for GPUs + x32 uplink to CPU/RAM.
- Still loses to Turin dual-CPU for memory-heavy workloads (32 shared lanes vs 128 direct lanes)
- Good for pure GPU-to-GPU inference where RAM bandwidth is not critical
The CPU generation makes a dramatic difference for cross-NUMA GPU communication.
| Feature | AMD EPYC Genoa (9004) | AMD EPYC Turin 9575F (9005) |
|---|---|---|
| xGMI Version | xGMI 3.0 | xGMI 3.0 (wider) |
| xGMI Links | 3x | 3x |
| xGMI Bandwidth | ~96 GB/s | 192-256 GB/s actual |
| Cross-NUMA P2P Latency | ~14 us (P2P disabled) | ~0.44 us (P2P enabled) |
| PCIe Lanes per Socket | 128 (Gen5) | 128 (Gen5) |
| DDR5 Speed | 4800 MHz | 6400 MHz |
| DIMM Channels per CPU | 12 | 12 |
Key finding: Turin's ultra-fast xGMI3 interconnect drastically lowers cross-socket latency. With proper NCCL tuning, Turin dual-CPU direct-attach setups reach parity with PCIe switch topologies for inference -- something that was impossible on Genoa.
"Direct-attach on Turin beats 160-lane switch for training and KV cache offload workloads." -- Community consensus
DRAM channel count and speed affect cross-NUMA bandwidth. Under-populated DRAM reduces available bandwidth:
| System | DIMM Config | xGMI Links | Cross-NUMA BW (bidir) |
|---|---|---|---|
| Festr (Turin) | 24x96 GB DDR5-6400, all 12 channels/CPU | 3x | ~99+ GB/s |
| orangezed (Genoa) | 10x48 GB DDR5-4800, 5 channels/CPU | 2x | ~64 GB/s |
Note: orangezed initially reported ~9 tok/s at 100K context for Kimi K2.5, but this was a measurement error — wall-clock time (including prefill) was divided by tokens generated. Actual decode-only throughput (from vLLM stats log) was 30-35 tok/s, comparable to other Genoa systems. The DRAM/xGMI differences do impact cross-NUMA bandwidth, but the effect on single-batch decode is smaller than initially believed.
- Price: ~EUR 2,420 incl. VAT / EUR 2,000 ex. VAT
- Source: c-payne.com
- Connects to motherboard via 2x MCIO connectors using a PCIe Gen5 x16 port
- All downstream MCIO ports appear as separate PCIe devices
- Shared uplink cap: x16 Gen5 total to CPU when all devices talk to CPU simultaneously
- GPU-to-GPU traffic within the switch: x16 each, simultaneously, without sharing bandwidth
- Gen4 x16 device only occupies Gen5 x8 to CPU (switch translates signaling)
- Supports 2-partition mode and dual-CPU bridge mode
- Smaller 52-lane version available (cheaper); software-configurable partitions
- Fits 8 GPUs: 8 x 16 = 128 lanes for GPUs, 16 lanes for CPU uplink
- Excellent for inference (GPU P2P dominant)
- Bottlenecked for training (RAM-to-GPU throughput limited by x16 uplink)
- Found in ASUS ESC8000A-E13P servers (2 physical chips, each with 2 partitions)
- ACS must be disabled on both root ports AND switch downstream ports for full P2P performance — see detailed guide
- With ACS disabled: interconnect score 0.96 (4 GPU) / 0.90 (8 GPU) — significantly higher than Microchip switches
- Products: P5-SW104, P5-SW144 (Gen5)
- URL: guavasystems.us
- PCIe Gen6 switch expected June/July 2026
- 4 PCIe switches, 2 GPUs + 2 NICs per switch = 8 GPUs total
- Each GPU gets its own dedicated 800Gbps NIC (~100GB/s direct GPU-to-GPU across ethernet fabric)
- ConnectX-8 NICs have built-in PCIe switches
- 9th PCIe slot for DPU (ARM cores, can run GPU-only box without host CPU)
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3
GPU0 X NODE NODE NODE
GPU1 NODE X NODE NODE
GPU2 NODE NODE X NODE
GPU3 NODE NODE NODE X
All GPUs show NODE -- connected via PCIe through the same NUMA node's root complex.
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X PHB PHB PHB SYS SYS SYS SYS
GPU1 PHB X PHB PHB SYS SYS SYS SYS
GPU2 PHB PHB X PHB SYS SYS SYS SYS
GPU3 PHB PHB PHB X SYS SYS SYS SYS
GPU4 SYS SYS SYS SYS X PHB PHB PHB
GPU5 SYS SYS SYS SYS PHB X PHB PHB
GPU6 SYS SYS SYS SYS PHB PHB X PHB
GPU7 SYS SYS SYS SYS PHB PHB PHB X
PHB= Same NUMA node, different PCIe Host Bridge root portsSYS= Cross-NUMA, traverses xGMI fabric between CPU sockets
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X PIX PIX PIX PHB PHB PHB PHB
GPU1 PIX X PIX PIX PHB PHB PHB PHB
GPU2 PIX PIX X PIX PHB PHB PHB PHB
GPU3 PIX PIX PIX X PHB PHB PHB PHB
GPU4 PHB PHB PHB PHB X PIX PIX PIX
GPU5 PHB PHB PHB PHB PIX X PIX PIX
GPU6 PHB PHB PHB PHB PIX PIX X PIX
GPU7 PHB PHB PHB PHB PIX PIX PIX X
PIX= Same PCIe switch (lowest latency P2P)PHB= Different switches, routed through CPU root complex
| Code | Meaning | Typical Latency |
|---|---|---|
PIX |
Same PCIe switch | ~0.45 us |
PHB |
Same NUMA, PCIe Host Bridge | ~0.36-0.45 us |
NODE |
Same NUMA node | ~0.36-0.45 us |
SYS |
Cross-NUMA (xGMI fabric) | ~0.44 us (Turin), ~14 us (no P2P) |
Tensor-parallel inference performs AllReduce operations after every attention and MoE layer. For models like Qwen3.5-397B or Kimi K2.5, these are small messages (32-256 KB) where latency dominates over bandwidth. A 1 us reduction in AllReduce latency translates directly to faster token generation.
| Setup | Bus BW (GB/s) | Notes |
|---|---|---|
| luke (3x switches, single CPU) | 41.1 | Highest measured |
| Grimulkan (4x switches, single CPU) | 39.4 | With NCCL_MIN_NCHANNELS=8 |
| Festr (dual Turin, no switches) | 37.6 | After NCCL tuning |
| Festr (dual Turin, default NCCL) | 22.2 | Before tuning |
Expert Parallelism (EP) was tested and found to be consistently slower or equivalent on PCIe setups:
"EP is dead in the water for these setups" -- luke
EP requires massive inter-GPU bandwidth for routing tokens to experts. Without NVLink (~900 GB/s), PCIe (~56 GB/s unidirectional) cannot keep up.
Exception: luke achieved ~350 tok/s with EP=8 on Qwen3.5 using a custom allreduce kernel on PCIe switches, but this required significant custom patches.
| Use Case | Recommended Topology | Why |
|---|---|---|
| 4-GPU inference | Direct-attach (any CPU) | Simplest, no switches needed |
| 8-GPU inference (lowest latency) | PCIe switches, single CPU | Avoids cross-NUMA; switch P2P is fastest |
| 8-GPU inference (budget) | Dual Turin direct-attach | Reaches parity with switches after NCCL tuning |
| 8-GPU + KV cache offload to RAM | Dual Turin direct-attach | 128 direct PCIe lanes to RAM vs shared x16 uplink |
| 16-GPU inference | 4x switches, single CPU | Proven by Grimulkan; avoids dual-CPU complexity |
| Training / fine-tuning | Dual Turin direct-attach | RAM bandwidth critical; 128 lanes beats shared switch uplink |
PCIe switches with ACS enabled (the default) redirect P2P traffic through the CPU root port instead of routing it directly within the switch fabric. This can halve same-switch bandwidth and completely negate the advantage of PCIe switches.
ACS must be disabled on both root ports and switch downstream ports. See the ASUS ESC8000A-E13P guide for detailed instructions.
Note: The kernel parameter
pcie_acs_override=downstream,multifunctionrequires a patched kernel and does NOT work on stock Ubuntu kernels. Usesetpciat runtime instead.
On direct-attach topologies (NODE — GPUs connected through CPU root ports without a PCIe switch), the nvidia driver defaults to SysMem staging for GPU-to-GPU memory accesses from CUDA kernels. This makes the PCIe oneshot allreduce ~15× slower than expected.
This only affects custom CUDA kernels that do direct SM loads from remote GPU memory via IPC handles. NCCL is not affected (it uses its own SHM transport).
Fix: Create /etc/modprobe.d/nvidia-p2p-override.conf:
options nvidia NVreg_RegistryDwords="ForceP2P=0x11;RMForceP2PType=1;RMPcieP2PType=2;GrdmaPciTopoCheckOverride=1;EnableResizableBar=1"
Then reload the nvidia module or reboot. See PCIe Oneshot AllReduce — Critical Prerequisite for details.
For pure inference (no KV cache offload, no training), PCIe switches on a single CPU provide the lowest latency. For workloads that need system RAM bandwidth (KV cache offload, training), dual Turin direct-attach is superior. Genoa dual-CPU setups should strongly consider adding switches to compensate for slower xGMI.
If using Broadcom switches: Disabling ACS is mandatory for full performance. With ACS off, Broadcom PEX890xx switches achieve interconnect scores of 0.90-0.96, significantly outperforming other tested configurations.