Summary
The MoE combine_push (and the symmetric dispatch payload push) in the DeepSeek-V4 kernels is a serial, put-issue-bound task that cannot be parallelized across cores today because pypto has no in-kernel primitive to guarantee a rank's outbound TPUTs have landed at the peer before a subsequent cross-rank notify.
We need either:
- (A) an in-kernel put-fence / TPUT-quiesce op (per-rank drain of outbound TPUTs), usable mid-kernel between a multi-core push scope and a single-core barrier scope; and/or
- (B) formal support + validation for a multi-core
pl.spmd cross-rank pld.tensor.put followed by a deps-ordered single-core notify/wait barrier, with the semantics that the deps edge implies remote-landing (not just task-completion).
Motivation / Use Case
combine_push (pypto-lib/models/deepseek/v4/combine.py:92-153) is one pl.at(CORE_GROUP) serial task: one scalar core builds ~48 scattered pld.tensor.put descriptors (~0.5 us each), then a cross-rank combine_done notify/wait barrier. Measured ~24 us on a2a3 EP=2, and during its window all 71 other cores are idle.
It is put-ISSUE-bound, not transfer-bound:
- 384 KB / 24 us = 16 GB/s ≈ 17% of the ~96 GB/s link (6× headroom) — a transfer-bound kernel would sit near peak.
- 24 us / 48 puts ≈ 0.5 us/put = one scalar core serially building scatter descriptors. Scattered
dst_offsets=[r_route, 0] forbid coalescing, so each row is its own descriptor.
The hardware already supports parallel issue: each core's TPUT selects channelGroupIdx = get_block_idx() → its own SDMA SQ/CQ among kSdmaMaxChannel = 48 channels. Distributing the put-issue across the idle cores would cut combine_push to a ~5-6 us floor (transfer + barrier), saving ~16-18 us per call. Because it is on the fully-serial MoE tail, that converts 1:1 to wall-clock and multiplies across the model's MoE layers per decode step.
Proposed API / Behavior
Option A — explicit in-kernel put-fence (preferred; minimal new surface):
with pl.spmd(EP_WORLD_SIZE * N_LOCAL_EXPERTS, name_hint="combine_push") as push_tid:
bucket = pl.tile.get_block_idx()
... # each core issues its own bucket's puts on its own SDMA channel
pld.tensor.put(dst=routed_y_buf, peer=dst, src=recv_y_flat, ...)
with pl.at(level=pl.Level.CORE_GROUP, deps=[push_tid], name_hint="combine_done") as barrier_tid:
pld.system.put_fence() # NEW: block until ALL of this rank's outbound TPUTs have LANDED at peers
pld.system.notify(target=combine_done, ...) # now guaranteed to follow the data
pld.system.wait(signal=combine_done, ...)
Option B — make deps imply TPUT-landing: define that a scope with deps=[push_tid], where push_tid issued cross-rank puts, is guaranteed to start only after push_tid's outbound TPUTs have landed (not merely been issued to SDMA), and add an ST test mirroring this combine_push pattern.
Either way the deliverable is: a kernel author can split put-issue across cores and still get correct cross-rank ordering before the done-notify.
Alternatives Considered
- Flat static-bound form (serial route-table build +
pl.spmd over N_ROUTES with a per-lane mask): retires the dynamic-loop-bound 507018 class, but keeps the same push/barrier split → same cross-core ordering hazard → expected to deadlock identically. Not a fix.
- Per-expert streaming (push inside
expert_routed's per-expert pl.parallel to hide under the expert cloud): needs the exact same cross-core put→notify ordering guarantee plus an all-experts join, with larger blast radius — strictly worse than fixing the primitive.
- Keep push+notify atomic (status quo): correct, but leaves ~16-18 us/layer of put-issue serialized on one core with 71 cores idle.
Additional Context
Empirical result. We implemented the 3-scope split (with pl.spmd(EP*N_LOCAL) as push_tid → pl.at(CORE_GROUP, deps=[push_tid]) barrier → pl.spmd(deps=[barrier_tid]) reduce, capture-form with pl.tile.get_block_idx()). It compiles, but on a2a3 EP=2 it DEADLOCKS (300 s timeout, killed) on the first run. This multi-core cross-rank put + deps-separated barrier pattern is not exercised anywhere in pypto tests.
Why it breaks. The current code deliberately keeps the structure atomic — combine.py:88: "keep TPUT + notify/wait in one atomic task". In one InCore task, the puts and the notify share the same SDMA channel + program order, so the notify is guaranteed to follow the puts. After the split, the barrier's notify runs on a different SDMA channel than the puts (channelGroupIdx = get_block_idx()), with no hardware ordering between this rank's puts-landing and its notify-to-peer. deps=[push_tid] only guarantees task-completion of the push scope, not remote-landing of its TPUTs. There is no in-kernel "drain my outbound TPUTs" primitive (comm_barrier is host-side all-rank, unusable mid-kernel).
Hardware path references: per-core channel selection channelGroupIdx = get_block_idx() (pto-isa .../comm/async/sdma/sdma_async_intrin.hpp:514-532), kSdmaMaxChannel = 48 (sdma_types.hpp:30); TPUT emit is pto.barrier<PIPE_ALL> + pto.comm.tput with no post-put completion/quiesce barrier (src/backend/common/pto_ops_common.cpp:2913-2919).
Affected kernels: pypto-lib/models/deepseek/v4/combine.py:92-153 (combine_push), dispatch.py:209-235 (symmetric payload push).
Versions: pypto 55fdd38 (main), pypto-lib 1a23bd3. Target: a2a3 EP=2. Blocks a measured ~16-18 us/layer MoE-tail win.
Summary
The MoE
combine_push(and the symmetricdispatchpayload push) in the DeepSeek-V4 kernels is a serial, put-issue-bound task that cannot be parallelized across cores today because pypto has no in-kernel primitive to guarantee a rank's outboundTPUTs have landed at the peer before a subsequent cross-ranknotify.We need either:
pl.spmdcross-rankpld.tensor.putfollowed by adeps-ordered single-corenotify/waitbarrier, with the semantics that thedepsedge implies remote-landing (not just task-completion).Motivation / Use Case
combine_push(pypto-lib/models/deepseek/v4/combine.py:92-153) is onepl.at(CORE_GROUP)serial task: one scalar core builds ~48 scatteredpld.tensor.putdescriptors (~0.5 us each), then a cross-rankcombine_donenotify/waitbarrier. Measured ~24 us on a2a3 EP=2, and during its window all 71 other cores are idle.It is put-ISSUE-bound, not transfer-bound:
dst_offsets=[r_route, 0]forbid coalescing, so each row is its own descriptor.The hardware already supports parallel issue: each core's TPUT selects
channelGroupIdx = get_block_idx()→ its own SDMA SQ/CQ amongkSdmaMaxChannel = 48channels. Distributing the put-issue across the idle cores would cutcombine_pushto a ~5-6 us floor (transfer + barrier), saving ~16-18 us per call. Because it is on the fully-serial MoE tail, that converts 1:1 to wall-clock and multiplies across the model's MoE layers per decode step.Proposed API / Behavior
Option A — explicit in-kernel put-fence (preferred; minimal new surface):
Option B — make
depsimply TPUT-landing: define that a scope withdeps=[push_tid], wherepush_tidissued cross-rank puts, is guaranteed to start only afterpush_tid's outbound TPUTs have landed (not merely been issued to SDMA), and add an ST test mirroring thiscombine_pushpattern.Either way the deliverable is: a kernel author can split put-issue across cores and still get correct cross-rank ordering before the done-
notify.Alternatives Considered
pl.spmdoverN_ROUTESwith a per-lane mask): retires the dynamic-loop-bound507018class, but keeps the same push/barrier split → same cross-core ordering hazard → expected to deadlock identically. Not a fix.expert_routed's per-expertpl.parallelto hide under the expert cloud): needs the exact same cross-core put→notify ordering guarantee plus an all-experts join, with larger blast radius — strictly worse than fixing the primitive.Additional Context
Empirical result. We implemented the 3-scope split (
with pl.spmd(EP*N_LOCAL) as push_tid→pl.at(CORE_GROUP, deps=[push_tid])barrier →pl.spmd(deps=[barrier_tid])reduce, capture-form withpl.tile.get_block_idx()). It compiles, but on a2a3 EP=2 it DEADLOCKS (300 s timeout, killed) on the first run. This multi-core cross-rank put +deps-separated barrier pattern is not exercised anywhere in pypto tests.Why it breaks. The current code deliberately keeps the structure atomic —
combine.py:88: "keep TPUT + notify/wait in one atomic task". In oneInCoretask, the puts and thenotifyshare the same SDMA channel + program order, so thenotifyis guaranteed to follow the puts. After the split, the barrier'snotifyruns on a different SDMA channel than the puts (channelGroupIdx = get_block_idx()), with no hardware ordering between this rank's puts-landing and its notify-to-peer.deps=[push_tid]only guarantees task-completion of the push scope, not remote-landing of its TPUTs. There is no in-kernel "drain my outbound TPUTs" primitive (comm_barrieris host-side all-rank, unusable mid-kernel).Hardware path references: per-core channel selection
channelGroupIdx = get_block_idx()(pto-isa .../comm/async/sdma/sdma_async_intrin.hpp:514-532),kSdmaMaxChannel = 48(sdma_types.hpp:30); TPUT emit ispto.barrier<PIPE_ALL>+pto.comm.tputwith no post-put completion/quiesce barrier (src/backend/common/pto_ops_common.cpp:2913-2919).Affected kernels:
pypto-lib/models/deepseek/v4/combine.py:92-153(combine_push),dispatch.py:209-235(symmetric payload push).Versions: pypto
55fdd38(main), pypto-lib1a23bd3. Target: a2a3 EP=2. Blocks a measured ~16-18 us/layer MoE-tail win.