Skip to content

[Feature] In-kernel put-fence / TPUT-quiesce primitive to enable multi-core parallelized cross-rank push (combine_push / dispatch) #1906

Description

@lwDavid

Summary

The MoE combine_push (and the symmetric dispatch payload push) in the DeepSeek-V4 kernels is a serial, put-issue-bound task that cannot be parallelized across cores today because pypto has no in-kernel primitive to guarantee a rank's outbound TPUTs have landed at the peer before a subsequent cross-rank notify.

We need either:

  • (A) an in-kernel put-fence / TPUT-quiesce op (per-rank drain of outbound TPUTs), usable mid-kernel between a multi-core push scope and a single-core barrier scope; and/or
  • (B) formal support + validation for a multi-core pl.spmd cross-rank pld.tensor.put followed by a deps-ordered single-core notify/wait barrier, with the semantics that the deps edge implies remote-landing (not just task-completion).

Motivation / Use Case

combine_push (pypto-lib/models/deepseek/v4/combine.py:92-153) is one pl.at(CORE_GROUP) serial task: one scalar core builds ~48 scattered pld.tensor.put descriptors (~0.5 us each), then a cross-rank combine_done notify/wait barrier. Measured ~24 us on a2a3 EP=2, and during its window all 71 other cores are idle.

It is put-ISSUE-bound, not transfer-bound:

  • 384 KB / 24 us = 16 GB/s ≈ 17% of the ~96 GB/s link (6× headroom) — a transfer-bound kernel would sit near peak.
  • 24 us / 48 puts ≈ 0.5 us/put = one scalar core serially building scatter descriptors. Scattered dst_offsets=[r_route, 0] forbid coalescing, so each row is its own descriptor.

The hardware already supports parallel issue: each core's TPUT selects channelGroupIdx = get_block_idx() → its own SDMA SQ/CQ among kSdmaMaxChannel = 48 channels. Distributing the put-issue across the idle cores would cut combine_push to a ~5-6 us floor (transfer + barrier), saving ~16-18 us per call. Because it is on the fully-serial MoE tail, that converts 1:1 to wall-clock and multiplies across the model's MoE layers per decode step.

Proposed API / Behavior

Option A — explicit in-kernel put-fence (preferred; minimal new surface):

with pl.spmd(EP_WORLD_SIZE * N_LOCAL_EXPERTS, name_hint="combine_push") as push_tid:
    bucket = pl.tile.get_block_idx()
    ...                                  # each core issues its own bucket's puts on its own SDMA channel
    pld.tensor.put(dst=routed_y_buf, peer=dst, src=recv_y_flat, ...)

with pl.at(level=pl.Level.CORE_GROUP, deps=[push_tid], name_hint="combine_done") as barrier_tid:
    pld.system.put_fence()               # NEW: block until ALL of this rank's outbound TPUTs have LANDED at peers
    pld.system.notify(target=combine_done, ...)   # now guaranteed to follow the data
    pld.system.wait(signal=combine_done, ...)

Option B — make deps imply TPUT-landing: define that a scope with deps=[push_tid], where push_tid issued cross-rank puts, is guaranteed to start only after push_tid's outbound TPUTs have landed (not merely been issued to SDMA), and add an ST test mirroring this combine_push pattern.

Either way the deliverable is: a kernel author can split put-issue across cores and still get correct cross-rank ordering before the done-notify.

Alternatives Considered

  • Flat static-bound form (serial route-table build + pl.spmd over N_ROUTES with a per-lane mask): retires the dynamic-loop-bound 507018 class, but keeps the same push/barrier split → same cross-core ordering hazard → expected to deadlock identically. Not a fix.
  • Per-expert streaming (push inside expert_routed's per-expert pl.parallel to hide under the expert cloud): needs the exact same cross-core put→notify ordering guarantee plus an all-experts join, with larger blast radius — strictly worse than fixing the primitive.
  • Keep push+notify atomic (status quo): correct, but leaves ~16-18 us/layer of put-issue serialized on one core with 71 cores idle.

Additional Context

Empirical result. We implemented the 3-scope split (with pl.spmd(EP*N_LOCAL) as push_tidpl.at(CORE_GROUP, deps=[push_tid]) barrier → pl.spmd(deps=[barrier_tid]) reduce, capture-form with pl.tile.get_block_idx()). It compiles, but on a2a3 EP=2 it DEADLOCKS (300 s timeout, killed) on the first run. This multi-core cross-rank put + deps-separated barrier pattern is not exercised anywhere in pypto tests.

Why it breaks. The current code deliberately keeps the structure atomic — combine.py:88: "keep TPUT + notify/wait in one atomic task". In one InCore task, the puts and the notify share the same SDMA channel + program order, so the notify is guaranteed to follow the puts. After the split, the barrier's notify runs on a different SDMA channel than the puts (channelGroupIdx = get_block_idx()), with no hardware ordering between this rank's puts-landing and its notify-to-peer. deps=[push_tid] only guarantees task-completion of the push scope, not remote-landing of its TPUTs. There is no in-kernel "drain my outbound TPUTs" primitive (comm_barrier is host-side all-rank, unusable mid-kernel).

Hardware path references: per-core channel selection channelGroupIdx = get_block_idx() (pto-isa .../comm/async/sdma/sdma_async_intrin.hpp:514-532), kSdmaMaxChannel = 48 (sdma_types.hpp:30); TPUT emit is pto.barrier<PIPE_ALL> + pto.comm.tput with no post-put completion/quiesce barrier (src/backend/common/pto_ops_common.cpp:2913-2919).

Affected kernels: pypto-lib/models/deepseek/v4/combine.py:92-153 (combine_push), dispatch.py:209-235 (symmetric payload push).

Versions: pypto 55fdd38 (main), pypto-lib 1a23bd3. Target: a2a3 EP=2. Blocks a measured ~16-18 us/layer MoE-tail win.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Fields

    No fields configured for Feature.

    Projects

    Status
    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions