[Feature] In-kernel put-fence / TPUT-quiesce primitive to enable multi-core parallelized cross-rank push (combine_push / dispatch)

### Summary

The MoE `combine_push` (and the symmetric `dispatch` payload push) in the DeepSeek-V4 kernels is a **serial, put-issue-bound** task that **cannot be parallelized across cores today** because pypto has no in-kernel primitive to guarantee a rank's outbound `TPUT`s have **landed at the peer** before a subsequent cross-rank `notify`.

We need either:
- **(A)** an in-kernel **put-fence / TPUT-quiesce** op (per-rank drain of outbound TPUTs), usable mid-kernel between a multi-core push scope and a single-core barrier scope; and/or
- **(B)** formal support + validation for a **multi-core `pl.spmd` cross-rank `pld.tensor.put` followed by a `deps`-ordered single-core `notify`/`wait` barrier**, with the semantics that the `deps` edge implies remote-landing (not just task-completion).

### Motivation / Use Case

`combine_push` (`pypto-lib/models/deepseek/v4/combine.py:92-153`) is one `pl.at(CORE_GROUP)` serial task: one scalar core builds ~48 scattered `pld.tensor.put` descriptors (~0.5 us each), then a cross-rank `combine_done` `notify`/`wait` barrier. Measured **~24 us** on a2a3 EP=2, and **during its window all 71 other cores are idle**.

It is **put-ISSUE-bound, not transfer-bound**:
- 384 KB / 24 us = **16 GB/s ≈ 17% of the ~96 GB/s link (6× headroom)** — a transfer-bound kernel would sit near peak.
- 24 us / 48 puts ≈ **0.5 us/put** = one scalar core serially building scatter descriptors. Scattered `dst_offsets=[r_route, 0]` forbid coalescing, so each row is its own descriptor.

The hardware **already supports parallel issue**: each core's TPUT selects `channelGroupIdx = get_block_idx()` → its **own** SDMA SQ/CQ among `kSdmaMaxChannel = 48` channels. Distributing the put-issue across the idle cores would cut `combine_push` to a **~5-6 us floor** (transfer + barrier), saving **~16-18 us per call**. Because it is on the **fully-serial MoE tail**, that converts **1:1 to wall-clock** and **multiplies across the model's MoE layers** per decode step.

### Proposed API / Behavior

**Option A — explicit in-kernel put-fence** (preferred; minimal new surface):

```python
with pl.spmd(EP_WORLD_SIZE * N_LOCAL_EXPERTS, name_hint="combine_push") as push_tid:
    bucket = pl.tile.get_block_idx()
    ...                                  # each core issues its own bucket's puts on its own SDMA channel
    pld.tensor.put(dst=routed_y_buf, peer=dst, src=recv_y_flat, ...)

with pl.at(level=pl.Level.CORE_GROUP, deps=[push_tid], name_hint="combine_done") as barrier_tid:
    pld.system.put_fence()               # NEW: block until ALL of this rank's outbound TPUTs have LANDED at peers
    pld.system.notify(target=combine_done, ...)   # now guaranteed to follow the data
    pld.system.wait(signal=combine_done, ...)
```

**Option B — make `deps` imply TPUT-landing:** define that a scope with `deps=[push_tid]`, where `push_tid` issued cross-rank puts, is guaranteed to start only after `push_tid`'s outbound TPUTs have **landed** (not merely been issued to SDMA), and add an ST test mirroring this `combine_push` pattern.

Either way the deliverable is: a kernel author can split put-issue across cores and still get correct cross-rank ordering before the done-`notify`.

### Alternatives Considered

- **Flat static-bound form** (serial route-table build + `pl.spmd` over `N_ROUTES` with a per-lane mask): retires the dynamic-loop-bound `507018` class, but keeps the **same push/barrier split** → same cross-core ordering hazard → expected to deadlock identically. Not a fix.
- **Per-expert streaming** (push inside `expert_routed`'s per-expert `pl.parallel` to hide under the expert cloud): needs the **exact same** cross-core put→notify ordering guarantee plus an all-experts join, with larger blast radius — strictly worse than fixing the primitive.
- **Keep push+notify atomic (status quo):** correct, but leaves ~16-18 us/layer of put-issue serialized on one core with 71 cores idle.

### Additional Context

**Empirical result.** We implemented the 3-scope split (`with pl.spmd(EP*N_LOCAL) as push_tid` → `pl.at(CORE_GROUP, deps=[push_tid])` barrier → `pl.spmd(deps=[barrier_tid])` reduce, capture-form with `pl.tile.get_block_idx()`). It **compiles**, but on **a2a3 EP=2 it DEADLOCKS (300 s timeout, killed)** on the first run. This multi-core cross-rank put + `deps`-separated barrier pattern is **not exercised anywhere** in pypto tests.

**Why it breaks.** The current code deliberately keeps the structure atomic — `combine.py:88`: *"keep TPUT + notify/wait in one atomic task"*. In one `InCore` task, the puts and the `notify` share the **same SDMA channel + program order**, so the `notify` is guaranteed to follow the puts. After the split, the barrier's `notify` runs on a **different** SDMA channel than the puts (`channelGroupIdx = get_block_idx()`), with **no hardware ordering** between this rank's puts-landing and its notify-to-peer. `deps=[push_tid]` only guarantees **task-completion** of the push scope, **not remote-landing** of its TPUTs. There is no in-kernel "drain my outbound TPUTs" primitive (`comm_barrier` is host-side all-rank, unusable mid-kernel).

**Hardware path references:** per-core channel selection `channelGroupIdx = get_block_idx()` (`pto-isa .../comm/async/sdma/sdma_async_intrin.hpp:514-532`), `kSdmaMaxChannel = 48` (`sdma_types.hpp:30`); TPUT emit is `pto.barrier<PIPE_ALL>` + `pto.comm.tput` with **no** post-put completion/quiesce barrier (`src/backend/common/pto_ops_common.cpp:2913-2919`).

**Affected kernels:** `pypto-lib/models/deepseek/v4/combine.py:92-153` (combine_push), `dispatch.py:209-235` (symmetric payload push).

**Versions:** pypto `55fdd38` (main), pypto-lib `1a23bd3`. Target: a2a3 EP=2. Blocks a measured **~16-18 us/layer** MoE-tail win.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] In-kernel put-fence / TPUT-quiesce primitive to enable multi-core parallelized cross-rank push (combine_push / dispatch) #1906

Summary

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] In-kernel put-fence / TPUT-quiesce primitive to enable multi-core parallelized cross-rank push (combine_push / dispatch) #1906

Description

Summary

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions