Skip to content

Chunked pto.comm.tput (staging tile smaller than transfer) does not drain its stores before a following TNOTIFY → peer reads partial data (a2a3) #872

Description

@YunjiQin

Summary

When the VEC staging tile passed to pto::comm::TPUT is smaller than the transfer extent (so TPUT internally 2-D-slides the transfer through the tile in multiple chunks), the generated kernel does not drain the chunked MTE3 stores before a following cross-rank pto::comm::TNOTIFY. The signal therefore races ahead of the last in-flight chunk stores; the peer's TWAIT unblocks and reads partial data.

Single-shot TPUT (staging tile == transfer extent) does not exhibit this — it is self-draining.

This is the same insert-sync / pto.comm.tput hazard family as #706 and #730, but specific to the chunked (sub-tile staging) TPUT path.

Repro

Frontend (PyPTO) — a 2-rank ring-shuffle pld.tensor.put of an [8, 64] FP32 window, with the staging tile forced down to [2, 64] (or [2, 32]) so TPUT must auto-chunk:

  • rank r pushes its [8,64] slice into peer (r+1)%2's slice, then TNOTIFYs the peer; peer TWAITs then reads its slice back.
  • Golden: out[r] == in[(r-1)%2].

Observed on device (a2a3 / Ascend 910B, 2 ranks):

  • chunked staging ([2,32] or [2,64] tile, full [8,64] transfer) → wrong: one direction fully correct, the other has the top rows correct and the bottom rows zero (a partial transfer; max diff = 511). Reproduced with both column-split and row-only chunking.
  • single-shot staging (full [8,64] tile) → correct.

Generated C++ (PTOAS output)

The chunked TPUT region (next_levels/chip_orch/kernels/aiv/ring_step.cpp). Note the GlobalTensor is the full Shape<1,1,1,8,64> transfer while the staging tile is Tile<...,2,32,...,-1,-1,...> (allocated 2x32, ValidRow=ValidCol=-1 = DYNAMIC):

GlobalTensor<float, Shape<1,1,1,8,64>, Stride<512,512,512,64,1>, Layout::ND> v36 = ...;  // full transfer
// v28 = Tile<TileType::Vec, float, 2, 32, RowMajor, -1, -1, NoneBox, ...>               // sub-tile staging

pipe_barrier(PIPE_ALL);                          // orders the preceding source TSTORE (ok)
wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);
pto::comm::TPUT(v36, v26, v28);                  // chunked: 4x2 internal slide, many MTE3 stores
set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID2);
// >>> nothing drains the chunked stores here <<<
pto::comm::TNOTIFY(v41, v12, v11);               // cross-rank signal races the in-flight stores
...
pto::comm::TWAIT(v44, v12, v10);                 // peer unblocks and reads partial data

PTOAS emits only the local set_flag(PIPE_MTE3, PIPE_MTE2, ...) after TPUT; there is no pipe_barrier(PIPE_ALL) (or equivalent MTE3 drain) between the chunked TPUT and the following TNOTIFY.

Expected

A chunked pto::comm::TPUT followed by a visibility/signal op (TNOTIFY) must guarantee all chunk stores have landed before the signal fires — same effective completion guarantee as single-shot TPUT. Either:

  1. insert-sync should emit a full pipe_barrier(PIPE_ALL) (or MTE3 drain) between a chunked pto.comm.tput and a following pto.system.notify / pto.comm.tnotify; or
  2. the chunked TPUT implementation should be self-completing (drain its last chunk before returning), matching single-shot semantics.

Workaround (verified)

Inserting a pipe_barrier(PIPE_ALL) immediately after TPUT (before TNOTIFY) makes the chunked case correct on device. We currently inject this from the PyPTO frontend codegen as a stopgap, but it should be PTOAS's responsibility (single-shot TPUT needs no such caller-inserted barrier).

Environment

  • Platform: a2a3 (Ascend 910B), 2 ranks
  • pto-isa commit: 016396b57e2c17093f1194e6acd89bb112b0ab24
  • Frontend: PyPTO pld.tensor.put with sub-tile VEC staging (auto-chunk)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions