Summary
When the VEC staging tile passed to pto::comm::TPUT is smaller than the transfer extent (so TPUT internally 2-D-slides the transfer through the tile in multiple chunks), the generated kernel does not drain the chunked MTE3 stores before a following cross-rank pto::comm::TNOTIFY. The signal therefore races ahead of the last in-flight chunk stores; the peer's TWAIT unblocks and reads partial data.
Single-shot TPUT (staging tile == transfer extent) does not exhibit this — it is self-draining.
This is the same insert-sync / pto.comm.tput hazard family as #706 and #730, but specific to the chunked (sub-tile staging) TPUT path.
Repro
Frontend (PyPTO) — a 2-rank ring-shuffle pld.tensor.put of an [8, 64] FP32 window, with the staging tile forced down to [2, 64] (or [2, 32]) so TPUT must auto-chunk:
- rank
r pushes its [8,64] slice into peer (r+1)%2's slice, then TNOTIFYs the peer; peer TWAITs then reads its slice back.
- Golden:
out[r] == in[(r-1)%2].
Observed on device (a2a3 / Ascend 910B, 2 ranks):
- chunked staging (
[2,32] or [2,64] tile, full [8,64] transfer) → wrong: one direction fully correct, the other has the top rows correct and the bottom rows zero (a partial transfer; max diff = 511). Reproduced with both column-split and row-only chunking.
- single-shot staging (full
[8,64] tile) → correct.
Generated C++ (PTOAS output)
The chunked TPUT region (next_levels/chip_orch/kernels/aiv/ring_step.cpp). Note the GlobalTensor is the full Shape<1,1,1,8,64> transfer while the staging tile is Tile<...,2,32,...,-1,-1,...> (allocated 2x32, ValidRow=ValidCol=-1 = DYNAMIC):
GlobalTensor<float, Shape<1,1,1,8,64>, Stride<512,512,512,64,1>, Layout::ND> v36 = ...; // full transfer
// v28 = Tile<TileType::Vec, float, 2, 32, RowMajor, -1, -1, NoneBox, ...> // sub-tile staging
pipe_barrier(PIPE_ALL); // orders the preceding source TSTORE (ok)
wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);
pto::comm::TPUT(v36, v26, v28); // chunked: 4x2 internal slide, many MTE3 stores
set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID2);
// >>> nothing drains the chunked stores here <<<
pto::comm::TNOTIFY(v41, v12, v11); // cross-rank signal races the in-flight stores
...
pto::comm::TWAIT(v44, v12, v10); // peer unblocks and reads partial data
PTOAS emits only the local set_flag(PIPE_MTE3, PIPE_MTE2, ...) after TPUT; there is no pipe_barrier(PIPE_ALL) (or equivalent MTE3 drain) between the chunked TPUT and the following TNOTIFY.
Expected
A chunked pto::comm::TPUT followed by a visibility/signal op (TNOTIFY) must guarantee all chunk stores have landed before the signal fires — same effective completion guarantee as single-shot TPUT. Either:
- insert-sync should emit a full
pipe_barrier(PIPE_ALL) (or MTE3 drain) between a chunked pto.comm.tput and a following pto.system.notify / pto.comm.tnotify; or
- the chunked TPUT implementation should be self-completing (drain its last chunk before returning), matching single-shot semantics.
Workaround (verified)
Inserting a pipe_barrier(PIPE_ALL) immediately after TPUT (before TNOTIFY) makes the chunked case correct on device. We currently inject this from the PyPTO frontend codegen as a stopgap, but it should be PTOAS's responsibility (single-shot TPUT needs no such caller-inserted barrier).
Environment
- Platform: a2a3 (Ascend 910B), 2 ranks
- pto-isa commit:
016396b57e2c17093f1194e6acd89bb112b0ab24
- Frontend: PyPTO
pld.tensor.put with sub-tile VEC staging (auto-chunk)
Related
Summary
When the VEC staging tile passed to
pto::comm::TPUTis smaller than the transfer extent (so TPUT internally 2-D-slides the transfer through the tile in multiple chunks), the generated kernel does not drain the chunked MTE3 stores before a following cross-rankpto::comm::TNOTIFY. The signal therefore races ahead of the last in-flight chunk stores; the peer'sTWAITunblocks and reads partial data.Single-shot TPUT (staging tile == transfer extent) does not exhibit this — it is self-draining.
This is the same insert-sync /
pto.comm.tputhazard family as #706 and #730, but specific to the chunked (sub-tile staging) TPUT path.Repro
Frontend (PyPTO) — a 2-rank ring-shuffle
pld.tensor.putof an[8, 64]FP32 window, with the staging tile forced down to[2, 64](or[2, 32]) so TPUT must auto-chunk:rpushes its[8,64]slice into peer(r+1)%2's slice, thenTNOTIFYs the peer; peerTWAITs then reads its slice back.out[r] == in[(r-1)%2].Observed on device (a2a3 / Ascend 910B, 2 ranks):
[2,32]or[2,64]tile, full[8,64]transfer) → wrong: one direction fully correct, the other has the top rows correct and the bottom rows zero (a partial transfer;max diff = 511). Reproduced with both column-split and row-only chunking.[8,64]tile) → correct.Generated C++ (PTOAS output)
The chunked TPUT region (
next_levels/chip_orch/kernels/aiv/ring_step.cpp). Note the GlobalTensor is the fullShape<1,1,1,8,64>transfer while the staging tile isTile<...,2,32,...,-1,-1,...>(allocated 2x32,ValidRow=ValidCol=-1= DYNAMIC):PTOAS emits only the local
set_flag(PIPE_MTE3, PIPE_MTE2, ...)afterTPUT; there is nopipe_barrier(PIPE_ALL)(or equivalent MTE3 drain) between the chunkedTPUTand the followingTNOTIFY.Expected
A chunked
pto::comm::TPUTfollowed by a visibility/signal op (TNOTIFY) must guarantee all chunk stores have landed before the signal fires — same effective completion guarantee as single-shot TPUT. Either:pipe_barrier(PIPE_ALL)(or MTE3 drain) between a chunkedpto.comm.tputand a followingpto.system.notify/pto.comm.tnotify; orWorkaround (verified)
Inserting a
pipe_barrier(PIPE_ALL)immediately afterTPUT(beforeTNOTIFY) makes the chunked case correct on device. We currently inject this from the PyPTO frontend codegen as a stopgap, but it should be PTOAS's responsibility (single-shot TPUT needs no such caller-inserted barrier).Environment
016396b57e2c17093f1194e6acd89bb112b0ab24pld.tensor.putwith sub-tile VEC staging (auto-chunk)Related
pto.tstoreandpto.comm.tput