Chunked pto.comm.tput (staging tile smaller than transfer) does not drain its stores before a following TNOTIFY → peer reads partial data (a2a3)

## Summary

When the VEC staging tile passed to `pto::comm::TPUT` is **smaller than the transfer extent** (so TPUT internally 2-D-slides the transfer through the tile in multiple chunks), the generated kernel does **not** drain the chunked MTE3 stores before a following cross-rank `pto::comm::TNOTIFY`. The signal therefore races ahead of the last in-flight chunk stores; the peer's `TWAIT` unblocks and reads **partial data**.

Single-shot TPUT (staging tile == transfer extent) does **not** exhibit this — it is self-draining.

This is the same insert-sync / `pto.comm.tput` hazard family as #706 and #730, but specific to the **chunked (sub-tile staging)** TPUT path.

## Repro

Frontend (PyPTO) — a 2-rank ring-shuffle `pld.tensor.put` of an `[8, 64]` FP32 window, with the staging tile forced down to `[2, 64]` (or `[2, 32]`) so TPUT must auto-chunk:

- rank `r` pushes its `[8,64]` slice into peer `(r+1)%2`'s slice, then `TNOTIFY`s the peer; peer `TWAIT`s then reads its slice back.
- Golden: `out[r] == in[(r-1)%2]`.

Observed on **device (a2a3 / Ascend 910B, 2 ranks)**:
- **chunked** staging (`[2,32]` or `[2,64]` tile, full `[8,64]` transfer) → **wrong**: one direction fully correct, the other has the **top rows correct and the bottom rows zero** (a partial transfer; `max diff = 511`). Reproduced with both column-split and row-only chunking.
- **single-shot** staging (full `[8,64]` tile) → **correct**.

## Generated C++ (PTOAS output)

The chunked TPUT region (`next_levels/chip_orch/kernels/aiv/ring_step.cpp`). Note the GlobalTensor is the **full** `Shape<1,1,1,8,64>` transfer while the staging tile is `Tile<...,2,32,...,-1,-1,...>` (allocated 2x32, `ValidRow=ValidCol=-1` = DYNAMIC):

```cpp
GlobalTensor<float, Shape<1,1,1,8,64>, Stride<512,512,512,64,1>, Layout::ND> v36 = ...;  // full transfer
// v28 = Tile<TileType::Vec, float, 2, 32, RowMajor, -1, -1, NoneBox, ...>               // sub-tile staging

pipe_barrier(PIPE_ALL);                          // orders the preceding source TSTORE (ok)
wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);
pto::comm::TPUT(v36, v26, v28);                  // chunked: 4x2 internal slide, many MTE3 stores
set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID2);
// >>> nothing drains the chunked stores here <<<
pto::comm::TNOTIFY(v41, v12, v11);               // cross-rank signal races the in-flight stores
...
pto::comm::TWAIT(v44, v12, v10);                 // peer unblocks and reads partial data
```

PTOAS emits only the local `set_flag(PIPE_MTE3, PIPE_MTE2, ...)` after `TPUT`; there is no `pipe_barrier(PIPE_ALL)` (or equivalent MTE3 drain) between the chunked `TPUT` and the following `TNOTIFY`.

## Expected

A chunked `pto::comm::TPUT` followed by a visibility/signal op (`TNOTIFY`) must guarantee all chunk stores have landed before the signal fires — same effective completion guarantee as single-shot TPUT. Either:

1. **insert-sync** should emit a full `pipe_barrier(PIPE_ALL)` (or MTE3 drain) between a chunked `pto.comm.tput` and a following `pto.system.notify` / `pto.comm.tnotify`; or
2. the **chunked TPUT** implementation should be self-completing (drain its last chunk before returning), matching single-shot semantics.

## Workaround (verified)

Inserting a `pipe_barrier(PIPE_ALL)` immediately after `TPUT` (before `TNOTIFY`) makes the chunked case correct on device. We currently inject this from the PyPTO frontend codegen as a stopgap, but it should be PTOAS's responsibility (single-shot TPUT needs no such caller-inserted barrier).

## Environment

- Platform: a2a3 (Ascend 910B), 2 ranks
- pto-isa commit: `016396b57e2c17093f1194e6acd89bb112b0ab24`
- Frontend: PyPTO `pld.tensor.put` with sub-tile VEC staging (auto-chunk)

## Related

- #706 — [Pass Bug] insert-sync: missing MTE3->MTE2 hazard between `pto.tstore` and `pto.comm.tput`
- #730 — [Bug] insert-sync: MTE3->MTE2 pipe flag for same-address GM store->load round-trip doesn't guarantee GM visibility


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chunked pto.comm.tput (staging tile smaller than transfer) does not drain its stores before a following TNOTIFY → peer reads partial data (a2a3) #872

Summary

Repro

Generated C++ (PTOAS output)

Expected

Workaround (verified)

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Chunked pto.comm.tput (staging tile smaller than transfer) does not drain its stores before a following TNOTIFY → peer reads partial data (a2a3) #872

Description

Summary

Repro

Generated C++ (PTOAS output)

Expected

Workaround (verified)

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions