Diagnosis
pypto (orchestration auto-dep / OverlapMap) — no WAR anti-dependency is inserted between two distinct reshape views of the same external inout tensor. Filed here because the workaround and the proposed fix both live in pypto-lib kernels.
Description
Context. decode_attention_{swa,hca} now update the KV cache in-place on the kv_cache inout tensor (validated directly), instead of via a separate kv_cache_out output. The in-place scatter must run after sparse_attn reads the pristine cache: with multi-token decode (S=2) sliding-window, the current token reaches attention via the kv overlay, and committing it to the cache before attention corrupts an earlier token's still-in-window eviction slot.
Bug. The orchestration does not order the writeback after the gather. In the generated orchestration cpp, the gather reads ori_kv_flat = ext_kv_cache.reshape(...) (add_input) and the writeback writes kv_cache_flat = ext_kv_cache.reshape(...) (add_output) — two distinct reshape objects of the same buffer. Auto-dep tracks RAW/WAW by SSA value (value-flow), not the underlying buffer, so no WAR edge is inserted → the in-place write races ahead of the read → x_out fails ~7% with non-deterministic failing indices across runs (kv_cache itself passes).
Minimal confirmation. Manually editing those two lines in the orchestration cpp add_input/add_output → add_inout and rerunning via --runtime-dir makes x_out PASS — confirming the missing WAR edge is the sole cause.
Current workaround (in tree). A no-op self-copy (t[i:i+1] = t[i:i+1]) in the gather_kv scope of decode_sparse_attn.sparse_attn and in the *_cache_writeback scope, forcing both tasks to declare kv_cache as add_inout; the runtime then serializes inout-vs-inout on the shared buffer by submission order. Verified PASS (swa/hca/csa) on a2a3 with the PTO2_RING_* big pool. Downsides: the shared sparse_attn pays T no-op tile copies on every call (incl. csa/prefill that don't need it), and it relies on a serialization side-effect rather than an explicit edge.
Proposed fix — manual dependency edge. pypto exposes result, tid = pl.submit(kernel, *args, deps=[...]) (pypto/language/scope.py) — a precision tool for "edges the runtime cannot infer" that unions with auto-deps (final fanin = auto ∪ explicit). Capture the sparse_attn/gather producer TaskId and submit the cache writeback with deps=[gather_tid], forcing writeback-after-gather explicitly, and drop the gather_kv self-copy (removing the shared-kernel overhead).
Repro.
PTO2_RING_DEP_POOL=1048576 PTO2_RING_TASK_WINDOW=1048576 PTO2_RING_HEAP=4294967296 \
python models/deepseek/v4/decode_attention_swa.py -d <dev>
Without any inout-forcing, x_out fails ~7% non-deterministically; with both sides forced add_inout (current hack, or manual cpp edit), it PASSES.
Environment
| Component |
Version |
| pypto-lib |
67bcc6f3 |
| pypto |
778dae48 (branch: main) |
| pypto runtime (submodule) |
48980572 |
| ptoas |
0.43 |
| CANN |
not detected |
Host Platform
Linux (aarch64)
Diagnosis
pypto (orchestration auto-dep / OverlapMap) — no WAR anti-dependency is inserted between two distinct reshape views of the same external inout tensor. Filed here because the workaround and the proposed fix both live in pypto-lib kernels.
Description
Context.
decode_attention_{swa,hca}now update the KV cache in-place on thekv_cacheinout tensor (validated directly), instead of via a separatekv_cache_outoutput. The in-place scatter must run aftersparse_attnreads the pristine cache: with multi-token decode (S=2) sliding-window, the current token reaches attention via thekvoverlay, and committing it to the cache before attention corrupts an earlier token's still-in-window eviction slot.Bug. The orchestration does not order the writeback after the gather. In the generated orchestration cpp, the gather reads
ori_kv_flat = ext_kv_cache.reshape(...)(add_input) and the writeback writeskv_cache_flat = ext_kv_cache.reshape(...)(add_output) — two distinct reshape objects of the same buffer. Auto-dep tracks RAW/WAW by SSA value (value-flow), not the underlying buffer, so no WAR edge is inserted → the in-place write races ahead of the read →x_outfails ~7% with non-deterministic failing indices across runs (kv_cacheitself passes).Minimal confirmation. Manually editing those two lines in the orchestration cpp
add_input/add_output→add_inoutand rerunning via--runtime-dirmakesx_outPASS — confirming the missing WAR edge is the sole cause.Current workaround (in tree). A no-op self-copy (
t[i:i+1] = t[i:i+1]) in thegather_kvscope ofdecode_sparse_attn.sparse_attnand in the*_cache_writebackscope, forcing both tasks to declarekv_cacheasadd_inout; the runtime then serializes inout-vs-inout on the shared buffer by submission order. Verified PASS (swa/hca/csa) on a2a3 with thePTO2_RING_*big pool. Downsides: the sharedsparse_attnpays T no-op tile copies on every call (incl. csa/prefill that don't need it), and it relies on a serialization side-effect rather than an explicit edge.Proposed fix — manual dependency edge. pypto exposes
result, tid = pl.submit(kernel, *args, deps=[...])(pypto/language/scope.py) — a precision tool for "edges the runtime cannot infer" that unions with auto-deps (final fanin = auto ∪ explicit). Capture thesparse_attn/gather producer TaskId and submit the cache writeback withdeps=[gather_tid], forcing writeback-after-gather explicitly, and drop thegather_kvself-copy (removing the shared-kernel overhead).Repro.
Without any inout-forcing,
x_outfails ~7% non-deterministically; with both sides forcedadd_inout(current hack, or manual cpp edit), it PASSES.Environment
67bcc6f3778dae48(branch: main)48980572Host Platform
Linux (aarch64)