[Bug] decode_attention_swa/hca need a no-op self-copy hack to force a WAR edge auto-dep misses — replace with pl.submit(deps=) manual dependency

### Diagnosis

**pypto** (orchestration auto-dep / OverlapMap) — no WAR anti-dependency is inserted between two distinct reshape views of the same external inout tensor. Filed here because the workaround and the proposed fix both live in pypto-lib kernels.

### Description

**Context.** `decode_attention_{swa,hca}` now update the KV cache in-place on the `kv_cache` inout tensor (validated directly), instead of via a separate `kv_cache_out` output. The in-place scatter must run **after** `sparse_attn` reads the pristine cache: with multi-token decode (S=2) sliding-window, the current token reaches attention via the `kv` overlay, and committing it to the cache before attention corrupts an earlier token's still-in-window eviction slot.

**Bug.** The orchestration does not order the writeback after the gather. In the generated orchestration cpp, the gather reads `ori_kv_flat = ext_kv_cache.reshape(...)` (`add_input`) and the writeback writes `kv_cache_flat = ext_kv_cache.reshape(...)` (`add_output`) — two distinct reshape objects of the same buffer. Auto-dep tracks RAW/WAW by SSA value (value-flow), not the underlying buffer, so no WAR edge is inserted → the in-place write races ahead of the read → `x_out` fails ~7% with **non-deterministic** failing indices across runs (`kv_cache` itself passes).

**Minimal confirmation.** Manually editing those two lines in the orchestration cpp `add_input`/`add_output` → `add_inout` and rerunning via `--runtime-dir` makes `x_out` PASS — confirming the missing WAR edge is the sole cause.

**Current workaround (in tree).** A no-op self-copy (`t[i:i+1] = t[i:i+1]`) in the `gather_kv` scope of `decode_sparse_attn.sparse_attn` and in the `*_cache_writeback` scope, forcing both tasks to declare `kv_cache` as `add_inout`; the runtime then serializes inout-vs-inout on the shared buffer by submission order. Verified PASS (swa/hca/csa) on a2a3 with the `PTO2_RING_*` big pool. Downsides: the shared `sparse_attn` pays T no-op tile copies on every call (incl. csa/prefill that don't need it), and it relies on a serialization side-effect rather than an explicit edge.

**Proposed fix — manual dependency edge.** pypto exposes `result, tid = pl.submit(kernel, *args, deps=[...])` (`pypto/language/scope.py`) — a precision tool for "edges the runtime cannot infer" that unions with auto-deps (final fanin = auto ∪ explicit). Capture the `sparse_attn`/gather producer TaskId and submit the cache writeback with `deps=[gather_tid]`, forcing writeback-after-gather explicitly, and drop the `gather_kv` self-copy (removing the shared-kernel overhead).

**Repro.**
```
PTO2_RING_DEP_POOL=1048576 PTO2_RING_TASK_WINDOW=1048576 PTO2_RING_HEAP=4294967296 \
  python models/deepseek/v4/decode_attention_swa.py -d <dev>
```
Without any inout-forcing, `x_out` fails ~7% non-deterministically; with both sides forced `add_inout` (current hack, or manual cpp edit), it PASSES.

### Environment

| Component | Version |
|---|---|
| pypto-lib | `67bcc6f3` |
| pypto | `778dae48` (branch: main) |
| pypto runtime (submodule) | `48980572` |
| ptoas | 0.43 |
| CANN | not detected |

### Host Platform

Linux (aarch64)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] decode_attention_swa/hca need a no-op self-copy hack to force a WAR edge auto-dep misses — replace with pl.submit(deps=) manual dependency #481

Diagnosis

Description

Environment

Host Platform

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Version
pypto-lib	`67bcc6f3`
pypto	`778dae48` (branch: main)
pypto runtime (submodule)	`48980572`
ptoas	0.43
CANN	not detected

Uh oh!

[Bug] decode_attention_swa/hca need a no-op self-copy hack to force a WAR edge auto-dep misses — replace with pl.submit(deps=) manual dependency #481

Description

Diagnosis

Description

Environment

Host Platform

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions