Skip to content

[Bug] decode_attention_swa/hca need a no-op self-copy hack to force a WAR edge auto-dep misses — replace with pl.submit(deps=) manual dependency #481

Description

@zhangqi-chen

Diagnosis

pypto (orchestration auto-dep / OverlapMap) — no WAR anti-dependency is inserted between two distinct reshape views of the same external inout tensor. Filed here because the workaround and the proposed fix both live in pypto-lib kernels.

Description

Context. decode_attention_{swa,hca} now update the KV cache in-place on the kv_cache inout tensor (validated directly), instead of via a separate kv_cache_out output. The in-place scatter must run after sparse_attn reads the pristine cache: with multi-token decode (S=2) sliding-window, the current token reaches attention via the kv overlay, and committing it to the cache before attention corrupts an earlier token's still-in-window eviction slot.

Bug. The orchestration does not order the writeback after the gather. In the generated orchestration cpp, the gather reads ori_kv_flat = ext_kv_cache.reshape(...) (add_input) and the writeback writes kv_cache_flat = ext_kv_cache.reshape(...) (add_output) — two distinct reshape objects of the same buffer. Auto-dep tracks RAW/WAW by SSA value (value-flow), not the underlying buffer, so no WAR edge is inserted → the in-place write races ahead of the read → x_out fails ~7% with non-deterministic failing indices across runs (kv_cache itself passes).

Minimal confirmation. Manually editing those two lines in the orchestration cpp add_input/add_outputadd_inout and rerunning via --runtime-dir makes x_out PASS — confirming the missing WAR edge is the sole cause.

Current workaround (in tree). A no-op self-copy (t[i:i+1] = t[i:i+1]) in the gather_kv scope of decode_sparse_attn.sparse_attn and in the *_cache_writeback scope, forcing both tasks to declare kv_cache as add_inout; the runtime then serializes inout-vs-inout on the shared buffer by submission order. Verified PASS (swa/hca/csa) on a2a3 with the PTO2_RING_* big pool. Downsides: the shared sparse_attn pays T no-op tile copies on every call (incl. csa/prefill that don't need it), and it relies on a serialization side-effect rather than an explicit edge.

Proposed fix — manual dependency edge. pypto exposes result, tid = pl.submit(kernel, *args, deps=[...]) (pypto/language/scope.py) — a precision tool for "edges the runtime cannot infer" that unions with auto-deps (final fanin = auto ∪ explicit). Capture the sparse_attn/gather producer TaskId and submit the cache writeback with deps=[gather_tid], forcing writeback-after-gather explicitly, and drop the gather_kv self-copy (removing the shared-kernel overhead).

Repro.

PTO2_RING_DEP_POOL=1048576 PTO2_RING_TASK_WINDOW=1048576 PTO2_RING_HEAP=4294967296 \
  python models/deepseek/v4/decode_attention_swa.py -d <dev>

Without any inout-forcing, x_out fails ~7% non-deterministically; with both sides forced add_inout (current hack, or manual cpp edit), it PASSES.

Environment

Component Version
pypto-lib 67bcc6f3
pypto 778dae48 (branch: main)
pypto runtime (submodule) 48980572
ptoas 0.43
CANN not detected

Host Platform

Linux (aarch64)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions