Track: a2a3 tile-op ST gaps deferred from PR #1823 (gemv family, bitwise int32, prelu, rem, sin/cos width-128, matmul_bias narrow-N)

_Auto-audited tracking issue: each deferred op was re-verified against the freshly-cloned a2a3 pto-isa headers and PyPTO codegen; reported symptoms were corrected where the audit disagreed (see Verification notes)._


These tile operators were excluded from a2a3 on-board ST coverage in PR #1823 due to a2a3 ISA / codegen gaps. This issue tracks them so they are not forgotten. Each item below records the audited root cause (which can differ from the originally reported symptom — see Verification notes), the load-bearing evidence, and the fix direction.

## pto-isa limitations (a2a3 ISA / PTOAS, file against pto-isa)

- [ ] **gemv_acc** — a2a3 lacks the 3-tile shared-accumulator overload of `TGEMV_ACC` that `TMATMUL_ACC` has; in-place `cOut==cIn` accumulation can't be assembled. PyPTO codegen is correct (Acc aliasing via `make_acc_codegen`, not a tmov). Evidence: `build_output/_deps/pto-isa/include/pto/npu/a2a3/TMatmul.hpp:120` (only 4-tile `TGEMV_ACC_IMPL`, no 3-tile overload like `:180`), `pto_instr.hpp:800`/`:808`, `README.md:81` (TGEMV_ACC = TODO vs `:94` TMATMUL_ACC = Yes). Fix: add the 3-tile shared-acc overload + PTOAS support in pto-isa, flip README TODO→Yes; PyPTO just un-defers ST. No PyPTO code change.

- [ ] **rem / rems** — a2a3 `RemInt32Instr` emulates int32 modulo via int32→fp32→int32 round-trip, clobbering src0/src1 in place and losing precision past fp32's 24-bit mantissa; `rems` shares the same precision-loss class plus a tmp-row mismatch. PyPTO codegen is correct (kSimpleOps + tmp operand from PR #1823). Evidence: `build_output/_deps/pto-isa/include/pto/npu/a2a3/TRem.hpp:67-85` (round-trip, in-place input clobber at `:82-83`), `:147-148` (static_assert advertises int32), a5's true `vmod` at `TRem.hpp:72`; `TRemS.hpp:67-82`; tmp-row asymmetry `TRemS.hpp:135` (≥1) vs `TRem.hpp:161` (≥2). Fix: pto-isa stage the round-trip through scratch tmp (stop clobbering inputs), route int32 to a true integer-modulo path like a5. Keep out of a2a3 ST until landed.

- [ ] **xor / xors** — a2a3 `TXOR`/`TXORS` are TOR+TAND+TNOT composites whose static_assert restricts element type to 16-/8-bit ints; int32 (default `int`/`torch.int32`) trips the assert at PTOAS. PyPTO codegen is dtype-agnostic passthrough. Evidence: `build_output/_deps/pto-isa/include/pto/npu/a2a3/TXor.hpp:28-30` (int8/uint8/int16/uint16 only — int8/uint8 ARE allowed), `TXor.hpp:49-60`, `TBitwiseSOp.hpp:121-135`; `pto_ops_common.cpp:2254`/`:2282`; `tests/st/runtime/ops/test_bitwise.py:21`. Fix (immediate, no code): re-enable a2a3 ST restricted to int16/uint16 (mirror the shipped `not_` pattern). If int32 xor is actually required, file pto-isa to widen the composites. Optionally tighten PyPTO deducers so int32 xor fails at compile with a clear message.

- [ ] **and_ / or_ (+ scalar ands/ors)** — int32 specifically is a2a3-unsupported: `TAND`/`TOR` accept only 1-/2-byte integral types (32-bit is A5-only), so an int32 tile yields the cryptic PTOAS `pto.tand invalid kind of type`. The ops themselves ARE supported on a2a3. Evidence: `build_output/_deps/pto-isa/include/pto/npu/a2a3/TAnd.hpp:75`, `TOr.hpp:65-67`, `tand.md` Target-Profile Restrictions (A2A3 = 1-/2-byte; A5 adds 4-byte); `KNOWN_ISSUES.md:216-222`. Fix: re-test with INT16/UINT16 and un-defer; add a PyPTO DSL/IR-level dtype check rejecting 32-bit and_/or_ on a2a3 with an actionable message. (See Verification notes — `shl`/`shr` split out below.)

- [ ] **sin / cos at width-128 / cols ≥ 64** — on-device-reproduced: `[8,128]` FP32 sin/cos correct for cols 0–63, garbage for 64–127 (64/1024 wrong), while `sqrt` at the same shape passes. sin/cos are software-decomposed (`LowerSinCos`) into ~30 primitives including FP32↔INT32 `tcvt` round-trips and 2-source binary ops — the only ops exercising those at width 128. Readable pto-isa headers handle 128 cols correctly (`TCvt.hpp:1166-1176`/`:1207-1208` → 2 repeats; `TBinOp.hpp` contiguous flatten), so the residual defect is most plausibly an assembler/intrinsic-level miscompile of the tcvt round-trip or 2-source binary on the second 64-element segment, below the C++ template layer. Evidence: `src/ir/transforms/lower_composite_ops_pass.cpp:317-373`/`:647-648`; `pto_ops_common.cpp:2261-2270` (no sin/cos mapping, sqrt single `pto.tsqrt`); `KNOWN_ISSUES.md:278-284`; `tests/st/runtime/ops/test_unary_math.py:49-58`. Fix: do NOT change PyPTO. Localize on device with an isolation ST running only (a) FP32→INT32→FP32 cast and (b) a single 2-source binary at `[8,128]`, comparing cols 0–63 vs 64–127; file pto-isa against the offending intrinsic (mirrors issue #173 / #1790 second-segment-tail pattern). Keep the `[8,128]` case skipped meanwhile.

## PyPTO codegen bugs

- [ ] **prelu** — codegen mapping-table arity is wrong (2, should be 3); the DSL/IR 3-arg shape `(tile, slope, tmp)` is correct and matches the 4-operand `TPRELU_IMPL(dst,src0,src1,tmp)` ISA contract (tmp is mandatory scratch). `MakeNaryCodegenPTO`'s `CHECK(args.size()==arity)` rejects the 3-arg call at compile time. Evidence: root cause at `src/backend/common/pto_ops_common.cpp:2260` `{tile.prelu, pto.tprelu, 2}`; arity check `:437`; `build_output/_deps/pto-isa/include/pto/npu/a2a3/TPrelu.hpp:42-62`/`:38-39`; DSL `python/pypto/language/op/tile_ops.py:1894`/`:1907`; IR `src/ir/op/tile_ops/elementwise.cpp:682-695`. Fix: change `:2260` to arity 3, then re-enable and verify a2a3 prelu ST. One-line change.

- [ ] **subc / subsc — wrong-doc (NOT a miscompute)** — hardware and PyPTO codegen are both correct (`a - b + c`); the defect is PyPTO docs/op-descriptions advertising `lhs - rhs - rhs2`, so a user gets `a-b+c` silently. Evidence: `build_output/_deps/pto-isa/include/pto/cpu/ElementOp.h:399-403`/`:605-608` (`a-b+c`); `tsubc.md:15`/`:61`, `tsubsc.md:15`; faithful codegen `pto_ops_common.cpp:2273`/`:2290`; wrong docs at `python/pypto/language/op/tile_ops.py:1946`/`:1981` and `src/ir/op/tile_ops/elementwise.cpp:714`/`:743`. Fix: correct the four docstrings/descriptions to `lhs - rhs + rhs2` / `lhs - scalar + rhs2`. No codegen change. If `a-b-c` is genuinely intended, that's a separate feature (no a-b-c primitive in pto-isa) and must be raised with the user.

- [ ] **expands (tile.expands)** — orphan duplicate op: IR op + DSL alias exist and `tensor.expands→tile.expands` routes to it, but it has NO f_codegen, so it crashes codegen. `TEXPANDS` is only reachable via the separate `tile.full` (`pl.full`) path. Evidence: `src/ir/op/tile_ops/broadcast.cpp:412`; reachable via `tile_ops.py:1402-1414` and `op_conversion_registry.cpp:177`; working path `pto_ops_common.cpp:3429-3435`; no reg in `pto_ops_common.cpp`. Fix: register an f_codegen emitting `pto.texpands` (mirror `MakeFullCodegenPTO`), or preferably retire `tile.expands` and route `pl.expands`/`tensor.expands` to the existing `tile.full` path.

- [ ] **sum (tile.sum)** — has axis/keepdim but is never lowered to `tile.row_sum`/`tile.col_sum` and has NO f_codegen → crashes codegen (intentionally used as a guaranteed-failure case in a codegen UT). The working reduction path is `pl.row_sum`/`pl.col_sum` → `pto.trowsum`/`pto.tcolsum`. Evidence: `src/ir/op/tile_ops/reduction.cpp:209`; `flatten_tile_nd_to_2d_pass.cpp:2020-2050` (only remaps axis); `tests/ut/codegen/test_pto_codegen.py:1548-1574`. Fix: add a lowering rewriting `tile.sum(axis)` → row_sum (last axis) / col_sum (axis 0) with the required tmp, or retire `tile.sum` and steer users to `pl.row_sum`/`pl.col_sum`.

## Needs design (operand-shape / valid-shape decision)

- [ ] **gemv / gemv_bias — IN PROGRESS** (see below) — 1-row lhs must be loaded to the Left (L0A) cube tile as `Rows==1` + row_major (none_box L1) with K aligned to `512/sizeof` (K%128 fp32, K%256 bf16) so a2a3 `TExtractToLeft`/`TMovToLeft` dispatch to the `TExtractToAVector` 1-row path. No ISA feature missing. Evidence: `build_output/_deps/pto-isa/include/pto/npu/a2a3/textract_common.hpp:470-479` (Rows==1 && isRowMajor → vector path), constraint on COLS at `:99-118` (`srcCol/dstCol % (512/sizeof)`), `tmov_common.hpp:17-20`, `tmatmul_kernel.cpp:347-353` (`RunTGEMV` reference), `constants.hpp:33-34`; PyPTO `pto_type_utils.cpp:129-137`, `pto_ops_common.cpp:2323-2325`, `matmul.cpp:308-350`. Fix: drive K alignment and row_major/none_box lhs load layout from the DSL/load (user controls K alignment via shapes/valid_shapes per the perf-sizes rule). **Being actively addressed in this worktree** (`test_gemv.py` loads `[1,k]` lhs to Mat with K aligned to 128/256) — track as in-progress, not just deferred.

- [ ] **sels (tile.sels)** — IR op reachable from `pl.sels` but NO f_codegen, and its 3-arg signature `(lhs,rhs,select_mode)` does not match the real a2a3 `TSELS` (vsel-style: `dst,mask,src,tmp,scalar`). Evidence: `src/ir/op/tile_ops/elementwise.cpp:872`; `build_output/_deps/pto-isa/include/pto/npu/a2a3/TSels.hpp:20-26`/`:52-56`; no reg in `pto_ops_common.cpp`. Fix: correct REGISTER_OP to the real TSELS shape (mirror `tile.sel`/`TSEL`) then add an f_codegen emitting `pto.tsels`; or retire the orphan op + DSL alias if redundant.

- [ ] **matmul_bias narrow-N** — the matmul_bias output (L0C Acc) valid_shape is always the full physical `[M,N]` (operand narrowing never propagated), so `tile.store` copies the full `[M,N]`; but `mad` writes only `[validRow x validCol]` and never clears L0C tail columns, so `[:, VALID_N:]` ships stale data. narrow-M/K pass (whole-row zero / no tail); narrow-N fails (stale within partially-written Nz blocks). Evidence: `src/ir/op/tile_ops/matmul.cpp:232`/`:255-257`; `build_output/_deps/pto-isa/include/pto/npu/a2a3/TMatmul.hpp:192-199`; CPU ref `pto/cpu/TMatmul.hpp:23-41`/`:196-214`; `tstore_common.hpp:163-204`; `pto_ops_common.cpp:1410`; `tests/st/runtime/ops/test_gemv.py:179-186`. Fix (primary, pypto-codegen): propagate operand valid_shapes into `DeduceTileMatMul*Type` (output valid_row from lhs, valid_col from rhs) so `tile.store` copies only `[M, VALID_N]` and GM `[:, VALID_N:]` keeps its pre-initialized zero. Alternative: clear the Acc tail columns in cube codegen. (Spans both layers; the pto-isa no-tail-zero behavior is a hardware reality to work around, not a fix target.)

## Verification notes

Several originally reported symptoms were refuted or corrected by the audit:

- **gemv_acc** — reported as a "missing acc→acc `pto.tmov`" in PyPTO. Refuted: PyPTO handles acc accumulation correctly via Acc-address aliasing (`make_acc_codegen`, same lambda as the working `matmul_acc`); the real gap is the pto-isa TGEMV_ACC TODO / missing 3-tile shared-acc overload.
- **subc / subsc** — reported as `pto.tsubc` miscomputing (`a-b-c`). Refuted: hardware computes `a-b+c` and PyPTO codegen is faithful; the defect is a PyPTO wrong-doc that advertises `a-b-c`. A naive `a-b-c` numpy reference is why ST mismatched, not a device miscompute.
- **gemv 1-row** — reported as hitting the `TExtractToA` `srcRow/dstRow % 16 == 0` static_assert. Corrected: the 1-row + row_major case explicitly *skips* `TExtractToA` (`textract_common.hpp:470-479`) and goes to `TExtractToAVector`, whose real constraint is on K (cols) `% (512/sizeof)`, not rows. Symptom cited the wrong static_assert.
- **xor / xors** — reported as "only int16/uint16". Imprecise: the static_assert also permits int8/uint8; only int32/int64 are unsupported.
- **and_ / or_ / shl / shr** — reported as "PTOAS rejects `pto.tand/tor/tshl/tshr` as unsupported on a2a3". Overbroad: the ops ARE supported; the real trigger is the **int32 element type** on tand/tor (a2a3 is 8/16-bit only, 32-bit is A5-only). Furthermore **shl / shr are likely over-attributed** — `TShl.hpp:42-45` / `TShr.hpp:42-45` allow 8/16/32-bit signed+unsigned, so i32 shifts should assemble at the ISA-impl level. The i32 error was observed for `tand` and assumed family-wide. Action: re-test shl/shr with int32 on device; if PTOAS still rejects, that is a *separate* real pto-isa/PTOAS verifier-vs-impl gap to file with an i32-shift repro — confirm on device before attributing. Correct the KNOWN_ISSUES/PR wording from "op unsupported" to "i32 element type unsupported for tand/tor on a2a3 (8/16-bit only)".
- **prelu** — reported as a DSL/`pto.tprelu` signature mismatch. Corrected: the DSL/IR 3-arg shape is correct and matches the ISA; the actual bug is the codegen mapping-table arity (2 instead of 3).
- **matmul_bias narrow-N** — reported simply as "the cube does not zero `[:, VALID_N:]`". Refined: the precise cause is a PyPTO valid_shape/value mismatch (output valid_shape hardcoded to full physical `[M,N]`), with the cube's no-tail-zero being the underlying hardware reality.

A temp-operand codegen fix for `trem/trems/txor/txors` was already landed in PR #1823; only `not_` and `matmul_bias` ST were retained on a2a3 in that PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Track: a2a3 tile-op ST gaps deferred from PR #1823 (gemv family, bitwise int32, prelu, rem, sin/cos width-128, matmul_bias narrow-N) #1846

pto-isa limitations (a2a3 ISA / PTOAS, file against pto-isa)

PyPTO codegen bugs

Needs design (operand-shape / valid-shape decision)

Verification notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Track: a2a3 tile-op ST gaps deferred from PR #1823 (gemv family, bitwise int32, prelu, rem, sin/cos width-128, matmul_bias narrow-N) #1846

Description

pto-isa limitations (a2a3 ISA / PTOAS, file against pto-isa)

PyPTO codegen bugs

Needs design (operand-shape / valid-shape decision)

Verification notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions