You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Auto-audited tracking issue: each deferred op was re-verified against the freshly-cloned a2a3 pto-isa headers and PyPTO codegen; reported symptoms were corrected where the audit disagreed (see Verification notes).
These tile operators were excluded from a2a3 on-board ST coverage in PR #1823 due to a2a3 ISA / codegen gaps. This issue tracks them so they are not forgotten. Each item below records the audited root cause (which can differ from the originally reported symptom — see Verification notes), the load-bearing evidence, and the fix direction.
pto-isa limitations (a2a3 ISA / PTOAS, file against pto-isa)
gemv_acc — a2a3 lacks the 3-tile shared-accumulator overload of TGEMV_ACC that TMATMUL_ACC has; in-place cOut==cIn accumulation can't be assembled. PyPTO codegen is correct (Acc aliasing via make_acc_codegen, not a tmov). Evidence: build_output/_deps/pto-isa/include/pto/npu/a2a3/TMatmul.hpp:120 (only 4-tile TGEMV_ACC_IMPL, no 3-tile overload like :180), pto_instr.hpp:800/:808, README.md:81 (TGEMV_ACC = TODO vs :94 TMATMUL_ACC = Yes). Fix: add the 3-tile shared-acc overload + PTOAS support in pto-isa, flip README TODO→Yes; PyPTO just un-defers ST. No PyPTO code change.
rem / rems — a2a3 RemInt32Instr emulates int32 modulo via int32→fp32→int32 round-trip, clobbering src0/src1 in place and losing precision past fp32's 24-bit mantissa; rems shares the same precision-loss class plus a tmp-row mismatch. PyPTO codegen is correct (kSimpleOps + tmp operand from PR test(ops): Add tile-op ST coverage for previously untested ops #1823). Evidence: build_output/_deps/pto-isa/include/pto/npu/a2a3/TRem.hpp:67-85 (round-trip, in-place input clobber at :82-83), :147-148 (static_assert advertises int32), a5's true vmod at TRem.hpp:72; TRemS.hpp:67-82; tmp-row asymmetry TRemS.hpp:135 (≥1) vs TRem.hpp:161 (≥2). Fix: pto-isa stage the round-trip through scratch tmp (stop clobbering inputs), route int32 to a true integer-modulo path like a5. Keep out of a2a3 ST until landed.
xor / xors — a2a3 TXOR/TXORS are TOR+TAND+TNOT composites whose static_assert restricts element type to 16-/8-bit ints; int32 (default int/torch.int32) trips the assert at PTOAS. PyPTO codegen is dtype-agnostic passthrough. Evidence: build_output/_deps/pto-isa/include/pto/npu/a2a3/TXor.hpp:28-30 (int8/uint8/int16/uint16 only — int8/uint8 ARE allowed), TXor.hpp:49-60, TBitwiseSOp.hpp:121-135; pto_ops_common.cpp:2254/:2282; tests/st/runtime/ops/test_bitwise.py:21. Fix (immediate, no code): re-enable a2a3 ST restricted to int16/uint16 (mirror the shipped not_ pattern). If int32 xor is actually required, file pto-isa to widen the composites. Optionally tighten PyPTO deducers so int32 xor fails at compile with a clear message.
and_ / or_ (+ scalar ands/ors) — int32 specifically is a2a3-unsupported: TAND/TOR accept only 1-/2-byte integral types (32-bit is A5-only), so an int32 tile yields the cryptic PTOAS pto.tand invalid kind of type. The ops themselves ARE supported on a2a3. Evidence: build_output/_deps/pto-isa/include/pto/npu/a2a3/TAnd.hpp:75, TOr.hpp:65-67, tand.md Target-Profile Restrictions (A2A3 = 1-/2-byte; A5 adds 4-byte); KNOWN_ISSUES.md:216-222. Fix: re-test with INT16/UINT16 and un-defer; add a PyPTO DSL/IR-level dtype check rejecting 32-bit and_/or_ on a2a3 with an actionable message. (See Verification notes — shl/shr split out below.)
sin / cos at width-128 / cols ≥ 64 — on-device-reproduced: [8,128] FP32 sin/cos correct for cols 0–63, garbage for 64–127 (64/1024 wrong), while sqrt at the same shape passes. sin/cos are software-decomposed (LowerSinCos) into ~30 primitives including FP32↔INT32 tcvt round-trips and 2-source binary ops — the only ops exercising those at width 128. Readable pto-isa headers handle 128 cols correctly (TCvt.hpp:1166-1176/:1207-1208 → 2 repeats; TBinOp.hpp contiguous flatten), so the residual defect is most plausibly an assembler/intrinsic-level miscompile of the tcvt round-trip or 2-source binary on the second 64-element segment, below the C++ template layer. Evidence: src/ir/transforms/lower_composite_ops_pass.cpp:317-373/:647-648; pto_ops_common.cpp:2261-2270 (no sin/cos mapping, sqrt single pto.tsqrt); KNOWN_ISSUES.md:278-284; tests/st/runtime/ops/test_unary_math.py:49-58. Fix: do NOT change PyPTO. Localize on device with an isolation ST running only (a) FP32→INT32→FP32 cast and (b) a single 2-source binary at [8,128], comparing cols 0–63 vs 64–127; file pto-isa against the offending intrinsic (mirrors issue fix(pass): Handle OpStmts in InsertSync and MemoryReuse passes #173 / [Bug] SplitVectorKernel (pl.split UP_DOWN) rescales reshape result TYPE but not its shape ARGUMENT → memory_reuse OOB abort (size 64 into 32) #1790 second-segment-tail pattern). Keep the [8,128] case skipped meanwhile.
PyPTO codegen bugs
prelu — codegen mapping-table arity is wrong (2, should be 3); the DSL/IR 3-arg shape (tile, slope, tmp) is correct and matches the 4-operand TPRELU_IMPL(dst,src0,src1,tmp) ISA contract (tmp is mandatory scratch). MakeNaryCodegenPTO's CHECK(args.size()==arity) rejects the 3-arg call at compile time. Evidence: root cause at src/backend/common/pto_ops_common.cpp:2260{tile.prelu, pto.tprelu, 2}; arity check :437; build_output/_deps/pto-isa/include/pto/npu/a2a3/TPrelu.hpp:42-62/:38-39; DSL python/pypto/language/op/tile_ops.py:1894/:1907; IR src/ir/op/tile_ops/elementwise.cpp:682-695. Fix: change :2260 to arity 3, then re-enable and verify a2a3 prelu ST. One-line change.
subc / subsc — wrong-doc (NOT a miscompute) — hardware and PyPTO codegen are both correct (a - b + c); the defect is PyPTO docs/op-descriptions advertising lhs - rhs - rhs2, so a user gets a-b+c silently. Evidence: build_output/_deps/pto-isa/include/pto/cpu/ElementOp.h:399-403/:605-608 (a-b+c); tsubc.md:15/:61, tsubsc.md:15; faithful codegen pto_ops_common.cpp:2273/:2290; wrong docs at python/pypto/language/op/tile_ops.py:1946/:1981 and src/ir/op/tile_ops/elementwise.cpp:714/:743. Fix: correct the four docstrings/descriptions to lhs - rhs + rhs2 / lhs - scalar + rhs2. No codegen change. If a-b-c is genuinely intended, that's a separate feature (no a-b-c primitive in pto-isa) and must be raised with the user.
expands (tile.expands) — orphan duplicate op: IR op + DSL alias exist and tensor.expands→tile.expands routes to it, but it has NO f_codegen, so it crashes codegen. TEXPANDS is only reachable via the separate tile.full (pl.full) path. Evidence: src/ir/op/tile_ops/broadcast.cpp:412; reachable via tile_ops.py:1402-1414 and op_conversion_registry.cpp:177; working path pto_ops_common.cpp:3429-3435; no reg in pto_ops_common.cpp. Fix: register an f_codegen emitting pto.texpands (mirror MakeFullCodegenPTO), or preferably retire tile.expands and route pl.expands/tensor.expands to the existing tile.full path.
sum (tile.sum) — has axis/keepdim but is never lowered to tile.row_sum/tile.col_sum and has NO f_codegen → crashes codegen (intentionally used as a guaranteed-failure case in a codegen UT). The working reduction path is pl.row_sum/pl.col_sum → pto.trowsum/pto.tcolsum. Evidence: src/ir/op/tile_ops/reduction.cpp:209; flatten_tile_nd_to_2d_pass.cpp:2020-2050 (only remaps axis); tests/ut/codegen/test_pto_codegen.py:1548-1574. Fix: add a lowering rewriting tile.sum(axis) → row_sum (last axis) / col_sum (axis 0) with the required tmp, or retire tile.sum and steer users to pl.row_sum/pl.col_sum.
gemv / gemv_bias — IN PROGRESS (see below) — 1-row lhs must be loaded to the Left (L0A) cube tile as Rows==1 + row_major (none_box L1) with K aligned to 512/sizeof (K%128 fp32, K%256 bf16) so a2a3 TExtractToLeft/TMovToLeft dispatch to the TExtractToAVector 1-row path. No ISA feature missing. Evidence: build_output/_deps/pto-isa/include/pto/npu/a2a3/textract_common.hpp:470-479 (Rows==1 && isRowMajor → vector path), constraint on COLS at :99-118 (srcCol/dstCol % (512/sizeof)), tmov_common.hpp:17-20, tmatmul_kernel.cpp:347-353 (RunTGEMV reference), constants.hpp:33-34; PyPTO pto_type_utils.cpp:129-137, pto_ops_common.cpp:2323-2325, matmul.cpp:308-350. Fix: drive K alignment and row_major/none_box lhs load layout from the DSL/load (user controls K alignment via shapes/valid_shapes per the perf-sizes rule). Being actively addressed in this worktree (test_gemv.py loads [1,k] lhs to Mat with K aligned to 128/256) — track as in-progress, not just deferred.
sels (tile.sels) — IR op reachable from pl.sels but NO f_codegen, and its 3-arg signature (lhs,rhs,select_mode) does not match the real a2a3 TSELS (vsel-style: dst,mask,src,tmp,scalar). Evidence: src/ir/op/tile_ops/elementwise.cpp:872; build_output/_deps/pto-isa/include/pto/npu/a2a3/TSels.hpp:20-26/:52-56; no reg in pto_ops_common.cpp. Fix: correct REGISTER_OP to the real TSELS shape (mirror tile.sel/TSEL) then add an f_codegen emitting pto.tsels; or retire the orphan op + DSL alias if redundant.
matmul_bias narrow-N — the matmul_bias output (L0C Acc) valid_shape is always the full physical [M,N] (operand narrowing never propagated), so tile.store copies the full [M,N]; but mad writes only [validRow x validCol] and never clears L0C tail columns, so [:, VALID_N:] ships stale data. narrow-M/K pass (whole-row zero / no tail); narrow-N fails (stale within partially-written Nz blocks). Evidence: src/ir/op/tile_ops/matmul.cpp:232/:255-257; build_output/_deps/pto-isa/include/pto/npu/a2a3/TMatmul.hpp:192-199; CPU ref pto/cpu/TMatmul.hpp:23-41/:196-214; tstore_common.hpp:163-204; pto_ops_common.cpp:1410; tests/st/runtime/ops/test_gemv.py:179-186. Fix (primary, pypto-codegen): propagate operand valid_shapes into DeduceTileMatMul*Type (output valid_row from lhs, valid_col from rhs) so tile.store copies only [M, VALID_N] and GM [:, VALID_N:] keeps its pre-initialized zero. Alternative: clear the Acc tail columns in cube codegen. (Spans both layers; the pto-isa no-tail-zero behavior is a hardware reality to work around, not a fix target.)
Verification notes
Several originally reported symptoms were refuted or corrected by the audit:
gemv_acc — reported as a "missing acc→acc pto.tmov" in PyPTO. Refuted: PyPTO handles acc accumulation correctly via Acc-address aliasing (make_acc_codegen, same lambda as the working matmul_acc); the real gap is the pto-isa TGEMV_ACC TODO / missing 3-tile shared-acc overload.
subc / subsc — reported as pto.tsubc miscomputing (a-b-c). Refuted: hardware computes a-b+c and PyPTO codegen is faithful; the defect is a PyPTO wrong-doc that advertises a-b-c. A naive a-b-c numpy reference is why ST mismatched, not a device miscompute.
gemv 1-row — reported as hitting the TExtractToAsrcRow/dstRow % 16 == 0 static_assert. Corrected: the 1-row + row_major case explicitly skipsTExtractToA (textract_common.hpp:470-479) and goes to TExtractToAVector, whose real constraint is on K (cols) % (512/sizeof), not rows. Symptom cited the wrong static_assert.
xor / xors — reported as "only int16/uint16". Imprecise: the static_assert also permits int8/uint8; only int32/int64 are unsupported.
and_ / or_ / shl / shr — reported as "PTOAS rejects pto.tand/tor/tshl/tshr as unsupported on a2a3". Overbroad: the ops ARE supported; the real trigger is the int32 element type on tand/tor (a2a3 is 8/16-bit only, 32-bit is A5-only). Furthermore shl / shr are likely over-attributed — TShl.hpp:42-45 / TShr.hpp:42-45 allow 8/16/32-bit signed+unsigned, so i32 shifts should assemble at the ISA-impl level. The i32 error was observed for tand and assumed family-wide. Action: re-test shl/shr with int32 on device; if PTOAS still rejects, that is a separate real pto-isa/PTOAS verifier-vs-impl gap to file with an i32-shift repro — confirm on device before attributing. Correct the KNOWN_ISSUES/PR wording from "op unsupported" to "i32 element type unsupported for tand/tor on a2a3 (8/16-bit only)".
prelu — reported as a DSL/pto.tprelu signature mismatch. Corrected: the DSL/IR 3-arg shape is correct and matches the ISA; the actual bug is the codegen mapping-table arity (2 instead of 3).
matmul_bias narrow-N — reported simply as "the cube does not zero [:, VALID_N:]". Refined: the precise cause is a PyPTO valid_shape/value mismatch (output valid_shape hardcoded to full physical [M,N]), with the cube's no-tail-zero being the underlying hardware reality.
A temp-operand codegen fix for trem/trems/txor/txors was already landed in PR #1823; only not_ and matmul_bias ST were retained on a2a3 in that PR.
Auto-audited tracking issue: each deferred op was re-verified against the freshly-cloned a2a3 pto-isa headers and PyPTO codegen; reported symptoms were corrected where the audit disagreed (see Verification notes).
These tile operators were excluded from a2a3 on-board ST coverage in PR #1823 due to a2a3 ISA / codegen gaps. This issue tracks them so they are not forgotten. Each item below records the audited root cause (which can differ from the originally reported symptom — see Verification notes), the load-bearing evidence, and the fix direction.
pto-isa limitations (a2a3 ISA / PTOAS, file against pto-isa)
gemv_acc — a2a3 lacks the 3-tile shared-accumulator overload of
TGEMV_ACCthatTMATMUL_ACChas; in-placecOut==cInaccumulation can't be assembled. PyPTO codegen is correct (Acc aliasing viamake_acc_codegen, not a tmov). Evidence:build_output/_deps/pto-isa/include/pto/npu/a2a3/TMatmul.hpp:120(only 4-tileTGEMV_ACC_IMPL, no 3-tile overload like:180),pto_instr.hpp:800/:808,README.md:81(TGEMV_ACC = TODO vs:94TMATMUL_ACC = Yes). Fix: add the 3-tile shared-acc overload + PTOAS support in pto-isa, flip README TODO→Yes; PyPTO just un-defers ST. No PyPTO code change.rem / rems — a2a3
RemInt32Instremulates int32 modulo via int32→fp32→int32 round-trip, clobbering src0/src1 in place and losing precision past fp32's 24-bit mantissa;remsshares the same precision-loss class plus a tmp-row mismatch. PyPTO codegen is correct (kSimpleOps + tmp operand from PR test(ops): Add tile-op ST coverage for previously untested ops #1823). Evidence:build_output/_deps/pto-isa/include/pto/npu/a2a3/TRem.hpp:67-85(round-trip, in-place input clobber at:82-83),:147-148(static_assert advertises int32), a5's truevmodatTRem.hpp:72;TRemS.hpp:67-82; tmp-row asymmetryTRemS.hpp:135(≥1) vsTRem.hpp:161(≥2). Fix: pto-isa stage the round-trip through scratch tmp (stop clobbering inputs), route int32 to a true integer-modulo path like a5. Keep out of a2a3 ST until landed.xor / xors — a2a3
TXOR/TXORSare TOR+TAND+TNOT composites whose static_assert restricts element type to 16-/8-bit ints; int32 (defaultint/torch.int32) trips the assert at PTOAS. PyPTO codegen is dtype-agnostic passthrough. Evidence:build_output/_deps/pto-isa/include/pto/npu/a2a3/TXor.hpp:28-30(int8/uint8/int16/uint16 only — int8/uint8 ARE allowed),TXor.hpp:49-60,TBitwiseSOp.hpp:121-135;pto_ops_common.cpp:2254/:2282;tests/st/runtime/ops/test_bitwise.py:21. Fix (immediate, no code): re-enable a2a3 ST restricted to int16/uint16 (mirror the shippednot_pattern). If int32 xor is actually required, file pto-isa to widen the composites. Optionally tighten PyPTO deducers so int32 xor fails at compile with a clear message.and_ / or_ (+ scalar ands/ors) — int32 specifically is a2a3-unsupported:
TAND/TORaccept only 1-/2-byte integral types (32-bit is A5-only), so an int32 tile yields the cryptic PTOASpto.tand invalid kind of type. The ops themselves ARE supported on a2a3. Evidence:build_output/_deps/pto-isa/include/pto/npu/a2a3/TAnd.hpp:75,TOr.hpp:65-67,tand.mdTarget-Profile Restrictions (A2A3 = 1-/2-byte; A5 adds 4-byte);KNOWN_ISSUES.md:216-222. Fix: re-test with INT16/UINT16 and un-defer; add a PyPTO DSL/IR-level dtype check rejecting 32-bit and_/or_ on a2a3 with an actionable message. (See Verification notes —shl/shrsplit out below.)sin / cos at width-128 / cols ≥ 64 — on-device-reproduced:
[8,128]FP32 sin/cos correct for cols 0–63, garbage for 64–127 (64/1024 wrong), whilesqrtat the same shape passes. sin/cos are software-decomposed (LowerSinCos) into ~30 primitives including FP32↔INT32tcvtround-trips and 2-source binary ops — the only ops exercising those at width 128. Readable pto-isa headers handle 128 cols correctly (TCvt.hpp:1166-1176/:1207-1208→ 2 repeats;TBinOp.hppcontiguous flatten), so the residual defect is most plausibly an assembler/intrinsic-level miscompile of the tcvt round-trip or 2-source binary on the second 64-element segment, below the C++ template layer. Evidence:src/ir/transforms/lower_composite_ops_pass.cpp:317-373/:647-648;pto_ops_common.cpp:2261-2270(no sin/cos mapping, sqrt singlepto.tsqrt);KNOWN_ISSUES.md:278-284;tests/st/runtime/ops/test_unary_math.py:49-58. Fix: do NOT change PyPTO. Localize on device with an isolation ST running only (a) FP32→INT32→FP32 cast and (b) a single 2-source binary at[8,128], comparing cols 0–63 vs 64–127; file pto-isa against the offending intrinsic (mirrors issue fix(pass): Handle OpStmts in InsertSync and MemoryReuse passes #173 / [Bug] SplitVectorKernel (pl.split UP_DOWN) rescales reshape result TYPE but not its shape ARGUMENT → memory_reuse OOB abort (size 64 into 32) #1790 second-segment-tail pattern). Keep the[8,128]case skipped meanwhile.PyPTO codegen bugs
prelu — codegen mapping-table arity is wrong (2, should be 3); the DSL/IR 3-arg shape
(tile, slope, tmp)is correct and matches the 4-operandTPRELU_IMPL(dst,src0,src1,tmp)ISA contract (tmp is mandatory scratch).MakeNaryCodegenPTO'sCHECK(args.size()==arity)rejects the 3-arg call at compile time. Evidence: root cause atsrc/backend/common/pto_ops_common.cpp:2260{tile.prelu, pto.tprelu, 2}; arity check:437;build_output/_deps/pto-isa/include/pto/npu/a2a3/TPrelu.hpp:42-62/:38-39; DSLpython/pypto/language/op/tile_ops.py:1894/:1907; IRsrc/ir/op/tile_ops/elementwise.cpp:682-695. Fix: change:2260to arity 3, then re-enable and verify a2a3 prelu ST. One-line change.subc / subsc — wrong-doc (NOT a miscompute) — hardware and PyPTO codegen are both correct (
a - b + c); the defect is PyPTO docs/op-descriptions advertisinglhs - rhs - rhs2, so a user getsa-b+csilently. Evidence:build_output/_deps/pto-isa/include/pto/cpu/ElementOp.h:399-403/:605-608(a-b+c);tsubc.md:15/:61,tsubsc.md:15; faithful codegenpto_ops_common.cpp:2273/:2290; wrong docs atpython/pypto/language/op/tile_ops.py:1946/:1981andsrc/ir/op/tile_ops/elementwise.cpp:714/:743. Fix: correct the four docstrings/descriptions tolhs - rhs + rhs2/lhs - scalar + rhs2. No codegen change. Ifa-b-cis genuinely intended, that's a separate feature (no a-b-c primitive in pto-isa) and must be raised with the user.expands (tile.expands) — orphan duplicate op: IR op + DSL alias exist and
tensor.expands→tile.expandsroutes to it, but it has NO f_codegen, so it crashes codegen.TEXPANDSis only reachable via the separatetile.full(pl.full) path. Evidence:src/ir/op/tile_ops/broadcast.cpp:412; reachable viatile_ops.py:1402-1414andop_conversion_registry.cpp:177; working pathpto_ops_common.cpp:3429-3435; no reg inpto_ops_common.cpp. Fix: register an f_codegen emittingpto.texpands(mirrorMakeFullCodegenPTO), or preferably retiretile.expandsand routepl.expands/tensor.expandsto the existingtile.fullpath.sum (tile.sum) — has axis/keepdim but is never lowered to
tile.row_sum/tile.col_sumand has NO f_codegen → crashes codegen (intentionally used as a guaranteed-failure case in a codegen UT). The working reduction path ispl.row_sum/pl.col_sum→pto.trowsum/pto.tcolsum. Evidence:src/ir/op/tile_ops/reduction.cpp:209;flatten_tile_nd_to_2d_pass.cpp:2020-2050(only remaps axis);tests/ut/codegen/test_pto_codegen.py:1548-1574. Fix: add a lowering rewritingtile.sum(axis)→ row_sum (last axis) / col_sum (axis 0) with the required tmp, or retiretile.sumand steer users topl.row_sum/pl.col_sum.Needs design (operand-shape / valid-shape decision)
gemv / gemv_bias — IN PROGRESS (see below) — 1-row lhs must be loaded to the Left (L0A) cube tile as
Rows==1+ row_major (none_box L1) with K aligned to512/sizeof(K%128 fp32, K%256 bf16) so a2a3TExtractToLeft/TMovToLeftdispatch to theTExtractToAVector1-row path. No ISA feature missing. Evidence:build_output/_deps/pto-isa/include/pto/npu/a2a3/textract_common.hpp:470-479(Rows==1 && isRowMajor → vector path), constraint on COLS at:99-118(srcCol/dstCol % (512/sizeof)),tmov_common.hpp:17-20,tmatmul_kernel.cpp:347-353(RunTGEMVreference),constants.hpp:33-34; PyPTOpto_type_utils.cpp:129-137,pto_ops_common.cpp:2323-2325,matmul.cpp:308-350. Fix: drive K alignment and row_major/none_box lhs load layout from the DSL/load (user controls K alignment via shapes/valid_shapes per the perf-sizes rule). Being actively addressed in this worktree (test_gemv.pyloads[1,k]lhs to Mat with K aligned to 128/256) — track as in-progress, not just deferred.sels (tile.sels) — IR op reachable from
pl.selsbut NO f_codegen, and its 3-arg signature(lhs,rhs,select_mode)does not match the real a2a3TSELS(vsel-style:dst,mask,src,tmp,scalar). Evidence:src/ir/op/tile_ops/elementwise.cpp:872;build_output/_deps/pto-isa/include/pto/npu/a2a3/TSels.hpp:20-26/:52-56; no reg inpto_ops_common.cpp. Fix: correct REGISTER_OP to the real TSELS shape (mirrortile.sel/TSEL) then add an f_codegen emittingpto.tsels; or retire the orphan op + DSL alias if redundant.matmul_bias narrow-N — the matmul_bias output (L0C Acc) valid_shape is always the full physical
[M,N](operand narrowing never propagated), sotile.storecopies the full[M,N]; butmadwrites only[validRow x validCol]and never clears L0C tail columns, so[:, VALID_N:]ships stale data. narrow-M/K pass (whole-row zero / no tail); narrow-N fails (stale within partially-written Nz blocks). Evidence:src/ir/op/tile_ops/matmul.cpp:232/:255-257;build_output/_deps/pto-isa/include/pto/npu/a2a3/TMatmul.hpp:192-199; CPU refpto/cpu/TMatmul.hpp:23-41/:196-214;tstore_common.hpp:163-204;pto_ops_common.cpp:1410;tests/st/runtime/ops/test_gemv.py:179-186. Fix (primary, pypto-codegen): propagate operand valid_shapes intoDeduceTileMatMul*Type(output valid_row from lhs, valid_col from rhs) sotile.storecopies only[M, VALID_N]and GM[:, VALID_N:]keeps its pre-initialized zero. Alternative: clear the Acc tail columns in cube codegen. (Spans both layers; the pto-isa no-tail-zero behavior is a hardware reality to work around, not a fix target.)Verification notes
Several originally reported symptoms were refuted or corrected by the audit:
pto.tmov" in PyPTO. Refuted: PyPTO handles acc accumulation correctly via Acc-address aliasing (make_acc_codegen, same lambda as the workingmatmul_acc); the real gap is the pto-isa TGEMV_ACC TODO / missing 3-tile shared-acc overload.pto.tsubcmiscomputing (a-b-c). Refuted: hardware computesa-b+cand PyPTO codegen is faithful; the defect is a PyPTO wrong-doc that advertisesa-b-c. A naivea-b-cnumpy reference is why ST mismatched, not a device miscompute.TExtractToAsrcRow/dstRow % 16 == 0static_assert. Corrected: the 1-row + row_major case explicitly skipsTExtractToA(textract_common.hpp:470-479) and goes toTExtractToAVector, whose real constraint is on K (cols)% (512/sizeof), not rows. Symptom cited the wrong static_assert.pto.tand/tor/tshl/tshras unsupported on a2a3". Overbroad: the ops ARE supported; the real trigger is the int32 element type on tand/tor (a2a3 is 8/16-bit only, 32-bit is A5-only). Furthermore shl / shr are likely over-attributed —TShl.hpp:42-45/TShr.hpp:42-45allow 8/16/32-bit signed+unsigned, so i32 shifts should assemble at the ISA-impl level. The i32 error was observed fortandand assumed family-wide. Action: re-test shl/shr with int32 on device; if PTOAS still rejects, that is a separate real pto-isa/PTOAS verifier-vs-impl gap to file with an i32-shift repro — confirm on device before attributing. Correct the KNOWN_ISSUES/PR wording from "op unsupported" to "i32 element type unsupported for tand/tor on a2a3 (8/16-bit only)".pto.tprelusignature mismatch. Corrected: the DSL/IR 3-arg shape is correct and matches the ISA; the actual bug is the codegen mapping-table arity (2 instead of 3).[:, VALID_N:]". Refined: the precise cause is a PyPTO valid_shape/value mismatch (output valid_shape hardcoded to full physical[M,N]), with the cube's no-tail-zero being the underlying hardware reality.A temp-operand codegen fix for
trem/trems/txor/txorswas already landed in PR #1823; onlynot_andmatmul_biasST were retained on a2a3 in that PR.