Skip to content

Track: a2a3 tile-op ST gaps deferred from PR #1823 (gemv family, bitwise int32, prelu, rem, sin/cos width-128, matmul_bias narrow-N) #1846

Description

@Little-oil

Auto-audited tracking issue: each deferred op was re-verified against the freshly-cloned a2a3 pto-isa headers and PyPTO codegen; reported symptoms were corrected where the audit disagreed (see Verification notes).

These tile operators were excluded from a2a3 on-board ST coverage in PR #1823 due to a2a3 ISA / codegen gaps. This issue tracks them so they are not forgotten. Each item below records the audited root cause (which can differ from the originally reported symptom — see Verification notes), the load-bearing evidence, and the fix direction.

pto-isa limitations (a2a3 ISA / PTOAS, file against pto-isa)

  • gemv_acc — a2a3 lacks the 3-tile shared-accumulator overload of TGEMV_ACC that TMATMUL_ACC has; in-place cOut==cIn accumulation can't be assembled. PyPTO codegen is correct (Acc aliasing via make_acc_codegen, not a tmov). Evidence: build_output/_deps/pto-isa/include/pto/npu/a2a3/TMatmul.hpp:120 (only 4-tile TGEMV_ACC_IMPL, no 3-tile overload like :180), pto_instr.hpp:800/:808, README.md:81 (TGEMV_ACC = TODO vs :94 TMATMUL_ACC = Yes). Fix: add the 3-tile shared-acc overload + PTOAS support in pto-isa, flip README TODO→Yes; PyPTO just un-defers ST. No PyPTO code change.

  • rem / rems — a2a3 RemInt32Instr emulates int32 modulo via int32→fp32→int32 round-trip, clobbering src0/src1 in place and losing precision past fp32's 24-bit mantissa; rems shares the same precision-loss class plus a tmp-row mismatch. PyPTO codegen is correct (kSimpleOps + tmp operand from PR test(ops): Add tile-op ST coverage for previously untested ops #1823). Evidence: build_output/_deps/pto-isa/include/pto/npu/a2a3/TRem.hpp:67-85 (round-trip, in-place input clobber at :82-83), :147-148 (static_assert advertises int32), a5's true vmod at TRem.hpp:72; TRemS.hpp:67-82; tmp-row asymmetry TRemS.hpp:135 (≥1) vs TRem.hpp:161 (≥2). Fix: pto-isa stage the round-trip through scratch tmp (stop clobbering inputs), route int32 to a true integer-modulo path like a5. Keep out of a2a3 ST until landed.

  • xor / xors — a2a3 TXOR/TXORS are TOR+TAND+TNOT composites whose static_assert restricts element type to 16-/8-bit ints; int32 (default int/torch.int32) trips the assert at PTOAS. PyPTO codegen is dtype-agnostic passthrough. Evidence: build_output/_deps/pto-isa/include/pto/npu/a2a3/TXor.hpp:28-30 (int8/uint8/int16/uint16 only — int8/uint8 ARE allowed), TXor.hpp:49-60, TBitwiseSOp.hpp:121-135; pto_ops_common.cpp:2254/:2282; tests/st/runtime/ops/test_bitwise.py:21. Fix (immediate, no code): re-enable a2a3 ST restricted to int16/uint16 (mirror the shipped not_ pattern). If int32 xor is actually required, file pto-isa to widen the composites. Optionally tighten PyPTO deducers so int32 xor fails at compile with a clear message.

  • and_ / or_ (+ scalar ands/ors) — int32 specifically is a2a3-unsupported: TAND/TOR accept only 1-/2-byte integral types (32-bit is A5-only), so an int32 tile yields the cryptic PTOAS pto.tand invalid kind of type. The ops themselves ARE supported on a2a3. Evidence: build_output/_deps/pto-isa/include/pto/npu/a2a3/TAnd.hpp:75, TOr.hpp:65-67, tand.md Target-Profile Restrictions (A2A3 = 1-/2-byte; A5 adds 4-byte); KNOWN_ISSUES.md:216-222. Fix: re-test with INT16/UINT16 and un-defer; add a PyPTO DSL/IR-level dtype check rejecting 32-bit and_/or_ on a2a3 with an actionable message. (See Verification notes — shl/shr split out below.)

  • sin / cos at width-128 / cols ≥ 64 — on-device-reproduced: [8,128] FP32 sin/cos correct for cols 0–63, garbage for 64–127 (64/1024 wrong), while sqrt at the same shape passes. sin/cos are software-decomposed (LowerSinCos) into ~30 primitives including FP32↔INT32 tcvt round-trips and 2-source binary ops — the only ops exercising those at width 128. Readable pto-isa headers handle 128 cols correctly (TCvt.hpp:1166-1176/:1207-1208 → 2 repeats; TBinOp.hpp contiguous flatten), so the residual defect is most plausibly an assembler/intrinsic-level miscompile of the tcvt round-trip or 2-source binary on the second 64-element segment, below the C++ template layer. Evidence: src/ir/transforms/lower_composite_ops_pass.cpp:317-373/:647-648; pto_ops_common.cpp:2261-2270 (no sin/cos mapping, sqrt single pto.tsqrt); KNOWN_ISSUES.md:278-284; tests/st/runtime/ops/test_unary_math.py:49-58. Fix: do NOT change PyPTO. Localize on device with an isolation ST running only (a) FP32→INT32→FP32 cast and (b) a single 2-source binary at [8,128], comparing cols 0–63 vs 64–127; file pto-isa against the offending intrinsic (mirrors issue fix(pass): Handle OpStmts in InsertSync and MemoryReuse passes #173 / [Bug] SplitVectorKernel (pl.split UP_DOWN) rescales reshape result TYPE but not its shape ARGUMENT → memory_reuse OOB abort (size 64 into 32) #1790 second-segment-tail pattern). Keep the [8,128] case skipped meanwhile.

PyPTO codegen bugs

  • prelu — codegen mapping-table arity is wrong (2, should be 3); the DSL/IR 3-arg shape (tile, slope, tmp) is correct and matches the 4-operand TPRELU_IMPL(dst,src0,src1,tmp) ISA contract (tmp is mandatory scratch). MakeNaryCodegenPTO's CHECK(args.size()==arity) rejects the 3-arg call at compile time. Evidence: root cause at src/backend/common/pto_ops_common.cpp:2260 {tile.prelu, pto.tprelu, 2}; arity check :437; build_output/_deps/pto-isa/include/pto/npu/a2a3/TPrelu.hpp:42-62/:38-39; DSL python/pypto/language/op/tile_ops.py:1894/:1907; IR src/ir/op/tile_ops/elementwise.cpp:682-695. Fix: change :2260 to arity 3, then re-enable and verify a2a3 prelu ST. One-line change.

  • subc / subsc — wrong-doc (NOT a miscompute) — hardware and PyPTO codegen are both correct (a - b + c); the defect is PyPTO docs/op-descriptions advertising lhs - rhs - rhs2, so a user gets a-b+c silently. Evidence: build_output/_deps/pto-isa/include/pto/cpu/ElementOp.h:399-403/:605-608 (a-b+c); tsubc.md:15/:61, tsubsc.md:15; faithful codegen pto_ops_common.cpp:2273/:2290; wrong docs at python/pypto/language/op/tile_ops.py:1946/:1981 and src/ir/op/tile_ops/elementwise.cpp:714/:743. Fix: correct the four docstrings/descriptions to lhs - rhs + rhs2 / lhs - scalar + rhs2. No codegen change. If a-b-c is genuinely intended, that's a separate feature (no a-b-c primitive in pto-isa) and must be raised with the user.

  • expands (tile.expands) — orphan duplicate op: IR op + DSL alias exist and tensor.expands→tile.expands routes to it, but it has NO f_codegen, so it crashes codegen. TEXPANDS is only reachable via the separate tile.full (pl.full) path. Evidence: src/ir/op/tile_ops/broadcast.cpp:412; reachable via tile_ops.py:1402-1414 and op_conversion_registry.cpp:177; working path pto_ops_common.cpp:3429-3435; no reg in pto_ops_common.cpp. Fix: register an f_codegen emitting pto.texpands (mirror MakeFullCodegenPTO), or preferably retire tile.expands and route pl.expands/tensor.expands to the existing tile.full path.

  • sum (tile.sum) — has axis/keepdim but is never lowered to tile.row_sum/tile.col_sum and has NO f_codegen → crashes codegen (intentionally used as a guaranteed-failure case in a codegen UT). The working reduction path is pl.row_sum/pl.col_sumpto.trowsum/pto.tcolsum. Evidence: src/ir/op/tile_ops/reduction.cpp:209; flatten_tile_nd_to_2d_pass.cpp:2020-2050 (only remaps axis); tests/ut/codegen/test_pto_codegen.py:1548-1574. Fix: add a lowering rewriting tile.sum(axis) → row_sum (last axis) / col_sum (axis 0) with the required tmp, or retire tile.sum and steer users to pl.row_sum/pl.col_sum.

Needs design (operand-shape / valid-shape decision)

  • gemv / gemv_bias — IN PROGRESS (see below) — 1-row lhs must be loaded to the Left (L0A) cube tile as Rows==1 + row_major (none_box L1) with K aligned to 512/sizeof (K%128 fp32, K%256 bf16) so a2a3 TExtractToLeft/TMovToLeft dispatch to the TExtractToAVector 1-row path. No ISA feature missing. Evidence: build_output/_deps/pto-isa/include/pto/npu/a2a3/textract_common.hpp:470-479 (Rows==1 && isRowMajor → vector path), constraint on COLS at :99-118 (srcCol/dstCol % (512/sizeof)), tmov_common.hpp:17-20, tmatmul_kernel.cpp:347-353 (RunTGEMV reference), constants.hpp:33-34; PyPTO pto_type_utils.cpp:129-137, pto_ops_common.cpp:2323-2325, matmul.cpp:308-350. Fix: drive K alignment and row_major/none_box lhs load layout from the DSL/load (user controls K alignment via shapes/valid_shapes per the perf-sizes rule). Being actively addressed in this worktree (test_gemv.py loads [1,k] lhs to Mat with K aligned to 128/256) — track as in-progress, not just deferred.

  • sels (tile.sels) — IR op reachable from pl.sels but NO f_codegen, and its 3-arg signature (lhs,rhs,select_mode) does not match the real a2a3 TSELS (vsel-style: dst,mask,src,tmp,scalar). Evidence: src/ir/op/tile_ops/elementwise.cpp:872; build_output/_deps/pto-isa/include/pto/npu/a2a3/TSels.hpp:20-26/:52-56; no reg in pto_ops_common.cpp. Fix: correct REGISTER_OP to the real TSELS shape (mirror tile.sel/TSEL) then add an f_codegen emitting pto.tsels; or retire the orphan op + DSL alias if redundant.

  • matmul_bias narrow-N — the matmul_bias output (L0C Acc) valid_shape is always the full physical [M,N] (operand narrowing never propagated), so tile.store copies the full [M,N]; but mad writes only [validRow x validCol] and never clears L0C tail columns, so [:, VALID_N:] ships stale data. narrow-M/K pass (whole-row zero / no tail); narrow-N fails (stale within partially-written Nz blocks). Evidence: src/ir/op/tile_ops/matmul.cpp:232/:255-257; build_output/_deps/pto-isa/include/pto/npu/a2a3/TMatmul.hpp:192-199; CPU ref pto/cpu/TMatmul.hpp:23-41/:196-214; tstore_common.hpp:163-204; pto_ops_common.cpp:1410; tests/st/runtime/ops/test_gemv.py:179-186. Fix (primary, pypto-codegen): propagate operand valid_shapes into DeduceTileMatMul*Type (output valid_row from lhs, valid_col from rhs) so tile.store copies only [M, VALID_N] and GM [:, VALID_N:] keeps its pre-initialized zero. Alternative: clear the Acc tail columns in cube codegen. (Spans both layers; the pto-isa no-tail-zero behavior is a hardware reality to work around, not a fix target.)

Verification notes

Several originally reported symptoms were refuted or corrected by the audit:

  • gemv_acc — reported as a "missing acc→acc pto.tmov" in PyPTO. Refuted: PyPTO handles acc accumulation correctly via Acc-address aliasing (make_acc_codegen, same lambda as the working matmul_acc); the real gap is the pto-isa TGEMV_ACC TODO / missing 3-tile shared-acc overload.
  • subc / subsc — reported as pto.tsubc miscomputing (a-b-c). Refuted: hardware computes a-b+c and PyPTO codegen is faithful; the defect is a PyPTO wrong-doc that advertises a-b-c. A naive a-b-c numpy reference is why ST mismatched, not a device miscompute.
  • gemv 1-row — reported as hitting the TExtractToA srcRow/dstRow % 16 == 0 static_assert. Corrected: the 1-row + row_major case explicitly skips TExtractToA (textract_common.hpp:470-479) and goes to TExtractToAVector, whose real constraint is on K (cols) % (512/sizeof), not rows. Symptom cited the wrong static_assert.
  • xor / xors — reported as "only int16/uint16". Imprecise: the static_assert also permits int8/uint8; only int32/int64 are unsupported.
  • and_ / or_ / shl / shr — reported as "PTOAS rejects pto.tand/tor/tshl/tshr as unsupported on a2a3". Overbroad: the ops ARE supported; the real trigger is the int32 element type on tand/tor (a2a3 is 8/16-bit only, 32-bit is A5-only). Furthermore shl / shr are likely over-attributedTShl.hpp:42-45 / TShr.hpp:42-45 allow 8/16/32-bit signed+unsigned, so i32 shifts should assemble at the ISA-impl level. The i32 error was observed for tand and assumed family-wide. Action: re-test shl/shr with int32 on device; if PTOAS still rejects, that is a separate real pto-isa/PTOAS verifier-vs-impl gap to file with an i32-shift repro — confirm on device before attributing. Correct the KNOWN_ISSUES/PR wording from "op unsupported" to "i32 element type unsupported for tand/tor on a2a3 (8/16-bit only)".
  • prelu — reported as a DSL/pto.tprelu signature mismatch. Corrected: the DSL/IR 3-arg shape is correct and matches the ISA; the actual bug is the codegen mapping-table arity (2 instead of 3).
  • matmul_bias narrow-N — reported simply as "the cube does not zero [:, VALID_N:]". Refined: the precise cause is a PyPTO valid_shape/value mismatch (output valid_shape hardcoded to full physical [M,N]), with the cube's no-tail-zero being the underlying hardware reality.

A temp-operand codegen fix for trem/trems/txor/txors was already landed in PR #1823; only not_ and matmul_bias ST were retained on a2a3 in that PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions