intrinsics: align spec with the GCC implementation surface by ptomsich · Pull Request #41 · riscv/integrated-matrix-extension

ptomsich · 2026-05-03T19:04:44Z

When implementing the intrinsics and porting tests/examples to them, we had to implement a slightly larger API surface for the intrinsics than in the original specification. This PR aligns the specification with the end-to-end tested GCC implementation.

…me_lambda geometry queries Add a normative subsection for the two geometry-query intrinsics that GCC and Clang already implement to enable runtime VLEN/lambda detection: size_t __riscv_ime_vlen (void); size_t __riscv_ime_lambda (void); Both fold to compile-time constants when VLEN is statically known (-mrvv-vector-bits=zvl) and otherwise emit a small runtime sequence (csrr vlenb + shift, or csrr vlenb + ctz + shift respectively). These intrinsics are the supported way for software to discover the implementation's tile geometry without parsing CSR fields directly, and are the building blocks for the runtime-dispatch pattern described in the existing VLEN-portable code subsection. A note clarifies that __riscv_ime_lambda returns a single representative value; software that needs to enumerate the WARL set must still use vsetvl write-readback.

…s for microscaled intrinsics Microscaled multiply-accumulate intrinsics carry two qualifiers that the existing intrinsics section did not name: _scaled - distinguishes the MX-scaled form of vfwmmacc / vfqmmacc / vf8wmmacc from their unscaled siblings. _bs{N} - selects the block size (16 or 32). Applies to all MX intrinsics, including the integer-input ones (vfwimmacc / vfqimmacc / vf8wimmacc), which exist only in the microscaled form and therefore do not carry _scaled. Add a new subsection "Microscaled multiply-accumulate intrinsics" between the FP multiply-accumulate prototypes and the VLEN-portable code discussion. The subsection extends the canonical-suffix grammar already defined in the intrinsics overview, lists representative prototypes for each (FP, INT)-input case and each block size, and confirms that the MX scale format is implied by the input data type (no separate scale-format selector is needed). Aligns the spec with what GCC and Clang already emit.

… and clarify _L{N} / _m orthogonality Tile load/store intrinsics support an optional mask through the _m suffix (the same convention as base V-extension load/store). The canonical suffix order is updated to allow _m as the final qualifier, and the spec explicitly states that _L{N} and _m are orthogonal and may be combined as _L{N}_m. Add representative masked prototypes for each of the four mnemonics (vmtl.v, vmts.v, vmttl.v, vmtts.v) and a combined-suffix example (vmtl_v_i8m1_L4_m, vmts_v_i8m1_L4_m). The mask bit width follows V's convention: vbool{N}_t where N matches the data element width (vbool8_t for i8, vbool32_t for i32, etc.). Closes a gap between the spec and what GCC emits today (test: zvmma-tile-masked.c, zvmma-ofp8-tile-imm-lambda.c).

…ad/store intrinsics Extend the tile load/store intrinsic table to cover the alternate-format input vector types (OFP8 E4M3 / E5M2, OFP4 E2M1, signed/unsigned Int4, and BFloat16) so that input tiles can be loaded and stored without an intervening vreinterpret. Element widths match the underlying storage: - 8 bits for OFP8 - 4 bits for OFP4 / Int4 - 16 bits for BF16 Base pointer type is uint8_t * for OFP8, OFP4, and Int4 (since these are packed into byte-addressable memory), and __bf16 * for BF16. The note clarifies that the masking (_m) and immediate-lambda (_L{N}) qualifiers extend to these alternate-format intrinsics on the same orthogonal basis as for the IEEE FP and standard-int types, and that the same expansion applies to the transposing variants vmttl.v / vmtts.v. Closes a gap between the spec and what GCC emits today (gap #14; commits 561cbb5a and 4d74ffa7 in vrull/ime-intrinsics).

…_su_lm{N} examples Two intrinsic-naming patterns were permitted by the spec grammar but never illustrated, leaving the surface ambiguous in practice: 1. Three-token long-form names arise when both altfmt_A and altfmt_B differ from the default *and* the accumulator type itself is the alternative encoding (e.g. BF16 from vfwmmacc.vv, or non-default OFP8 accumulator from non-widening vfmmacc.vv). Add three concrete examples (vfwmmacc bf16<-E4M3xE5M2; vfmmacc OFP8<-E4M3xE5M2; matching overloaded short form) and an explanatory paragraph that states the token order: accumulator first, then A and B input types in order. 2. The _su/_us mixed-sign suffix and the _lm{N} LMUL suffix combine in the canonical order _su|_us followed by _lm{N}. Add three examples covering vmmacc / vwmmacc / vqwmmacc, both _su and _us, and LMUL = 2, 4, 8. Both patterns are already implemented and tested in GCC; the spec now shows them explicitly so users do not have to derive them from the suffix grammar at line 1448.

joseemoreira

I am approving and merging, but I believe that

size_t __riscv_ime_lambda (void);

must take a parameter, namely the SEW, so that it can return the proper lambda.

ptomsich added 5 commits May 3, 2026 20:37

ptomsich requested review from efocht, efocht-oct and joseemoreira May 3, 2026 19:04

joseemoreira approved these changes May 3, 2026

View reviewed changes

joseemoreira merged commit eb2e37e into main May 3, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intrinsics: align spec with the GCC implementation surface#41

intrinsics: align spec with the GCC implementation surface#41
joseemoreira merged 5 commits intomainfrom
ptomsich/intrinsics-gcc-feature-parity

ptomsich commented May 3, 2026

Uh oh!

joseemoreira left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ptomsich commented May 3, 2026

Uh oh!

joseemoreira left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants