Skip to content

intrinsics: align spec with the GCC implementation surface#41

Merged
joseemoreira merged 5 commits intomainfrom
ptomsich/intrinsics-gcc-feature-parity
May 3, 2026
Merged

intrinsics: align spec with the GCC implementation surface#41
joseemoreira merged 5 commits intomainfrom
ptomsich/intrinsics-gcc-feature-parity

Conversation

@ptomsich
Copy link
Copy Markdown
Collaborator

@ptomsich ptomsich commented May 3, 2026

When implementing the intrinsics and porting tests/examples to them, we had to implement a slightly larger API surface for the intrinsics than in the original specification. This PR aligns the specification with the end-to-end tested GCC implementation.

ptomsich added 5 commits May 3, 2026 20:37
…me_lambda geometry queries

Add a normative subsection for the two geometry-query intrinsics that
GCC and Clang already implement to enable runtime VLEN/lambda detection:

  size_t __riscv_ime_vlen   (void);
  size_t __riscv_ime_lambda (void);

Both fold to compile-time constants when VLEN is statically known
(-mrvv-vector-bits=zvl) and otherwise emit a small runtime sequence
(csrr vlenb + shift, or csrr vlenb + ctz + shift respectively).

These intrinsics are the supported way for software to discover the
implementation's tile geometry without parsing CSR fields directly,
and are the building blocks for the runtime-dispatch pattern described
in the existing VLEN-portable code subsection.

A note clarifies that __riscv_ime_lambda returns a single representative
value; software that needs to enumerate the WARL set must still use
vsetvl write-readback.
…s for microscaled intrinsics

Microscaled multiply-accumulate intrinsics carry two qualifiers that
the existing intrinsics section did not name:

  _scaled  - distinguishes the MX-scaled form of vfwmmacc / vfqmmacc /
             vf8wmmacc from their unscaled siblings.

  _bs{N}   - selects the block size (16 or 32).  Applies to all MX
             intrinsics, including the integer-input ones (vfwimmacc /
             vfqimmacc / vf8wimmacc), which exist only in the
             microscaled form and therefore do not carry _scaled.

Add a new subsection "Microscaled multiply-accumulate intrinsics"
between the FP multiply-accumulate prototypes and the VLEN-portable
code discussion.  The subsection extends the canonical-suffix grammar
already defined in the intrinsics overview, lists representative
prototypes for each (FP, INT)-input case and each block size, and
confirms that the MX scale format is implied by the input data type
(no separate scale-format selector is needed).

Aligns the spec with what GCC and Clang already emit.
… and clarify _L{N} / _m orthogonality

Tile load/store intrinsics support an optional mask through the _m
suffix (the same convention as base V-extension load/store).  The
canonical suffix order is updated to allow _m as the final qualifier,
and the spec explicitly states that _L{N} and _m are orthogonal and
may be combined as _L{N}_m.

Add representative masked prototypes for each of the four mnemonics
(vmtl.v, vmts.v, vmttl.v, vmtts.v) and a combined-suffix example
(vmtl_v_i8m1_L4_m, vmts_v_i8m1_L4_m).

The mask bit width follows V's convention: vbool{N}_t where N matches
the data element width (vbool8_t for i8, vbool32_t for i32, etc.).

Closes a gap between the spec and what GCC emits today
(test: zvmma-tile-masked.c, zvmma-ofp8-tile-imm-lambda.c).
…ad/store intrinsics

Extend the tile load/store intrinsic table to cover the alternate-format
input vector types (OFP8 E4M3 / E5M2, OFP4 E2M1, signed/unsigned Int4,
and BFloat16) so that input tiles can be loaded and stored without an
intervening vreinterpret.

Element widths match the underlying storage:
  - 8 bits for OFP8
  - 4 bits for OFP4 / Int4
  - 16 bits for BF16

Base pointer type is uint8_t * for OFP8, OFP4, and Int4 (since these are
packed into byte-addressable memory), and __bf16 * for BF16.

The note clarifies that the masking (_m) and immediate-lambda (_L{N})
qualifiers extend to these alternate-format intrinsics on the same
orthogonal basis as for the IEEE FP and standard-int types, and that
the same expansion applies to the transposing variants vmttl.v / vmtts.v.

Closes a gap between the spec and what GCC emits today (gap #14;
commits 561cbb5a and 4d74ffa7 in vrull/ime-intrinsics).
…_su_lm{N} examples

Two intrinsic-naming patterns were permitted by the spec grammar but
never illustrated, leaving the surface ambiguous in practice:

1. Three-token long-form names arise when both altfmt_A and altfmt_B
   differ from the default *and* the accumulator type itself is the
   alternative encoding (e.g. BF16 from vfwmmacc.vv, or non-default
   OFP8 accumulator from non-widening vfmmacc.vv).  Add three concrete
   examples (vfwmmacc bf16<-E4M3xE5M2; vfmmacc OFP8<-E4M3xE5M2;
   matching overloaded short form) and an explanatory paragraph that
   states the token order: accumulator first, then A and B input
   types in order.

2. The _su/_us mixed-sign suffix and the _lm{N} LMUL suffix combine
   in the canonical order _su|_us followed by _lm{N}.  Add three
   examples covering vmmacc / vwmmacc / vqwmmacc, both _su and _us,
   and LMUL = 2, 4, 8.

Both patterns are already implemented and tested in GCC; the spec now
shows them explicitly so users do not have to derive them from the
suffix grammar at line 1448.
Copy link
Copy Markdown
Collaborator

@joseemoreira joseemoreira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am approving and merging, but I believe that

size_t __riscv_ime_lambda (void);

must take a parameter, namely the SEW, so that it can return the proper lambda.

@joseemoreira joseemoreira merged commit eb2e37e into main May 3, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants