intrinsics: align spec with the GCC implementation surface#41
Merged
joseemoreira merged 5 commits intomainfrom May 3, 2026
Merged
intrinsics: align spec with the GCC implementation surface#41joseemoreira merged 5 commits intomainfrom
joseemoreira merged 5 commits intomainfrom
Conversation
…me_lambda geometry queries Add a normative subsection for the two geometry-query intrinsics that GCC and Clang already implement to enable runtime VLEN/lambda detection: size_t __riscv_ime_vlen (void); size_t __riscv_ime_lambda (void); Both fold to compile-time constants when VLEN is statically known (-mrvv-vector-bits=zvl) and otherwise emit a small runtime sequence (csrr vlenb + shift, or csrr vlenb + ctz + shift respectively). These intrinsics are the supported way for software to discover the implementation's tile geometry without parsing CSR fields directly, and are the building blocks for the runtime-dispatch pattern described in the existing VLEN-portable code subsection. A note clarifies that __riscv_ime_lambda returns a single representative value; software that needs to enumerate the WARL set must still use vsetvl write-readback.
…s for microscaled intrinsics
Microscaled multiply-accumulate intrinsics carry two qualifiers that
the existing intrinsics section did not name:
_scaled - distinguishes the MX-scaled form of vfwmmacc / vfqmmacc /
vf8wmmacc from their unscaled siblings.
_bs{N} - selects the block size (16 or 32). Applies to all MX
intrinsics, including the integer-input ones (vfwimmacc /
vfqimmacc / vf8wimmacc), which exist only in the
microscaled form and therefore do not carry _scaled.
Add a new subsection "Microscaled multiply-accumulate intrinsics"
between the FP multiply-accumulate prototypes and the VLEN-portable
code discussion. The subsection extends the canonical-suffix grammar
already defined in the intrinsics overview, lists representative
prototypes for each (FP, INT)-input case and each block size, and
confirms that the MX scale format is implied by the input data type
(no separate scale-format selector is needed).
Aligns the spec with what GCC and Clang already emit.
… and clarify _L{N} / _m orthogonality
Tile load/store intrinsics support an optional mask through the _m
suffix (the same convention as base V-extension load/store). The
canonical suffix order is updated to allow _m as the final qualifier,
and the spec explicitly states that _L{N} and _m are orthogonal and
may be combined as _L{N}_m.
Add representative masked prototypes for each of the four mnemonics
(vmtl.v, vmts.v, vmttl.v, vmtts.v) and a combined-suffix example
(vmtl_v_i8m1_L4_m, vmts_v_i8m1_L4_m).
The mask bit width follows V's convention: vbool{N}_t where N matches
the data element width (vbool8_t for i8, vbool32_t for i32, etc.).
Closes a gap between the spec and what GCC emits today
(test: zvmma-tile-masked.c, zvmma-ofp8-tile-imm-lambda.c).
…ad/store intrinsics
Extend the tile load/store intrinsic table to cover the alternate-format
input vector types (OFP8 E4M3 / E5M2, OFP4 E2M1, signed/unsigned Int4,
and BFloat16) so that input tiles can be loaded and stored without an
intervening vreinterpret.
Element widths match the underlying storage:
- 8 bits for OFP8
- 4 bits for OFP4 / Int4
- 16 bits for BF16
Base pointer type is uint8_t * for OFP8, OFP4, and Int4 (since these are
packed into byte-addressable memory), and __bf16 * for BF16.
The note clarifies that the masking (_m) and immediate-lambda (_L{N})
qualifiers extend to these alternate-format intrinsics on the same
orthogonal basis as for the IEEE FP and standard-int types, and that
the same expansion applies to the transposing variants vmttl.v / vmtts.v.
Closes a gap between the spec and what GCC emits today (gap #14;
commits 561cbb5a and 4d74ffa7 in vrull/ime-intrinsics).
…_su_lm{N} examples
Two intrinsic-naming patterns were permitted by the spec grammar but
never illustrated, leaving the surface ambiguous in practice:
1. Three-token long-form names arise when both altfmt_A and altfmt_B
differ from the default *and* the accumulator type itself is the
alternative encoding (e.g. BF16 from vfwmmacc.vv, or non-default
OFP8 accumulator from non-widening vfmmacc.vv). Add three concrete
examples (vfwmmacc bf16<-E4M3xE5M2; vfmmacc OFP8<-E4M3xE5M2;
matching overloaded short form) and an explanatory paragraph that
states the token order: accumulator first, then A and B input
types in order.
2. The _su/_us mixed-sign suffix and the _lm{N} LMUL suffix combine
in the canonical order _su|_us followed by _lm{N}. Add three
examples covering vmmacc / vwmmacc / vqwmmacc, both _su and _us,
and LMUL = 2, 4, 8.
Both patterns are already implemented and tested in GCC; the spec now
shows them explicitly so users do not have to derive them from the
suffix grammar at line 1448.
joseemoreira
approved these changes
May 3, 2026
Collaborator
joseemoreira
left a comment
There was a problem hiding this comment.
I am approving and merging, but I believe that
size_t __riscv_ime_lambda (void);
must take a parameter, namely the SEW, so that it can return the proper lambda.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When implementing the intrinsics and porting tests/examples to them, we had to implement a slightly larger API surface for the intrinsics than in the original specification. This PR aligns the specification with the end-to-end tested GCC implementation.