Skip to content

Commit eb2e37e

Browse files
authored
intrinsics: align spec with the GCC implementation surface (#41)
* unprivileged/integrated-matrix: Document __riscv_ime_vlen / __riscv_ime_lambda geometry queries Add a normative subsection for the two geometry-query intrinsics that GCC and Clang already implement to enable runtime VLEN/lambda detection: size_t __riscv_ime_vlen (void); size_t __riscv_ime_lambda (void); Both fold to compile-time constants when VLEN is statically known (-mrvv-vector-bits=zvl) and otherwise emit a small runtime sequence (csrr vlenb + shift, or csrr vlenb + ctz + shift respectively). These intrinsics are the supported way for software to discover the implementation's tile geometry without parsing CSR fields directly, and are the building blocks for the runtime-dispatch pattern described in the existing VLEN-portable code subsection. A note clarifies that __riscv_ime_lambda returns a single representative value; software that needs to enumerate the WARL set must still use vsetvl write-readback. * unprivileged/integrated-matrix: Document _scaled and _bs{N} qualifiers for microscaled intrinsics Microscaled multiply-accumulate intrinsics carry two qualifiers that the existing intrinsics section did not name: _scaled - distinguishes the MX-scaled form of vfwmmacc / vfqmmacc / vf8wmmacc from their unscaled siblings. _bs{N} - selects the block size (16 or 32). Applies to all MX intrinsics, including the integer-input ones (vfwimmacc / vfqimmacc / vf8wimmacc), which exist only in the microscaled form and therefore do not carry _scaled. Add a new subsection "Microscaled multiply-accumulate intrinsics" between the FP multiply-accumulate prototypes and the VLEN-portable code discussion. The subsection extends the canonical-suffix grammar already defined in the intrinsics overview, lists representative prototypes for each (FP, INT)-input case and each block size, and confirms that the MX scale format is implied by the input data type (no separate scale-format selector is needed). Aligns the spec with what GCC and Clang already emit. * unprivileged/integrated-matrix: Add masked tile load/store intrinsics and clarify _L{N} / _m orthogonality Tile load/store intrinsics support an optional mask through the _m suffix (the same convention as base V-extension load/store). The canonical suffix order is updated to allow _m as the final qualifier, and the spec explicitly states that _L{N} and _m are orthogonal and may be combined as _L{N}_m. Add representative masked prototypes for each of the four mnemonics (vmtl.v, vmts.v, vmttl.v, vmtts.v) and a combined-suffix example (vmtl_v_i8m1_L4_m, vmts_v_i8m1_L4_m). The mask bit width follows V's convention: vbool{N}_t where N matches the data element width (vbool8_t for i8, vbool32_t for i32, etc.). Closes a gap between the spec and what GCC emits today (test: zvmma-tile-masked.c, zvmma-ofp8-tile-imm-lambda.c). * unprivileged/integrated-matrix: Add typed OFP8/OFP4/Int4/BF16 tile load/store intrinsics Extend the tile load/store intrinsic table to cover the alternate-format input vector types (OFP8 E4M3 / E5M2, OFP4 E2M1, signed/unsigned Int4, and BFloat16) so that input tiles can be loaded and stored without an intervening vreinterpret. Element widths match the underlying storage: - 8 bits for OFP8 - 4 bits for OFP4 / Int4 - 16 bits for BF16 Base pointer type is uint8_t * for OFP8, OFP4, and Int4 (since these are packed into byte-addressable memory), and __bf16 * for BF16. The note clarifies that the masking (_m) and immediate-lambda (_L{N}) qualifiers extend to these alternate-format intrinsics on the same orthogonal basis as for the IEEE FP and standard-int types, and that the same expansion applies to the transposing variants vmttl.v / vmtts.v. Closes a gap between the spec and what GCC emits today (gap #14; commits 561cbb5a and 4d74ffa7 in vrull/ime-intrinsics). * unprivileged/integrated-matrix: Document three-token mixed names and _su_lm{N} examples Two intrinsic-naming patterns were permitted by the spec grammar but never illustrated, leaving the surface ambiguous in practice: 1. Three-token long-form names arise when both altfmt_A and altfmt_B differ from the default *and* the accumulator type itself is the alternative encoding (e.g. BF16 from vfwmmacc.vv, or non-default OFP8 accumulator from non-widening vfmmacc.vv). Add three concrete examples (vfwmmacc bf16<-E4M3xE5M2; vfmmacc OFP8<-E4M3xE5M2; matching overloaded short form) and an explanatory paragraph that states the token order: accumulator first, then A and B input types in order. 2. The _su/_us mixed-sign suffix and the _lm{N} LMUL suffix combine in the canonical order _su|_us followed by _lm{N}. Add three examples covering vmmacc / vwmmacc / vqwmmacc, both _su and _us, and LMUL = 2, 4, 8. Both patterns are already implemented and tested in GCC; the spec now shows them explicitly so users do not have to derive them from the suffix grammar at line 1448.
1 parent e9ce69a commit eb2e37e

1 file changed

Lines changed: 204 additions & 1 deletion

File tree

src/integrated-matrix.adoc

Lines changed: 204 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1446,8 +1446,11 @@ The naming convention and type system extend the RISC-V V Intrinsics API
14461446
`__riscv_{mnemonic}_vv_{accum}_{inputA}_{inputB}`.
14471447
When `altfmt_A = altfmt_B`, a single input-type suffix suffices.
14481448
* The canonical suffix ordering for the long-form name is:
1449-
`{type-suffix}[_{inputA}[_{inputB}]][_su|_us][_lm{N}][_L{N}]`
1449+
`{type-suffix}[_{inputA}[_{inputB}]][_su|_us][_lm{N}][_L{N}][_m]`
14501450
where each bracketed component is omitted when not applicable.
1451+
The `_m` (masked) suffix is reserved for tile load/store intrinsics
1452+
and always appears last; `_L{N}` (immediate-lambda) and `_m` are
1453+
orthogonal and may be combined as `_L{N}_m`.
14511454

14521455
===== Overloaded short forms (GCC and Clang)
14531456

@@ -1491,6 +1494,39 @@ NOTE: The mechanism used for overload resolution in C is compiler-specific
14911494
The resulting user-visible API is identical across compilers; only the internal
14921495
implementation differs.
14931496

1497+
===== Geometry-query intrinsics
1498+
1499+
Two non-instruction intrinsics expose the implementation's IME geometry to
1500+
software:
1501+
1502+
[source,c]
1503+
--
1504+
size_t __riscv_ime_vlen (void); /* VLEN in bits */
1505+
size_t __riscv_ime_lambda (void); /* implementation lambda value */
1506+
--
1507+
1508+
`__riscv_ime_vlen()` returns the architectural VLEN (in bits). When VLEN is
1509+
known to the compiler — for example via `-mrvv-vector-bits=zvl` on GCC or
1510+
Clang — the call folds to a load-immediate; otherwise the compiler emits
1511+
`csrr vlenb` followed by a left-shift by 3.
1512+
1513+
`__riscv_ime_lambda()` returns the implementation's lambda value as observed
1514+
by the matrix unit. When VLEN is statically known the call folds to a
1515+
constant; otherwise the compiler emits a runtime sequence (typically
1516+
`csrr vlenb` followed by `ctz` and a shift) that derives lambda from VLEN.
1517+
1518+
These intrinsics are the supported way for software to discover the
1519+
implementation's tile geometry without parsing CSR fields directly, and are
1520+
intended for use in the runtime-dispatch pattern described in
1521+
<<_vlen_portable_code>>: software queries lambda (and VLEN, if needed),
1522+
then selects an appropriate code path among several specialised by MUL_C.
1523+
1524+
NOTE: `__riscv_ime_lambda()` returns a single representative lambda value
1525+
for the implementation. Implementations may support multiple lambda values
1526+
in `vtype.lambda[2:0]` (the field is WARL). Software that needs to enumerate
1527+
all supported lambda values must do so through `vsetvl` write-readback at
1528+
the relevant SEW; see <<_vlen_portable_code>>.
1529+
14941530
===== Tile load and store (Zvvmtls) and transposed tile load and store (Zvvmttls)
14951531

14961532
The load intrinsics accept a base pointer, a leading-dimension `ld` in
@@ -1542,6 +1578,43 @@ void __riscv_vmtts_v_f32m1 (float *base, size_t ld, vfloat32m1_t vs3, size_t
15421578
void __riscv_vmtts_v_f64m1 (double *base, size_t ld, vfloat64m1_t vs3, size_t vl);
15431579
--
15441580

1581+
The tile load/store intrinsics are also defined for the alternate-format
1582+
input vector types (OFP4, OFP8, BF16, Int4) so that input tiles can be
1583+
loaded and stored without a `vreinterpret`. The element width follows
1584+
the underlying storage width: 8 bits for OFP8 and 4 bits for OFP4 / Int4.
1585+
The base-pointer type is `uint8_t *` for OFP8 (and for OFP4 / Int4 packed
1586+
into bytes).
1587+
1588+
[source,c]
1589+
--
1590+
/* OFP8 — order-preserving load/store: */
1591+
vfloat8e4m3m1_t __riscv_vmtl_v_f8e4m3m1 (const uint8_t *base, size_t ld, size_t vl);
1592+
vfloat8e5m2m1_t __riscv_vmtl_v_f8e5m2m1 (const uint8_t *base, size_t ld, size_t vl);
1593+
void __riscv_vmts_v_f8e4m3m1 (uint8_t *base, size_t ld, vfloat8e4m3m1_t vs3, size_t vl);
1594+
void __riscv_vmts_v_f8e5m2m1 (uint8_t *base, size_t ld, vfloat8e5m2m1_t vs3, size_t vl);
1595+
1596+
/* OFP8 — transposing load/store: */
1597+
vfloat8e4m3m1_t __riscv_vmttl_v_f8e4m3m1 (const uint8_t *base, size_t ld, size_t vl);
1598+
void __riscv_vmtts_v_f8e5m2m1 (uint8_t *base, size_t ld, vfloat8e5m2m1_t vs3, size_t vl);
1599+
1600+
/* OFP4 / Int4 — order-preserving load/store: */
1601+
vfloat4e2m1m1_t __riscv_vmtl_v_f4e2m1m1 (const uint8_t *base, size_t ld, size_t vl);
1602+
vint4m1_t __riscv_vmtl_v_i4m1 (const uint8_t *base, size_t ld, size_t vl);
1603+
vuint4m1_t __riscv_vmtl_v_u4m1 (const uint8_t *base, size_t ld, size_t vl);
1604+
void __riscv_vmts_v_f4e2m1m1 (uint8_t *base, size_t ld, vfloat4e2m1m1_t vs3, size_t vl);
1605+
void __riscv_vmts_v_i4m1 (uint8_t *base, size_t ld, vint4m1_t vs3, size_t vl);
1606+
void __riscv_vmts_v_u4m1 (uint8_t *base, size_t ld, vuint4m1_t vs3, size_t vl);
1607+
1608+
/* BFloat16 — order-preserving load/store: */
1609+
vbfloat16m1_t __riscv_vmtl_v_bf16m1 (const __bf16 *base, size_t ld, size_t vl);
1610+
void __riscv_vmts_v_bf16m1 (__bf16 *base, size_t ld, vbfloat16m1_t vs3, size_t vl);
1611+
--
1612+
1613+
The transposing tile load/store intrinsics (`vmttl.v` / `vmtts.v`) follow
1614+
the same pattern; replace `vmtl` / `vmts` with `vmttl` / `vmtts`.
1615+
All masking (`_m`) and immediate-lambda (`_L{N}`) qualifiers extend to
1616+
these alternate-format intrinsics on the same orthogonal basis.
1617+
15451618
When an immediate lambda override is required, a `_L{N}` variant encodes the
15461619
lambda value (N ∈ {1, 2, 4, 8, 16, 32, 64}) directly in the intrinsic name.
15471620
Because the lambda value maps to a 3-bit instruction-immediate field, the
@@ -1559,6 +1632,38 @@ vint32m1_t __riscv_vmttl_v_i32m1_L4 (const int32_t *base, size_t ld, size_t vl
15591632
void __riscv_vmtts_v_i32m1_L4 (int32_t *base, size_t ld, vint32m1_t vs3, size_t vl);
15601633
--
15611634

1635+
Tile load and store intrinsics support an optional mask through the
1636+
`_m` suffix. The masked load takes a leading `vbool{N}_t mask` argument;
1637+
the masked store takes a leading `vbool{N}_t mask` followed by the usual
1638+
arguments. The mask bit width matches the data type's element width
1639+
(e.g., `vbool8_t` for `i8m1`, `vbool32_t` for `i32m1`, `vbool16_t` for
1640+
`bf16m1`), following the standard V-extension convention.
1641+
1642+
[source,c]
1643+
--
1644+
/* vmtl.v masked: */
1645+
vint8m1_t __riscv_vmtl_v_i8m1_m (vbool8_t mask, const int8_t *base, size_t ld, size_t vl);
1646+
vint32m1_t __riscv_vmtl_v_i32m1_m (vbool32_t mask, const int32_t *base, size_t ld, size_t vl);
1647+
/* vmts.v masked: */
1648+
void __riscv_vmts_v_i8m1_m (vbool8_t mask, int8_t *base, size_t ld, vint8m1_t vs3, size_t vl);
1649+
void __riscv_vmts_v_i32m1_m (vbool32_t mask, int32_t *base, size_t ld, vint32m1_t vs3, size_t vl);
1650+
/* vmttl.v masked: */
1651+
vint32m1_t __riscv_vmttl_v_i32m1_m (vbool32_t mask, const int32_t *base, size_t ld, size_t vl);
1652+
/* vmtts.v masked: */
1653+
void __riscv_vmtts_v_i32m1_m (vbool32_t mask, int32_t *base, size_t ld, vint32m1_t vs3, size_t vl);
1654+
--
1655+
1656+
The `_L{N}` (immediate-lambda) and `_m` (masked) qualifiers are orthogonal
1657+
and may be combined. `_L{N}` precedes `_m` per the canonical suffix order:
1658+
1659+
[source,c]
1660+
--
1661+
/* vmtl.v with lambda=4 and a mask: */
1662+
vint8m1_t __riscv_vmtl_v_i8m1_L4_m (vbool8_t mask, const int8_t *base, size_t ld, size_t vl);
1663+
/* vmts.v with lambda=4 and a mask: */
1664+
void __riscv_vmts_v_i8m1_L4_m (vbool8_t mask, int8_t *base, size_t ld, vint8m1_t vs3, size_t vl);
1665+
--
1666+
15621667
===== Integer matrix multiply-accumulate (Zvvmm)
15631668

15641669
The accumulator register group C uses MUL_C = VLEN ÷ (SEW × λ²) registers.
@@ -1658,6 +1763,22 @@ vint32m2_t __riscv_vwmmacc_vv_i32m2_us(vint32m2_t vd, vuint16m1_t vs1,
16581763
vint16m1_t vs2, size_t vl);
16591764
--
16601765

1766+
The `_su` / `_us` mixed-sign suffix and the `_lm{N}` LMUL suffix combine
1767+
in the canonical order `_su|_us` followed by `_lm{N}`:
1768+
1769+
[source,c]
1770+
--
1771+
/* vmmacc.vv signed A × unsigned B with LMUL=2 (MUL_C=2, LMUL=2): */
1772+
vint32m2_t __riscv_vmmacc_vv_i32m2_su_lm2(vint32m2_t vd, vint32m2_t vs1,
1773+
vuint32m2_t vs2, size_t vl);
1774+
/* vwmmacc.vv signed A × unsigned B with LMUL=4 (MUL_C=2, LMUL=4): */
1775+
vint32m2_t __riscv_vwmmacc_vv_i32m2_su_lm4(vint32m2_t vd, vint16m4_t vs1,
1776+
vuint16m4_t vs2, size_t vl);
1777+
/* vqwmmacc.vv unsigned A × signed B with LMUL=8 (MUL_C=2, LMUL=8): */
1778+
vint32m2_t __riscv_vqwmmacc_vv_i32m2_us_lm8(vint32m2_t vd, vuint8m8_t vs1,
1779+
vint8m8_t vs2, size_t vl);
1780+
--
1781+
16611782
===== Floating-point matrix multiply-accumulate (Zvvfmm)
16621783

16631784
The OFP8 vector types `vfloat8e4m3mX_t` (E4M3) and `vfloat8e5m2mX_t` (E5M2),
@@ -1763,13 +1884,31 @@ vfloat16m4_t __riscv_vfwmmacc_vv_f16m4_f8e4m3m1_f8e5m2m1(vfloat16m4_t vd,
17631884
/* Mixed 16-bit: FP16 × BF16 → FP32 via vfwmmacc.vv (altfmt_A=0, altfmt_B=1): */
17641885
vfloat32m2_t __riscv_vfwmmacc_vv_f32m2_f16m1_bf16m1(vfloat32m2_t vd,
17651886
vfloat16m1_t vs1, vbfloat16m1_t vs2, size_t vl);
1887+
/* Mixed OFP8 with BF16 accumulator (altfmt selects BF16 over FP16
1888+
for the C type, plus altfmt_A=0, altfmt_B=1 for the inputs).
1889+
The accumulator type appears once, followed by both input types: */
1890+
vbfloat16m4_t __riscv_vfwmmacc_vv_bf16m4_f8e4m3m1_f8e5m2m1(vbfloat16m4_t vd,
1891+
vfloat8e4m3m1_t vs1, vfloat8e5m2m1_t vs2, size_t vl);
1892+
/* Mixed OFP8 → OFP8 (non-widening vfmmacc.vv). Both input formats and
1893+
the accumulator format are independent; up to three type tokens: */
1894+
vfloat8e4m3m1_t __riscv_vfmmacc_vv_f8e4m3m1_f8e4m3m1_f8e5m2m1(vfloat8e4m3m1_t vd,
1895+
vfloat8e4m3m1_t vs1, vfloat8e5m2m1_t vs2, size_t vl);
17661896
/* Overloaded short forms — compiler resolves from vs1/vs2 types: */
17671897
vfloat16m4_t __riscv_vfwmmacc_vv(vfloat16m4_t vd,
17681898
vfloat8e4m3m1_t vs1, vfloat8e5m2m1_t vs2, size_t vl) __attribute__((overloadable));
17691899
vfloat32m2_t __riscv_vfwmmacc_vv(vfloat32m2_t vd,
17701900
vfloat16m1_t vs1, vbfloat16m1_t vs2, size_t vl) __attribute__((overloadable));
1901+
vbfloat16m4_t __riscv_vfwmmacc_vv(vbfloat16m4_t vd,
1902+
vfloat8e4m3m1_t vs1, vfloat8e5m2m1_t vs2, size_t vl) __attribute__((overloadable));
17711903
--
17721904

1905+
The three-token long-form name encodes the accumulator type first (as
1906+
the canonical type-suffix), followed by the A and B input-type suffixes
1907+
in order. When the accumulator is the default for the given instruction
1908+
and SEW (e.g., FP16 for `vfwmmacc.vv` with 8-bit inputs), the two-token
1909+
form is used; the three-token form arises only when the accumulator type
1910+
itself is a non-default selection (e.g., BF16 from `vfwmmacc.vv`).
1911+
17731912

17741913
The `vfwimmacc.vv` and `vfqimmacc.vv` instructions use integer input tiles
17751914
with a floating-point accumulator. The input tile type reflects the integer
@@ -1793,6 +1932,70 @@ vfloat16m4_t __riscv_vfqimmacc_vv_f16m4_i4m1(vfloat16m4_t vd,
17931932
vbfloat16m4_t __riscv_vfqimmacc_vv_bf16m4_u4m1(vbfloat16m4_t vd,
17941933
vuint4m1_t vs1, vuint4m1_t vs2, size_t vl);
17951934
--
1935+
1936+
===== Microscaled multiply-accumulate intrinsics
1937+
1938+
Microscaled (MX) multiply-accumulate variants take an extra `vuint16m1_t v0`
1939+
argument carrying the paired block-scale values (see <<integrated-matrix-microscaling>>).
1940+
Two qualifiers in the intrinsic name distinguish them from the unscaled forms:
1941+
1942+
* `_scaled` — applied to the floating-point widening intrinsics
1943+
(`vfwmmacc`, `vfqmmacc`, `vf8wmmacc`). These instructions exist in both
1944+
unscaled and scaled forms; the qualifier disambiguates.
1945+
* `_bs{N}` (N ∈ {16, 32}) — appended to the type-suffix to select the
1946+
block size. This applies to all microscaled intrinsics, including the
1947+
integer-input ones (`vfwimmacc`, `vfqimmacc`, `vf8wimmacc`), which
1948+
exist only in the microscaled form and therefore do not carry `_scaled`.
1949+
1950+
The full canonical suffix order for microscaled multiply-accumulate
1951+
intrinsics extends the one given in <<integrated-matrix-intrinsics>>:
1952+
1953+
----
1954+
{type-suffix}[_{inputA}[_{inputB}]][_su|_us][_lm{N}]_bs{N}
1955+
----
1956+
1957+
[source,c]
1958+
--
1959+
/* Floating-point microscaled (FP × FP × MX scale): _scaled qualifier */
1960+
1961+
/* FP16 × FP16 → FP32, block size 32 */
1962+
vfloat32m2_t __riscv_vfwmmacc_scaled_vv_f32m2_f16m1_bs32(
1963+
vfloat32m2_t vd, vfloat16m1_t vs1, vfloat16m1_t vs2,
1964+
vuint16m1_t v0, size_t vl);
1965+
/* FP16 × FP16 → FP32, block size 16 */
1966+
vfloat32m2_t __riscv_vfwmmacc_scaled_vv_f32m2_f16m1_bs16(
1967+
vfloat32m2_t vd, vfloat16m1_t vs1, vfloat16m1_t vs2,
1968+
vuint16m1_t v0, size_t vl);
1969+
/* OFP8 E4M3 × OFP8 E4M3 → FP32, block size 32 */
1970+
vfloat32m2_t __riscv_vfqwmmacc_scaled_vv_f32m2_f8e4m3m1_bs32(
1971+
vfloat32m2_t vd, vfloat8e4m3m1_t vs1, vfloat8e4m3m1_t vs2,
1972+
vuint16m1_t v0, size_t vl);
1973+
/* OFP4 E2M1 × OFP4 E2M1 → FP32, block size 32 */
1974+
vfloat32m2_t __riscv_vf8wmmacc_scaled_vv_f32m2_f4e2m1m1_bs32(
1975+
vfloat32m2_t vd, vfloat4e2m1m1_t vs1, vfloat4e2m1m1_t vs2,
1976+
vuint16m1_t v0, size_t vl);
1977+
1978+
/* Integer-input microscaled (Int × Int × MX scale → FP):
1979+
no _scaled qualifier — these instructions only exist in MX form. */
1980+
1981+
/* MXINT8 → FP16, block size 32 */
1982+
vfloat16m1_t __riscv_vfwimmacc_vv_f16m1_i8m1_bs32(
1983+
vfloat16m1_t vd, vint8m1_t vs1, vint8m1_t vs2,
1984+
vuint16m1_t v0, size_t vl);
1985+
/* MXINT8 → FP32, block size 32 */
1986+
vfloat32m2_t __riscv_vfqwimmacc_vv_f32m2_i8m1_bs32(
1987+
vfloat32m2_t vd, vint8m1_t vs1, vint8m1_t vs2,
1988+
vuint16m1_t v0, size_t vl);
1989+
/* MXINT4 → FP64, block size 32 */
1990+
vfloat64m4_t __riscv_vf8wimmacc_vv_f64m4_i8m1_bs32(
1991+
vfloat64m4_t vd, vint8m1_t vs1, vint8m1_t vs2,
1992+
vuint16m1_t v0, size_t vl);
1993+
--
1994+
1995+
The scale-format is implied by the input data type per the encoding maps in
1996+
<<integrated-matrix-microscaling>>; software does not select the scale
1997+
format separately.
1998+
17961999
===== VLEN-portable code
17972000

17982001
Because MUL_C = VLEN / (SEW × λ²), the accumulator register-group multiplier

0 commit comments

Comments
 (0)