diff --git a/src/integrated-matrix.adoc b/src/integrated-matrix.adoc index a31c9efea..f0a7fa63a 100644 --- a/src/integrated-matrix.adoc +++ b/src/integrated-matrix.adoc @@ -72,8 +72,8 @@ The accumulator register group has cardinality MUL_C = VLEN ÷ (SEW × λ²), in The elements in the vector registers are contiguous in the λ direction, as depicted in <>. Tile A and C elements are sorted in row-major order while tile B^T^ elements are sorted in column-major order. This choice allows the implementation of the -matrix tile multiplication as inner product (eg. a systolic array) or outer product and -simplifies the implementation of high-rank updates based outer products. +matrix tile multiplication as inner product (e.g., a systolic array) or outer product and +simplifies the implementation of high-rank updates based on outer products. [#ime-tile-lmul-fig] .LMUL scaling of matrix tiles for LMUL=2 (left) and LMUL=4 (right). @@ -100,12 +100,12 @@ The type tokens are: * Integer: `i4`, `i8`, `i16`, `i32`, `i64` * Floating-point: `ofp4`, `ofp8`, `fp16`, `bf16`, `fp32`, `fp64` -* Microscaling (E8M0-scaled, block size 32): prefix `x` — e.g. `xofp8`, `xi8` -* Microscaling (E8M0-scaled, block size 16): prefix `xn` — e.g. `xnofp8`, `xni8` +* Microscaling (E8M0-scaled, block size 32): prefix `x` — e.g., `xofp8`, `xi8` +* Microscaling (E8M0-scaled, block size 16): prefix `xn` — e.g., `xnofp8`, `xni8` Examples: `Zvvi8i32mm` (Int8 inputs, Int32 accumulator), `Zvvfp16fp32mm` (FP16 inputs, FP32 accumulator), `Zvvxofp8fp16mm` (MXFP8 inputs with BS=32, FP16 accumulator). -When the accumulator type matches the input type (e.g. `Zvvi16mm`, `Zvvfp32mm`), +When the accumulator type matches the input type (e.g., `Zvvi16mm`, `Zvvfp32mm`), the output token is dropped. ==== Computational subextensions @@ -151,6 +151,103 @@ The subextensions specific to microscaling are listed separately in <> can be combined in many ways, +depending on the target workload and implementation goals. +This section defines specialized Zvvm extensions as useful and interoperable +bundles of existing computational subextensions for major application domains. + +Each specialized extension is defined solely by the set of computational subextensions it includes. +These specialized extensions do not define new instructions or new arithmetic semantics beyond those already +defined by the included subextensions. +Implementations are free to support other combinations of subextensions. + +==== Zvvhpc + +The `Zvvhpc` extension is a specialized bundle of Zvvm computational subextensions +targeting high-performance computing workloads. +It emphasizes support for the floating-point data types most commonly used in +scientific computing, numerical linear algebra, simulation, and technical computing. + +The `Zvvhpc` extension is characterized by support for: + +* IEEE binary64 matrix multiplication and accumulation +* IEEE binary32 matrix multiplication and accumulation +* IEEE binary16 and BFloat16 inputs with accumulation into IEEE binary32 and, where supported, IEEE binary64 + +The `Zvvhpc` extension includes: + +* `Zvvfp64mm` +* `Zvvfp32mm` +* `Zvvfp32fp64mm` +* `Zvvfp16fp32mm` +* `Zvvfp16fp64mm` +* `Zvvbf16fp32mm` +* `Zvvbf16fp64mm` + +Implementations targeting `Zvvhpc` may also support additional integer subextensions, +but low-precision OFP4/OFP8 and microscaling-oriented subextensions are not the primary focus of this extension. + +==== Zvvaimlfp + +The `Zvvaimlfp` extension is a specialized bundle of Zvvm computational subextensions +targeting artificial-intelligence and machine-learning workloads based on floating-point +and mixed-precision arithmetic. +It emphasizes support for lower-precision floating-point formats, widening accumulation, +and microscaling-friendly data types commonly used in training and inference. + +The `Zvvaimlfp` extension is characterized by support for: + +* OFP8 matrix multiplication and accumulation +* IEEE binary16 and BFloat16 matrix multiplication and accumulation +* OFP8, IEEE binary16, and BFloat16 inputs with accumulation into wider formats such as IEEE binary16 and IEEE binary32 +* Microscaling-oriented floating-point data types and associated computational subextensions + +The `Zvvaimlfp` extension includes: + +* `Zvvofp8mm` +* `Zvvofp8fp16mm` +* `Zvvofp8bf16mm` +* `Zvvofp8fp32mm` +* `Zvvfp16mm` +* `Zvvfp16fp32mm` +* `Zvvbf16mm` +* `Zvvbf16fp32mm` + +Implementations targeting `Zvvaimlfp` may also support OFP4-derived widening subextensions +such as `Zvvofp4ofp8mm`, `Zvvofp4fp16mm`, `Zvvofp4bf16mm`, and `Zvvofp4fp32mm`, +as well as microscaling-specific subextensions listed in <>. + +==== Zvvaimlint + +The `Zvvaimlint` extension is a specialized bundle of Zvvm computational subextensions +targeting artificial-intelligence and machine-learning workloads based on quantized integer arithmetic. +It emphasizes support for low-precision integer inputs with accumulation into wider integer formats, +as commonly used in quantized inference. + +The `Zvvaimlint` extension is characterized by support for: + +* Int8 and Int16 matrix multiplication and accumulation +* widening accumulation into Int16, Int32, and, where useful, Int64 +* optional support for Int4 input formats for more aggressively quantized workloads + +The `Zvvaimlint` extension includes: + +* `Zvvi8mm` +* `Zvvi8i16mm` +* `Zvvi8i32mm` +* `Zvvi16mm` +* `Zvvi16i32mm` +* `Zvvi4i8mm` +* `Zvvi4i16mm` +* `Zvvi4i32mm` + +Implementations targeting `Zvvaimlint` may also support wider integer accumulation +subextensions such as `Zvvi8i64mm`, `Zvvi16i64mm`, and `Zvvi32i64mm`, depending on workload requirements. + +NOTE: The `Zvvhpc`, `Zvvaimlfp`, and `Zvvaimlint` extensions are not mutually exclusive. +A single implementation may support subextensions from more than one specialized extension. === New fields in the Vector Type (`vtype`) Register @@ -287,7 +384,7 @@ tile load instructions always transfer data at SEW granularity, every loaded SEW-bit position contains W contiguous narrow elements that the multiply-accumulate instruction consumes as a sub-dot-product. -[#ime-tile-widening-fig]] +[#ime-tile-widening-fig] .Element distribution and tile geometry example for L=32, SEW wide elements (left), two SEW/2 wide elements packed per SEW (middle), and four SEW/4 wide elements per SEW (right). Packing/widening by W increases the effective K dimension of the tile by a factor of W. image::png/ime-tile-widening.png[align="center"] @@ -337,15 +434,15 @@ The following table lists the integer matrix tile multiplication instructions. T | `vmmacc.vv vd, vs1, vs2` | 1 | SEW | SEW | `vwmmacc.vv vd, vs1, vs2` | 2 | SEW/2 | SEW -| `vqwmmacc.vv vd, vs1, vs2` | 4 | SEW/4 | SEW +| `vqmmacc.vv vd, vs1, vs2` | 4 | SEW/4 | SEW | `v8wmmacc.vv vd, vs1, vs2` | 8 | SEW/8 | SEW |=== Vector masking is not supported on matrix multiply-accumulate instructions: the `vm` bit in the encoding must be 1. For `vmmacc.vv`, `vm=0` is _reserved_. -For `vwmmacc.vv`, `vqwmmacc.vv`, and `v8wmmacc.vv`, `vm=0` encodes -`vfwimmacc.vv`, `vfqwimmacc.vv`, and `vf8wimmacc.vv` respectively — +For `vwmmacc.vv`, `vqmmacc.vv`, and `v8wmmacc.vv`, `vm=0` encodes +`vfwimmacc.vv`, `vfqimmacc.vv`, and `vf8wimmacc.vv` respectively — integer-input, floating-point-accumulate, microscaled instructions (see <> and <>). For floating-point multiply-accumulate instructions, `vm=0` enables @@ -364,7 +461,7 @@ Because modular addition is both associative and commutative, the final result i The `Zvvfmm` family of extensions provides floating-point matrix multiply-accumulate instructions, computing C ← C + A × B^T^. A, B, and C are matrix tiles held in the vector register file. -The following table lists the integer matrix tile multiplication instructions. The arguments are: +The following table lists the floating-point matrix tile multiplication instructions. The arguments are: * `vd`: Destination vector register group containing the C matrix tile. * `vs1`: Source vector register (group) containing the A matrix tile. @@ -375,12 +472,12 @@ The following table lists the integer matrix tile multiplication instructions. T |=== | Mnemonic | W | A/B element width | C element width | Remark -| `vfmmacc.vv vd, vs1, vs2[, v0.scale]` | 1 | SEW | SEW | floating point and MX scaled A/B tiles -| `vfwmmacc.vv vd, vs1, vs2[, v0.scale]` | 2 | SEW/2 | SEW | floating point and MX scaled A/B tiles -| `vfqwmmacc.vv vd, vs1, vs2[, v0.scale]` | 4 | SEW/4 | SEW | floating point and MX scaled A/B tiles -| `vf8wmmacc.vv vd, vs1, vs2[, v0.scale]` | 8 | SEW/8 | SEW | floating point and MX scaled A/B tiles +| `vfmmacc.vv vd, vs1, vs2` | 1 | SEW | SEW | floating-point A/B tiles only +| `vfwmmacc.vv vd, vs1, vs2[, v0.scale]` | 2 | SEW/2 | SEW | floating-point and MX-scaled A/B tiles +| `vfqmmacc.vv vd, vs1, vs2[, v0.scale]` | 4 | SEW/4 | SEW | floating-point and MX-scaled A/B tiles +| `vf8wmmacc.vv vd, vs1, vs2[, v0.scale]`| 8 | SEW/8 | SEW | floating-point and MX-scaled A/B tiles | `vfwimmacc.vv vd, vs1, vs2, v0.scale` | 2 | SEW/2 | SEW | only MX-scaled integer A/B tiles -| `vfqwimmacc.vv vd, vs1, vs2, v0.scale` | 4 | SEW/4 | SEW | only MX-scaled integer A/B tiles +| `vfqimmacc.vv vd, vs1, vs2, v0.scale` | 4 | SEW/4 | SEW | only MX-scaled integer A/B tiles | `vf8wimmacc.vv vd, vs1, vs2, v0.scale` | 8 | SEW/8 | SEW | only MX-scaled integer A/B tiles |=== @@ -399,25 +496,24 @@ The floating-point format of each operand tile is controlled by fields in `vtype ==== Mixed-format inputs (`altfmt_A` ≠ `altfmt_B`) Mixed-format inputs (`altfmt_A` ≠ `altfmt_B`) are permitted for all -floating-point multiply-accumulate instructions. Each A×B product is -computed in sufficient internal precision and rounded to the C accumulator -format; the two input formats need not match. The valid mixed-format -combinations are listed in <>. +floating-point multiply-accumulate instructions. +Each A×B product is defined by the exact mathematical product of the corresponding +input values, and the floating-point result of the overall matrix multiply-accumulate +operation is determined according to <>. +The valid mixed-format combinations are listed in <>. ===== Mixed-format multiplication semantics -Each A×B element product is computed per IEEE 754: the exact -(infinite-precision) mathematical product is formed, then rounded once to -the C accumulator format using the dynamic rounding mode from `frm`. -This definition applies uniformly regardless of whether `fmt_A` equals -`fmt_B`. +Each A×B element product is defined by the exact mathematical product of the corresponding input values, regardless of whether `fmt_A` equals `fmt_B`. +The floating-point result of the overall matrix multiply-accumulate operation, including any intermediate rounding, grouping, and partial-sum formation, is then determined according to <>. +In particular, the implementation-defined mapping from (`SEW`, `W`, `lambda`) to (`G`, `psm`, `rnd`) determines how exact products are grouped into partial sums and whether those partial sums are rounded before accumulation into C. [NOTE] ==== The exact product of two floating-point values with p~A~ and p~B~ significand bits has at most p~A~ + p~B~ significand bits. An implementation may compute the product in any internal format with at -least that many significand bits and then round once to `fmt_C`. +least that many significand bits. .Significand widths (p) of supported formats [cols="4,1"] @@ -434,17 +530,21 @@ least that many significand bits and then round once to `fmt_C`. | IEEE binary64 | 53 |=== -When p~A~ + p~B~ ≤ p~C~, the product is exact in `fmt_C` and no rounding -is needed. This holds for all mixed-format widening combinations — for -example, E4M3 × E5M2 → FP16 (4 + 3 = 7 ≤ 11) and FP16 × BF16 → FP32 -(11 + 8 = 19 ≤ 24). +When p~A~ + p~B~ ≤ p~C~, the exact product is representable in `fmt_C`, so any later rounding of that value to the C accumulator format is exact. +This holds for all mixed-format widening combinations. For example, E4M3 × E5M2 → FP16 (4 + 3 = 7 ≤ 11) and FP16 × BF16 → FP32 (11 + 8 = 19 ≤ 24). An implementation need not use a standard IEEE 754 type for the internal product; any representation that captures the exact product (e.g., an -integer significand–exponent pair) suffices, provided the final result -equals the correctly-rounded value in `fmt_C`. +integer significand-exponent pair) suffices, provided the overall matrix +multiply-accumulate result matches the semantics defined in <>. ==== +All floating-point matrix multiply-accumulate instructions, including mixed-format and microscaling variants, shall follow the normative rules in <> for grouping, partial-sum formation, optional rounding of partial sums, and final accumulation into C. + +Final accumulation into the C accumulator format uses the dynamic rounding mode from `frm`. +Any intermediate grouping, partial-sum formation, and optional rounding of partial sums follow <>. +Floating-point exception flags accumulate in `fflags`. + The subextension requirements for mixed-format operation depend on the input element width: @@ -458,7 +558,7 @@ E4M3 or E5M2. All four combinations are covered by the same OFP8 subextension for the given output format. NOTE: Mixed OFP8 inputs (E4M3 × E5M2) are only permitted with widening -instructions (`vfwmmacc.vv`, `vfqwmmacc.vv`), not with `vfmmacc.vv` +instructions (`vfwmmacc.vv`, `vfqmmacc.vv`), not with `vfmmacc.vv` (OFP8 → OFP8), because the exact product (up to 7 significand bits) exceeds the OFP8 output precision (p ≤ 4). @@ -467,7 +567,7 @@ the OFP8 output precision (p ≤ 4). For 16-bit inputs, `altfmt_A` and `altfmt_B` independently select IEEE binary16 or BFloat16. When both are the same, a single subextension suffices (Zvvfp16mm* for IEEE binary16, Zvvbf16mm* for BFloat16), and all -three instructions (`vfmmacc.vv`, `vfwmmacc.vv`, `vfqwmmacc.vv`) are +three instructions (`vfmmacc.vv`, `vfwmmacc.vv`, `vfqmmacc.vv`) are available. When `altfmt_A ≠ altfmt_B` (one input is IEEE binary16 and the other is @@ -488,7 +588,7 @@ is missing, the instruction raises an illegal-instruction exception. | Instruction | Output format | Required subextensions (both must be present) | vfwmmacc.vv | FP32 (SEW=32) | Zvvfp16fp32mm AND Zvvbf16fp32mm -| vfqwmmacc.vv| FP64 (SEW=64) | Zvvfp16fp64mm AND Zvvbf16fp64mm +| vfqmmacc.vv| FP64 (SEW=64) | Zvvfp16fp64mm AND Zvvbf16fp64mm |=== If the required subextensions are not both present, the instruction raises an illegal-instruction exception. @@ -499,9 +599,11 @@ For IEEE binary32 and IEEE binary64 inputs, `altfmt_A` and `altfmt_B` are ignored (there is only one format per width), so mixed-format operation does not apply. -All multiply and add operations use the dynamic rounding mode from `frm`; floating-point exception flags are accumulated into `fflags`. +Final accumulation into the C accumulator format uses the dynamic rounding mode from `frm`. +Any intermediate grouping, partial-sum formation, and optional rounding of partial sums follow <>. +Floating-point exception flags are accumulated into `fflags`. -The K-dimension, tile-dimension formulae, MUL_C, and instruction-to-widening-factor mapping are the same as for the integer family (see <>); the floating-point mnemonics are `vfmmacc.vv` (W=1), `vfwmmacc.vv` (W=2), and `vfqwmmacc.vv` (W=4). +The K-dimension, tile-dimension formulae, MUL_C, and instruction-to-widening-factor mapping are the same as for the integer family (see <>); the floating-point mnemonics are `vfmmacc.vv` (W=1), `vfwmmacc.vv` (W=2), and `vfqmmacc.vv` (W=4). As with the integer multiply-accumulate instructions, vector masking is not supported. When `vm=0`, the instruction does not apply a vector mask; @@ -515,77 +617,87 @@ Whenever mixed formats are used in the computational instructions, one must foll [#arithmetic-considerations] ==== Arithmetic considerations -Each multiply-accumulate instruction computes, for every output element stem:[C_{m,n}]: +Each multiply-accumulate instruction computes, for every output element stem:[C_{i,j}]: -stem:[C_{m,n} \leftarrow C_{m,n} + \sum_{k=0}^{K_{\text{eff}}-1} A_{m,k} \times B_{k,n}] +stem:[C_{i,j} \leftarrow C_{i,j} + \sum_{k=0}^{K_{\text{eff}}-1} A_{i,k} \times B_{k,j}] where K_eff = λ × W × LMUL. This section specifies how the K_eff product terms may be grouped and when intermediate rounding is permitted. ===== Sub-dot-products -For widening instructions (W = 2 or W = 4), the W narrow input-element pairs packed within one accumulator-width (SEW-bit) position form a natural computational unit called a _sub-dot-product_. -The W multiplications of (SEW÷W)-bit elements are performed and their products summed. +When `W > 1`, the `W` narrow input-element pairs packed within one accumulator-width (`SEW`-bit) position form a natural computational unit called a _sub-dot-product_. +The `W` multiplications of (`SEW÷W`)-bit elements are performed and their products summed. Each product of two (SEW÷W)-bit values is _exact_ at SEW precision: the significand has at most 2 × p~input~ − 1 bits, which fits in the wider SEW format. -The sub-dot-product can therefore be computed with very little loss at SEW or wider precision. +The sub-dot-product can therefore be computed with very little loss at `SEW` or wider precision. There are K_eff ÷ W = λ × LMUL sub-dot-products per output element. -For non-widening instructions (W = 1), each product of two SEW-bit values is exact at 2 × SEW bits; a sub-dot-product consists of a single product term. +When `W = 1`, each product of two `SEW`-bit values is exact at `2 × SEW` bits; a sub-dot-product consists of a single product term. ===== Accumulation and rounding model (floating-point) -The resulting value of element stem:[C_{m,n}] for any given LMUL must match the value computed by applying a series of multiply-accumulate instructions with LMUL=1, in increasing order of vector register indices. -Therefore, it is enough to specify the result of the computations +The resulting value of element stem:[C_{i,j}] for any given `LMUL` must match the value computed by applying a series of multiply-accumulate instructions with `LMUL=1`, in increasing order of vector-register indices. +Therefore, it is enough to specify the result of the computation -stem:[C_{m,n} \leftarrow C_{m,n} + \sum_{k=0}^{\lambda W - 1} A_{m,k} \times B_{k,n}] +stem:[C_{i,j} \leftarrow C_{i,j} + \sum_{k=0}^{\lambda W - 1} A_{i,k} \times B_{k,j}] -An implementation partitions the λ sub-dot-products for each output element into consecutive groups of G sub-dot-products. +An implementation partitions the λ sub-dot-products for each output element into consecutive groups of `G` sub-dot-products. +The implementation-defined grouping factor `G` is applied separately within each `LMUL=1` step, so that a group shall not cross the boundary between two consecutive `LMUL=1` steps. -* Within each group, the G partial results are accumulated into a partial sum S using internal precision that requires no rounding to SEW precision inside a group. +`G` must satisfy: -* After each group, the accumulated partial sum S is _added to the running value of_ stem:[C_{m,n}] by computing +* `G` is a power of two; +* 1 ≤ `G` ≤ λ. -stem:[C_{m,n} \leftarrow round_{frm}(C_{m,n} + (round_{rto}(S)))], +* Within each group, the `G` sub-dot-products jointly define a dot product of two vectors with `G × W` elements. This group dot product is reduced into a partial sum `S` according to an implementation-defined parameter `psm`. -where the accumulation rounding is performed with the rounding mode from `frm`. -The rounding of partial sum S _before_ it is accumulated to the running value of stem:[C_{m,n}] is optional and, if performed, must use round-to-odd (rto) mode. +** If `psm=0`, the partial sum `S` is formed using exact computation: the contributing products and sums are computed in sufficiently precise internal form, without rounding to the C accumulator format, until the next rounding step defined by this specification. -The value of G is _implementation-defined_ and may depend on SEW, W, λ, LMUL, and the microarchitecture. -It must satisfy: +** If `psm=1`, the exact products belonging to the group are accumulated in a bulk-normalized representation according to the RVBNA Algorithm, or in an equivalent representation producing the same numerical result. -* G is a power of two; -* 1 ≤ G ≤ λ. +* Once the partial sum `S` is computed, and before it is accumulated into the result C, it is treated according to an implementation-defined parameter `rnd`. -The resulting number of rounding operations per output element of C is [2](λ × LMUL) ÷ G, where the factor of 2 is present if the accumulated partial sum is rounded before being added to the running value. +** If `rnd=frm`, the partial sum `S` is rounded to the C accumulator format using the rounding mode from `frm`. -[NOTE] -==== -In a *systolic-array* datapath, G is typically 1: each sub-dot-product is rounded and added to C immediately, yielding λ × LMUL rounding additions per output element. +** If `rnd=rto`, the partial sum `S` is rounded to the C accumulator format using round-to-odd. -In an *outer-product* datapath, G is typically on the order of λ (e.g. λ, λ÷2, λ÷4, or λ÷8): multiple sub-dot-products are accumulated at extended internal precision before a single rounding and C addition. -This significantly reduces the number of expensive full-precision additions. +** If `rnd=xct`, no rounding to the C accumulator format is applied to the partial sum `S` before the final accumulation into `C`. -Software must not depend on a particular value of G. +* After each group, the rounded partial sum `S` is accumulated into the running value of stem:[C_{i,j}] by computing -It is expected an implementation will have to disclose its chosen value of G in order to be certified as compliant. +stem:[C_{i,j} \leftarrow round_{frm}(C_{i,j} + S)]. -The rounding of partial sum S is made optional in order to match established practices in industry and other RISC-V matrix extensions. -==== +The accumulation rounding is performed with the rounding mode from `frm`. +If `rnd=xct`, the partial sum `S` is retained in sufficient internal precision so that the only rounding step in the update of stem:[C_{i,j}] by that group is the final stem:[round_{frm}(C_{i,j} + S)]. -Because G is implementation-defined, two conforming implementations may produce floating-point results that differ in the least-significant bits for identical inputs. -Bit-exact reproducibility of floating-point matrix multiply-accumulate results across different implementations is therefore _not_ guaranteed. +Supported values of `G`, `psm`, and `rnd` by an implementation may depend on `SEW`, `W`, and `lambda`. +An implementation must disclose all supported combinations by publishing a table of its mappings from (`SEW`, `W`, `lambda`) to (`G`, `psm`, `rnd`). +The floating-point result of a matrix multiply-accumulate instruction is fully determined once the implementation’s supported mapping from (`SEW`, `W`, `lambda`) to (`G`, `psm`, `rnd`) is known. -Floating-point exception flags (inexact, overflow, underflow, invalid, etc.) are accumulated into `fflags`; the order in which individual exceptions are raised within a single instruction execution is implementation-defined. +The number of rounding operations per output element of C is (λ × `LMUL`) ÷ `G` if no rounding is applied to the reduced partial sum before accumulation, and 2 × (λ × `LMUL`) ÷ `G` if the reduced partial sum is rounded before accumulation into the running value of C. [NOTE] -.Editorial Note ==== -We could ammend the requirement that "... the G partial results are accumulated into a partial sum S using internal precision that requires no rounding to SEW precision inside a group" with the license to use the Bulk Normalization of Dot-Products (BNDP) procedure. If we do that, we must require the implementor to disclose the choice of BNDP parameters. -Adding the license would result in more compatibility with other RISC-V matrix extensions. +A Zvvm implementation can produce the exact same results as the analogous Zvtm implementation when operating on 16- and 8-bit inputs by setting `psm=1`, `rnd=rto`, and choosing a suitable architectural value of `G`: `G=2` for 16-bit inputs with 32-bit outputs, and `G=1` for 8-bit inputs with 32-bit outputs. +For input element widths of 32 bits or greater, compatibility with the analogous Zvtm instruction is obtained with `G=1`, `psm=0`, and `rnd=frm` so that each group contains a single product and the partial sum `S` is that product rounded according to `frm`. + +In a *systolic-array* datapath, `G` is typically 1: each sub-dot-product is rounded and added to C immediately, yielding λ × LMUL rounding additions per output element. + +In an *outer-product* datapath, `G` is typically on the order of λ (e.g., λ, λ÷2, λ÷4, or λ÷8): multiple sub-dot-products are accumulated at extended internal precision before a single rounding and C addition. +This significantly reduces the number of expensive full-precision additions. + +Portable software must not assume a particular mapping from (`SEW`, `W`, `lambda`) to (`G`, `psm`, `rnd`) unless that mapping has been disclosed by the implementation. + +The rounding of partial sum `S` is optional in order to match established practices in industry and other RISC-V matrix extensions. ==== +Two conforming implementations may produce floating-point results that differ for identical inputs. +Bit-exact reproducibility of floating-point matrix multiply-accumulate results across different implementations is therefore guaranteed only when both implementations support the same (`SEW`, `W`, `lambda`) → (`G`, `psm`, `rnd`) mapping. + +Floating-point exception flags (inexact, overflow, underflow, invalid, etc.) are accumulated into `fflags`; the order in which individual exceptions are raised within a single instruction execution is implementation-defined. + [#integrated-matrix-microscaling] ==== Microscaling support (`v0.scale`) @@ -618,7 +730,7 @@ final operand, analogous to `v0.t` for masked operations in the base V extension: vfwmmacc.vv vd, vs1, vs2, v0.scale - vfqwmmacc.vv vd, vs1, vs2, v0.scale + vfqmmacc.vv vd, vs1, vs2, v0.scale vf8wmmacc.vv vd, vs1, vs2, v0.scale When `vm=0` is not specified (i.e., `vm=1`, the default), no scaling is @@ -659,7 +771,7 @@ Because E8M0 values are exact powers of two, scale application is equivalent to exponent addition and introduces no rounding error. When a non-NaN E8M0 value represents a power-of-two that overflows the -accumulator FP format (e.g. 2^128^ cannot be represented in FP16 which +accumulator FP format (e.g., 2^128^ cannot be represented in FP16 which has a maximum finite exponent of 2^15^), the converted scale value is +∞. The subsequent block-scale multiply follows IEEE 754 rules: +∞ multiplied by a non-zero finite value yields ±∞ (propagating into the @@ -778,18 +890,18 @@ illegal-instruction exception. |=== | Input format | Instruction | W | SEW | Max LMUL -| FP4 → FP16/BF16 | vfqwmmacc | 4 | 16 | 4 +| FP4 → FP16/BF16 | vfqmmacc | 4 | 16 | 4 | FP4 → OFP8 | vfwmmacc | 2 | 8 | 4 -| Int4 → FP16/BF16 | vfqwimmacc | 4 | 16 | 4 +| Int4 → FP16/BF16 | vfqimmacc | 4 | 16 | 4 | All others | — | — | — | 8 (no restriction) |=== ===== Applicability Microscaling via `v0.scale` is supported on the widening floating-point -multiply-accumulate instructions (`vfwmmacc.vv`, `vfqwmmacc.vv`, `vf8wmmacc.vv`) +multiply-accumulate instructions (`vfwmmacc.vv`, `vfqmmacc.vv`, `vf8wmmacc.vv`) and on the integer-input FP-accumulate instructions `vfwimmacc.vv` (MXINT8→FP16/BF16), -`vfqwimmacc.vv` (MXINT4→FP16/BF16, MXINT8→FP32), +`vfqimmacc.vv` (MXINT4→FP16/BF16, MXINT8→FP32), and `vf8wimmacc.vv` (MXINT4→FP32, MXINT8→FP64). For `vmmacc.vv`, `vm=0` remains _reserved_ and raises an illegal-instruction exception. @@ -986,7 +1098,7 @@ Physically both transfers are identical: they move contiguous segments of length The tile load/store instructions interpret the memory layout according to the specified leading dimension, but the resulting data layout in the VR is the same regardless of whether the source/destination matrix is stored in row-major or column-major order. [#ime-load-store-geometry] -.Loading a matrix tile from memory for LMUL=1. The matrix is layed out linearly in memory, the leading dimension LD specifies its row size (a) or column size (b). Element indices represent the offset of the elements in memory. Blue arrows indicate the data ordering in memory/VR. +.Loading a matrix tile from memory for LMUL=1. The matrix is laid out linearly in memory, the leading dimension LD specifies its row size (a) or column size (b). Element indices represent the offset of the elements in memory. Blue arrows indicate the data ordering in memory/VR. image::png/ime-load-store-geometry.png[align="center"] If (rs2) = 0, then the leading dimension LD is set to the _natural dimension_ of λ × LMUL. @@ -1002,7 +1114,7 @@ Efficient implementation of tile loads with `rs2 = x0` are essential to high-per When using a vector length lower than the maximum for the data width, the data is loaded such that it fills full rows in an A tile or columns in a B^T^ tile. The impact is sketched in <>. [#ime-load-store-vl] -.Examples of elemnent distribution for full (left) and reduced vector length matrix tile load / store operations. +.Examples of element distribution for full (left) and reduced vector length matrix tile load / store operations. image::png/ime-load-store-vl.png[width="100%"] ==== Instructions @@ -1133,7 +1245,7 @@ For each element index `i` in the body `[vstart, VL)` where the mask is enabled: [NOTE] ==== -Transposing loads and stores should be performed with SEW set to the width of the actual data that is being loaded and transposed. Sub-byte elements (eg. OFP4) cannot be transposed directly. +Transposing loads and stores should be performed with SEW set to the width of the actual data that is being loaded and transposed. Sub-byte elements (e.g., OFP4) cannot be transposed directly. ==== <<< @@ -1255,8 +1367,8 @@ Key observations: and `altfmt_B` in `vtype` to select the desired operand signedness. The load instructions are unchanged. -* For widening variants (W=2 or W=4), use `vfwmmacc.vv`/`vfqwmmacc.vv` or - `vwmmacc.vv`/`vqwmmacc.vv`. +* For widening variants (W=2 or W=4), use `vfwmmacc.vv`/`vfqmmacc.vv` or + `vwmmacc.vv`/`vfqmmacc.vv`. The narrower A and B element widths are encoded in the multiply-accumulate instruction; the tile load instructions require no modification. @@ -1458,14 +1570,14 @@ vint32m2_t __riscv_vwmmacc_vv_i32m2 (vint32m2_t vd, vint16m1_t vs1, vint16m1_t vint64m1_t __riscv_vwmmacc_vv_i64m1 (vint64m1_t vd, vint32m1_t vs1, vint32m1_t vs2, size_t vl); -- -For `vqwmmacc.vv` (W=4), A and B use SEW/4 elements: +For `vqmmacc.vv` (W=4), A and B use SEW/4 elements: [source,c] -- /* SEW=32 accum / EEW=8 inputs, VLEN=256, λ=2: MUL_C=2. LMUL=1 (default): */ -vint32m2_t __riscv_vqwmmacc_vv_i32m2 (vint32m2_t vd, vint8m1_t vs1, vint8m1_t vs2, size_t vl); +vint32m2_t __riscv_vqmmacc_vv_i32m2 (vint32m2_t vd, vint8m1_t vs1, vint8m1_t vs2, size_t vl); /* SEW=64 accum / EEW=16 inputs, VLEN=256, λ=2: MUL_C=1. LMUL=1 (default): */ -vint64m1_t __riscv_vqwmmacc_vv_i64m1 (vint64m1_t vd, vint16m1_t vs1, vint16m1_t vs2, size_t vl); +vint64m1_t __riscv_vqmmacc_vv_i64m1 (vint64m1_t vd, vint16m1_t vs1, vint16m1_t vs2, size_t vl); -- The `vint4m{N}_t` and `vuint4m{N}_t` vector types are implementation-defined @@ -1481,12 +1593,12 @@ vint8m8_t __riscv_vwmmacc_vv_i8m8_i4m1 (vint8m8_t vd, vint4m1_t vs1, vint4m1_t vs2, size_t vl); -- -For `vqwmmacc.vv` (W=4) with SEW=16, the inputs are 4-bit integers (Zvvi4i16mm): +For `vqmmacc.vv` (W=4) with SEW=16, the inputs are 4-bit integers (Zvvi4i16mm): [source,c] -- /* Zvvi4i16mm: Int4→Int16, SEW=16, VLEN=256, λ=2: MUL_C=4. LMUL=1 (default): */ -vint16m4_t __riscv_vqwmmacc_vv_i16m4_i4m1(vint16m4_t vd, +vint16m4_t __riscv_vqmmacc_vv_i16m4_i4m1(vint16m4_t vd, vint4m1_t vs1, vint4m1_t vs2, size_t vl); -- @@ -1570,24 +1682,24 @@ vfloat64m1_t __riscv_vfwmmacc_vv_f64m1(vfloat64m1_t vd, vfloat32m1_t vs1, vfloat32m1_t vs2, size_t vl); -- -For `vfqwmmacc.vv` (W=4), A/B use SEW/4: +For `vfqmmacc.vv` (W=4), A/B use SEW/4: [source,c] -- /* FP16 accum / OFP4 E2M1 inputs, VLEN=256, λ=2: MUL_C=4. LMUL=1 (default): */ -vfloat16m4_t __riscv_vfqwmmacc_vv_f16m4_f4e2m1m1(vfloat16m4_t vd, +vfloat16m4_t __riscv_vfqmmacc_vv_f16m4_f4e2m1m1(vfloat16m4_t vd, vfloat4e2m1m1_t vs1, vfloat4e2m1m1_t vs2, size_t vl); /* BF16 accum / OFP4 E2M1 inputs, VLEN=256, λ=2: MUL_C=4. LMUL=1 (default): */ -vbfloat16m4_t __riscv_vfqwmmacc_vv_bf16m4_f4e2m1m1(vbfloat16m4_t vd, +vbfloat16m4_t __riscv_vfqmmacc_vv_bf16m4_f4e2m1m1(vbfloat16m4_t vd, vfloat4e2m1m1_t vs1, vfloat4e2m1m1_t vs2, size_t vl); /* FP32 accum / OFP8 E4M3 inputs, VLEN=256, λ=2: MUL_C=2. LMUL=1 (default): */ -vfloat32m2_t __riscv_vfqwmmacc_vv_f32m2_f8e4m3m1(vfloat32m2_t vd, +vfloat32m2_t __riscv_vfqmmacc_vv_f32m2_f8e4m3m1(vfloat32m2_t vd, vfloat8e4m3m1_t vs1, vfloat8e4m3m1_t vs2, size_t vl); /* FP64 accum / BF16 inputs, VLEN=256, λ=2: MUL_C=1. LMUL=1 (default): */ -vfloat64m1_t __riscv_vfqwmmacc_vv_f64m1_bf16m1(vfloat64m1_t vd, +vfloat64m1_t __riscv_vfqmmacc_vv_f64m1_bf16m1(vfloat64m1_t vd, vbfloat16m1_t vs1, vbfloat16m1_t vs2, size_t vl); /* FP64 accum / FP16 inputs, VLEN=256, λ=2: MUL_C=1. LMUL=1 (default): */ -vfloat64m1_t __riscv_vfqwmmacc_vv_f64m1(vfloat64m1_t vd, +vfloat64m1_t __riscv_vfqmmacc_vv_f64m1(vfloat64m1_t vd, vfloat16m1_t vs1, vfloat16m1_t vs2, size_t vl); -- @@ -1606,8 +1718,8 @@ vfloat32m2_t __riscv_vfwmmacc_vv_f32m2_bf16m1(vfloat32m2_t vd, /* vfwmmacc.vv FP16 accum / OFP8 E5M2 inputs (altfmt_A=B=1 for 8-bit inputs): */ vfloat16m4_t __riscv_vfwmmacc_vv_f16m4_f8e5m2m1(vfloat16m4_t vd, vfloat8e5m2m1_t vs1, vfloat8e5m2m1_t vs2, size_t vl); -/* vfqwmmacc.vv FP32 accum / OFP8 E5M2 inputs (altfmt_A=B=1 for 8-bit inputs): */ -vfloat32m2_t __riscv_vfqwmmacc_vv_f32m2_f8e5m2m1(vfloat32m2_t vd, +/* vfqmmacc.vv FP32 accum / OFP8 E5M2 inputs (altfmt_A=B=1 for 8-bit inputs): */ +vfloat32m2_t __riscv_vfqmmacc_vv_f32m2_f8e5m2m1(vfloat32m2_t vd, vfloat8e5m2m1_t vs1, vfloat8e5m2m1_t vs2, size_t vl); -- @@ -1629,7 +1741,7 @@ vfloat32m2_t __riscv_vfwmmacc_vv(vfloat32m2_t vd, -- -The `vfwimmacc.vv` and `vfqwimmacc.vv` instructions use integer input tiles +The `vfwimmacc.vv` and `vfqimmacc.vv` instructions use integer input tiles with a floating-point accumulator. The input tile type reflects the integer element width and signedness; the accumulator type reflects the FP format. @@ -1641,14 +1753,14 @@ vfloat16m4_t __riscv_vfwimmacc_vv_f16m4_i8m1(vfloat16m4_t vd, /* MXINT8 → BF16: vfwimmacc.vv, SEW=16, UInt8×Int8 (mixed signedness) */ vbfloat16m4_t __riscv_vfwimmacc_vv_bf16m4_u8m1_i8m1(vbfloat16m4_t vd, vuint8m1_t vs1, vint8m1_t vs2, size_t vl); -/* MXINT8 → FP32: vfqwimmacc.vv, SEW=32, Int8×Int8, VLEN=256, λ=2 */ -vfloat32m2_t __riscv_vfqwimmacc_vv_f32m2_i8m1(vfloat32m2_t vd, +/* MXINT8 → FP32: vfqimmacc.vv, SEW=32, Int8×Int8, VLEN=256, λ=2 */ +vfloat32m2_t __riscv_vfqimmacc_vv_f32m2_i8m1(vfloat32m2_t vd, vint8m1_t vs1, vint8m1_t vs2, size_t vl); -/* MXINT4 → FP16: vfqwimmacc.vv, SEW=16, Int4×Int4 */ -vfloat16m4_t __riscv_vfqwimmacc_vv_f16m4_i4m1(vfloat16m4_t vd, +/* MXINT4 → FP16: vfqimmacc.vv, SEW=16, Int4×Int4 */ +vfloat16m4_t __riscv_vfqimmacc_vv_f16m4_i4m1(vfloat16m4_t vd, vint4m1_t vs1, vint4m1_t vs2, size_t vl); -/* MXINT4 → BF16: vfqwimmacc.vv, SEW=16, UInt4×UInt4 */ -vbfloat16m4_t __riscv_vfqwimmacc_vv_bf16m4_u4m1(vbfloat16m4_t vd, +/* MXINT4 → BF16: vfqimmacc.vv, SEW=16, UInt4×UInt4 */ +vbfloat16m4_t __riscv_vfqimmacc_vv_bf16m4_u4m1(vbfloat16m4_t vd, vuint4m1_t vs1, vuint4m1_t vs2, size_t vl); -- ===== VLEN-portable code @@ -1658,7 +1770,7 @@ varies across implementations with different values of VLEN and λ, even for a f The type suffix in a long-form multiply-accumulate intrinsic name directly encodes MUL_C, whereas in an overloaded name that MUL_C is implicit. Either way, source code with IME intrinsics is tied to a specific combination of input/output types and value of MUL_C. -Althought it is possible to write more general assembly code, it is common industry practice to favor coding with compiler intrinsics. +Although it is possible to write more general assembly code, it is common industry practice to favor coding with compiler intrinsics. The recommended approach for writing portable code with IME intrinsics is to package multiple code paths in the same executable, each optimized for a specific value of MUL_C. Runtime selection of the appropriate code path is then performed based on the result of `vsetvl` and computations of MUL_C = VLEN / (SEW × λ²). @@ -1670,6 +1782,10 @@ helper functions that are shared across all instructions in this chapter: [source,sail] -- +val is_pow2 : int -> bool + +val fp_internal_scale : (bits, fp_fmt, fp_internal) -> fp_internal + // Decode a 3-bit lambda encoding to the actual lambda value. // Returns None() if the encoding is 0b000 (meaning "use vtype.lambda"). function decode_vlambda(lam_enc : bits(3)) -> option(int) = @@ -1705,7 +1821,7 @@ function mat_C_idx(i : int, j : int, M : int, // fp_format_of(EEW : int, alt : bit) -> fp_fmt // Decode (element width in bits, altfmt flag) to the concrete FP format as specified -// by the altfmt encoding tables in the vtype section (e.g. EEW=8, alt=0 → E4M3 (OFP8)). +// by the altfmt encoding tables in the vtype section (e.g., EEW=8, alt=0 → E4M3 (OFP8)). // fp_mul_to(a : bits(EEW_A), fmt_A : fp_fmt, // b : bits(EEW_A), fmt_B : fp_fmt, @@ -1766,6 +1882,7 @@ function scale_width_of(sfmt : scale_format) -> int = struct gemm_geom = { EEW_C : int, // accumulator element width (= SEW) EEW_A : int, // input element width (= SEW / W) + W : int, // widening / packing factor LMUL : int, // register-group multiplier lambda : int, // tile-layout parameter from vtype K_eff : int, // effective K dimension (= lambda * W * LMUL) @@ -1803,11 +1920,121 @@ function decode_gemm_geometry(W : int, vl_divisor_is_lambda : bool) -> gemm_geom if unsigned(vl) % vl_div != 0 then return Illegal_Instruction(); if MUL_C not in {1,2,4,8,16} then return Illegal_Instruction(); - struct { EEW_C, EEW_A, LMUL, lambda, K_eff, M, N, MUL_C, epr_A, epr_C } + struct { EEW_C, EEW_A, W, LMUL, lambda, K_eff, M, N, MUL_C, epr_A, epr_C } +} + +// Partial-sum formation mode for floating-point group reduction. +enum psm_mode = { PSM_EXACT, PSM_BNA } + +// Abstract internal floating-point representation used for the group +// partial sum S before any rounding to the C accumulator format. +type fp_internal + +// Exact internal product of two FP operands. +// No rounding to the C accumulator format is performed here. +val fp_mul_exact : forall ('n, 'm). + (bits('n), fp_fmt, bits('m), fp_fmt) -> fp_internal + +// Zero value of the abstract internal FP representation. +val fp_internal_zero : unit -> fp_internal + +// Exact internal addition of two abstract internal FP values. +val fp_internal_add : (fp_internal, fp_internal) -> fp_internal + +// Bulk-normalized accumulation step. +// This may be implemented using RVBNA or any equivalent representation +// producing the same numerical result. +val fp_bna_add : (fp_internal, fp_internal) -> fp_internal + +// Group partial sum S for one output element. +// The group lies within one LMUL=1 step, identified by step, and begins +// at sub-dot-product index g0 within that step. Each sub-dot-product +// contributes W scalar products, so the full group corresponds to a dot +// product of two vectors with g_len × W elements. +// +// The returned value S has not yet been rounded to the C accumulator format; +// any such rounding is handled separately according to rnd. +function fp_group_sum(i : int, j : int, + step : int, g0 : int, g_len : int, + g : gemm_geom, + fmt_A : fp_fmt, fmt_B : fp_fmt, + psm : psm_mode, + vs1 : regidx, vs2 : regidx) -> fp_internal = { + let k_base : int = step * g.lambda * g.W; + let k_lo : int = k_base + g0 * g.W; + let k_hi : int = k_base + (g0 + g_len) * g.W - 1; + + match psm { + PSM_EXACT => { + var S : fp_internal = fp_internal_zero(); + foreach (k from k_lo to k_hi) { + let a_flat : int = mat_A_idx(i, k, g.K_eff, g.LMUL, g.lambda, g.epr_A); + let b_flat : int = mat_B_idx(k, j, g.K_eff, g.LMUL, g.lambda, g.epr_A); + + let a_bits : bits(g.EEW_A) = read_single_element(g.EEW_A, a_flat, vs1); + let b_bits : bits(g.EEW_A) = read_single_element(g.EEW_A, b_flat, vs2); + + let prod : fp_internal = fp_mul_exact(a_bits, fmt_A, b_bits, fmt_B); + S = fp_internal_add(S, prod) + }; + S + }, + + PSM_BNA => { + var S : fp_internal = fp_internal_zero(); + foreach (k from k_lo to k_hi) { + let a_flat : int = mat_A_idx(i, k, g.K_eff, g.LMUL, g.lambda, g.epr_A); + let b_flat : int = mat_B_idx(k, j, g.K_eff, g.LMUL, g.lambda, g.epr_A); + + let a_bits : bits(g.EEW_A) = read_single_element(g.EEW_A, a_flat, vs1); + let b_bits : bits(g.EEW_A) = read_single_element(g.EEW_A, b_flat, vs2); + + let prod : fp_internal = fp_mul_exact(a_bits, fmt_A, b_bits, fmt_B); + S = fp_bna_add(S, prod) + }; + S + } + } } +// Treatment of the group partial sum S before final accumulation into C. +enum rnd_mode = { RND_FRM, RND_RTO, RND_XCT } + +// Abstract implementation-defined parameters for FP group semantics. +// These correspond to the disclosed mapping from (SEW, W, lambda) +// to (G, psm, rnd). +val get_fp_grouping : (int, int, int) -> int +val get_fp_psm : (int, int, int) -> psm_mode +val get_fp_rnd : (int, int, int) -> rnd_mode + +// Round an internal FP value to the C accumulator format using frm. +val fp_round_to_frm : (fp_internal, int, fp_fmt, rounding_mode) -> bits + +// Round an internal FP value to the C accumulator format using round-to-odd. +val fp_round_to_rto : (fp_internal, int, fp_fmt) -> bits + +// Add an internal FP value directly to an accumulator-format operand, +// with the only rounding step being the final rounding according to frm. +val fp_add_internal : (bits, fp_internal, int, fp_fmt, rounding_mode) -> bits + +// Apply the implementation-selected treatment of the group partial sum S +// before the final accumulation into C. +// +// Returns Some(bits(...)) when S is rounded to the C accumulator format +// before accumulation, and None() when S is retained in internal form +// (rnd = RND_XCT). +function round_group_sum(S : fp_internal, + rnd : rnd_mode, + EEW_C : int, fmt_C : fp_fmt, + rm : rounding_mode) -> option(bits(EEW_C)) = + match rnd { + RND_FRM => Some(fp_round_to_frm(S, EEW_C, fmt_C, rm)), + RND_RTO => Some(fp_round_to_rto(S, EEW_C, fmt_C)), + RND_XCT => None() + } + // Read and unpack paired block scales from v0 for block index s. -// The scale format (e.g. E8M0) is determined by the input data type +// The scale format (e.g., E8M0) is determined by the input data type // and passed as scale_fmt. The per-scale width (sw) and pair width // (pw = 2 × sw) are derived from the scale format. // When vm=1 (no microscaling), returns (1.0, false). @@ -1842,6 +2069,8 @@ function read_block_scales(vm : bit, i : int, j : int, s : int, (blk_scale, fp_is_NaN(blk_scale, EEW_C, fmt_C)) } +// NOTE: Obsolete helper retained temporarily during the transition to the +// group-based floating-point semantics. Not used by fp_gemm or fp_scaled_gemm. // Floating-point inner product over elements [k_lo, k_hi] with widening. // Returns the block accumulator in format fmt_C at width EEW_C. function fp_block_dot(i : int, j : int, k_lo : int, k_hi : int, @@ -1920,13 +2149,38 @@ function fp_gemm(g : gemm_geom, fmt_A : fp_fmt, fmt_B : fp_fmt, fmt_C : fp_fmt, rm : rounding_mode, vs1 : regidx, vs2 : regidx, vd : regidx) -> unit = { + let G : int = get_fp_grouping(g.EEW_C, g.W, g.lambda); + let psm : psm_mode = get_fp_psm(g.EEW_C, g.W, g.lambda); + let rnd : rnd_mode = get_fp_rnd(g.EEW_C, g.W, g.lambda); + + if G < 1 | G > g.lambda | not(is_pow2(G)) then + return Illegal_Instruction(); + foreach (j from 0 to (g.N - 1)) { foreach (i from 0 to (g.M - 1)) { - let c_flat : int = mat_C_idx(i, j, g.M, g.MUL_C, g.lambda, g.epr_C); + let c_flat : int = mat_C_idx(i, j, g.M, g.MUL_C, g.lambda, g.epr_C); var acc : bits(g.EEW_C) = read_single_element(g.EEW_C, c_flat, vd); - acc = fp_add(acc, fp_block_dot(i, j, 0, g.K_eff - 1, g, - fmt_A, fmt_B, fmt_C, rm, vs1, vs2), - g.EEW_C, fmt_C, rm); + + // The semantics for LMUL > 1 are defined as repeated application + // of the LMUL = 1 computation, in increasing register-index order. + foreach (step from 0 to (g.LMUL - 1)) { + + // Each LMUL = 1 step contains lambda sub-dot-products. + foreach (g0 from 0 to (g.lambda - 1) by G) { + let S : fp_internal = + fp_group_sum(i, j, step, g0, G, + g, fmt_A, fmt_B, psm, + vs1, vs2); + + match round_group_sum(S, rnd, g.EEW_C, fmt_C, rm) { + Some(S_bits) => + acc = fp_add(acc, S_bits, g.EEW_C, fmt_C, rm), + None() => + acc = fp_add_internal(acc, S, g.EEW_C, fmt_C, rm) + } + } + }; + write_single_element(g.EEW_C, c_flat, vd, acc) } } @@ -1934,6 +2188,7 @@ function fp_gemm(g : gemm_geom, // FP scaled GEMM: C (FP) += scale_A × scale_B × (A (FP) × B (FP)), // with block scales from v0. + function fp_scaled_gemm(g : gemm_geom, fmt_A : fp_fmt, fmt_B : fp_fmt, fmt_C : fp_fmt, scale_fmt : scale_format, @@ -1941,27 +2196,72 @@ function fp_scaled_gemm(g : gemm_geom, vs1 : regidx, vs2 : regidx, vd : regidx) -> unit = { let pw : int = 2 * scale_width_of(scale_fmt); let block_size : int = if vtype[bs] == 0b1 then 16 else 32; - let S : int = (g.K_eff + block_size - 1) / block_size; + let S_blocks : int = (g.K_eff + block_size - 1) / block_size; let R : int = g.lambda * g.EEW_C / pw; + let G : int = get_fp_grouping(g.EEW_C, g.W, g.lambda); + let psm : psm_mode = get_fp_psm(g.EEW_C, g.W, g.lambda); + let rnd : rnd_mode = get_fp_rnd(g.EEW_C, g.W, g.lambda); + + if G < 1 | G > g.lambda | not(is_pow2(G)) then + return Illegal_Instruction(); + foreach (j from 0 to (g.N - 1)) { foreach (i from 0 to (g.M - 1)) { - let c_flat : int = mat_C_idx(i, j, g.M, g.MUL_C, g.lambda, g.epr_C); - var acc : bits(g.EEW_C) = read_single_element(g.EEW_C, c_flat, vd); + let c_flat : int = mat_C_idx(i, j, g.M, g.MUL_C, g.lambda, g.epr_C); + var acc : bits(g.EEW_C) = read_single_element(g.EEW_C, c_flat, vd); var nan_out : bool = false; - foreach (s from 0 to (S - 1)) { - let (blk_scale, is_nan) = read_block_scales(0b0, i, j, s, R, - g.EEW_C, fmt_C, scale_fmt, rm); + foreach (s from 0 to (S_blocks - 1)) { + let (blk_scale, is_nan) = + read_block_scales(0b0, i, j, s, R, g.EEW_C, fmt_C, scale_fmt, rm); if is_nan then { nan_out = true; break }; - let k_lo : int = s * block_size; - let k_hi : int = min(k_lo + block_size, g.K_eff) - 1; - let blk_acc : bits(g.EEW_C) = fp_block_dot(i, j, k_lo, k_hi, g, - fmt_A, fmt_B, fmt_C, rm, - vs1, vs2); - acc = fp_add(acc, fp_mul(blk_scale, blk_acc, g.EEW_C, fmt_C, rm), - g.EEW_C, fmt_C, rm) + // Scalar k-range covered by this scale block. + let blk_k_lo : int = s * block_size; + let blk_k_hi : int = min(blk_k_lo + block_size, g.K_eff) - 1; + + // Process the block as repeated LMUL=1 steps, in order. + foreach (step from 0 to (g.LMUL - 1)) { + let step_k_lo : int = step * g.lambda * g.W; + let step_k_hi : int = step_k_lo + g.lambda * g.W - 1; + + // Intersect this LMUL=1 step with the current scale block. + let int_k_lo : int = max(step_k_lo, blk_k_lo); + let int_k_hi : int = min(step_k_hi, blk_k_hi); + + if int_k_lo <= int_k_hi then { + // Convert the intersected scalar-k interval back into a + // sub-dot-product interval relative to this LMUL=1 step. + let subdot_lo : int = (int_k_lo - step_k_lo) / g.W; + let subdot_hi : int = (int_k_hi - step_k_lo) / g.W; + + foreach (g0 from subdot_lo to subdot_hi by G) { + // Groups are not permitted to cross either an LMUL=1 step boundary + // or a microscaling block boundary. Therefore the final group in this + // intersected range may contain fewer than G sub-dot-products. + let g_len : int = min(G, subdot_hi - g0 + 1); + + let S : fp_internal = + fp_group_sum(i, j, step, g0, g_len, + g, fmt_A, fmt_B, psm, + vs1, vs2); + + match round_group_sum(S, rnd, g.EEW_C, fmt_C, rm) { + Some(S_bits) => { + let scaled_S : bits(g.EEW_C) = + fp_mul(blk_scale, S_bits, g.EEW_C, fmt_C, rm); + acc = fp_add(acc, scaled_S, g.EEW_C, fmt_C, rm) + }, + None() => { + let scaled_S : fp_internal = + fp_internal_scale(blk_scale, fmt_C, S); + acc = fp_add_internal(acc, scaled_S, g.EEW_C, fmt_C, rm) + } + } + } + } + } }; if nan_out then @@ -2129,7 +2429,9 @@ VL selects how many columns of B (and C) are updated; VL must be a multiple of K The floating-point format for A and B elements is selected independently by `altfmt_A` and `altfmt_B`, interpreted for input element width SEW÷8; the C format is selected by `altfmt`, interpreted for width SEW. -All operations use the dynamic rounding mode from `frm`; floating-point exception flags accumulate in `fflags`. +Final accumulation into the C accumulator format uses the dynamic rounding mode from `frm`. +Any intermediate grouping, partial-sum formation, and optional rounding of partial sums follow <>. +Floating-point exception flags accumulate in `fflags`. + When `vm=0` (`v0.scale`), the register `v0` supplies paired block-scale factors for microscaling; the scale format is determined by the input data @@ -2220,7 +2522,9 @@ VL selects how many columns of B (and C) are updated; VL must be a multiple of K The floating-point format for A and B elements is selected independently by `altfmt_A` and `altfmt_B` in `vtype` (see <>); the C accumulator format is selected by `altfmt`. -All operations use the dynamic rounding mode from `frm`; floating-point exception flags accumulate in `fflags`. +Final accumulation into the C accumulator format uses the dynamic rounding mode from `frm`. +Any intermediate grouping, partial-sum formation, and optional rounding of partial sums follow <>. +Floating-point exception flags accumulate in `fflags`. + Microscaling (`vm=0`) is not supported for non-widening floating-point multiply-accumulate; `vm=0` is _reserved_. @@ -2260,22 +2564,18 @@ Included in:: |Zvvfmm (<<#zvvfmm>>) |0.1 |Draft - -|Zvvxofp8mm, Zvvxnofp8mm (<<#integrated-matrix-microscaling>>) -|0.1 -|Draft |=== <<< -[#insns-vfqwmmacc_vv,reftext="Floating-Point Quad-Widening Matrix Multiply-Accumulate"] -==== vfqwmmacc.vv +[#insns-vfqmmacc_vv,reftext="Floating-Point Quad-Widening Matrix Multiply-Accumulate"] +==== vfqmmacc.vv Synopsis:: Floating-Point Quad-Widening Matrix Multiply-Accumulate Mnemonic:: -vfqwmmacc.vv _vd_, _vs1_, _vs2_ + -vfqwmmacc.vv _vd_, _vs1_, _vs2_, v0.scale +vfqmmacc.vv _vd_, _vs1_, _vs2_ + +vfqmmacc.vv _vd_, _vs1_, _vs2_, v0.scale Encoding:: [wavedrom, , svg] @@ -2287,7 +2587,7 @@ Encoding:: { bits: 5, name: 'vs1' }, { bits: 5, name: 'vs2' }, { bits: 1, name: 1 }, - { bits: 6, name: 0x16, attr: ['vfqwmmacc.vv'] } + { bits: 6, name: 0x16, attr: ['vfqmmacc.vv'] } ]} .... @@ -2305,7 +2605,9 @@ VL selects how many columns of B (and C) are updated; VL must be a multiple of K The floating-point format for A and B elements is selected independently by `altfmt_A` and `altfmt_B`, interpreted for input element width SEW÷4; the C format is selected by `altfmt`, interpreted for width SEW. -All operations use the dynamic rounding mode from `frm`; floating-point exception flags accumulate in `fflags`. +Final accumulation into the C accumulator format uses the dynamic rounding mode from `frm`. +Any intermediate grouping, partial-sum formation, and optional rounding of partial sums follow <>. +Floating-point exception flags accumulate in `fflags`. + When `vm=0` (`v0.scale`), the register `v0` supplies paired block-scale factors for microscaling; the scale format is determined by the input data @@ -2401,7 +2703,9 @@ VL selects how many columns of B (and C) are updated; VL must be a multiple of K The floating-point format for A and B elements is selected independently by `altfmt_A` and `altfmt_B`, interpreted for input element width SEW÷2; the C format is selected by `altfmt`, interpreted for width SEW. -All operations use the dynamic rounding mode from `frm`; floating-point exception flags accumulate in `fflags`. +Final accumulation into the C accumulator format uses the dynamic rounding mode from `frm`. +Any intermediate grouping, partial-sum formation, and optional rounding of partial sums follow <>. +Floating-point exception flags accumulate in `fflags`. + When `vm=0` (`v0.scale`), the register `v0` supplies paired block-scale factors for microscaling; the scale format is determined by the input data @@ -2660,14 +2964,14 @@ Included in:: |=== <<< -[#insns-vfqwimmacc_vv,reftext="Integer-Input Quad-Widening FP-Accumulate Matrix Multiply-Accumulate (MX)"] -==== vfqwimmacc.vv +[#insns-vfqimmacc_vv,reftext="Integer-Input Quad-Widening FP-Accumulate Matrix Multiply-Accumulate (MX)"] +==== vfqimmacc.vv Synopsis:: Integer-Input Quad-Widening Floating-Point-Accumulate Matrix Multiply-Accumulate with Microscaling Mnemonic:: -vfqwimmacc.vv _vd_, _vs1_, _vs2_, v0.scale +vfqimmacc.vv _vd_, _vs1_, _vs2_, v0.scale Encoding:: [wavedrom, , svg] @@ -2679,12 +2983,12 @@ Encoding:: { bits: 5, name: 'vs1' }, { bits: 5, name: 'vs2' }, { bits: 1, name: 0, attr: ['vm=0'] }, - { bits: 6, name: 0x3a, attr: ['vfqwimmacc.vv'] } + { bits: 6, name: 0x3a, attr: ['vfqimmacc.vv'] } ]} .... -NOTE: This instruction shares its opcode with `vqwmmacc.vv`; `vm=0` selects -the integer-input FP-accumulate MX form. `vqwmmacc.vv` requires `vm=1`. +NOTE: This instruction shares its opcode with `vqmmacc.vv`; `vm=0` selects +the integer-input FP-accumulate MX form. `vqmmacc.vv` requires `vm=1`. Description:: Computes the exact integer matrix-matrix product T = vs1 × vs2 block by block, @@ -2703,7 +3007,7 @@ MXINT inputs are always signed; `altfmt_A` and `altfmt_B` must be 0. `altfmt` selects the FP format of the accumulator C. + The register `v0` supplies paired block-scale factors using the same -tile-strided layout as `vfqwmmacc.vv` (see <>). +tile-strided layout as `vqmmacc.vv` (see <>). The `bs` field in `vtype` selects the block size (32 or 16 elements). + The encodings SEW=8 (EEW=2), altfmt=1 with SEW=32, and SEW=64 are _reserved_. @@ -3241,14 +3545,14 @@ Included in:: |=== <<< -[#insns-vqwmmacc_vv,reftext="Integer Quad-Widening Matrix Multiply-Accumulate"] -==== vqwmmacc.vv +[#insns-vqmmacc_vv,reftext="Integer Quad-Widening Matrix Multiply-Accumulate"] +==== vqmmacc.vv Synopsis:: Integer Quad-Widening Matrix Multiply-Accumulate Mnemonic:: -vqwmmacc.vv _vd_, _vs1_, _vs2_ +vqmmacc.vv _vd_, _vs1_, _vs2_ Encoding:: [wavedrom, , svg] @@ -3260,7 +3564,7 @@ Encoding:: { bits: 5, name: 'vs1' }, { bits: 5, name: 'vs2' }, { bits: 1, name: 1, attr: ['vm=1'] }, - { bits: 6, name: 0x3a, attr: ['vqwmmacc.vv'] } + { bits: 6, name: 0x3a, attr: ['vqmmacc.vv'] } ]} .... @@ -3394,12 +3698,12 @@ Each microscaling subextension enables the `vm=0` / `v0.scale` encoding for a specific (input format, accumulator format, block size) combination. A floating-point microscaling subextension implies its corresponding base subextension: for example, Zvvxofp4fp16mm implies Zvvofp4fp16mm, which provides -the underlying OFP4→FP16 instruction support. Only narrow input formats -(OFP4, OFP8) define FP microscaling variants. +the underlying OFP4→FP16 instruction support. +Only widening floating-point multiply-accumulate instructions define FP microscaling variants, and only for narrow input formats (OFP4, OFP8). The integer MX subextensions (`Zvvxi*mm`, `Zvvxni*mm`) do not imply a -base subextension: `vfwimmacc.vv` and `vfqwimmacc.vv` have no non-MX +base subextension: `vfwimmacc.vv` and `vfqimmacc.vv` have no non-MX FP-accumulate form (the `vm=1` encoding of those opcodes is the existing -integer-accumulate `vwmmacc.vv` / `vqwmmacc.vv`). +integer-accumulate `vwmmacc.vv` / `vqmmacc.vv`). The block size and implication relationships are listed in <>. [#tbl-mx-subextensions] @@ -3467,26 +3771,26 @@ combinations are listed explicitly in the table below. 12+^e| *vfmmacc.vv (W=1)* -| vfmmacc | 8 | 8 | 0 | E4M3 | 0 | E4M3 | 0 | E4M3 | Zvvofp8mm | Zvvxofp8mm | Zvvxnofp8mm -| vfmmacc | 8 | 8 | 0 | E4M3 | 0 | E4M3 | 1 | E5M2 | Zvvofp8mm | Zvvxofp8mm | Zvvxnofp8mm -| vfmmacc | 8 | 8 | 1 | E5M2 | 1 | E5M2 | 0 | E4M3 | Zvvofp8mm | Zvvxofp8mm | Zvvxnofp8mm -| vfmmacc | 8 | 8 | 1 | E5M2 | 1 | E5M2 | 1 | E5M2 | Zvvofp8mm | Zvvxofp8mm | Zvvxnofp8mm -| vfmmacc | 8 | 8 | 0 | E4M3 | 1 | E5M2 | 0 | E4M3 | Zvvofp8mm | Zvvxofp8mm | Zvvxnofp8mm -| vfmmacc | 8 | 8 | 0 | E4M3 | 1 | E5M2 | 1 | E5M2 | Zvvofp8mm | Zvvxofp8mm | Zvvxnofp8mm -| vfmmacc | 8 | 8 | 1 | E5M2 | 0 | E4M3 | 0 | E4M3 | Zvvofp8mm | Zvvxofp8mm | Zvvxnofp8mm -| vfmmacc | 8 | 8 | 1 | E5M2 | 0 | E4M3 | 1 | E5M2 | Zvvofp8mm | Zvvxofp8mm | Zvvxnofp8mm -| vfmmacc | 16 | 16 | 0 | FP16 | 0 | FP16 | 0 | FP16 | Zvvfp16mm | — | — -| vfmmacc | 16 | 16 | 0 | FP16 | 0 | FP16 | 1 | BF16 | _reserved_ | — | — -| vfmmacc | 16 | 16 | 1 | BF16 | 1 | BF16 | 0 | FP16 | _reserved_ | — | — -| vfmmacc | 16 | 16 | 1 | BF16 | 1 | BF16 | 1 | BF16 | Zvvbf16mm | — | — -| vfmmacc | 16 | 16 | 0 | FP16 | 1 | BF16 | 0 | FP16 | Zvvfp16mm, Zvvbf16mm | — | — -| vfmmacc | 16 | 16 | 0 | FP16 | 1 | BF16 | 1 | BF16 | Zvvfp16mm, Zvvbf16mm | — | — -| vfmmacc | 16 | 16 | 1 | BF16 | 0 | FP16 | 0 | FP16 | Zvvfp16mm, Zvvbf16mm | — | — -| vfmmacc | 16 | 16 | 1 | BF16 | 0 | FP16 | 1 | BF16 | Zvvfp16mm, Zvvbf16mm | — | — -| vfmmacc | 32 | 32 | × | FP32 | × | FP32 | 0 | FP32 | Zvvfp32mm | — | — -| vfmmacc | 32 | 32 | × | FP32 | × | FP32 | 1 | — | _reserved_ | — | — -| vfmmacc | 64 | 64 | × | FP64 | × | FP64 | 0 | FP64 | Zvvfp64mm | — | — -| vfmmacc | 64 | 64 | × | FP64 | × | FP64 | 1 | — | _reserved_ | — | — +| vfmmacc | 8 | 8 | 0 | E4M3 | 0 | E4M3 | 0 | E4M3 | Zvvofp8mm | — | — +| vfmmacc | 8 | 8 | 0 | E4M3 | 0 | E4M3 | 1 | E5M2 | Zvvofp8mm | — | — +| vfmmacc | 8 | 8 | 1 | E5M2 | 1 | E5M2 | 0 | E4M3 | Zvvofp8mm | — | — +| vfmmacc | 8 | 8 | 1 | E5M2 | 1 | E5M2 | 1 | E5M2 | Zvvofp8mm | — | — +| vfmmacc | 8 | 8 | 0 | E4M3 | 1 | E5M2 | 0 | E4M3 | Zvvofp8mm | — | — +| vfmmacc | 8 | 8 | 0 | E4M3 | 1 | E5M2 | 1 | E5M2 | Zvvofp8mm | — | — +| vfmmacc | 8 | 8 | 1 | E5M2 | 0 | E4M3 | 0 | E4M3 | Zvvofp8mm | — | — +| vfmmacc | 8 | 8 | 1 | E5M2 | 0 | E4M3 | 1 | E5M2 | Zvvofp8mm | — | — +| vfmmacc | 16 | 16 | 0 | FP16 | 0 | FP16 | 0 | FP16 | Zvvfp16mm | — | — +| vfmmacc | 16 | 16 | 0 | FP16 | 0 | FP16 | 1 | BF16 | _reserved_ | — | — +| vfmmacc | 16 | 16 | 1 | BF16 | 1 | BF16 | 0 | FP16 | _reserved_ | — | — +| vfmmacc | 16 | 16 | 1 | BF16 | 1 | BF16 | 1 | BF16 | Zvvbf16mm | — | — +| vfmmacc | 16 | 16 | 0 | FP16 | 1 | BF16 | 0 | FP16 | Zvvfp16mm, Zvvbf16mm | — | — +| vfmmacc | 16 | 16 | 0 | FP16 | 1 | BF16 | 1 | BF16 | Zvvfp16mm, Zvvbf16mm | — | — +| vfmmacc | 16 | 16 | 1 | BF16 | 0 | FP16 | 0 | FP16 | Zvvfp16mm, Zvvbf16mm | — | — +| vfmmacc | 16 | 16 | 1 | BF16 | 0 | FP16 | 1 | BF16 | Zvvfp16mm, Zvvbf16mm | — | — +| vfmmacc | 32 | 32 | × | FP32 | × | FP32 | 0 | FP32 | Zvvfp32mm | — | — +| vfmmacc | 32 | 32 | × | FP32 | × | FP32 | 1 | — | _reserved_ | — | — +| vfmmacc | 64 | 64 | × | FP64 | × | FP64 | 0 | FP64 | Zvvfp64mm | — | — +| vfmmacc | 64 | 64 | × | FP64 | × | FP64 | 1 | — | _reserved_ | — | — 12+^e| *vfwmmacc.vv (W=2)* @@ -3510,24 +3814,24 @@ combinations are listed explicitly in the table below. | vfwmmacc | 64 | 32 | × | FP32 | × | FP32 | 0 | FP64 | Zvvfp32fp64mm | — | — | vfwmmacc | 64 | 32 | × | FP32 | × | FP32 | 1 | — | _reserved_ | — | — -12+^e| *vfqwmmacc.vv (W=4)* - -| vfqwmmacc | 8 | 2 | × | — | × | — | × | — | _reserved_ | — | — -| vfqwmmacc | 16 | 4 | 0 | E2M1 | 0 | E2M1 | 0 | FP16 | Zvvofp4fp16mm | Zvvxofp4fp16mm | Zvvxnofp4fp16mm -| vfqwmmacc | 16 | 4 | 0 | E2M1 | 0 | E2M1 | 1 | BF16 | Zvvofp4bf16mm | Zvvxofp4bf16mm | Zvvxnofp4bf16mm -| vfqwmmacc | 16 | 4 | 1 | — | 1 | — | × | — | _reserved_ | — | — -| vfqwmmacc | 32 | 8 | 0 | E4M3 | 0 | E4M3 | 0 | FP32 | Zvvofp8fp32mm | Zvvxofp8fp32mm | Zvvxnofp8fp32mm -| vfqwmmacc | 32 | 8 | 0 | E4M3 | 0 | E4M3 | 1 | — | _reserved_ | — | — -| vfqwmmacc | 32 | 8 | 1 | E5M2 | 1 | E5M2 | 0 | FP32 | Zvvofp8fp32mm | Zvvxofp8fp32mm | Zvvxnofp8fp32mm -| vfqwmmacc | 32 | 8 | 1 | E5M2 | 1 | E5M2 | 1 | — | _reserved_ | — | — -| vfqwmmacc | 32 | 8 | 0 | E4M3 | 1 | E5M2 | 0 | FP32 | Zvvofp8fp32mm | Zvvxofp8fp32mm | Zvvxnofp8fp32mm -| vfqwmmacc | 32 | 8 | 1 | E5M2 | 0 | E4M3 | 0 | FP32 | Zvvofp8fp32mm | Zvvxofp8fp32mm | Zvvxnofp8fp32mm -| vfqwmmacc | 64 | 16 | 0 | FP16 | 0 | FP16 | 0 | FP64 | Zvvfp16fp64mm | — | — -| vfqwmmacc | 64 | 16 | 0 | FP16 | 0 | FP16 | 1 | — | _reserved_ | — | — -| vfqwmmacc | 64 | 16 | 1 | BF16 | 1 | BF16 | 0 | FP64 | Zvvbf16fp64mm | — | — -| vfqwmmacc | 64 | 16 | 1 | BF16 | 1 | BF16 | 1 | — | _reserved_ | — | — -| vfqwmmacc | 64 | 16 | 0 | FP16 | 1 | BF16 | 0 | FP64 | Zvvfp16fp64mm, Zvvbf16fp64mm | — | — -| vfqwmmacc | 64 | 16 | 1 | BF16 | 0 | FP16 | 0 | FP64 | Zvvfp16fp64mm, Zvvbf16fp64mm | — | — +12+^e| *vfqmmacc.vv (W=4)* + +| vfqmmacc | 8 | 2 | × | — | × | — | × | — | _reserved_ | — | — +| vfqmmacc | 16 | 4 | 0 | E2M1 | 0 | E2M1 | 0 | FP16 | Zvvofp4fp16mm | Zvvxofp4fp16mm | Zvvxnofp4fp16mm +| vfqmmacc | 16 | 4 | 0 | E2M1 | 0 | E2M1 | 1 | BF16 | Zvvofp4bf16mm | Zvvxofp4bf16mm | Zvvxnofp4bf16mm +| vfqmmacc | 16 | 4 | 1 | — | 1 | — | × | — | _reserved_ | — | — +| vfqmmacc | 32 | 8 | 0 | E4M3 | 0 | E4M3 | 0 | FP32 | Zvvofp8fp32mm | Zvvxofp8fp32mm | Zvvxnofp8fp32mm +| vfqmmacc | 32 | 8 | 0 | E4M3 | 0 | E4M3 | 1 | — | _reserved_ | — | — +| vfqmmacc | 32 | 8 | 1 | E5M2 | 1 | E5M2 | 0 | FP32 | Zvvofp8fp32mm | Zvvxofp8fp32mm | Zvvxnofp8fp32mm +| vfqmmacc | 32 | 8 | 1 | E5M2 | 1 | E5M2 | 1 | — | _reserved_ | — | — +| vfqmmacc | 32 | 8 | 0 | E4M3 | 1 | E5M2 | 0 | FP32 | Zvvofp8fp32mm | Zvvxofp8fp32mm | Zvvxnofp8fp32mm +| vfqmmacc | 32 | 8 | 1 | E5M2 | 0 | E4M3 | 0 | FP32 | Zvvofp8fp32mm | Zvvxofp8fp32mm | Zvvxnofp8fp32mm +| vfqmmacc | 64 | 16 | 0 | FP16 | 0 | FP16 | 0 | FP64 | Zvvfp16fp64mm | — | — +| vfqmmacc | 64 | 16 | 0 | FP16 | 0 | FP16 | 1 | — | _reserved_ | — | — +| vfqmmacc | 64 | 16 | 1 | BF16 | 1 | BF16 | 0 | FP64 | Zvvbf16fp64mm | — | — +| vfqmmacc | 64 | 16 | 1 | BF16 | 1 | BF16 | 1 | — | _reserved_ | — | — +| vfqmmacc | 64 | 16 | 0 | FP16 | 1 | BF16 | 0 | FP64 | Zvvfp16fp64mm, Zvvbf16fp64mm | — | — +| vfqmmacc | 64 | 16 | 1 | BF16 | 0 | FP16 | 0 | FP64 | Zvvfp16fp64mm, Zvvbf16fp64mm | — | — 12+^e| *vf8wmmacc.vv (W=8)* @@ -3548,8 +3852,8 @@ combinations are listed explicitly in the table below. `altfmt_A` and `altfmt_B` select signed (0) or unsigned (1) interpretation; the accumulator is always signed. `altfmt` is unused and `vm` must be 1 (`vm=0` is _reserved_) for all four instructions. -For `vwmmacc.vv`, `vqwmmacc.vv`, and `v8wmmacc.vv`, `vm=0` is separately defined as -`vfwimmacc.vv`, `vfqwimmacc.vv`, and `vf8wimmacc.vv` +For `vwmmacc.vv`, `vqmmacc.vv`, and `v8wmmacc.vv`, `vm=0` is separately defined as +`vfwimmacc.vv`, `vfqimmacc.vv`, and `vf8wimmacc.vv` (integer inputs, FP accumulator, microscaling); see <>. [#tbl-int-encoding-map] @@ -3596,21 +3900,21 @@ For `vwmmacc.vv`, `vqwmmacc.vv`, and `v8wmmacc.vv`, `vm=0` is separately defined | vwmmacc | 64 | 32 | 1 | UInt32 | 0 | Int32 | Int64 | Zvvi32i64mm | vwmmacc | 64 | 32 | 1 | UInt32 | 1 | UInt32 | Int64 | Zvvi32i64mm -9+^e| *vqwmmacc.vv (W=4)* - -| vqwmmacc | 8 | 2 | × | — | × | — | — | _reserved_ -| vqwmmacc | 16 | 4 | 0 | Int4 | 0 | Int4 | Int16 | Zvvi4i16mm -| vqwmmacc | 16 | 4 | 0 | Int4 | 1 | UInt4 | Int16 | Zvvi4i16mm -| vqwmmacc | 16 | 4 | 1 | UInt4 | 0 | Int4 | Int16 | Zvvi4i16mm -| vqwmmacc | 16 | 4 | 1 | UInt4 | 1 | UInt4 | Int16 | Zvvi4i16mm -| vqwmmacc | 32 | 8 | 0 | Int8 | 0 | Int8 | Int32 | Zvvi8i32mm -| vqwmmacc | 32 | 8 | 0 | Int8 | 1 | UInt8 | Int32 | Zvvi8i32mm -| vqwmmacc | 32 | 8 | 1 | UInt8 | 0 | Int8 | Int32 | Zvvi8i32mm -| vqwmmacc | 32 | 8 | 1 | UInt8 | 1 | UInt8 | Int32 | Zvvi8i32mm -| vqwmmacc | 64 | 16 | 0 | Int16 | 0 | Int16 | Int64 | Zvvi16i64mm -| vqwmmacc | 64 | 16 | 0 | Int16 | 1 | UInt16 | Int64 | Zvvi16i64mm -| vqwmmacc | 64 | 16 | 1 | UInt16 | 0 | Int16 | Int64 | Zvvi16i64mm -| vqwmmacc | 64 | 16 | 1 | UInt16 | 1 | UInt16 | Int64 | Zvvi16i64mm +9+^e| *vqmmacc.vv (W=4)* + +| vqmmacc | 8 | 2 | × | — | × | — | — | _reserved_ +| vqmmacc | 16 | 4 | 0 | Int4 | 0 | Int4 | Int16 | Zvvi4i16mm +| vqmmacc | 16 | 4 | 0 | Int4 | 1 | UInt4 | Int16 | Zvvi4i16mm +| vqmmacc | 16 | 4 | 1 | UInt4 | 0 | Int4 | Int16 | Zvvi4i16mm +| vqmmacc | 16 | 4 | 1 | UInt4 | 1 | UInt4 | Int16 | Zvvi4i16mm +| vqmmacc | 32 | 8 | 0 | Int8 | 0 | Int8 | Int32 | Zvvi8i32mm +| vqmmacc | 32 | 8 | 0 | Int8 | 1 | UInt8 | Int32 | Zvvi8i32mm +| vqmmacc | 32 | 8 | 1 | UInt8 | 0 | Int8 | Int32 | Zvvi8i32mm +| vqmmacc | 32 | 8 | 1 | UInt8 | 1 | UInt8 | Int32 | Zvvi8i32mm +| vqmmacc | 64 | 16 | 0 | Int16 | 0 | Int16 | Int64 | Zvvi16i64mm +| vqmmacc | 64 | 16 | 0 | Int16 | 1 | UInt16 | Int64 | Zvvi16i64mm +| vqmmacc | 64 | 16 | 1 | UInt16 | 0 | Int16 | Int64 | Zvvi16i64mm +| vqmmacc | 64 | 16 | 1 | UInt16 | 1 | UInt16 | Int64 | Zvvi16i64mm 9+^e| *v8wmmacc.vv (W=8)* @@ -3630,8 +3934,8 @@ For `vwmmacc.vv`, `vqwmmacc.vv`, and `v8wmmacc.vv`, `vm=0` is separately defined [#tbl-intmx-encoding-map] ==== Integer MX encoding map (`vm=0`) -When `vm=0`, `vwmmacc.vv`, `vqwmmacc.vv`, and `v8wmmacc.vv` are decoded as -`vfwimmacc.vv`, `vfqwimmacc.vv`, and `vf8wimmacc.vv` respectively. +When `vm=0`, `vwmmacc.vv`, `vqmmacc.vv`, and `v8wmmacc.vv` are decoded as +`vfwimmacc.vv`, `vfqimmacc.vv`, and `vf8wimmacc.vv` respectively. MXINT inputs are always signed; `altfmt_A` and `altfmt_B` must be 0 (unsigned is _reserved_). `altfmt` selects the FP accumulator format. @@ -3656,23 +3960,23 @@ MXINT inputs are always signed; | vfwimmacc | 32 | 16 | × | Int16 | × | Int16 | × | — | _reserved_ | _reserved_ | vfwimmacc | 64 | 32 | × | Int32 | × | Int32 | × | — | _reserved_ | _reserved_ -11+^e| *vfqwimmacc.vv (W=4, vm=0 of vqwmmacc opcode)* - -| vfqwimmacc | 8 | 2 | × | — | × | — | × | — | _reserved_ | _reserved_ -| vfqwimmacc | 16 | 4 | 0 | Int4 | 0 | Int4 | 0 | FP16 | Zvvxi4fp16mm | Zvvxni4fp16mm -| vfqwimmacc | 16 | 4 | 0 | Int4 | 1 | UInt4 | 0 | FP16 | _reserved_ | _reserved_ -| vfqwimmacc | 16 | 4 | 1 | UInt4 | 0 | Int4 | 0 | FP16 | _reserved_ | _reserved_ -| vfqwimmacc | 16 | 4 | 1 | UInt4 | 1 | UInt4 | 0 | FP16 | _reserved_ | _reserved_ -| vfqwimmacc | 16 | 4 | 0 | Int4 | 0 | Int4 | 1 | BF16 | Zvvxi4bf16mm | Zvvxni4bf16mm -| vfqwimmacc | 16 | 4 | 0 | Int4 | 1 | UInt4 | 1 | BF16 | _reserved_ | _reserved_ -| vfqwimmacc | 16 | 4 | 1 | UInt4 | 0 | Int4 | 1 | BF16 | _reserved_ | _reserved_ -| vfqwimmacc | 16 | 4 | 1 | UInt4 | 1 | UInt4 | 1 | BF16 | _reserved_ | _reserved_ -| vfqwimmacc | 32 | 8 | 0 | Int8 | 0 | Int8 | 0 | FP32 | Zvvxi8fp32mm | Zvvxni8fp32mm -| vfqwimmacc | 32 | 8 | 0 | Int8 | 1 | UInt8 | 0 | FP32 | _reserved_ | _reserved_ -| vfqwimmacc | 32 | 8 | 1 | UInt8 | 0 | Int8 | 0 | FP32 | _reserved_ | _reserved_ -| vfqwimmacc | 32 | 8 | 1 | UInt8 | 1 | UInt8 | 0 | FP32 | _reserved_ | _reserved_ -| vfqwimmacc | 32 | 8 | × | — | × | — | 1 | — | _reserved_ | _reserved_ -| vfqwimmacc | 64 | 16 | × | Int16 | × | Int16 | × | — | _reserved_ | _reserved_ +11+^e| *vfqimmacc.vv (W=4, vm=0 of vqmmacc opcode)* + +| vfqimmacc | 8 | 2 | × | — | × | — | × | — | _reserved_ | _reserved_ +| vfqimmacc | 16 | 4 | 0 | Int4 | 0 | Int4 | 0 | FP16 | Zvvxi4fp16mm | Zvvxni4fp16mm +| vfqimmacc | 16 | 4 | 0 | Int4 | 1 | UInt4 | 0 | FP16 | _reserved_ | _reserved_ +| vfqimmacc | 16 | 4 | 1 | UInt4 | 0 | Int4 | 0 | FP16 | _reserved_ | _reserved_ +| vfqimmacc | 16 | 4 | 1 | UInt4 | 1 | UInt4 | 0 | FP16 | _reserved_ | _reserved_ +| vfqimmacc | 16 | 4 | 0 | Int4 | 0 | Int4 | 1 | BF16 | Zvvxi4bf16mm | Zvvxni4bf16mm +| vfqimmacc | 16 | 4 | 0 | Int4 | 1 | UInt4 | 1 | BF16 | _reserved_ | _reserved_ +| vfqimmacc | 16 | 4 | 1 | UInt4 | 0 | Int4 | 1 | BF16 | _reserved_ | _reserved_ +| vfqimmacc | 16 | 4 | 1 | UInt4 | 1 | UInt4 | 1 | BF16 | _reserved_ | _reserved_ +| vfqimmacc | 32 | 8 | 0 | Int8 | 0 | Int8 | 0 | FP32 | Zvvxi8fp32mm | Zvvxni8fp32mm +| vfqimmacc | 32 | 8 | 0 | Int8 | 1 | UInt8 | 0 | FP32 | _reserved_ | _reserved_ +| vfqimmacc | 32 | 8 | 1 | UInt8 | 0 | Int8 | 0 | FP32 | _reserved_ | _reserved_ +| vfqimmacc | 32 | 8 | 1 | UInt8 | 1 | UInt8 | 0 | FP32 | _reserved_ | _reserved_ +| vfqimmacc | 32 | 8 | × | — | × | — | 1 | — | _reserved_ | _reserved_ +| vfqimmacc | 64 | 16 | × | Int16 | × | Int16 | × | — | _reserved_ | _reserved_ 11+^e| *vf8wimmacc.vv (W=8, vm=0 of v8wmmacc opcode)*