diff --git a/main/acle.md b/main/acle.md
index 3b066e93..5b3d2bdd 100644
--- a/main/acle.md
+++ b/main/acle.md
@@ -465,6 +465,8 @@ Armv8.4-A [[ARMARMv84]](#ARMARMv84). Support is added for the Dot Product intrin
* Added feature test macro for FEAT_SSVE_FEXPA.
* Added feature test macro for FEAT_CSSC.
+* Added support for modal 8-bit floating point matrix multiply-accumulate widening intrinsics.
+* Added support for 16-bit floating point matrix multiply-accumulate widening intrinsics.
### References
@@ -2346,6 +2348,26 @@ is hardware support for the SVE forms of these instructions and if the
associated ACLE intrinsics are available. This implies that
`__ARM_FEATURE_MATMUL_INT8` and `__ARM_FEATURE_SVE` are both nonzero.
+##### Multiplication of modal 8-bit floating-point matrices
+
+This section is in
+[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
+extended in the future.
+
+`__ARM_FEATURE_F8F16MM` is defined to `1` if there is hardware support
+for the NEON and SVE modal 8-bit floating-point matrix multiply-accumulate to half-precision (FEAT_F8F16MM)
+instructions and if the associated ACLE intrinsics are available.
+
+`__ARM_FEATURE_F8F32MM` is defined to `1` if there is hardware support
+for the NEON and SVE modal 8-bit floating-point matrix multiply-accumulate to single-precision (FEAT_F8F32MM)
+instructions and if the associated ACLE intrinsics are available.
+
+##### Multiplication of 16-bit floating-point matrices
+
+`__ARM_FEATURE_SVE_F16F32MM` is defined to `1` if there is hardware support
+for the SVE 16-bit floating-point to 32-bit floating-point matrix multiply and add
+(FEAT_SVE_F16F32MM) instructions and if the associated ACLE intrinsics are available.
+
##### Multiplication of 32-bit floating-point matrices
`__ARM_FEATURE_SVE_MATMUL_FP32` is defined to `1` if there is hardware support
@@ -2637,6 +2659,9 @@ be found in [[BA]](#BA).
| [`__ARM_FEATURE_SVE_BITS`](#scalable-vector-extension-sve) | The number of bits in an SVE vector, when known in advance | 256 |
| [`__ARM_FEATURE_SVE_MATMUL_FP32`](#multiplication-of-32-bit-floating-point-matrices) | 32-bit floating-point matrix multiply extension (FEAT_F32MM) | 1 |
| [`__ARM_FEATURE_SVE_MATMUL_FP64`](#multiplication-of-64-bit-floating-point-matrices) | 64-bit floating-point matrix multiply extension (FEAT_F64MM) | 1 |
+| [`__ARM_FEATURE_F8F16MM`](#multiplication-of-modal-8-bit-floating-point-matrices) | Modal 8-bit floating-point matrix multiply-accumulate to half-precision extension (FEAT_F8F16MM) | 1 |
+| [`__ARM_FEATURE_F8F32MM`](#multiplication-of-modal-8-bit-floating-point-matrices) | Modal 8-bit floating-point matrix multiply-accumulate to single-precision extension (FEAT_F8F32MM) | 1 |
+| [`__ARM_FEATURE_SVE_F16F32MM`](#multiplication-of-16-bit-floating-point-matrices) | 16-bit floating-point matrix multiply-accumulate to single-precision extension (FEAT_SVE_F16F32MM) | 1 |
| [`__ARM_FEATURE_SVE_MATMUL_INT8`](#multiplication-of-8-bit-integer-matrices) | SVE support for the integer matrix multiply extension (FEAT_I8MM) | 1 |
| [`__ARM_FEATURE_SVE_PREDICATE_OPERATORS`](#scalable-vector-extension-sve) | Level of support for C and C++ operators on SVE vector types | 1 |
| [`__ARM_FEATURE_SVE_VECTOR_OPERATORS`](#scalable-vector-extension-sve) | Level of support for C and C++ operators on SVE predicate types | 1 |
@@ -9374,6 +9399,31 @@ BFloat16 floating-point multiply vectors.
uint64_t imm_idx);
```
+### SVE2 floating-point matrix multiply-accumulate instructions.
+
+#### FMMLA (widening, FP8 to FP16)
+
+Modal 8-bit floating-point matrix multiply-accumulate to half-precision.
+```c
+ // Only if (__ARM_FEATURE_SVE2 && __ARM_FEATURE_F8F16MM)
+ svfloat16_t svmmla[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm);
+```
+
+#### FMMLA (widening, FP8 to FP32)
+
+Modal 8-bit floating-point matrix multiply-accumulate to single-precision.
+```c
+ // Only if (__ARM_FEATURE_SVE2 && __ARM_FEATURE_F8F32MM)
+ svfloat32_t svmmla[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm);
+```
+#### FMMLA (widening, FP16 to FP32)
+
+16-bit floating-point matrix multiply-accumulate to single-precision.
+```c
+ // Only if __ARM_FEATURE_SVE_F16F32MM
+ svfloat32_t svmmla[_f32_f16](svfloat32_t zda, svfloat16_t zn, svfloat16_t zm);
+```
+
### SVE2.1 instruction intrinsics
The specification for SVE2.1 is in
diff --git a/neon_intrinsics/advsimd.md b/neon_intrinsics/advsimd.md
index a87ad725..dad00fa3 100644
--- a/neon_intrinsics/advsimd.md
+++ b/neon_intrinsics/advsimd.md
@@ -6175,3 +6175,14 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``.
| float32x4_t vmlalltbq_laneq_f32_mf8_fpm(
float32x4_t vd,
mfloat8x16_t vn,
mfloat8x16_t vm,
const int lane,
fpm_t fpm) | `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 15` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` |
| float32x4_t vmlallttq_lane_f32_mf8_fpm(
float32x4_t vd,
mfloat8x16_t vn,
mfloat8x8_t vm,
const int lane,
fpm_t fpm) | `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 7` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` |
| float32x4_t vmlallttq_laneq_f32_mf8_fpm(
float32x4_t vd,
mfloat8x16_t vn,
mfloat8x16_t vm,
const int lane,
fpm_t fpm) | `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 15` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` |
+
+## Matrix multiplication intrinsics from Armv9.6-A
+
+### Vector arithmetic
+
+#### Matrix multiply
+
+| Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures |
+|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|-------------------------------|-------------------|---------------------------|
+| float16x8_t vmmlaq_f16_mf8(
float16x8_t r,
mfloat8x16_t a,
mfloat8x16_t b,
fpm_t fpm) | `r -> Vd.4H`
`a -> Vn.16B`
`b -> Vm.16B` | `FMMLA Vd.4H, Vn.16B, Vm.16B` | `Vd.4H -> result` | `A64` |
+| float32x4_t vmmlaq_f32_mf8(
float32x4_t r,
mfloat8x16_t a,
mfloat8x16_t b,
fpm_t fpm) | `r -> Vd.4S`
`a -> Vn.16B`
`b -> Vm.16B` | `FMMLA Vd.4S, Vn.16B, Vm.16B` | `Vd.4S -> result` | `A64` |
diff --git a/tools/intrinsic_db/advsimd.csv b/tools/intrinsic_db/advsimd.csv
index 7b51c965..7fd96e6e 100644
--- a/tools/intrinsic_db/advsimd.csv
+++ b/tools/intrinsic_db/advsimd.csv
@@ -4810,3 +4810,7 @@ float32x4_t vmlalltbq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x
float32x4_t vmlalltbq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 15 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64
float32x4_t vmlallttq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 7 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64
float32x4_t vmlallttq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 15 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64
+
+ Matrix multiplication intrinsics from Armv9.6-A
+float16x8_t vmmlaq_f16_mf8(float16x8_t r, mfloat8x16_t a, mfloat8x16_t b, fpm_t fpm) r -> Vd.4H;a -> Vn.16B;b -> Vm.16B FMMLA Vd.4H, Vn.16B, Vm.16B Vd.4H -> result A64
+float32x4_t vmmlaq_f32_mf8(float32x4_t r, mfloat8x16_t a, mfloat8x16_t b, fpm_t fpm) r -> Vd.4S;a -> Vn.16B;b -> Vm.16B FMMLA Vd.4S, Vn.16B, Vm.16B Vd.4S -> result A64
diff --git a/tools/intrinsic_db/advsimd_classification.csv b/tools/intrinsic_db/advsimd_classification.csv
index a5e75510..b8a8a39f 100644
--- a/tools/intrinsic_db/advsimd_classification.csv
+++ b/tools/intrinsic_db/advsimd_classification.csv
@@ -4697,3 +4697,5 @@ vmlalltbq_lane_f32_mf8_fpm Vector arithmetic|Multiply|Multiply-accumulate and wi
vmlalltbq_laneq_f32_mf8_fpm Vector arithmetic|Multiply|Multiply-accumulate and widen
vmlallttq_lane_f32_mf8_fpm Vector arithmetic|Multiply|Multiply-accumulate and widen
vmlallttq_laneq_f32_mf8_fpm Vector arithmetic|Multiply|Multiply-accumulate and widen
+vmmlaq_f16_mf8 Vector arithmetic|Matrix multiply
+vmmlaq_f32_mf8 Vector arithmetic|Matrix multiply