From 3ccdc79a30abd9b224604187f78919f6074eb3a7 Mon Sep 17 00:00:00 2001 From: rsandifo-arm Date: Thu, 1 Feb 2024 09:47:36 +0000 Subject: [PATCH] Generalise the SME state management attributes (#276) This patch replaces __arm_shared_za and __arm_new_za with more general attributes. The purpose is twofold: * To allow the same approach to be taken for ZT0, and for any other similar state that is added in future. * To allow the programmer to give the compiler more information about when state is live (if the programmer wants to). __arm_shared_za is directly equivalent to __arm_inout("za"). Any code that is already using __arm_shared_za, or that prefers that syntax for some reason, could just #define one to the other. Similarly, __arm_new_za is directly equivalent to __arm_new("za"). The patch also removes __arm_preserves_za and replaces it with a more restricted __arm_preserves attribute. The old __arm_preserves_za could be attached to both private-ZA and shared-ZA functions. However, the private-ZA version had somewhat dubious semantics: * It made a promise about how ZA would be handled by a C/C++ function that doesn't use ZA directly. It wasn't obvious how the burden of keeping that promise was distributed between the programmer and the compiler. (The feature was always intended to be low-level.) * The semantics for private-ZA functions meant that callers would still need to prepare a lazy save buffer. __arm_preserves_za just meant that they could avoid having to restore from it afterwards. I'm hoping to replace the private-ZA form of __arm_preserves_za with an alternative, optional, extension that avoids the need for the lazy save buffer, and that would be handled entirely by the compiler. In contrast, __arm_preserves("za") makes a function shared-ZA. Co-authored-by: Sander de Smalen --- main/acle.md | 1079 +++++++++++++++++++++++++++++--------------------- 1 file changed, 633 insertions(+), 446 deletions(-) diff --git a/main/acle.md b/main/acle.md index a1d363c3..9a76aff7 100644 --- a/main/acle.md +++ b/main/acle.md @@ -348,8 +348,8 @@ Armv8.4-A [[ARMARMv84]](#ARMARMv84). Support is added for the Dot Product intrin [CMSE](#CMSE-ACLE)'s Arguments on the stack and floating point handling. * Corrected description and example in [CMSE](#CMSE-ACLE)'s section about non-secure function pointers. -* Added a requirement on [`arm_new_za`](#arm_new_za) to set the initial - contents of ZA to zero. +* Added a requirement on [`arm_new_za`] to set the initial contents of + ZA to zero. #### Changes for next release @@ -369,6 +369,9 @@ Armv8.4-A [[ARMARMv84]](#ARMARMv84). Support is added for the Dot Product intrin for the Armv9.5-A's PAC Enhancements. * Introduced a new value to the `__ARM_FEATURE_PAC_DEFAULT` macro to indicate the use of PC as a diversifier for [Pointer Authentication](#pointer-authentication). +* Added a [State management](#state-management) section, replacing the + `__arm_shared_za`, `__arm_new_za`, and `__arm_preserves_za` attributes + in the previous Alpha SME spec. ### References @@ -806,9 +809,11 @@ predefine the associated macro to a nonzero value. | **Name** | **Target** | **Predefined macro** | | ----------------------------------------------------------- | --------------------- | --------------------------------- | | [`__arm_locally_streaming`](#arm_locally_streaming) | function declaration | `__ARM_FEATURE_LOCALLY_STREAMING` | -| [`__arm_new_za`](#arm_new_za) | function declaration | `__ARM_FEATURE_SME` | -| [`__arm_preserves_za`](#arm_preserves_za) | function type | `__ARM_FEATURE_SME` | -| [`__arm_shared_za`](#arm_shared_za) | function type | `__ARM_FEATURE_SME` | +| [`__arm_in`](#ways-of-sharing-state) | function type | Argument-dependent | +| [`__arm_inout`](#ways-of-sharing-state) | function type | Argument-dependent | +| [`__arm_new`](#arm_new) | function declaration | Argument-dependent | +| [`__arm_out`](#ways-of-sharing-state) | function type | Argument-dependent | +| [`__arm_preserves`](#ways-of-sharing-state) | function type | Argument-dependent | | [`__arm_streaming`](#arm_streaming) | function type | `__ARM_FEATURE_SME` | | [`__arm_streaming_compatible`](#arm_streaming_compatible) | function type | `__ARM_FEATURE_SME` | @@ -4588,6 +4593,427 @@ two pointers, ignoring the tags. The return value is the sign-extended result of the computation. The tag bits in the input pointers are ignored for this operation. +# State management + +The specification for SME is in +[**Alpha** state](#current-status-and-anticipated-changes) and may change or be +extended in the future. + +## Introduction + +ACLE often uses normal C and C++ objects to represent architectural state. +These objects are passed to intrinsics and returned from intrinsics in +the same way as they would be for a normal function. + +For example, ACLE defines an `svbool_t` type to represent the +contents of an SVE predicate register. SVE intrinsics that read +a predicate register have an `svbool_t` argument for that register. +SVE intrinsics that write to a predicate register have an `svbool_t` +return type for that register. + +SVE provides 16 predicate registers, but this number is not exposed to +C and C++ code. Instead, the program can have as many `svbool_t` objects +as it needs, and it is the compiler's responsibility to manage the +register allocation. + +However, there are some pieces of architectural state for which this +approach is not suitable. For example, [SME's ZA](#za-storage) is a +single piece of storage: there are not multiple ZAs, and so it does +not make sense for a C or C++ function to have multiple ZA objects. +It would also be inefficient to copy state from one ZA object to another. + +ACLE models this kind of architectural state using [keyword +attributes](#keyword-attributes) rather than types and objects. +The main purpose of these keyword attributes is to establish the following +binary choices for a given piece of architectural state S: + + + +1. A function with a given type either shares S with its caller, + or it does not. + +2. A given function definition either uses S internally, or it does not. + +A function definition that shares S with its caller also (implicitly) uses +S internally. + + + +A function definition that uses S internally but does not share S with +its caller is said to create a “new scope” for S. + +ACLE uses strings to identify each piece of state that is managed in +this way. The strings are listed in the table below. + + + +| **String** | **State** | **State macro** | **Feature macro** | +| ---------- | ------------------ | ----------------- | -------------------- | +| `"za"` | SME's ZA storage | `__ARM_STATE_ZA` | `__ARM_FEATURE_SME` | +| `"zt0"` | SME2's ZT0 | `__ARM_STATE_ZT0` | `__ARM_FEATURE_SME2` | + +For each string, there are two associated macros: + +* a so-called “state macro” that compilers predefine if they recognize + the string + +* a so-called “feature macro” that compilers predefine if they can compile + functions that use the state + +A compiler that predefines the feature macro must also predefine the +state macro. + +For example, `__ARM_STATE_ZA` allows declarations of functions that share ZA. +`__ARM_FEATURE_SME` allows function definitions to use ZA internally. + +The strings are case-sensitive. For example, `"za"` cannot be written `"ZA"`. + +## Ways of sharing state + +ACLE identifies several different ways in which a callee can share +state with its caller. Each one has an associated [keyword +attribute](#keyword-attributes): + +* `__arm_in`: the callee takes the state as input and returns with + the state unchanged. This is similar to a const reference in C++. + +* `__arm_out`: the callee ignores the incoming state and returns new state. + This is similar to a function return value. + +* `__arm_inout`: the callee takes the state as input and returns new state. + This is similar to a non-const reference in C++. + +* `__arm_preserves`: the callee does not read the incoming state and + returns with the state unchanged. + +Each keyword attribute takes a comma-separated list of state strings as +an argument. The list must be non-empty. + +For example: + +``` c + void f() __arm_inout("za") __arm_preserves("zt0"); +``` + +says that `f` uses ZA to receive data from callers and to pass data +back to callers. `f` also promises to preserve the contents of ZT0. + +A program is [ill-formed](#ill-formed) if a function type has two +attributes that specify different behavior for the same piece of state. +For example, a program is ill-formed if a function type has both +`__arm_in("za")` and `__arm_out("za")`, or both `__arm_in("za")` +and `__arm_inout("za")`. + +The same information may be specified multiple times. For example: + +``` c + void f() __arm_in("za", "za") __arm_in("za"); +``` + +is well-formed. + +Two function types are incompatible if they handle a piece of +architectural state differently. For example: + +``` c + void f1() __arm_in("za"); + void f2() __arm_out("za"); + void f3() __arm_inout("za"); + void f4() __arm_preserves("za"); + void f5(); + + void (*ptr1)() __arm_in("za"); + void (*ptr2)() __arm_out("za"); + void (*ptr3)() __arm_inout("za"); + void (*ptr4)() __arm_preserves("za"); + void (*ptr5)(); + + void test() { + ptr1 = f1; // OK + ptr2 = f1; // Invalid + ptr3 = f1; // Invalid + ptr4 = f1; // Invalid + ptr5 = f1; // Invalid + + ptr1 = f2; // Invalid + ptr2 = f2; // OK + ptr3 = f2; // Invalid + ptr4 = f2; // Invalid + ptr5 = f2; // Invalid + + ptr1 = f3; // Invalid + ptr2 = f3; // Invalid + ptr3 = f3; // OK + ptr4 = f3; // Invalid + ptr5 = f3; // Invalid + + ptr1 = f4; // Invalid + ptr2 = f4; // Invalid + ptr3 = f4; // Invalid + ptr4 = f4; // OK + ptr5 = f4; // Invalid + + ptr1 = f5; // Invalid + ptr2 = f5; // Invalid + ptr3 = f5; // Invalid + ptr4 = f5; // Invalid + ptr5 = f5; // OK + } +``` + +`__arm_in` and `__arm_preserves` both guarantee that a function leaves +the contents of the state unchanged on return. Functions are not +[ill-formed](#ill-formed) if they write to such state, but they must ensure +that the cumulative effect of such writes is to leave the state unchanged. + +A program is [ill-formed](#ill-formed) if a function that does not [use +state](#uses-state) S contains a call to a function that shares S with +its caller. This is conceptually similar to an undefined variable in +C. For example: + +``` c + void callee() __arm_inout("za"); + void caller() { + callee(); // Ill-formed, caller does not use ZA + } +``` + +If a C or C++ function F [uses state](#uses-state) S internally and calls +a function F2 that does not share S with its callers, the call to F2 has +no effect on F's S state. The compiler must ensure this by [restoring +the old contents of S](#za-implementation-notes) after the call to F2, +if such a restoration is necessary. For example: + +``` c + void za_callee() __arm_inout("za"); + void za_caller() __arm_inout("za") { + za_callee(); + printf("Here\n"); // Does not change za_caller's ZA state + za_callee(); + } +``` + +## Mapping to the Procedure Call Standard + +[[AAPCS64]](#AAPCS64) classifies functions as having one of the following +interfaces: + + + +* a “private-ZA” interface + +* a “shared-ZA” interface + +If a C or C++ function F forms part of the object code's ABI, that +object code function has a shared-ZA interface if and only if at least +one of the following is true: + +* F shares ZA with its caller + +* F shares ZT0 with its caller + +All other functions have a private-ZA interface. + +## Function definitions + + + +The [keyword attribute](#keyword-attributes) `__arm_new(...)` +applies to function definitions. It specifies that the function +creates a [new scope](#new-state-scope) for the given state. `...` is a +comma-separated list of [state strings](#state-strings). The list +must be non-empty. + +Each piece of state specified by `__arm_new` is zero-initialized. + +This attribute does not change a function's binary interface. If the +function forms part of the object code's ABI, that object code function +has the same interface as it would have had without `__arm_new`. + +A program is [ill-formed](#ill-formed) if a function definition F +uses `__arm_new` for some state that F also [shares](#shares-state) +with its caller. For example: + +``` c + __arm_new("za") void f1() __arm_in("za") { ... } // Invalid + __arm_new("za") void f2() __arm_out("za") { ... } // Invalid + __arm_new("za") void f3() __arm_inout("za") { ... } // Invalid + __arm_new("za") void f4() __arm_inout("zt0") { ... } // OK + __arm_new("zt0") void f5() __arm_inout("za") { ... } // OK +``` + +## Inline assembly + +Some ACLE implementations might support the GNU inline asm extension. +For implementations that do, suppose that an inline asm occurs in a +function F that [uses state](#uses-state) S. There are then two +possibilities: + +* S is not an input to the asm and S is not an output from the asm. + The contents of S immediately after executing the asm's instructions + are the same as the contents of S immediately before executing the + instructions. This is the default assumption. + +* S might be an input to the asm, might be an output from the asm, + and might be changed by the asm's instructions. This is indicated by + adding S's string to the asm's clobber list. For example, if an asm's + clobber list includes `"za"`, ZA might be an input to the asm, + an output from the asm, or something that the asm changes. + +If instead the inline asm occurs in a function that does not [use +state](#uses-state) S, the two possibilities are: + +* S is not an input to the asm and S is not an output from the asm. + The contents of S immediately after executing the asm's instructions + are the same as the contents of S immediately before executing the + instructions. This is the default assumption. + +* S is not an input to the asm and S is not an output from the asm. + However, the contents of S after executing the asm's instructions might + be different from the contents of S before executing the instructions. + This is indicated by adding S's string to the asm's clobber list. + + If an asm takes this option for state that is controlled by PSTATE.ZA, + the asm itself is responsible for handling the [[AAPCS64]](#AAPCS64) + lazy save scheme. + +[[AAPCS64]](#AAPCS64) defines three possible states for ZA: +“off”, “dormant” or “active”. These states describe the values +of two things: + +* the PSTATE.ZA bit + +* the TPIDR2_EL0 register + +ZA is guaranteed to be active on entry to an inline asm A in a +function F if at least one of the following is true: + +* F [uses `"za"`](#uses-state) and A's clobber list includes `"za"`. + +* F [uses `"zt0"`](#uses-state) and A's clobber list includes `"zt0"`. + +Otherwise, ZA can be in any state on entry to A if at least one of the +following is true: + +* F [uses](#uses-state) `"za"` + +* F [uses](#uses-state) `"zt0"` + +Otherwise, ZA can be off or dormant on entry to A, as for what AAPCS64 +calls “private-ZA” functions. + +If ZA is active on entry to A then A's instructions must ensure that +ZA is also active when the asm finishes. + +Similarly, if ZA is off on entry to A then A's instructions must ensure +that ZA is off when the asm finishes. + +If ZA is dormant on entry to A and A's clobber list does not include `"za"`, +A's instructions must ensure that ZA is dormant when the asm finishes. + +Otherwise, if ZA is dormant on entry to A and A's clobber list +includes `"za"`, A's instructions can leave ZA unchanged or in +the off state (A's choice). In the latter case, A's instructions +must commit the lazy save that was pending on entry to A. + +The table below summarizes the possible ZA states on entry to an +inline asm A in a function F. It also specifies what the state +is required to be when A finishes; it is the asm's responsibility +to ensure this. Since PSTATE.ZA controls both ZA and ZT0, the rules +depend on ZT0 as well as ZA. + +| **ZA state before A** | **ZA state after A** | **Possible if…** | +| --------------------- | -------------------- | -------------------------------------- | +| off | off | F's uses and A's clobbers are disjoint | +| dormant | dormant | " " " | +| dormant | off | " " ", and A clobbers `"za"` | +| active | active | F uses `"za"` and/or `"zt0"` | + +The [`__ARM_STATE` macros](#state-strings) indicate whether a compiler +is guaranteed to support a particular clobber string. For example, +the following code is a safe way to indicate that an asm might commit +a lazy ZA save: + +``` c + // Function foo doesn't use ZA or ZT0. + void foo() { + ... + asm volatile("bl something" + ::: ...call-clobbered registers... +#ifdef __ARM_STATE_ZA + , "za" +#endif + ); + ... + } +``` + +## Implementation notes + + + +A piece of state S can be said to be “live” at a point P during the +execution of a function F if: + +* F [uses](#uses-state) S; and + +* a random, exogenous change to S at P could change the behavior + of the program. + +If S is live before a call from F to a function F2 that does not share S +with its caller, the compiler must arrange for S to be preserved around +the call to F2. One way to do this is to save S before the call and +restore it afterwards. However, if S is ZA and F2 is a private-ZA +function, the compiler can instead make use of the lazy-save scheme +described in [[AAPCS64]](#AAPCS64). + +For example, the code below shows a function that uses ZA and ZT0 +internally. The comments describe when ZA should be saved and restored: + +``` c + void f_shares_zt0() __arm_inout("zt0"); + void f_shares_za() __arm_inout("za"); + void f_private(); + + void setup_za() __arm_out("za"); + void use_za() __arm_in("za"); + + __arm_new("za", "zt0") void test() { + f_private(); // ZA is not live, no save necessary + + setup_za(); // ZA is live after this call + f_shares_zt0(); // The compiler should save and restore ZA + // around the call ("caller-save") + f_shares_za(); // ZA is live before and after this call + f_private(); // The compiler should preserve ZA across the call + // It can use the lazy-save mechanism + use_za(); // ZA is no longer live after this call + + f_private(); // ZA is not live, no save necessary + } +``` + +ZT0 cannot be lazily saved, so if ZT0 is live before a call to a +function that does not share ZT0, the compiler must save and restore +ZT0 around the call. For example: + +``` c + void setup_zt0() __arm_out("zt0"); + void use_zt0() __arm_in("zt0"); + void f_private(); + + __arm_new("zt0") void test() { + f_private(); // ZT0 is not live, no save necessary + + setup_zt0(); // ZT0 is live after this call + f_private(); // The compiler should save and restore ZT0 + // around the call ("caller-save") + use_zt0(); // ZT0 is no longer live after this call + + f_private(); // ZT0 is not live, no save necessary + } +``` + # System register access ## Special register intrinsics @@ -8697,8 +9123,6 @@ whose values might be changed by the asm. ## ZA storage -### Introduction to ZA storage - SME provides an area of storage called ZA, of size SVL.B×SVL.B bytes. It also provides a processor state bit called PSTATE.ZA to control whether ZA is enabled. @@ -8708,15 +9132,15 @@ anchor that can be referred via an internal hyperlink to the paragraph following it. --> In C and C++ code, access to ZA is controlled at function granularity: -a function either uses ZA or it does not. Another way to say this is -that a function either “has ZA state” or it does not. +a function either [uses](#uses-state) ZA or it does not. Another way to +say this is that a function either “has ZA state” or it does not. -If a function does have ZA state, the function can either share that ZA -state with the function's caller or create new ZA state “from scratch”. -In the latter case, it is the compiler's responsibility to free up ZA -so that the function can use it; see the description of the lazy saving -scheme in [[AAPCS64]](#AAPCS64) for details about how the compiler -does this. +If a function does have ZA state, the function can either +[share](#shares-state) that ZA state with the function's caller or create +new ZA state “from scratch”. In the latter case, it is the compiler's +responsibility to free up ZA so that the function can use it; see the +description of the lazy saving scheme in [[AAPCS64]](#AAPCS64) for +details about how the compiler does this. These possibilities give a one-out-of-three choice for how a function handles ZA: @@ -8724,101 +9148,16 @@ handles ZA: 1. The function has no [ZA state](#za-state). This is the default. 2. The function has [ZA state](#za-state) that it shares with its caller. - This is indicated by adding [`__arm_shared_za`](#arm_shared_za) to - the function type. - - This case is similar in concept to passing an uncopyable (move-only) value - by reference to a C++ function: - - ``` c++ - // Pseudo-code showing the conceptual effect of __arm_shared_za. - struct pseudo_za_state { - ... - pseudo_za_state(const pseudo_za_state &) = delete; - pseudo_za_state &operator=(const pseudo_za_state &) = delete; - pseudo_za_state *operator&() const = delete; - ... - }; - void shared_za_f1(pseudo_za_state &); - void shared_za_f2(pseudo_za_state &shared_za) { - ... - shared_za_f1(shared_za); - ... - } - ``` + This is indicated by adding a [state-sharing + attribute](#ways-of-sharing-state) to the function type, + such as `__arm_inout("za")`. 3. The function has [ZA state](#za-state) that it creates “from scratch” and that it does not share with its caller. This is indicated by adding - `__arm_new_za` to the function definition. - - This case is similar in spirit to declaring a single uncopyable C++ - variable at function scope. Continuing the pseudo-code above: - - ``` c++ - // Pseudo-code showing the conceptual effect of arm_new_za. - void new_za_f3() { - pseudo_za_state new_za; - ... - shared_za_f2(new_za); - ... - } - ``` - - - -Reusing a term from [[AAPCS64]](#AAPCS64), the functions in category (2) -are called “shared-ZA” functions whereas the functions in categories (1) and -(3) are called “private-ZA” functions. Therefore, “private-ZA” is the -opposite of “shared-ZA”. - -A program is [ill-formed](#ill-formed) if: - -* a function that has no [ZA state](#za-state) contains an - [evaluated call](#evaluated-call) to a [shared-ZA](#shared-za) function. - -* [`__arm_new_za`](#arm_new_za) is used to define a - [shared-ZA](#shared-za) function. - -If a function F1 has [ZA state](#za-state) and it calls a function F2, then: - -* if F2 is a [shared-ZA](#shared-za) function, F1 shares its ZA state - with F2. - -* otherwise, F1's ZA state is unchanged by F2. - -Again the analogy is with passing or not passing a `pseudo_za_state` -reference to F2. + [`__arm_new("za")`](#arm_new) to the function definition. Functions that have ZA state can use the [SME instruction intrinsics](#sme-instruction-intrinsics) -to manipulate that state. These intrinsics themselves act as shared-ZA -functions and so share ZA state with their callers. - -### `asm` statements and ZA - -Some ACLE implementations might support the GNU “inline asm” extension. -For implementations that do, suppose that an inline asm occurs in a -function F. There are then two cases: - -1. If F has [ZA state](#za-state), PSTATE.ZA is guaranteed to be 1 on entry - to the inline asm. The inline asm must finish with PSTATE.ZA equal - to 1, otherwise the behavior is undefined. - - The inline asm can indicate that it reads the current contents of ZA - and/or that it changes the contents of ZA by adding `"za"` to the - asm's clobber list. Using the clobber list for this purpose is a - syntactic convenience: it does not fit the normal semantics for clobbers. - - If the inline asm does not have a `"za"` clobber but nevertheless - reads the current contents of ZA or changes the contents of ZA, - the behavior is undefined. - -2. If F does not have [ZA state](#za-state), the inline asm must - “comply with the lazy saving scheme”, in the sense of - [[AAPCS64]](#AAPCS64). The behavior in other cases is undefined. - - The inline asm is [ill-formed](#ill-formed) if it has a `"za"` clobber. +to manipulate that state. ## ZT0 Lookup Table @@ -8879,9 +9218,9 @@ The function type attributes cannot be used with K&R-style ``` c #define ATTR __arm_streaming - typedef int ft1() ATTR; // Ill-formed in C, OK in C++ - int f1() ATTR { ... } // Ill-formed in C18 and earlier, OK in + typedef int ft1() ATTR; // Ill-formed in C, C18 and earlier, OK in // later versions of C and in C++ + int f1() ATTR { ... } // Likewise typedef int ft2(void) ATTR; // OK int f2(void) ATTR { ... } // OK ``` @@ -8964,158 +9303,6 @@ an [`__arm_streaming`](#arm_streaming) type. See [Changing streaming mode locally](#changing-streaming-mode-locally) for more information. -### SME attributes relating to ZA - - - -#### `__arm_shared_za` - -This [keyword attribute](#keyword-attributes) applies to **function types** -and specifies the following: - -* The function has [ZA state](#za-state). - -* The function shares ZA state with the function's callers: the function - can use ZA to receive data from its callers and to pass data back to - its callers. - -* When the hardware supports SME2, the function has [ZT state](#zt-state). - The function's ZT state is created on entry to the function and destroyed - on return from the function. That is, the function does not use ZT0 - to receive data from callers or to pass data back to callers. - -* If the function forms part of the object code's ABI, that object code - function has a “shared-ZA interface”; see [[AAPCS64]](#AAPCS64) - for more details. - - - -#### `__arm_new_za` - -This [keyword attribute](#keyword-attributes) applies to **function -definitions**. It specifies the following: - -* The function has [ZA state](#za-state). - -* The function does not use ZA to receive data from callers or to pass - data back to callers. - -* Instead, the [abstract machine](#abstract-machine) creates new ZA state - on entry to the function and destroys the state on return from the - function. - -* The abstract machine initializes each byte of the function's ZA state - to zero. - -* When the hardware supports SME2, the function has [ZT state](#zt-state). - -This attribute does not change a function's binary interface. If the -function forms part of the object code's ABI, that object code function -has a “private-ZA interface”, just like all other non-`__arm_shared_za` -functions do. See [[AAPCS64]](#AAPCS64) for more details about -private-ZA interfaces. - -A function definition with this attribute is [ill-formed](#ill-formed) -if the function's type has an [`__arm_shared_za`](#arm_shared_za) -attribute or an [`__arm_preserves_za`](#arm_preserves_za) attribute. - - - -#### `__arm_preserves_za` - -This [keyword attribute](#keyword-attributes) applies to **function types** -and is simply an optimization hint to the compiler; it is never needed -for correctness. It can be attached to any function type, including: - -* a function type that has an [`__arm_shared_za`](#arm_shared_za) - keyword attribute - -* the type of a function whose definition has an - [`__arm_new_za`](#arm_new_za) keyword attribute - -* the type of a function whose definition has no ZA state - -The attribute specifies that the function “preserves ZA”, in the sense -of [[AAPCS64]](#AAPCS64). The mapping of this PCS concept to C and C++ -depends on whether the function is [shared-ZA](#shared-za) or -[private-ZA](#private-za): - -* If the function is [shared-ZA](#shared-za), the attribute guarantees - that the contents of the shared [ZA state](#za-state) on return from - the function are the same as the contents of the shared ZA state on - entry to the function. Either the function does not change the ZA - state at all, or the function undoes any changes to the ZA state - before returning. - - Note: C and C++ calls from shared-ZA functions to private-ZA - functions are defined to preserve ZA state. Such calls do not affect - whether a function can be marked [`__arm_preserves_za`](#arm_preserves_za). - -* If the function is [private-ZA](#private-za), the attribute guarantees - that a call to the function does not directly or indirectly involve - any of the following: - - * a call to an [`__arm_new_za`](#arm_new_za) function - - * a call to `setjmp` - - * a call to [`__arm_za_disable`](#arm_za_disable) - - * a call to a non-C/C++ function that “commits a lazy save”, in the - sense of the [[APCS64]](#AAPCS64) - - * an inline asm that commits a lazy save - - * the catching of an exception - - The platform may place additional requirements as well. - -* ZT state is also considered preserved when a function is marked with - [`arm_preserves_za`](#arm_preserves_za). - -In both cases, the onus is on the definition of the function to honor -the guarantee that is being made. The attribute does not direct the -compiler to do anything to honor the guarantee. - -If a function with an `__arm_preserves_za` type does not preserve ZA, -the behavior is undefined. (There is an analogy with functions -that are declared `noreturn` but do in fact return, and to functions -that are declared `const` but do in fact change memory.) - -The attribute is intended to be useful for functions at API boundaries, -where the compiler might not have access to the definition of the -function being called. As the description above implies, attaching -`__arm_preserves_za` to a [private-ZA](#private-za) function is quite -a low-level feature, but it is useful for -[streaming-compatible versions of standard routines](#streaming-compatible-versions-of-standard-routines) -and could be useful for things like vector math routines. - -Function types with this attribute implicitly convert to function types -that do not have the attribute. However, the reverse is not true. For example: - -``` c - void (*ptr1)(void) __arm_preserves_za; - void (*ptr2)(void); - void f1(void) __arm_preserves_za; - void f2(void); - - void code() { - ptr1 = ptr2; // Ill-formed - ptr1 = f1; // OK - ptr1 = f2; // Ill-formed - - ptr2 = ptr1; // OK - ptr2 = f1; // OK - ptr2 = f2; // OK - } -``` - ## SME types ### Predicate-as-counter @@ -9297,7 +9484,7 @@ ZA array vectors. The intrinsics model this in the following way: ``` c // Reads 2 consecutive horizontal tile slices from ZA into multi-vector. svint8x2_t svread_hor_za8_s8_vg2(uint64_t tile, uint32_t slice) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); ``` * Intrinsic functions have a `_vg1x2`, `_vg1x4` suffix if the function @@ -9314,7 +9501,7 @@ ZA array vectors. The intrinsics model this in the following way: // SMLAL intrinsic for 2 quad-vector groups. void svmla_lane_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` * Intrinsic functions that take a multi-vector operand may have additional @@ -9327,15 +9514,15 @@ ZA array vectors. The intrinsics model this in the following way: ``` c void svmla_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svmla[_single]_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svmla_lane_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` @@ -9345,7 +9532,7 @@ ZA array vectors. The intrinsics model this in the following way: // Also for _za16, _za32, _za64 and _za128 (with the same prototype). void svld1_hor_za8(uint64_t tile, uint32_t slice, svbool_t pg, const void *ptr) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Synthetic intrinsic: adds vnum to slice and vnum * svcntsb() to the // address given by ptr. @@ -9353,12 +9540,12 @@ ZA array vectors. The intrinsics model this in the following way: // Also for _za16, _za32, _za64 and _za128 (with the same prototype). void svld1_hor_vnum_za8(uint64_t tile, uint32_t slice, svbool_t pg, const void *ptr, int64_t vnum) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Also for _za16, _za32, _za64 and _za128 (with the same prototype). void svld1_ver_za8(uint64_t tile, uint32_t slice, svbool_t pg, const void *ptr) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Synthetic intrinsic: adds vnum to slice and vnum * svcntsb() to the // address given by ptr. @@ -9366,20 +9553,20 @@ ZA array vectors. The intrinsics model this in the following way: // Also for _za16, _za32, _za64 and _za128 (with the same prototype). void svld1_ver_vnum_za8(uint64_t tile, uint32_t slice, svbool_t pg, const void *ptr, int64_t vnum) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### LDR ``` c void svldr_za(uint32_t slice, const void *ptr) - __arm_streaming_compatible __arm_shared_za; + __arm_streaming_compatible __arm_inout("za"); // Adds vnum to slice and vnum * svcntsb() to the address given by ptr. // This can be done in a single instruction if vnum is a constant in the // range [0, 15]. The intrinsic is synthetic for other vnum parameters. void svldr_vnum_za(uint32_t slice, const void *ptr, int64_t vnum) - __arm_streaming_compatible __arm_shared_za; + __arm_streaming_compatible __arm_inout("za"); ``` #### ST1B, ST1H, ST1W, ST1D, ST1Q @@ -9388,7 +9575,7 @@ ZA array vectors. The intrinsics model this in the following way: // Also for _za16, _za32, _za64 and _za128 (with the same prototype). void svst1_hor_za8(uint64_t tile, uint32_t slice, svbool_t pg, void *ptr) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); // Synthetic intrinsic: adds vnum to slice and vnum * svcntsb() to the // address given by ptr. @@ -9396,12 +9583,12 @@ ZA array vectors. The intrinsics model this in the following way: // Also for _za16, _za32, _za64 and _za128 (with the same prototype). void svst1_hor_vnum_za8(uint64_t tile, uint32_t slice, svbool_t pg, void *ptr, int64_t vnum) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); // Also for _za16, _za32, _za64 and _za128 (with the same prototype). void svst1_ver_za8(uint64_t tile, uint32_t slice, svbool_t pg, void *ptr) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); // Synthetic intrinsic: adds vnum to slice and vnum * svcntsb() to the // address given by ptr. @@ -9409,20 +9596,20 @@ ZA array vectors. The intrinsics model this in the following way: // Also for _za16, _za32, _za64 and _za128 (with the same prototype). void svst1_ver_vnum_za8(uint64_t tile, uint32_t slice, svbool_t pg, void *ptr, int64_t vnum) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); ``` #### STR ``` c void svstr_za(uint32_t slice, void *ptr) - __arm_streaming_compatible __arm_shared_za __arm_preserves_za; + __arm_streaming_compatible __arm_in("za"); // Adds vnum to slice and vnum * svcntsb() to the address given by ptr. // This can be done in a single instruction if vnum is a constant in the // range [0, 15]. The intrinsic is synthetic for other vnum parameters. void svstr_vnum_za(uint32_t slice, void *ptr, int64_t vnum) - __arm_streaming_compatible __arm_shared_za __arm_preserves_za; + __arm_streaming_compatible __arm_in("za"); ``` #### MOVA @@ -9436,27 +9623,27 @@ parameter both have type `svuint8_t`. // And similarly for u8. svint8_t svread_hor_za8[_s8]_m(svint8_t zd, svbool_t pg, uint64_t tile, uint32_t slice) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); // And similarly for u16, bf16 and f16. svint16_t svread_hor_za16[_s16]_m(svint16_t zd, svbool_t pg, uint64_t tile, uint32_t slice) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); // And similarly for u32 and f32. svint32_t svread_hor_za32[_s32]_m(svint32_t zd, svbool_t pg, uint64_t tile, uint32_t slice) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); // And similarly for u64 and f64. svint64_t svread_hor_za64[_s64]_m(svint64_t zd, svbool_t pg, uint64_t tile, uint32_t slice) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); // And similarly for s16, s32, s64, u8, u16, u32, u64, bf16, f16, f32, f64 svint8_t svread_hor_za128[_s8]_m(svint8_t zd, svbool_t pg, uint64_t tile, uint32_t slice) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); ``` Replacing `_hor` with `_ver` gives the associated vertical forms. @@ -9469,27 +9656,27 @@ the `zn` parameter to the `_u8` intrinsic has type `svuint8_t`. // And similarly for u8. void svwrite_hor_za8[_s8]_m(uint64_t tile, uint32_t slice, svbool_t pg, svint8_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // And similarly for u16, bf16 and f16. void svwrite_hor_za16[_s16]_m(uint64_t tile, uint32_t slice, svbool_t pg, svint16_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // And similarly for u32 and f32. void svwrite_hor_za32[_s32]_m(uint64_t tile, uint32_t slice, svbool_t pg, svint32_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // And similarly for u64 and f64. void svwrite_hor_za64[_s64]_m(uint64_t tile, uint32_t slice, svbool_t pg, svint64_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // And similarly for s16, s32, s64, u8, u16, u32, u64, bf16, f16, f32, f64 void svwrite_hor_za128[_s8]_m(uint64_t tile, uint32_t slice, svbool_t pg, svint8_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` Replacing `_hor` with `_ver` gives the associated vertical forms. @@ -9499,21 +9686,21 @@ Replacing `_hor` with `_ver` gives the associated vertical forms. ``` c void svaddha_za32[_s32]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint32_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svaddha_za32[_u32]_m(uint64_t tile, svbool_t pn, svbool_t pm, svuint32_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Only if __ARM_FEATURE_SME_I16I64 != 0 void svaddha_za64[_s64]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint64_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Only if __ARM_FEATURE_SME_I16I64 != 0 void svaddha_za64[_u64]_m(uint64_t tile, svbool_t pn, svbool_t pm, svuint64_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### ADDVA @@ -9521,21 +9708,21 @@ Replacing `_hor` with `_ver` gives the associated vertical forms. ``` c void svaddva_za32[_s32]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint32_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svaddva_za32[_u32]_m(uint64_t tile, svbool_t pn, svbool_t pm, svuint32_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Only if __ARM_FEATURE_SME_I16I64 != 0 void svaddva_za64[_s64]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint64_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Only if __ARM_FEATURE_SME_I16I64 != 0 void svaddva_za64[_u64]_m(uint64_t tile, svbool_t pn, svbool_t pm, svuint64_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### BFMOPA, FMOPA (widening), SMOPA, UMOPA @@ -9543,11 +9730,11 @@ Replacing `_hor` with `_ver` gives the associated vertical forms. ``` c void svmopa_za32[_bf16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svbfloat16_t zn, svbfloat16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svmopa_za32[_f16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svfloat16_t zn, svfloat16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svmopa_za32[_s8]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint8_t zn, svint8_t zm) @@ -9555,17 +9742,17 @@ Replacing `_hor` with `_ver` gives the associated vertical forms. void svmopa_za32[_u8]_m(uint64_t tile, svbool_t pn, svbool_t pm, svuint8_t zn, svuint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Only if __ARM_FEATURE_SME_I16I64 != 0 void svmopa_za64[_s16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint16_t zn, svint16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Only if __ARM_FEATURE_SME_I16I64 != 0 void svmopa_za64[_u16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svuint16_t zn, svuint16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### FMOPA (non-widening) @@ -9573,12 +9760,12 @@ Replacing `_hor` with `_ver` gives the associated vertical forms. ``` c void svmopa_za32[_f32]_m(uint64_t tile, svbool_t pn, svbool_t pm, svfloat32_t zn, svfloat32_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Only if __ARM_FEATURE_SME_F64F64 != 0 void svmopa_za64[_f64]_m(uint64_t tile, svbool_t pn, svbool_t pm, svfloat64_t zn, svfloat64_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### BFMOPS, FMOPS (widening), SMOPS, UMOPS @@ -9586,29 +9773,29 @@ Replacing `_hor` with `_ver` gives the associated vertical forms. ``` c void svmops_za32[_bf16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svbfloat16_t zn, svbfloat16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svmops_za32[_f16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svfloat16_t zn, svfloat16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svmops_za32[_s8]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint8_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svmops_za32[_u8]_m(uint64_t tile, svbool_t pn, svbool_t pm, svuint8_t zn, svuint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Only if __ARM_FEATURE_SME_I16I64 != 0 void svmops_za64[_s16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint16_t zn, svint16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Only if __ARM_FEATURE_SME_I16I64 != 0 void svmops_za64[_u16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svuint16_t zn, svuint16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### FMOPS (non-widening) @@ -9616,12 +9803,12 @@ Replacing `_hor` with `_ver` gives the associated vertical forms. ``` c void svmops_za32[_f32]_m(uint64_t tile, svbool_t pn, svbool_t pm, svfloat32_t zn, svfloat32_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Only if __ARM_FEATURE_SME_F64F64 != 0 void svmops_za64[_f64]_m(uint64_t tile, svbool_t pn, svbool_t pm, svfloat64_t zn, svfloat64_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### RDSVL @@ -9631,19 +9818,19 @@ The following intrinsics read the length of a streaming vector: ``` c // Return the number of bytes in a streaming vector. // Equivalent to svcntb() when called in streaming mode. - uint64_t svcntsb() __arm_streaming_compatible __arm_preserves_za; + uint64_t svcntsb() __arm_streaming_compatible; // Return the number of halfwords in a streaming vector. // Equivalent to svcnth() when called in streaming mode. - uint64_t svcntsh() __arm_streaming_compatible __arm_preserves_za; + uint64_t svcntsh() __arm_streaming_compatible; // Return the number of words in a streaming vector. // Equivalent to svcntw() when called in streaming mode. - uint64_t svcntsw() __arm_streaming_compatible __arm_preserves_za; + uint64_t svcntsw() __arm_streaming_compatible; // Return the number of doublewords in a streaming vector. // Equivalent to svcntd() when called in streaming mode. - uint64_t svcntsd() __arm_streaming_compatible __arm_preserves_za; + uint64_t svcntsd() __arm_streaming_compatible; ``` `svcntsb()` is equivalent to an RDSVL instruction with an immediate @@ -9665,12 +9852,12 @@ possible to write these operations using normal C arithmetic. For example: ``` c void svsumopa_za32[_s8]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint8_t zn, svuint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Only if __ARM_FEATURE_SME_I16I64 != 0 void svsumopa_za64[_s16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint16_t zn, svuint16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### SUMOPS @@ -9678,12 +9865,12 @@ possible to write these operations using normal C arithmetic. For example: ``` c void svsumops_za32[_s8]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint8_t zn, svuint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Only if __ARM_FEATURE_SME_I16I64 != 0 void svsumops_za64[_s16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint16_t zn, svuint16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### USMOPA @@ -9691,12 +9878,12 @@ possible to write these operations using normal C arithmetic. For example: ``` c void svusmopa_za32[_u8]_m(uint64_t tile, svbool_t pn, svbool_t pm, svuint8_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Only if __ARM_FEATURE_SME_I16I64 != 0 void svusmopa_za64[_u16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svuint16_t zn, svint16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### USMOPS @@ -9704,21 +9891,21 @@ possible to write these operations using normal C arithmetic. For example: ``` c void svusmops_za32[_u8]_m(uint64_t tile, svbool_t pn, svbool_t pm, svuint8_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Only if __ARM_FEATURE_SME_I16I64 != 0 void svusmops_za64[_u16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svuint16_t zn, svint16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### ZERO ``` c void svzero_mask_za(uint64_t tile_mask) - __arm_streaming_compatible __arm_shared_za; + __arm_streaming_compatible __arm_inout("za"); - void svzero_za() __arm_streaming_compatible __arm_shared_za; + void svzero_za() __arm_streaming_compatible __arm_out("za"); ``` ### SME2 instruction intrinsics @@ -9741,7 +9928,7 @@ the result is written directly into ZA. // _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0) void svadd_write[_single]_za32[_s32]_vg1x2(uint32_t slice, svint32x2_t zn, svint32_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -9751,7 +9938,7 @@ the result is written directly into ZA. // _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0) void svadd_write[_single]_za32[_s32]_vg1x4(uint32_t slice, svint32x4_t zn, svint32_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -9761,7 +9948,7 @@ the result is written directly into ZA. // _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0) void svsub_write[_single]_za32[_u32]_vg1x2(uint32_t slice, svuint32x2_t zn, svuint32_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -9771,7 +9958,7 @@ the result is written directly into ZA. // _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0) void svsub_write[_single]_za32[_u32]_vg1x4(uint32_t slice, svuint32x4_t zn, svuint32_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### ADD, SUB (store into ZA, multi) @@ -9789,7 +9976,7 @@ the result is written directly into ZA. // _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0) void svadd_write_za32[_s32]_vg1x2(uint32_t slice, svint32x2_t zn, svint32x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -9799,7 +9986,7 @@ the result is written directly into ZA. // _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0) void svadd_write_za32[_s32]_vg1x4(uint32_t slice, svint32x4_t zn, svint32x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -9809,7 +9996,7 @@ the result is written directly into ZA. // _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0) void svsub_write_za32[_u32]_vg1x2(uint32_t slice, svuint32x2_t zn, svuint32x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -9819,7 +10006,7 @@ the result is written directly into ZA. // _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0) void svsub_write_za32[_u32]_vg1x4(uint32_t slice, svuint32x4_t zn, svuint32x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### ADD (vectors) @@ -9852,7 +10039,7 @@ Multi-vector add/sub and accumulate into ZA // _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0) // _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0) void svadd_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -9863,7 +10050,7 @@ Multi-vector add/sub and accumulate into ZA // _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0) // _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0) void svadd_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -9874,7 +10061,7 @@ Multi-vector add/sub and accumulate into ZA // _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0) // _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0) void svsub_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -9885,7 +10072,7 @@ Multi-vector add/sub and accumulate into ZA // _za64[_s64] (only if __ARM_FEATURE_SME_I16I64 != 0) // _za64[_u64] (only if __ARM_FEATURE_SME_I16I64 != 0) void svsub_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### BFCVTN, FCVTN @@ -9980,7 +10167,7 @@ Multi-vector dot-product (2-way and 4-way) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svdot[_single]_za32[_bf16]_vg1x2(uint32_t slice, svbfloat16x2_t zn, svbfloat16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -9994,27 +10181,27 @@ Multi-vector dot-product (2-way and 4-way) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svdot[_single]_za32[_bf16]_vg1x4(uint32_t slice, svbfloat16x4_t zn, svbfloat16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svsudot[_single]_za32[_s8]_vg1x2(uint32_t slice, svint8x2_t zn, svuint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svsudot[_single]_za32[_s8]_vg1x4(uint32_t slice, svint8x4_t zn, svuint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusdot[_single]_za32[_u8]_vg1x2(uint32_t slice, svuint8x2_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusdot[_single]_za32[_u8]_vg1x4(uint32_t slice, svuint8x4_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### FDOT, BFDOT, SUDOT, USDOT, SDOT, UDOT (store into ZA, multi) @@ -10033,7 +10220,7 @@ Multi-vector dot-product (2-way and 4-way) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svdot_za32[_bf16]_vg1x2(uint32_t slice, svbfloat16x2_t zn, svbfloat16x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10047,23 +10234,23 @@ Multi-vector dot-product (2-way and 4-way) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svdot_za32[_bf16]_vg1x4(uint32_t slice, svbfloat16x4_t zn, svbfloat16x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svsudot_za32[_s8]_vg1x2(uint32_t slice, svint8x2_t zn, svuint8x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svsudot_za32[_s8]_vg1x4(uint32_t slice, svint8x4_t zn, svuint8x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusdot_za32[_u8]_vg1x2(uint32_t slice, svuint8x2_t zn, svint8x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusdot_za32[_u8]_vg1x4(uint32_t slice, svuint8x4_t zn, svint8x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### FDOT, BFDOT, SUDOT, USDOT, SDOT, UDOT (store into ZA, indexed) @@ -10082,7 +10269,7 @@ Multi-vector dot-product (2-way and 4-way) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svdot_lane_za32[_bf16]_vg1x2(uint32_t slice, svbfloat16x2_t zn, svbfloat16_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10096,27 +10283,27 @@ Multi-vector dot-product (2-way and 4-way) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svdot_lane_za32[_bf16]_vg1x4(uint32_t slice, svbfloat16x4_t zn, svbfloat16_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svsudot_lane_za32[_s8]_vg1x2(uint32_t slice, svint8x2_t zn, svuint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svsudot_lane_za32[_s8]_vg1x4(uint32_t slice, svint8x4_t zn, svuint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusdot_lane_za32[_u8]_vg1x2(uint32_t slice, svuint8x2_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusdot_lane_za32[_u8]_vg1x4(uint32_t slice, svuint8x4_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### FVDOT, BFVDOT, SUVDOT, USVDOT, SVDOT, UVDOT @@ -10126,18 +10313,18 @@ Multi-vector vertical dot-product by indexed element. ``` c void svsuvdot_lane_za32[_s8]_vg1x4(uint32_t slice, svint8x4_t zn, svuint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusvdot_lane_za32[_u8]_vg1x4(uint32_t slice, svuint8x4_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svvdot_lane_za32[_bf16]_vg1x2(uint32_t slice, svbfloat16x2_t zn, svbfloat16_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10147,7 +10334,7 @@ Multi-vector vertical dot-product by indexed element. // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svvdot_lane_za32[_s8]_vg1x4(uint32_t slice, svint8x4_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### UMOPA, SMOPA, UMOPS, SMOPS @@ -10158,13 +10345,13 @@ Integer sum of outer products and accumulate/subtract (2-way) // Variants are also available for _za32[_u16] void svmopa_za32[_s16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint16_t zn, svint16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za32[_u16] void svmops_za32[_s16]_m(uint64_t tile, svbool_t pn, svbool_t pm, svint16_t zn, svint16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### BMOPA, BMOPS @@ -10175,13 +10362,13 @@ Bitwise exclusive NOR population count outer product and accumulate/subtract // Variants are also available for _za32[_s32] void svbmopa_za32[_u32]_m(uint64_t tile, svbool_t pn, svbool_t pm, svuint32_t zn, svuint32_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za32[_s32] void svbmops_za32[_u32]_m(uint64_t tile, svbool_t pn, svbool_t pm, svuint32_t zn, svuint32_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### FMLA, FMLS (single) @@ -10194,7 +10381,7 @@ Multi-vector floating-point fused multiply-add/subtract // _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0) void svmla[_single]_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zn, svfloat32_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10202,7 +10389,7 @@ Multi-vector floating-point fused multiply-add/subtract // _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0) void svmla[_single]_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zn, svfloat32_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10210,7 +10397,7 @@ Multi-vector floating-point fused multiply-add/subtract // _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0) void svmls[_single]_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zn, svfloat32_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10218,7 +10405,7 @@ Multi-vector floating-point fused multiply-add/subtract // _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0) void svmls[_single]_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zn, svfloat32_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### FMLA, FMLS (multi) @@ -10231,7 +10418,7 @@ Multi-vector floating-point fused multiply-add/subtract // _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0) void svmla_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zn, svfloat32x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10239,7 +10426,7 @@ Multi-vector floating-point fused multiply-add/subtract // _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0) void svmla_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zn, svfloat32x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10247,7 +10434,7 @@ Multi-vector floating-point fused multiply-add/subtract // _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0) void svmls_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zn, svfloat32x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10255,7 +10442,7 @@ Multi-vector floating-point fused multiply-add/subtract // _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0) void svmls_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zn, svfloat32x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### FMLA, FMLS (indexed) @@ -10268,7 +10455,7 @@ Multi-vector floating-point fused multiply-add/subtract // _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0) void svmla_lane_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zn, svfloat32_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10276,7 +10463,7 @@ Multi-vector floating-point fused multiply-add/subtract // _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0) void svmla_lane_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zn, svfloat32_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10284,7 +10471,7 @@ Multi-vector floating-point fused multiply-add/subtract // _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0) void svmls_lane_za32[_f32]_vg1x2(uint32_t slice, svfloat32x2_t zn, svfloat32_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10292,7 +10479,7 @@ Multi-vector floating-point fused multiply-add/subtract // _za64[_f64] (only if __ARM_FEATURE_SME_F64F64 != 0) void svmls_lane_za32[_f32]_vg1x4(uint32_t slice, svfloat32x4_t zn, svfloat32_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### FMLAL, BFMLAL, SMLAL, UMLAL (single) @@ -10303,19 +10490,19 @@ Multi-vector multiply-add long (widening) // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmla_za32[_bf16]_vg2x1(uint32_t slice, svbfloat16_t zn, svbfloat16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmla[_single]_za32[_bf16]_vg2x2(uint32_t slice, svbfloat16x2_t zn, svbfloat16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmla[_single]_za32[_bf16]_vg2x4(uint32_t slice, svbfloat16x4_t zn, svbfloat16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### FMLAL, BFMLAL, SMLAL, UMLAL (multi) @@ -10326,13 +10513,13 @@ Multi-vector multiply-add long (widening) // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmla_za32[_bf16]_vg2x2(uint32_t slice, svbfloat16x2_t zn, svbfloat16x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmla_za32[_bf16]_vg2x4(uint32_t slice, svbfloat16x4_t zn, svbfloat16x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### FMLAL, BFMLAL, SMLAL, UMLAL (indexed) @@ -10343,19 +10530,19 @@ Multi-vector multiply-add long (widening) // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmla_lane_za32[_bf16]_vg2x1(uint32_t slice, svbfloat16_t zn, svbfloat16_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmla_lane_za32[_bf16]_vg2x2(uint32_t slice, svbfloat16x2_t zn, svbfloat16_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmla_lane_za32[_bf16]_vg2x4(uint32_t slice, svbfloat16x4_t zn, svbfloat16_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### BFMLSL, FMLSL, UMLSL, SMLSL (single) @@ -10366,19 +10553,19 @@ Multi-vector multiply-subtract long (widening) // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmls_za32[_bf16]_vg2x1(uint32_t slice, svbfloat16_t zn, svbfloat16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmls[_single]_za32[_bf16]_vg2x2(uint32_t slice, svbfloat16x2_t zn, svbfloat16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmls[_single]_za32[_bf16]_vg2x4(uint32_t slice, svbfloat16x4_t zn, svbfloat16_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### BFMLSL, FMLSL, UMLSL, SMLSL (multi) @@ -10389,13 +10576,13 @@ Multi-vector multiply-subtract long (widening) // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmls_za32[_bf16]_vg2x2(uint32_t slice, svbfloat16x2_t zn, svbfloat16x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmls_za32[_bf16]_vg2x4(uint32_t slice, svbfloat16x4_t zn, svbfloat16x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### BFMLSL, FMLSL, UMLSL, SMLSL (indexed) @@ -10406,19 +10593,19 @@ Multi-vector multiply-subtract long (widening) // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmls_lane_za32[_bf16]_vg2x1(uint32_t slice, svbfloat16_t zn, svbfloat16_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmls_lane_za32[_bf16]_vg2x2(uint32_t slice, svbfloat16x2_t zn, svbfloat16_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za32[_f16], _za32[_s16] and _za32[_u16] void svmls_lane_za32[_bf16]_vg2x4(uint32_t slice, svbfloat16x4_t zn, svbfloat16_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### UMLALL, SMLALL, USMLALL, SUMLALL (single) @@ -10432,7 +10619,7 @@ Multi-vector multiply-add long long (widening) // _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmla_za32[_s8]_vg4x1(uint32_t slice, svint8_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10442,7 +10629,7 @@ Multi-vector multiply-add long long (widening) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmla[_single]_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10452,35 +10639,35 @@ Multi-vector multiply-add long long (widening) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmla[_single]_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svsumla_za32[_s8]_vg4x1(uint32_t slice, svint8_t zn, svuint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svsumla[_single]_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svuint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svsumla[_single]_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn, svuint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusmla_za32[_u8]_vg4x1(uint32_t slice, svuint8_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusmla[_single]_za32[_u8]_vg4x2(uint32_t slice, svuint8x2_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusmla[_single]_za32[_u8]_vg4x4(uint32_t slice, svuint8x4_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### UMLALL, SMLALL, USMLALL, SUMLALL (multi) @@ -10494,7 +10681,7 @@ Multi-vector multiply-add long long (widening) // _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmla_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10503,23 +10690,23 @@ Multi-vector multiply-add long long (widening) // _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmla_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn, svint8x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svsumla_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svuint8x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svsumla_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn, svuint8x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusmla_za32[_u8]_vg4x2(uint32_t slice, svuint8x2_t zn, svint8x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusmla_za32[_u8_vg4x4(uint32_t slice, svuint8x4_t zn, svint8x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### UMLALL, SMLALL, USMLALL, SUMLALL (indexed) @@ -10534,7 +10721,7 @@ Multi-vector multiply-add long long (widening) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmla_lane_za32[_s8]_vg4x1(uint32_t slice, svint8_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10544,7 +10731,7 @@ Multi-vector multiply-add long long (widening) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmla_lane_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10554,37 +10741,37 @@ Multi-vector multiply-add long long (widening) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmla_lane_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svsumla_lane_za32[_s8]_vg4x1(uint32_t slice, svint8_t zn, svuint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svsumla_lane_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svuint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svsumla_lane_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn, svuint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusmla_lane_za32[_u8]_vg4x1(uint32_t slice, svuint8_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusmla_lane_za32[_u8]_vg4x2(uint32_t slice, svuint8x2_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); void svusmla_lane_za32[_u8]_vg4x4(uint32_t slice, svuint8x4_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### SMLSLL, UMLSLL (single) @@ -10598,7 +10785,7 @@ Multi-vector multiply-subtract long long (widening) // _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmls_za32[_s8]_vg4x1(uint32_t slice, svint8_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10608,7 +10795,7 @@ Multi-vector multiply-subtract long long (widening) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmls[_single]_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10618,7 +10805,7 @@ Multi-vector multiply-subtract long long (widening) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmls[_single]_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn, svint8_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### SMLSLL, UMLSLL (multi) @@ -10632,7 +10819,7 @@ Multi-vector multiply-subtract long long (widening) // _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmls_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8x2_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10641,7 +10828,7 @@ Multi-vector multiply-subtract long long (widening) // _za64[_s16] (only if __ARM_FEATURE_SME_I16I64 != 0) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmls_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn, svint8x4_t zm) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### SMLSLL, UMLSLL (indexed) @@ -10656,7 +10843,7 @@ Multi-vector multiply-subtract long long (widening) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmls_lane_za32[_s8]_vg4x1(uint32_t slice, svint8_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10666,7 +10853,7 @@ Multi-vector multiply-subtract long long (widening) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmls_lane_za32[_s8]_vg4x2(uint32_t slice, svint8x2_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are available for: @@ -10676,7 +10863,7 @@ Multi-vector multiply-subtract long long (widening) // _za64[_u16] (only if __ARM_FEATURE_SME_I16I64 != 0) void svmls_lane_za32[_s8]_vg4x4(uint32_t slice, svint8x4_t zn, svint8_t zm, uint64_t imm_idx) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### BFMLSLB, BFMLSLT @@ -11221,11 +11408,11 @@ Spill and fill of ZT0 ``` c void svldr_zt(uint64_t zt, const void *rn) - __arm_streaming_compatible __arm_shared_za __arm_preserves_za; + __arm_streaming_compatible __arm_inout("zt0"); void svstr_zt(uint64_t zt, void *rn) - __arm_streaming_compatible __arm_shared_za __arm_preserves_za; + __arm_streaming_compatible __arm_in("zt0"); ``` #### ZERO @@ -11234,7 +11421,7 @@ Zero ZT0 ``` c void svzero_zt(uint64_t zt) - __arm_streaming_compatible __arm_shared_za __arm_preserves_za; + __arm_streaming_compatible __arm_out("zt0"); ``` #### LUTI2, LUTI4 @@ -11245,41 +11432,41 @@ Lookup table read with 2-bit and 4-bit indexes // Variants are also available for _zt_u8, _zt_s16, _zt_u16, _zt_f16, // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32 svint8_t svluti2_lane_zt_s8(uint64_t zt, svuint8_t zn, uint64_t imm_idx) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("zt0"); // Variants are also available for _zt_u8, _zt_s16, _zt_u16, _zt_f16, // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32 svint8x2_t svluti2_lane_zt_s8_x2(uint64_t zt, svuint8_t zn, uint64_t imm_idx) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("zt0"); // Variants are also available for _zt_u8, _zt_s16, _zt_u16, _zt_f16, // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32 svint8x4_t svluti2_lane_zt_s8_x4(uint64_t zt, svuint8_t zn, uint64_t imm_idx) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("zt0"); // Variants are also available for _zt_u8, _zt_s16, _zt_u16, _zt_f16, // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32 svint8_t svluti4_lane_zt_s8(uint64_t zt, svuint8_t zn, uint64_t imm_idx) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("zt0"); // Variants are also available for _zt_u8, _zt_s16, _zt_u16, _zt_f16, // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32 svint8x2_t svluti4_lane_zt_s8_x2(uint64_t zt, svuint8_t zn, uint64_t imm_idx) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("zt0"); // Variants are also available for _zt_u16, _zt_f16, _zt_bf16, _zt_s32, // _zt_u32 and _zt_f32 svint16x4_t svluti4_lane_zt_s16_x4(uint64_t zt, svuint16_t zn, uint64_t imm_idx) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("zt0"); ``` #### MOVA @@ -11291,84 +11478,84 @@ Move multi-vectors to/from ZA // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64, _za64_u64 and _za64_f64 svint8x2_t svread_hor_za8_s8_vg2(uint64_t tile, uint32_t slice) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); // Variants are also available for _za8_u8, _za16_s16, _za16_u16, // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64, _za64_u64 and _za64_f64 svint8x4_t svread_hor_za8_s8_vg4(uint64_t tile, uint32_t slice) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); // Variants are also available for _za8_u8, _za16_s16, _za16_u16, // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64, _za64_u64 and _za64_f64 svint8x2_t svread_ver_za8_s8_vg2(uint64_t tile, uint32_t slice) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); // Variants are also available for _za8_u8, _za16_s16, _za16_u16, // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64, _za64_u64 and _za64_f64 svint8x4_t svread_ver_za8_s8_vg4(uint64_t tile, uint32_t slice) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); // Variants are also available for _za8_u8, _za16_s16, _za16_u16, // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64, _za64_u64 and _za64_f64 svint8x2_t svread_za8_s8_vg1x2(uint32_t slice) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); // Variants are also available for _za8_u8, _za16_s16, _za16_u16, // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64, _za64_u64 and _za64_f64 svint8x4_t svread_za8_s8_vg1x4(uint32_t slice) - __arm_streaming __arm_shared_za __arm_preserves_za; + __arm_streaming __arm_in("za"); // Variants are also available for _za8[_u8], _za16[_s16], _za16[_u16], // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32], // _za64[_s64], _za64[_u64] and _za64[_f64] void svwrite_hor_za8[_s8]_vg2(uint64_t tile, uint32_t slice, svint8x2_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za8[_u8], _za16[_s16], _za16[_u16], // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32], // _za64[_s64], _za64[_u64] and _za64[_f64] void svwrite_hor_za8[_s8]_vg4(uint64_t tile, uint32_t slice, svint8x4_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za8[_u8], _za16[_s16], _za16[_u16], // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32], // _za64[_s64], _za64[_u64] and _za64[_f64] void svwrite_ver_za8[_s8]_vg2(uint64_t tile, uint32_t slice, svint8x2_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za8[_u8], _za16[_s16], _za16[_u16], // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32], // _za64[_s64], _za64[_u64] and _za64[_f64] void svwrite_ver_za8[_s8]_vg4(uint64_t tile, uint32_t slice, svint8x4_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za8[_u8], _za16[_s16], _za16[_u16], // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32], // _za64[_s64], _za64[_u64] and _za64[_f64] void svwrite_za8[_s8]_vg1x2(uint32_t slice, svint8x2_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); // Variants are also available for _za8[_u8], _za16[_s16], _za16[_u16], // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32], // _za64[_s64], _za64[_u64] and _za64[_f64] void svwrite_za8[_s8]_vg1x4(uint32_t slice, svint8x4_t zn) - __arm_streaming __arm_shared_za; + __arm_streaming __arm_inout("za"); ``` #### PTRUE @@ -11736,16 +11923,16 @@ are named after. All of the functions have external linkage. ``` c void *__arm_sc_memcpy(void *dest, const void *src, size_t n) - __arm_streaming_compatible __arm_preserves_za; + __arm_streaming_compatible; void *__arm_sc_memmove(void *dest, const void *src, size_t n) - __arm_streaming_compatible __arm_preserves_za; + __arm_streaming_compatible; void *__arm_sc_memset(void *s, int c, size_t n) - __arm_streaming_compatible __arm_preserves_za; + __arm_streaming_compatible; void *__arm_sc_memchr(void *s, int c, size_t n) - __arm_streaming_compatible __arm_preserves_za; + __arm_streaming_compatible; ```