-
Notifications
You must be signed in to change notification settings - Fork 37
Added optimized ppc64le support functions for ML-KEM. #1184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @dannytsen, this is an exciting contribution 🎉
I think as the first stage of review, the goal should be to get your changes through CI, and extend it so that the PPC64 backend is exercised (to this end: do you know if your assembly works with qemu-ppc64le
, and what flags are needed?).
In a second phase, we can dive into the backend itself and hopefully convince ourselves that it is functionally correct and upholds the assumptions made by the frontend.
I left a few comments to kick things off, but additionally I can see that there are failures related to autogen
and format
, so a good starting point would be to resolve those. You should be able to run simpasm
with a PPC cross compiler to get simplified assembly that you can check in to main source tree.
@hanno-becker I believe the code will work on qemu-ppc64le even though I did not run on it. My testing platform are p9 and p10 systems. I will go thru the comments and fix issues. Thanks. |
# | ||
|
||
#include "../../../common.h" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want to guard those files by MLK_ARITH_BACKEND_PPC64LE_DEFAULT
|
||
#include "../../../common.h" | ||
|
||
.machine "any" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do? Does it have any effect to specify "any" machine?
@dannytsen Please see https://github.com/pq-code-package/mlkem-native/commits/ppc64le_backend for the changes to get the asm through the usual format/autogen/simpasm pipeline. Feel free to amend your commit(s). At least the base CI is happy with this: https://github.com/pq-code-package/mlkem-native/actions/runs/17640154327 NOTE: The resulting ASM in mlkem/* is currently unusable because the references to the .data section have been messed up during simpasm. As mentioned above, please see if you can follow the approach from the AArch64 backend: Define the NTT and invNTT twiddle tables in *.c and pass them to the ASM routines as arguments. The other constants can be generated in the code itself, as in https://github.com/pq-code-package/mlkem-native/blob/main/mlkem/src/native/aarch64/src/ntt.S#L79 for example. If it's inconvenient to do this, you can also go with a single large constant table including all constants you need, pass the pointer to that to each ASM function, and load from a suitable offset in the ASM. This is the approach used in the x86_64 backend, see dev/x86_64/src/consts.c. |
@hanno-becker Thanks for the pointer. But I am not a python programmer and don't really can comprehend python so changes scripts will not be my first choice. I just want to get the simpasm work on my code. I can change my code to use data array from a C file. But I need an example (a command line example) to generate a simplified assembly. So, where do you run simpasm from? from scripts directory or dev directory? And what are the options I need to pass thru? Like simpasm -???? Just a example for x86 or arm will be fine. I just want to know how to run it so I can fix my assembly code accordingly. Thanks. I have a t.S file with .data section stripped. And here is the output for your reference. So, you know what I am talking about. [07:06] danny@ltc-zz4-lp9 dev % ../scripts/simpasm -i ${PWD}/ppc64le/src/t.S Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): |
@dannytsen I have basically done this for you in the branch (atop of your changes), so you won't need to fiddle with Python anymore. But you will need to change the ASM to pass in the constants as arguments, rather than having If you checkout the branch, enter the |
@hanno-becker Sure. I can do that. |
@hanno-becker BTW, there is no nix for ppc. |
@dannytsen You should work in an x86_64 or AArch64 Linux/Mac environment and use the PPC64Le cross compiler, which is already part of the environment established by |
@hanno-becker Ok. I'll check that. Thanks. |
@dannytsen What indicators/assurances have you obtained so far that the assembly is correct? Also, have you successfully run the code on QEMU, or real HW only? |
@hanno-becker The code was run successfully in liboqs and mlkem-native project on HW. The code was originally written for liboqs. |
@dannytsen Independent of the work of separating the twiddles from the assembly: I ran the code in a ppc64le emulator, but it fails as soon as I start to use the NTT or invNTT. Specifically, in a Linux/Mac environment, and using your current
This gives:
If I comment out It could just be some CPU configuration missing. Are you assuming a particular vector length, for example? I can also see the code failing when running under Bottomline: I'll help with the integration details, but you'd need to find out / demonstrate that/how the code works in an emulated QEMU environment so we can test it in CI -- can you do that please? |
@hanno-becker Which means that it doesn't work with p8 or some instructions was not supported in qmenu. |
@hanno-becker Here is my output from p9. [00:01] danny@ltc-zz4-lp9 mlkem-native_dev % make test |
@dannytsen Thank you. As mentioned, can you please find out how to test the code using qemu? The
Can you find out which one it is? The PR documentation states that the ASM works P8 upwards. |
@hanno-becker It looks like qmenu cross compiler soen't support "xxpermdi" instruction. I'll check. ![]() |
@dannytsen I don't know. You should be able to find out assembling a minimal example and using |
I'll check. |
@hanno-becker I don't have qemu on my system. But I installed nix on my Mac and run the following command under nix, These are the final build output. FUNC ML-KEM-1024: test/build/mlkem1024/bin/test_mlkem1024 I'll check about qemu. |
Please see #1184 (comment) again -- once in the |
|
||
# ppc64le backend (little endian) | ||
|
||
This directory contains a native backend for little endian POWER 8 (ppc64le) and above systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed before, I couldn't yet run the code with power8 emulation, and I think you mentioned you only tested power9 and power10. Are you sure it works on power8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dannytsen I note that you do not cache the twisted twiddles in the Montgomery multiplication: That is, in
vmladduhm 15, 15, V_QINV, 3
vmladduhm 20, 20, V_QINV, 3
vmladduhm 25, 25, V_QINV, 3
vmladduhm 30, 30, V_QINV, 3
this could all be precomputed and stored in the constant table. This is what the AArch64 and x86_64 backends do.
Have you considered that?
# fqmul = zeta * coefficient | ||
# Modular multification bond by 2^16 * q in abs value | ||
vmladduhm 15, 13, \_vz0, 3 | ||
vmladduhm 20, 18, \_vz1, 3 | ||
vmladduhm 25, 23, \_vz2, 3 | ||
vmladduhm 30, 28, \_vz3, 3 | ||
|
||
# Signed multiply-high-round; outputs are bound by 2^15 * q in abs value | ||
vmhraddshs 14, 13, \_vz0, 3 | ||
vmhraddshs 19, 18, \_vz1, 3 | ||
vmhraddshs 24, 23, \_vz2, 3 | ||
vmhraddshs 29, 28, \_vz3, 3 | ||
|
||
vmladduhm 15, 15, V_QINV, 3 | ||
vmladduhm 20, 20, V_QINV, 3 | ||
vmladduhm 25, 25, V_QINV, 3 | ||
vmladduhm 30, 30, V_QINV, 3 | ||
|
||
vmhraddshs 15, 15, V_NMKQ, 14 | ||
vmhraddshs 20, 20, V_NMKQ, 19 | ||
vmhraddshs 25, 25, V_NMKQ, 24 | ||
vmhraddshs 30, 30, V_NMKQ, 29 | ||
|
||
vsrah 13, 15, 4 # >> 1 | ||
vsrah 18, 20, 4 # >> 1 | ||
vsrah 23, 25, 4 # >> 1 | ||
vsrah 28, 30, 4 # >> 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dannytsen @bhess You tread new ground here, algorithmically, I think? This is essentially Montgomery via Rounding but this has been documented to require an odd operand, which is not given (the original idea was that one would bump twiddles by Q to make them odd if necessary, but that's not happening here). Interestingly, the off-by-plus-one error that may (and does!) emerge from this is then compensated for by the >> 1, an observation that I don't think has been made in the literature.
I've exhaustively tested the equivalence between the above Montmul and a normal one here:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <assert.h>
#define MLKEM_Q 3329
// PPC instruction equivalents for int16
int16_t vmladduhm(int16_t a, int16_t b, int16_t c) {
return (a * b + c) & 0xFFFF;
}
int16_t vmhraddshs(int16_t a, int16_t b, int16_t c) {
int32_t doubled_mul = 2 * (int32_t)a * (int32_t)b;
int16_t high = (doubled_mul + 0x8000) >> 16;
return high + c;
}
int16_t vsrah(int16_t a, int shift) {
return (int32_t) a >> shift;
}
// PPC Montgomery multiplication
int16_t montgomery_mult_ppc(int16_t coeff, int16_t zeta) {
const int16_t V_QINV = -3327;
const int16_t V_NMKQ = -3329;
int16_t r13 = coeff, r14, r15;
r15 = vmladduhm(r13, zeta, 0);
r14 = vmhraddshs(r13, zeta, 0);
r15 = vmladduhm(r15, V_QINV, 0);
r15 = vmhraddshs(r15, V_NMKQ, r14);
/* // We can tolerate an off-by-plus-one in the rounding which makes the result */
/* // odd, because the correct value is even and shifted by 1 in the end anyway. */
/* if ((r15 & 1) != 0) { */
/* printf("Assert failed: input to shift is odd: a=0x%04x (%d), b=0x%04x (%d)\n", */
/* coeff, coeff, zeta, zeta); */
/* } */
r13 = vsrah(r15, 1);
return r13;
}
// Ordinary Montgomery multiplication
int16_t montgomery_mult_c(int16_t a, int16_t b) {
const uint32_t QINV = 62209;
int32_t prod = (int32_t)a * (int32_t)b;
const uint16_t a_reduced = prod & UINT16_MAX;
const uint16_t a_inverted = (a_reduced * QINV) & UINT16_MAX;
const int16_t t = (int16_t)a_inverted;
int32_t r = prod - ((int32_t)t * MLKEM_Q);
r = r >> 16;
return (int16_t)r;
}
int main() {
printf("Exhaustively testing all int16 input pairs...\n");
long int passed = 0;
long int total = 0;
int16_t a_low = INT16_MIN;
int16_t a_high = INT16_MAX;
int16_t b_low = -29000;
int16_t b_high = 29000;
for (int a = a_low; a <= a_high; a++) {
for (int b = b_low; b <= b_high; b++) {
int16_t result_c = montgomery_mult_c((int16_t)a, (int16_t)b);
int16_t result_ppc = montgomery_mult_ppc((int16_t)a, (int16_t)b);
total++;
if (result_c == result_ppc) {
passed++;
} else {
printf("FAIL: a=0x%04x (%d) b=0x%04x (%d) c_result=0x%04x (%d) ppc_result=0x%04x (%d)\n",
(uint16_t)a, a,
(uint16_t)b, b,
(uint16_t)result_c, result_c,
(uint16_t)result_ppc, result_ppc);
}
}
}
printf("Passed %ld/%ld tests (%.6f%%)\n", passed, total, 100.0 * passed / total);
return passed == total ? 0 : 1;
}
Nice 😄
On the other hand, I don't think there is a reason here not to just use Barret multiplication as in the AArch64 code: This would remove a) one low multiplication, and b) the final >> 1, and would likely also pipeline better since it uses low-MLA rather than high-MLA. This would of course double the number of twiddles, though. What are your thoughts?
@dannytsen @bhess A general note: I don't know anything about the Power microarchitectures, but depending on how OOO they are, whether latency/throughput/units are public knowledge, and how important performance is to you, you may consider adding a microarchitecture model for SLOTHY and applying that here. Ordinarily, that would be a fair amount of boilerplate work, but in today's world of LLMs, I would imagine that an LLM can do this pretty quickly for you if you feed some microarchitecture documentation as input. Me or @mkannwischer can of course help with generic SLOTHY questions. Obviously, this is not a blocker, but just an FYI for you. |
#include "x86_64/meta.h" | ||
#endif | ||
|
||
#ifdef MLK_SYS_PPC64LE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dannytsen This needs to be guarded by feature flags indicating supporting for the extensions used by the backend. Could you adjust this?
Removed non-p8 instruction, xxspltib. Signed-off-by: Danny Tsen <[email protected]>
@hanno-becker I did not read their codes. I'll think about it. |
@hanno-becker I don't know about SLOTHY. My codes are all based my experience and understanding of the PPC microarchitecture which may not be optimal. I try to make my codes easier for me to read and maintain later but also optimize the performance. I'll spend some time on SLOTHY. Thanks. |
Signed-off-by: Danny Tsen <[email protected]>
Is there such a thing as a Software Optimization Guide for Power (e.g. like this one for Cortex-A78)? |
@hanno-becker @bhess As far as I know, there are none. |
dev/ppc64le/src/intt_ppc.S
Outdated
.align 4 | ||
# | ||
# Montgomery reduce loops with constant 1441 | ||
# | ||
addi 14, 4, C1441_OFFSET | ||
lvx V1441, 0, 14 | ||
|
||
Reload_4coeffs | ||
MREDUCE_4X V1441, V1441, V1441, V1441, 6, 7, 8, 9 | ||
Reload_4coeffs | ||
MREDUCE_4X V1441, V1441, V1441, V1441, 13, 18, 23, 28 | ||
MWrite_8X 32+6, 32+7, 32+8, 32+9, 32+13, 32+18, 32+23, 32+28 | ||
|
||
Reload_4coeffs | ||
MREDUCE_4X V1441, V1441, V1441, V1441, 6, 7, 8, 9 | ||
Reload_4coeffs | ||
MREDUCE_4X V1441, V1441, V1441, V1441, 13, 18, 23, 28 | ||
MWrite_8X 32+6, 32+7, 32+8, 32+9, 32+13, 32+18, 32+23, 32+28 | ||
|
||
Reload_4coeffs | ||
MREDUCE_4X V1441, V1441, V1441, V1441, 6, 7, 8, 9 | ||
Reload_4coeffs | ||
MREDUCE_4X V1441, V1441, V1441, V1441, 13, 18, 23, 28 | ||
MWrite_8X 32+6, 32+7, 32+8, 32+9, 32+13, 32+18, 32+23, 32+28 | ||
|
||
Reload_4coeffs | ||
MREDUCE_4X V1441, V1441, V1441, V1441, 6, 7, 8, 9 | ||
Reload_4coeffs | ||
MREDUCE_4X V1441, V1441, V1441, V1441, 13, 18, 23, 28 | ||
MWrite_8X 32+6, 32+7, 32+8, 32+9, 32+13, 32+18, 32+23, 32+28 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dannytsen This will need to be moved to the beginning of the invNTT: We don't reduce the output of the base multiplication, which means that input coefficient to the invNTT is essentially unconstrained in size. Frontloading the Montgomery scaling acts as a reduction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely the reason why you see the newly added unit tests failing in #1193.
@dannytsen I don't yet understand how the last two NTT layers handle data permutation. It seems to me that Layers 1-5 load and store data without interleaving. But for layer 6, the butterflies operate at distance of 4 coefficients, which is smaller than the vector size (8 coefficients), so I would expect some intra-vector rearrangement or a modified load (like ld2 on Arm). I see that the store operations do some interleaving, but I don't see anything up until and including the load for layer 6. What am I missing? |
@dannytsen Looking at the invNTT, beyond the ask to move the scaling to the front (blocker for the PR), is there a particular reason why you Barrett-reduce the entire data at every layer? Esp. when the scaling is moved to the front and the coefficients thereby reduced, you only need few reductions in the middle of the invNTT. Have a look at the bounds estimates in https://github.com/pq-code-package/mlkem-native/blob/main/dev/aarch64_clean/src/intt.S#L238. Performance improvements are not a blocker the initial merge, but we should at least document them in the code. |
@hanno-becker I did not read x86 or arm codes. I just coded it based on the C-file. There maybe something I missed and can be improved. I'm not mathematician so there will be some mistakes or incorrect approach. Layer 6 could be re-worked. The data was re-arranged in the beginning and re-arrange back which may be more work needs to be done. I definitely will look more into it. Thanks. |
@hanno-becker My code was originally based on the C-file from kyber NTT/INTT. I just based on the C-code and did it in assembly. I am sure it may not be mathematically optimal and ideal. |
@dannytsen Ok. We do need confidence in the correctness of the code, that's not negotiable. Performance is negotiable, but we leave a lot on the table without optimizations like a) lazy reduction or b) layer merging, and I'm not sure the effort of adding the PPC64 assembly is worth if we don't unlock them. Ping @mkannwischer @bhess for other opinions on how to proceed here. Would you have time to provide new version of the assembly for NTT and invNTT which follows the same approach as the AArch64 [inv]NTT? The main points are:
I am as little an expert in PPC64LE as you are in AArch64, but I am sure that we can work out the translation. In fact, I asked an LLM to make a start at documenting the AArch64 NTT in such a way that will -- hopefully -- be of some use to you. I am pretty sure some if it will be wrong, but I did look more closely at what it did for the modular arithmetic primitives, and I think it's correct to say that there is a straightforward translation of Barrett multiplication into PPC64LE. /* Copyright (c) 2022 Arm Limited
* Copyright (c) 2022 Hanno Becker
* Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer
* Copyright (c) The mlkem-native project authors
* SPDX-License-Identifier: Apache-2.0 OR ISC OR MIT
*/
/* ANNOTATED FOR PPC64LE TRANSLATION
* =================================
* This file contains the original AArch64 NTT implementation with detailed
* commentary explaining how each instruction maps to PPC64LE VSX equivalents.
*
* Key AArch64 -> PPC64LE instruction mappings:
* - NEON vectors (v0-v31) -> VSX vectors (v0-v31, accessed as VSR32-VSR63)
* - ld1/st1 -> lxvd2x/stxvx (with endianness correction via xxpermdi)
* - add/sub -> vadduhm/vsubuhm (unsigned halfword modulo)
* - mul -> vmladduhm with zero accumulator
* - sqrdmulh -> vmhraddshs (multiply-high-round-add signed halfword saturate)
* - mls -> vmladduhm with accumulator (effectively multiply-add)
* - trn1/trn2 -> xxmrglw/xxmrghw (merge low/high words)
*/
#include "../../../common.h"
#if defined(MLK_ARITH_BACKEND_AARCH64) && \
!defined(MLK_CONFIG_MULTILEVEL_NO_SHARED)
/* simpasm: header-end */
/* PPC64LE TRANSLATION: Montgomery Multiplication Macro
* ====================================================
* AArch64 uses sqrdmulh + mul + mls for Montgomery reduction.
* PPC64LE equivalent uses vmhraddshs + vmladduhm + vmladduhm (with zero/non-zero accumulators).
*
* AArch64 sqrdmulh: Signed saturating rounding doubling multiply high
* -> PPC64LE vmhraddshs: Vector multiply-high-round-add signed halfword saturate
*
* AArch64 mul: Vector multiply (low part)
* -> PPC64LE vmladduhm with zero accumulator: vmladduhm dst, src, const, zero_reg
*
* AArch64 mls: Vector multiply-subtract (dst = dst - src1*src2)
* -> PPC64LE vmladduhm with accumulator: vmladduhm dst, t2, neg_q, dst
*/
.macro mulmodq dst, src, const, idx0, idx1
// Signed barrett multiplication @[NeonNTT, Section 3.1.2] using
// round-to-nearest-even-integer approximation. Following loc.cit.,
// this is functionally the same as a signed Montgomery multiplication
// with a suitable constant of absolute value < q.
// AArch64: sqrdmulh t2.8h, src.8h, const.h[idx1]
// PPC64LE: vmhraddshs t2, src, const_high, zero_reg
// Computes high part of multiplication with rounding and saturation
sqrdmulh t2.8h, \src\().8h, \const\().h[\idx1\()]
// AArch64: mul dst.8h, src.8h, const.h[idx0]
// PPC64LE: vmladduhm dst, src, const_low, zero_reg (zero accumulator)
// Computes low part of multiplication (modulo 2^16)
mul \dst\().8h, \src\().8h, \const\().h[\idx0\()]
// AArch64: mls dst.8h, t2.8h, consts.h[0] (dst = dst - t2*q)
// PPC64LE: vmladduhm dst, t2, neg_q, dst (dst accumulator)
// Performs Montgomery reduction: dst = dst + (t2 * (-q))
mls \dst\().8h, t2.8h, consts.h[0]
.endm
/* PPC64LE TRANSLATION: Vector Montgomery Multiplication
* =====================================================
* This version uses full vector operands instead of indexed access.
* More suitable for PPC64LE where indexed vector operations are limited.
*/
.macro mulmod dst, src, const, const_twisted
// AArch64: sqrdmulh t2.8h, src.8h, const_twisted.8h
// PPC64LE: vmhraddshs t2, src, const_twisted, zero_reg
sqrdmulh t2.8h, \src\().8h, \const_twisted\().8h
// AArch64: mul dst.8h, src.8h, const.8h
// PPC64LE: vmladduhm dst, src, const, zero_reg (zero accumulator)
mul \dst\().8h, \src\().8h, \const\().8h
// AArch64: mls dst.8h, t2.8h, consts.h[0]
// PPC64LE: vmladduhm dst, t2, neg_q, dst (dst accumulator)
mls \dst\().8h, t2.8h, consts.h[0]
.endm
/* PPC64LE TRANSLATION: Cooley-Tukey Butterfly with Indexed Twiddles
* ==================================================================
* Implements: (a', b') = (a + t*w, a - t*w) where t = b, w = twiddle
*
* AArch64 approach:
* 1. Compute t = mulmodq(b, w) // Montgomery multiply b by twiddle
* 2. b' = a - t // Upper butterfly output
* 3. a' = a + t // Lower butterfly output
*
* PPC64LE equivalent uses the MREDUCE_4X macro pattern from existing code.
*/
.macro ct_butterfly a, b, root, idx0, idx1
// Step 1: Montgomery multiply b by twiddle factor
// AArch64: mulmodq tmp, b, root, idx0, idx1
// PPC64LE: Use MREDUCE_4X pattern with single twiddle
mulmodq tmp, \b, \root, \idx0, \idx1
// Step 2: Butterfly subtraction (upper output)
// AArch64: sub b.8h, a.8h, tmp.8h
// PPC64LE: vsubuhm b, a, tmp
sub \b\().8h, \a\().8h, tmp.8h
// Step 3: Butterfly addition (lower output)
// AArch64: add a.8h, a.8h, tmp.8h
// PPC64LE: vadduhm a, a, tmp
add \a\().8h, \a\().8h, tmp.8h
.endm
/* PPC64LE TRANSLATION: Cooley-Tukey Butterfly with Vector Twiddles
* =================================================================
* Same butterfly operation but with full vector twiddle factors.
* Better suited for PPC64LE's vector instruction set.
*/
.macro ct_butterfly_v a, b, root, root_twisted
// Montgomery multiply with vector twiddles
mulmod tmp, \b, \root, \root_twisted
// Butterfly operations
// AArch64: sub/add -> PPC64LE: vsubuhm/vadduhm
sub \b\().8h, \a\().8h, tmp.8h
add \a\().8h, \a\().8h, tmp.8h
.endm
/* PPC64LE TRANSLATION: Twiddle Factor Loading Macros
* ===================================================
* AArch64 uses ldr (load register) to load 128-bit vectors from memory.
* PPC64LE uses lxv/lxvd2x for vector loads, with endianness considerations.
*/
/* Load twiddle factors for layers 1, 2, 3
* AArch64: ldr loads 128 bits (8 int16_t values) into NEON register
* PPC64LE: lxv loads 128 bits into VSX register (no byte swap needed if data is correct endian)
* lxvd2x loads with automatic little-endian byte swap
*/
.macro load_roots_123
// AArch64: ldr q_root0, [r12345_ptr], #32 // Load and post-increment by 32 bytes
// PPC64LE: lxv VSR(root0), 0(r12345_ptr) // Load 16 bytes
// addi r12345_ptr, r12345_ptr, 32 // Manual increment
ldr q_root0, [r12345_ptr], #32
// AArch64: ldr q_root1, [r12345_ptr, #-16] // Load with negative offset
// PPC64LE: lxv VSR(root1), -16(r12345_ptr) // Load with offset
ldr q_root1, [r12345_ptr, #-16]
.endm
/* Load twiddle factors for layers 4, 5
* Simpler loading pattern - single vector per call
*/
.macro load_next_roots_45
// AArch64: ldr q_root0, [r12345_ptr], #16 // Load 16 bytes, post-increment
// PPC64LE: lxv VSR(root0), 0(r12345_ptr)
// addi r12345_ptr, r12345_ptr, 16
ldr q_root0, [r12345_ptr], #16
.endm
/* Load twiddle factors for layers 6, 7
* More complex loading pattern with multiple vectors and twisted versions
* PPC64LE note: "twisted" versions are pre-computed high parts for Montgomery reduction
*/
.macro load_next_roots_67
// Load primary twiddle factors
// AArch64: ldr with complex addressing: base + offset, then increment base
// PPC64LE: Multiple lxv instructions with calculated offsets
ldr q_root0, [r67_ptr], #(6*16) // Load root0, advance by 96 bytes
ldr q_root0_tw, [r67_ptr, #(-6*16 + 1*16)] // Load twisted version at offset -80
ldr q_root1, [r67_ptr, #(-6*16 + 2*16)] // Load root1 at offset -64
ldr q_root1_tw, [r67_ptr, #(-6*16 + 3*16)] // Load twisted version at offset -48
ldr q_root2, [r67_ptr, #(-6*16 + 4*16)] // Load root2 at offset -32
ldr q_root2_tw, [r67_ptr, #(-6*16 + 5*16)] // Load twisted version at offset -16
/* PPC64LE equivalent would be:
* lxv VSR(root0), 0(r67_ptr)
* lxv VSR(root0_tw), 16(r67_ptr)
* lxv VSR(root1), 32(r67_ptr)
* lxv VSR(root1_tw), 48(r67_ptr)
* lxv VSR(root2), 64(r67_ptr)
* lxv VSR(root2_tw), 80(r67_ptr)
* addi r67_ptr, r67_ptr, 96
*/
.endm
/* PPC64LE TRANSLATION: Matrix Transpose for 4x4 Layout
* =====================================================
* AArch64 uses trn1/trn2 (transpose) instructions for data reorganization.
* PPC64LE uses xxmrglw/xxmrghw (merge low/high words) and xxpermdi (permute doublewords).
*
* This macro transforms data layout from:
* [a0 a1 a2 a3 | b0 b1 b2 b3 | c0 c1 c2 c3 | d0 d1 d2 d3]
* to:
* [a0 b0 c0 d0 | a1 b1 c1 d1 | a2 b2 c2 d2 | a3 b3 c3 d3]
*/
.macro transpose4 data
// Step 1: Transpose at 32-bit word level
// AArch64: trn1/trn2 t.4s, data0.4s, data1.4s
// Interleaves 32-bit words: trn1 takes even positions, trn2 takes odd
// PPC64LE: xxmrglw/xxmrghw merge low/high 32-bit words from two vectors
trn1 t0.4s, \data\()0.4s, \data\()1.4s // t0 = [a0 b0 a2 b2]
trn2 t1.4s, \data\()0.4s, \data\()1.4s // t1 = [a1 b1 a3 b3]
trn1 t2.4s, \data\()2.4s, \data\()3.4s // t2 = [c0 d0 c2 d2]
trn2 t3.4s, \data\()2.4s, \data\()3.4s // t3 = [c1 d1 c3 d3]
// Step 2: Transpose at 64-bit doubleword level
// AArch64: trn1/trn2 t.2d, t0.2d, t2.2d
// Interleaves 64-bit doublewords
// PPC64LE: xxpermdi permutes doublewords between vectors
trn2 \data\()2.2d, t0.2d, t2.2d // data2 = [a2 b2 c2 d2]
trn2 \data\()3.2d, t1.2d, t3.2d // data3 = [a3 b3 c3 d3]
trn1 \data\()0.2d, t0.2d, t2.2d // data0 = [a0 b0 c0 d0]
trn1 \data\()1.2d, t1.2d, t3.2d // data1 = [a1 b1 c1 d1]
/* PPC64LE equivalent:
* xxmrglw t0, data0, data1 // Merge low words
* xxmrghw t1, data0, data1 // Merge high words
* xxmrglw t2, data2, data3 // Merge low words
* xxmrghw t3, data2, data3 // Merge high words
* xxpermdi data0, t0, t2, 0 // Permute: low of t0, low of t2
* xxpermdi data1, t1, t3, 0 // Permute: low of t1, low of t3
* xxpermdi data2, t0, t2, 3 // Permute: high of t0, high of t2
* xxpermdi data3, t1, t3, 3 // Permute: high of t1, high of t3
*/
.endm
/* PPC64LE TRANSLATION: Stack Management
* =====================================
* AArch64 uses stp/ldp (store/load pair) for efficient register saving.
* PPC64LE: This needs adjusting to whatever calling convention PPC64LE has.
*/
.macro save_vregs
// AArch64: Adjust stack pointer and save register pairs
sub sp, sp, #(16*4) // Allocate 64 bytes on stack
stp d8, d9, [sp, #16*0] // Store d8,d9 pair at sp+0
stp d10, d11, [sp, #16*1] // Store d10,d11 pair at sp+16
stp d12, d13, [sp, #16*2] // Store d12,d13 pair at sp+32
stp d14, d15, [sp, #16*3] // Store d14,d15 pair at sp+48
.endm
.macro restore_vregs
// AArch64: Restore register pairs and adjust stack pointer
ldp d8, d9, [sp, #16*0] // Load d8,d9 pair from sp+0
ldp d10, d11, [sp, #16*1] // Load d10,d11 pair from sp+16
ldp d12, d13, [sp, #16*2] // Load d12,d13 pair from sp+32
ldp d14, d15, [sp, #16*3] // Load d14,d15 pair from sp+48
add sp, sp, #(16*4) // Deallocate 64 bytes from stack
.endm
.endm
.macro push_stack
save_vregs
.endm
.macro pop_stack
restore_vregs
.endm
/* PPC64LE TRANSLATION: Register Assignments
* ==========================================
* AArch64 uses x0-x30 for general purpose, v0-v31 for NEON vectors.
* PPC64LE uses r0-r31 for general purpose, raw numbers for vector registers.
* VSX registers: v0-v31 are accessed as 32+0 through 32+31 in instructions.
*/
// Arguments - AArch64 calling convention
in .req x0 // Input/output buffer -> PPC64LE: 3
r12345_ptr .req x1 // twiddles for layer 0,1,2,3,4 -> PPC64LE: 4
r67_ptr .req x2 // twiddles for layer 5,6 -> PPC64LE: 5
// Working registers
inp .req x3 // Saved input pointer -> PPC64LE: 6
count .req x4 // Loop counter -> PPC64LE: 7
wtmp .req w5 // 32-bit temporary -> PPC64LE: 8
// Data vectors - using callee-saved NEON registers v8-v15
// PPC64LE: Use raw numbers, accessed as 32+N in instructions
data0 .req v8 // PPC64LE: 20 (accessed as 32+20)
data1 .req v9 // PPC64LE: 21 (accessed as 32+21)
data2 .req v10 // PPC64LE: 22 (accessed as 32+22)
data3 .req v11 // PPC64LE: 23 (accessed as 32+23)
data4 .req v12 // PPC64LE: 24 (accessed as 32+24)
data5 .req v13 // PPC64LE: 25 (accessed as 32+25)
data6 .req v14 // PPC64LE: 26 (accessed as 32+26)
data7 .req v15 // PPC64LE: 27 (accessed as 32+27)
// 128-bit (quadword) versions of data vectors
q_data0 .req q8 // PPC64LE: 32+20 in VSX instructions
q_data1 .req q9 // PPC64LE: 32+21 in VSX instructions
q_data2 .req q10 // PPC64LE: 32+22 in VSX instructions
q_data3 .req q11 // PPC64LE: 32+23 in VSX instructions
q_data4 .req q12 // PPC64LE: 32+24 in VSX instructions
q_data5 .req q13 // PPC64LE: 32+25 in VSX instructions
q_data6 .req q14 // PPC64LE: 32+26 in VSX instructions
q_data7 .req q15 // PPC64LE: 32+27 in VSX instructions
// Twiddle factor vectors - using caller-saved registers
// PPC64LE: Use raw numbers, accessed as 32+N in instructions
root0 .req v0 // PPC64LE: 0 (accessed as 32+0)
root1 .req v1 // PPC64LE: 1 (accessed as 32+1)
root2 .req v2 // PPC64LE: 2 (accessed as 32+2)
root0_tw .req v4 // PPC64LE: 4 (accessed as 32+4) - twisted version
root1_tw .req v5 // PPC64LE: 5 (accessed as 32+5)
root2_tw .req v6 // PPC64LE: 6 (accessed as 32+6)
// 128-bit versions of twiddle vectors
q_root0 .req q0 // PPC64LE: 32+0 in VSX instructions
q_root1 .req q1 // PPC64LE: 32+1 in VSX instructions
q_root2 .req q2 // PPC64LE: 32+2 in VSX instructions
q_root0_tw .req q4 // PPC64LE: 32+4 in VSX instructions
q_root1_tw .req q5 // PPC64LE: 32+5 in VSX instructions
q_root2_tw .req q6 // PPC64LE: 32+6 in VSX instructions
// Constants vector (holds MLKEM_Q and other constants)
consts .req v7 // PPC64LE: 7 (accessed as 32+7)
// Temporary vectors for intermediate calculations
tmp .req v24 // PPC64LE: 28 (accessed as 32+28)
t0 .req v25 // PPC64LE: 29 (accessed as 32+29)
t1 .req v26 // PPC64LE: 30 (accessed as 32+30)
t2 .req v27 // PPC64LE: 31 (accessed as 32+31)
t3 .req v28 // PPC64LE: 3 (accessed as 32+3) - additional temp if needed
t2 .req v27 // PPC64LE: v31 (VSR63)
t3 .req v28 // PPC64LE: v3 (VSR35) - additional temp if needed
.text
.global MLK_ASM_NAMESPACE(ntt_asm)
.balign 4
MLK_ASM_FN_SYMBOL(ntt_asm)
/* PPC64LE TRANSLATION: Function Entry
* ===================================
* AArch64: push_stack saves callee-saved NEON registers
* PPC64LE: Save callee-saved VSX registers and set up constants
*/
push_stack
/* PPC64LE TRANSLATION: Constant Initialization
* =============================================
* AArch64 uses mov to load immediate values into vector elements.
* PPC64LE loads constants from memory or uses vector splat operations.
*/
// Load MLKEM_Q = 3329 into constants vector
// AArch64: mov wtmp, #3329; mov consts.h[0], wtmp
// PPC64LE: li r9, 3329; mtvsrwz VSR39, r9; xxspltw VSR39, VSR39, 0
mov wtmp, #3329 // Load immediate 3329 into 32-bit register
mov consts.h[0], wtmp // Move to first halfword of constants vector
// Load Barrett constant 20159 for Montgomery reduction
// This is used in the high-part multiplication for reduction
mov wtmp, #20159 // Barrett reduction constant
mov consts.h[1], wtmp // Store in second halfword
/* PPC64LE equivalent:
* li r9, 3329 // Load immediate MLKEM_Q
* li r10, 20159 // Load Barrett constant
* mtvsrwz VSR39, r9 // Move r9 to VSX register
* mtvsrwz VSR40, r10 // Move r10 to VSX register
* xxspltw VSR39, VSR39, 0 // Splat word across vector (all elements = 3329)
* xxspltw VSR40, VSR40, 0 // Splat word across vector (all elements = 20159)
*/
// Initialize working pointers and loop counter
mov inp, in // Save original input pointer
mov count, #4 // Process 4 blocks in layers 1-2-3
// Load initial twiddle factors for layers 1, 2, 3
load_roots_123
.p2align 2
// Bounds reasoning:
// - There are 7 layers
// - When passing from layer N to layer N+1, each layer-N value
// is modified through the addition/subtraction of a Montgomery
// product of a twiddle of absolute value < q/2 and a layer-N value.
// - Recalling that for C such that |a| < C * q and |t|<q/2, we have
// |mlk_fqmul(a,t)| < q * (0.0254*C + 1/2), we see that the coefficients
// of layer N (starting with layer 0 = input data) are bound by q * f^N(1),
// where f(C) = 1/2 + 1.0508*C.
// For N=7, we get the bound of f^7(1) * q < 18295.
//
// See test/test_bounds.py for more details.
ntt_layer123_start:
/* PPC64LE TRANSLATION: Data Loading for Layers 1-2-3
* ====================================================
* AArch64: ldr loads 128 bits (8 int16_t values) from memory
* PPC64LE: lxvd2x loads 128 bits with little-endian byte swap
* xxpermdi corrects doubleword order if needed
*
* Memory layout: 256 int16_t values = 512 bytes total
* Each vector holds 8 int16_t values = 16 bytes
* 512/8 = 64 bytes between corresponding elements in different "columns"
*/
// Load 8 vectors of data (64 int16_t values total)
// AArch64: ldr q_data0, [in, #0] loads 16 bytes from in+0
// PPC64LE: lxvd2x VSR52, r3, r9 (where r9=0) loads with byte swap
ldr q_data0, [in, #0] // Load data[0:7]
ldr q_data1, [in, #(1*(512/8))] // Load data[64:71] (offset 64 bytes)
ldr q_data2, [in, #(2*(512/8))] // Load data[128:135] (offset 128 bytes)
ldr q_data3, [in, #(3*(512/8))] // Load data[192:199] (offset 192 bytes)
ldr q_data4, [in, #(4*(512/8))] // Load data[256:263] (offset 256 bytes)
ldr q_data5, [in, #(5*(512/8))] // Load data[320:327] (offset 320 bytes)
ldr q_data6, [in, #(6*(512/8))] // Load data[384:391] (offset 384 bytes)
ldr q_data7, [in, #(7*(512/8))] // Load data[448:455] (offset 448 bytes)
/* PPC64LE equivalent:
* li r9, 0 // Offset 0
* li r10, 64 // Offset 64
* li r11, 128 // Offset 128
* li r12, 192 // Offset 192
* lxvd2x VSR52, r3, r9 // Load data0 with byte swap
* lxvd2x VSR53, r3, r10 // Load data1 with byte swap
* lxvd2x VSR54, r3, r11 // Load data2 with byte swap
* lxvd2x VSR55, r3, r12 // Load data3 with byte swap
* xxpermdi VSR52, VSR52, VSR52, 2 // Correct doubleword order
* xxpermdi VSR53, VSR53, VSR53, 2 // Correct doubleword order
* xxpermdi VSR54, VSR54, VSR54, 2 // Correct doubleword order
* xxpermdi VSR55, VSR55, VSR55, 2 // Correct doubleword order
* (Continue for data4-data7 with offsets 256, 320, 384, 448)
*/
/* PPC64LE TRANSLATION: Layer 1 NTT Butterflies
* =============================================
* Layer 1 processes 4 butterflies with stride 128 (4*64 bytes)
* Each butterfly: (data[i], data[i+128]) -> (data[i]+t, data[i]-t)
* where t = Montgomery_multiply(data[i+128], twiddle)
*/
// Layer 1: 4 butterflies with the same twiddle factor
// Butterfly 1: (data0, data4) using root0.h[0] and root0.h[1]
ct_butterfly data0, data4, root0, 0, 1
// Butterfly 2: (data1, data5) using same twiddle
ct_butterfly data1, data5, root0, 0, 1
// Butterfly 3: (data2, data6) using same twiddle
ct_butterfly data2, data6, root0, 0, 1
// Butterfly 4: (data3, data7) using same twiddle
ct_butterfly data3, data7, root0, 0, 1
/* PPC64LE TRANSLATION: Layer 2 NTT Butterflies
* =============================================
* Layer 2 processes 8 butterflies with stride 64 (2*64 bytes)
* Uses different twiddle factors for different butterfly pairs
*/
// Layer 2: 8 butterflies with 2 different twiddle factors
// Butterflies 1-2: (data0,data2) and (data1,data3) using root0.h[2,3]
ct_butterfly data0, data2, root0, 2, 3
ct_butterfly data1, data3, root0, 2, 3
// Butterflies 3-4: (data4,data6) and (data5,data7) using root0.h[4,5]
ct_butterfly data4, data6, root0, 4, 5
ct_butterfly data5, data7, root0, 4, 5
/* PPC64LE TRANSLATION: Layer 3 NTT Butterflies
* =============================================
* Layer 3 processes 16 butterflies with stride 32 (1*64 bytes)
* Uses 4 different twiddle factors from root0 and root1
*/
// Layer 3: 16 butterflies with 4 different twiddle factors
// Butterflies 1-2: (data0,data1) using root0.h[6,7]
ct_butterfly data0, data1, root0, 6, 7
// Butterflies 3-4: (data2,data3) using root1.h[0,1]
ct_butterfly data2, data3, root1, 0, 1
// Butterflies 5-6: (data4,data5) using root1.h[2,3]
ct_butterfly data4, data5, root1, 2, 3
// Butterflies 7-8: (data6,data7) using root1.h[4,5]
ct_butterfly data6, data7, root1, 4, 5
/* PPC64LE TRANSLATION: Data Storage After Layers 1-2-3
* =====================================================
* AArch64: str stores 128 bits to memory with post-increment addressing
* PPC64LE: stxvd2x stores 128 bits with little-endian byte swap
*/
// Store results back to memory
// AArch64: str q_data0, [in], #16 stores and increments pointer by 16
// PPC64LE: stxvd2x VSR52, r3, r9; addi r3, r3, 16
str q_data0, [in], #(16) // Store data0, advance pointer by 16
str q_data1, [in, #(-16 + 1*(512/8))] // Store data1 at offset 48 from new position
str q_data2, [in, #(-16 + 2*(512/8))] // Store data2 at offset 112 from new position
str q_data3, [in, #(-16 + 3*(512/8))] // Store data3 at offset 176 from new position
str q_data4, [in, #(-16 + 4*(512/8))] // Store data4 at offset 240 from new position
str q_data5, [in, #(-16 + 5*(512/8))] // Store data5 at offset 304 from new position
str q_data6, [in, #(-16 + 6*(512/8))] // Store data6 at offset 368 from new position
str q_data7, [in, #(-16 + 7*(512/8))] // Store data7 at offset 432 from new position
/* PPC64LE equivalent:
* xxpermdi VSR52, VSR52, VSR52, 2 // Correct doubleword order before store
* xxpermdi VSR53, VSR53, VSR53, 2 // (Reverse the correction from load)
* stxvd2x VSR52, r3, r9 // Store data0 with byte swap
* addi r3, r3, 16 // Advance pointer
* li r10, 48 // Calculate offset for data1
* stxvd2x VSR53, r3, r10 // Store data1
* (Continue for remaining data vectors)
*/
// Loop control: decrement counter and branch if not zero
// AArch64: subs sets flags, cbnz branches if not zero
// PPC64LE: subic. sets condition register, bne branches if not equal
subs count, count, #1 // Decrement counter and set flags
cbnz count, ntt_layer123_start // Branch if count != 0
/* PPC64LE equivalent:
* subic. r7, r7, 1 // Subtract immediate and set CR0
* bne ntt_layer123_start // Branch if not equal (CR0[EQ] = 0)
*/
/* PPC64LE TRANSLATION: Setup for Layers 4-5-6-7
* ===============================================
* After layers 1-2-3, we reset pointers and process remaining layers
* with different stride patterns and data organization
*/
// Reset input pointer and set up for layers 4-5-6-7
mov in, inp // Restore original input pointer
mov count, #8 // Process 8 blocks in layers 4-5-6-7
.p2align 2
ntt_layer4567_start:
/* PPC64LE TRANSLATION: Data Loading for Layers 4-5-6-7
* ======================================================
* Now we process 4 vectors at a time (32 int16_t values)
* with different memory stride patterns
*/
// Load 4 consecutive vectors (64 bytes total)
// AArch64: ldr loads 16 bytes each
// PPC64LE: lxvd2x with consecutive offsets
ldr q_data0, [in, #(16*0)] // Load 16 bytes at offset 0
ldr q_data1, [in, #(16*1)] // Load 16 bytes at offset 16
ldr q_data2, [in, #(16*2)] // Load 16 bytes at offset 32
ldr q_data3, [in, #(16*3)] // Load 16 bytes at offset 48
/* PPC64LE equivalent:
* li r9, 0 // Offset 0
* li r10, 16 // Offset 16
* li r11, 32 // Offset 32
* li r12, 48 // Offset 48
* lxvd2x VSR52, r3, r9 // Load data0
* lxvd2x VSR53, r3, r10 // Load data1
* lxvd2x VSR54, r3, r11 // Load data2
* lxvd2x VSR55, r3, r12 // Load data3
* xxpermdi VSR52, VSR52, VSR52, 2 // Endian correction
* xxpermdi VSR53, VSR53, VSR53, 2
* xxpermdi VSR54, VSR54, VSR54, 2
* xxpermdi VSR55, VSR55, VSR55, 2
*/
// Load twiddle factors for layer 4-5
load_next_roots_45
/* PPC64LE TRANSLATION: Layer 4 and 5 Butterflies
* ===============================================
* Layer 4: stride 16, Layer 5: stride 8
* Uses indexed twiddle access for efficiency
*/
// Layer 4: 2 butterflies with stride 16 (2 vectors apart)
ct_butterfly data0, data2, root0, 0, 1 // Butterfly: (data0, data2)
ct_butterfly data1, data3, root0, 0, 1 // Butterfly: (data1, data3)
// Layer 5: 4 butterflies with stride 8 (1 vector apart)
ct_butterfly data0, data1, root0, 2, 3 // Butterfly: (data0, data1)
ct_butterfly data2, data3, root0, 4, 5 // Butterfly: (data2, data3)
/* PPC64LE TRANSLATION: Matrix Transpose for Layers 6-7
* =====================================================
* The transpose operation reorganizes data for efficient processing
* of the final two NTT layers with different access patterns
*/
// Transpose the 4x4 matrix of vectors for layers 6-7
// This changes the data layout to enable vectorized processing
// of the remaining butterfly operations
transpose4 data
// Load twiddle factors for layers 6-7 (including twisted versions)
load_next_roots_67
/* PPC64LE TRANSLATION: Layer 6 and 7 Butterflies
* ===============================================
* These layers use vector twiddle factors (not indexed)
* and process all elements in parallel within each vector
*/
// Layer 6: 8 butterflies with stride 4 (within vectors)
// Uses vector twiddle factors for parallel processing
ct_butterfly_v data0, data2, root0, root0_tw // Butterfly with vector twiddles
ct_butterfly_v data1, data3, root0, root0_tw // Same twiddles for parallel lanes
// Layer 7: 16 butterflies with stride 2 (within vectors)
ct_butterfly_v data0, data1, root1, root1_tw // Different twiddles for each pair
ct_butterfly_v data2, data3, root2, root2_tw // Different twiddles for each pair
/* PPC64LE TRANSLATION: Final Transpose and Storage
* ================================================
* Transpose back to restore the original data organization
* before storing results to memory
*/
// Transpose back to original layout
transpose4 data
// Store results back to memory
// AArch64: str with post-increment addressing
// PPC64LE: stxvd2x with manual pointer arithmetic
str q_data0, [in], #(16*4) // Store data0, advance by 64 bytes
str q_data1, [in, #(-16*3)] // Store data1 at offset -48
str q_data2, [in, #(-16*2)] // Store data2 at offset -32
str q_data3, [in, #(-16*1)] // Store data3 at offset -16
/* PPC64LE equivalent:
* xxpermdi VSR52, VSR52, VSR52, 2 // Endian correction before store
* xxpermdi VSR53, VSR53, VSR53, 2
* xxpermdi VSR54, VSR54, VSR54, 2
* xxpermdi VSR55, VSR55, VSR55, 2
* li r9, 0 // Offset 0
* stxvd2x VSR52, r3, r9 // Store data0
* addi r3, r3, 64 // Advance pointer by 64 bytes
* li r10, -48 // Offset -48
* li r11, -32 // Offset -32
* li r12, -16 // Offset -16
* stxvd2x VSR53, r3, r10 // Store data1
* stxvd2x VSR54, r3, r11 // Store data2
* stxvd2x VSR55, r3, r12 // Store data3
*/
// Loop control for layers 4-5-6-7
subs count, count, #1 // Decrement counter
cbnz count, ntt_layer4567_start // Continue if not zero
pop_stack // Restore saved NEON/VSX registers
ret // Return to caller
/* simpasm: footer-start */
#endif /* MLK_ARITH_BACKEND_AARCH64 && !MLK_CONFIG_MULTILEVEL_NO_SHARED */
/* PPC64LE TRANSLATION SUMMARY
* ===========================
*
* Key instruction mappings:
* 1. Memory operations:
* - ldr/str -> lxvd2x/stxvd2x (with endianness correction)
* - Addressing modes -> manual offset calculation
*
* 2. Arithmetic operations:
* - add/sub -> vadduhm/vsubuhm (unsigned halfword modulo)
* - mul -> vmladduhm with zero accumulator
* - sqrdmulh -> vmhraddshs (multiply-high-round-add signed halfword saturate)
* - mls -> vmladduhm with accumulator (effectively multiply-add)
*
* 3. Data reorganization:
* - trn1/trn2 -> xxmrglw/xxmrghw + xxpermdi
* - Matrix transpose -> word merge + doubleword permute
*/ Could you give this a shot? |
@hanno-becker Thanks for taking so much on ppc64le integration. I do learn a lot from here in all aspects. I don't know how much time will take me but I'll work on it. |
@dannytsen This is great to hear! Please don't hesitate to ask if anything is unclear about the algorithmic aspects of the AArch64 implementation, or how to translate it. Otherwise, I'm looking forward to seeing the updated code! |
Thanks @dannytsen! Feel free to let me know if there’s any area where I can jump in and help out with coding. |
@hanno-becker @bhess Thanks. |
@dannytsen Any update? |
@hanno-becker Still working on fixing current code before the other. |
Re-arranged zeta array for NTT/INTT for Len 2 and 4. Signed-off-by: Danny Tsen <[email protected]>
@dannytsen Ack. Let me know when you're done with the rework or have any questions. |
@hanno-becker @bhess My fix of my implementation to work with your new backend unit test is very straight forward and simple, just matched work flow of your C-implementation. This implementation worked as is. So, I don't plan on re-work for any time soon. Thanks. |
@bhess @dannytsen Could you provide performance improvement data for the backend as it stands? What is the plan towards integrating assembly that leverages lazy reduction and layer merging? |
Added optimized ppc64le support functions for ML-KEM.
The supported native functions include:
And other interface functions and headers.
Signed-off-by: Danny Tsen [email protected]" .