Skip to content

Conversation

dannytsen
Copy link

Added optimized ppc64le support functions for ML-KEM.

The supported native functions include:

  1. MLK_USE_NATIVE_NTT (ntt_ppc.S)
  2. MLK_USE_NATIVE_INTT (intt_ppc.S)
  3. MLK_USE_NATIVE_POLY_REDUCE (reduce.S)
  4. MLK_USE_NATIVE_POLY_TOMONT (poly_tomont.S)

And other interface functions and headers.

Signed-off-by: Danny Tsen [email protected]" .

@dannytsen dannytsen requested a review from a team as a code owner September 9, 2025 15:06
Copy link
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @dannytsen, this is an exciting contribution 🎉

I think as the first stage of review, the goal should be to get your changes through CI, and extend it so that the PPC64 backend is exercised (to this end: do you know if your assembly works with qemu-ppc64le, and what flags are needed?).

In a second phase, we can dive into the backend itself and hopefully convince ourselves that it is functionally correct and upholds the assumptions made by the frontend.

I left a few comments to kick things off, but additionally I can see that there are failures related to autogen and format, so a good starting point would be to resolve those. You should be able to run simpasm with a PPC cross compiler to get simplified assembly that you can check in to main source tree.

@dannytsen
Copy link
Author

@hanno-becker I believe the code will work on qemu-ppc64le even though I did not run on it. My testing platform are p9 and p10 systems. I will go thru the comments and fix issues. Thanks.

#

#include "../../../common.h"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to guard those files by MLK_ARITH_BACKEND_PPC64LE_DEFAULT


#include "../../../common.h"

.machine "any"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do? Does it have any effect to specify "any" machine?

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 11, 2025

@dannytsen Please see https://github.com/pq-code-package/mlkem-native/commits/ppc64le_backend for the changes to get the asm through the usual format/autogen/simpasm pipeline. Feel free to amend your commit(s). At least the base CI is happy with this: https://github.com/pq-code-package/mlkem-native/actions/runs/17640154327

NOTE: The resulting ASM in mlkem/* is currently unusable because the references to the .data section have been messed up during simpasm. As mentioned above, please see if you can follow the approach from the AArch64 backend: Define the NTT and invNTT twiddle tables in *.c and pass them to the ASM routines as arguments. The other constants can be generated in the code itself, as in https://github.com/pq-code-package/mlkem-native/blob/main/mlkem/src/native/aarch64/src/ntt.S#L79 for example. If it's inconvenient to do this, you can also go with a single large constant table including all constants you need, pass the pointer to that to each ASM function, and load from a suitable offset in the ASM. This is the approach used in the x86_64 backend, see dev/x86_64/src/consts.c.

@dannytsen
Copy link
Author

dannytsen commented Sep 11, 2025

@hanno-becker Thanks for the pointer. But I am not a python programmer and don't really can comprehend python so changes scripts will not be my first choice. I just want to get the simpasm work on my code. I can change my code to use data array from a C file. But I need an example (a command line example) to generate a simplified assembly. So, where do you run simpasm from? from scripts directory or dev directory? And what are the options I need to pass thru? Like simpasm -???? Just a example for x86 or arm will be fine. I just want to know how to run it so I can fix my assembly code accordingly. Thanks.

I have a t.S file with .data section stripped. And here is the output for your reference. So, you know what I am talking about.

[07:06] danny@ltc-zz4-lp9 dev % ../scripts/simpasm -i ${PWD}/ppc64le/src/t.S
simpasm: Command failed: gcc -c -x assembler-with-cpp -o /tmp/tmpitbxzagr.o -
simpasm: Exit code: 1
simpasm: stderr: :13:10: fatal error: ../../../common.h: No such file or directory
compilation terminated.

Traceback (most recent call last):
File "/home/danny/mlkem-native_dev/dev/../scripts/simpasm", line 158, in run_cmd
r = subprocess.run(
File "/usr/lib64/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['gcc', '-c', '-x', 'assembler-with-cpp', '-o', '/tmp/tmpitbxzagr.o', '-']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/danny/mlkem-native_dev/dev/../scripts/simpasm", line 445, in
_main()
File "/home/danny/mlkem-native_dev/dev/../scripts/simpasm", line 436, in _main
simplify(logger, args, args.input, args.output)
File "/home/danny/mlkem-native_dev/dev/../scripts/simpasm", line 206, in simplify
run_cmd(cmd, input=asm_no_if)
File "/home/danny/mlkem-native_dev/dev/../scripts/simpasm", line 166, in run_cmd
raise Exception("simpasm failed") from e
Exception: simpasm failed
[07:06] danny@ltc-zz4-lp9 dev %

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 11, 2025

@dannytsen I have basically done this for you in the branch (atop of your changes), so you won't need to fiddle with Python anymore. But you will need to change the ASM to pass in the constants as arguments, rather than having .data sections in the ASM.

If you checkout the branch, enter the nix develop .#ci-cross shell (this will take a very long time initially, unfortunately), and do autogen --force-cross, everything will just work: It'll run simpasm on your dev/ppc64le backend and update the code in mlkem/src/native/ppc64le/ accordingly.

@dannytsen
Copy link
Author

@dannytsen I have basically done this for you in the branch (atop of your changes), so you won't need to fiddle with Python anymore. But you will need to change the ASM to pass in the constants as arguments, rather than having .data sections in the ASM.

@hanno-becker Sure. I can do that.

@dannytsen
Copy link
Author

@dannytsen I have basically done this for you in the branch (atop of your changes), so you won't need to fiddle with Python anymore. But you will need to change the ASM to pass in the constants as arguments, rather than having .data sections in the ASM.

If you checkout the branch, enter the nix develop .#ci-cross shell (this will take a very long time initially, unfortunately), and do autogen, everything will just work.

@hanno-becker BTW, there is no nix for ppc.

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 11, 2025

@dannytsen You should work in an x86_64 or AArch64 Linux/Mac environment and use the PPC64Le cross compiler, which is already part of the environment established by nix develop .#ci-cross.

@dannytsen
Copy link
Author

@dannytsen You should work in an x86_64 or AArch64 Linux/Mac environment and use the PPC64Le cross compiler, which is already part of the environment established by nix develop .#ci-cross.

@hanno-becker Ok. I'll check that. Thanks.

@hanno-becker
Copy link
Contributor

@dannytsen What indicators/assurances have you obtained so far that the assembly is correct? Also, have you successfully run the code on QEMU, or real HW only?

@dannytsen
Copy link
Author

@dannytsen What indicators/assurances have you obtained so far that the assembly is correct? Also, have you successfully run the code on QEMU, or real HW only?

@hanno-becker The code was run successfully in liboqs and mlkem-native project on HW. The code was originally written for liboqs.

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 12, 2025

@dannytsen Independent of the work of separating the twiddles from the assembly:

I ran the code in a ppc64le emulator, but it fails as soon as I start to use the NTT or invNTT. Specifically, in a Linux/Mac environment, and using your current dannytsen:main:

nix develop --extra-experimental-features 'nix-command flakes'  .#ci-cross
make clean
tests func --cross-prefix=powerpc64le-unknown-linux-gnu- --exec-wrapper="qemu-ppc64le -cpu power9" --opt=opt

This gives:

INFO  > Functional Test    Compile     (cross opt):      CROSS_PREFIX=powerpc64le-unknown-linux-gnu- make func OPT=1 AUTO=1 -j32
INFO  > Functional Test    ML-KEM-512  (cross opt):      EXEC_WRAPPER=qemu-ppc64le -cpu power9 make run_func_512 -j32
ERROR > Functional Test    ML-KEM-512  (cross opt):      'EXEC_WRAPPER=qemu-ppc64le -cpu power9 make run_func_512 -j32' failed with with 2
ERROR > Functional Test    ML-KEM-512  (cross opt):      ERROR (test/test_mlkem.c,49)
ERROR (test/test_mlkem.c,225)
make: *** [Makefile:58: run_func_512] Error 1

If I comment out MLK_USE_NATIVE_NTT and MLK_USE_NATIVE_INTT, it works, so something must be off related to the [inv]NTT.

It could just be some CPU configuration missing. Are you assuming a particular vector length, for example?

I can also see the code failing when running under qemu-ppc64le -cpu power8, with Illegal instruction aborts. From the documentation, I was expecting it to work from power8 upwards.

Bottomline: I'll help with the integration details, but you'd need to find out / demonstrate that/how the code works in an emulated QEMU environment so we can test it in CI -- can you do that please?

@dannytsen
Copy link
Author

dannytsen commented Sep 12, 2025

@dannytsen Independent of the work of separating the twiddles from the assembly:

I ran the code in a ppc64le emulator, but it fails as soon as I start to use the NTT or invNTT. Specifically, in a Linux/Mac environment, and using your current dannytsen:main:

nix develop --extra-experimental-features 'nix-command flakes'  .#ci-cross
make clean
tests func --cross-prefix=powerpc64le-unknown-linux-gnu- --exec-wrapper="qemu-ppc64le -cpu power9" --opt=opt

This gives:

INFO  > Functional Test    Compile     (cross opt):      CROSS_PREFIX=powerpc64le-unknown-linux-gnu- make func OPT=1 AUTO=1 -j32
INFO  > Functional Test    ML-KEM-512  (cross opt):      EXEC_WRAPPER=qemu-ppc64le -cpu power9 make run_func_512 -j32
ERROR > Functional Test    ML-KEM-512  (cross opt):      'EXEC_WRAPPER=qemu-ppc64le -cpu power9 make run_func_512 -j32' failed with with 2
ERROR > Functional Test    ML-KEM-512  (cross opt):      ERROR (test/test_mlkem.c,49)
ERROR (test/test_mlkem.c,225)
make: *** [Makefile:58: run_func_512] Error 1

If I comment out MLK_USE_NATIVE_NTT and MLK_USE_NATIVE_INTT, it works, so something must be off related to the [inv]NTT.

It could just be some CPU configuration missing. Are you assuming a particular vector length, for example?

I can also see the code failing when running under qemu-ppc64le -cpu power8, with Illegal instruction aborts. From the documentation, I was expecting it to work from power8 upwards.

Bottomline: I'll help with the integration details, but you'd need to find out / demonstrate that/how the code works in an emulated QEMU environment so we can test it in CI -- can you do that please?

@hanno-becker Which means that it doesn't work with p8 or some instructions was not supported in qmenu.
I only tested on p9 and p10 HW platforms.

@dannytsen
Copy link
Author

@hanno-becker Here is my output from p9.

[00:01] danny@ltc-zz4-lp9 mlkem-native_dev % make test
AS test/build/mlkem512/mlkem/src/native/ppc64le/src/t.S.o
AR test/build/libmlkem512.a
LD test/build/mlkem512/bin/gen_KAT512
KAT ML-KEM-512: test/build/mlkem512/bin/gen_KAT512
set -o pipefail; test/build/mlkem512/bin/gen_KAT512 | shasum -a 256 | cut -d " " -f 1 | xargs ./META.sh ML-KEM-512 kat-sha256
/usr/bin/which: no yq in (/home/danny/.local/bin:/home/danny/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)
META.yml ML-KEM-512 kat-sha256: OK
AS test/build/mlkem768/mlkem/src/native/ppc64le/src/t.S.o
AR test/build/libmlkem768.a
LD test/build/mlkem768/bin/gen_KAT768
KAT ML-KEM-768: test/build/mlkem768/bin/gen_KAT768
set -o pipefail; test/build/mlkem768/bin/gen_KAT768 | shasum -a 256 | cut -d " " -f 1 | xargs ./META.sh ML-KEM-768 kat-sha256
/usr/bin/which: no yq in (/home/danny/.local/bin:/home/danny/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)
META.yml ML-KEM-768 kat-sha256: OK
AS test/build/mlkem1024/mlkem/src/native/ppc64le/src/t.S.o
AR test/build/libmlkem1024.a
LD test/build/mlkem1024/bin/gen_KAT1024
KAT ML-KEM-1024: test/build/mlkem1024/bin/gen_KAT1024
set -o pipefail; test/build/mlkem1024/bin/gen_KAT1024 | shasum -a 256 | cut -d " " -f 1 | xargs ./META.sh ML-KEM-1024 kat-sha256
/usr/bin/which: no yq in (/home/danny/.local/bin:/home/danny/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)
META.yml ML-KEM-1024 kat-sha256: OK
LD test/build/mlkem512/bin/test_mlkem512
FUNC ML-KEM-512: test/build/mlkem512/bin/test_mlkem512
test/build/mlkem512/bin/test_mlkem512
CRYPTO_SECRETKEYBYTES: 1632
CRYPTO_PUBLICKEYBYTES: 800
CRYPTO_CIPHERTEXTBYTES: 768
LD test/build/mlkem768/bin/test_mlkem768
FUNC ML-KEM-768: test/build/mlkem768/bin/test_mlkem768
test/build/mlkem768/bin/test_mlkem768
CRYPTO_SECRETKEYBYTES: 2400
CRYPTO_PUBLICKEYBYTES: 1184
CRYPTO_CIPHERTEXTBYTES: 1088
LD test/build/mlkem1024/bin/test_mlkem1024
FUNC ML-KEM-1024: test/build/mlkem1024/bin/test_mlkem1024
test/build/mlkem1024/bin/test_mlkem1024
CRYPTO_SECRETKEYBYTES: 3168
CRYPTO_PUBLICKEYBYTES: 1568
CRYPTO_CIPHERTEXTBYTES: 1568
LD test/build/mlkem512/bin/acvp_mlkem512
ACVP ML-KEM-512: test/build/mlkem512/bin/acvp_mlkem512
LD test/build/mlkem768/bin/acvp_mlkem768
ACVP ML-KEM-768: test/build/mlkem768/bin/acvp_mlkem768
LD test/build/mlkem1024/bin/acvp_mlkem1024
ACVP ML-KEM-1024: test/build/mlkem1024/bin/acvp_mlkem1024
python3 ./test/acvp_client.py
Using ACVP test vectors version v1.1.0.40
Running ACVP tests for test/.acvp-data/v1.1.0.40/files/ML-KEM-keyGen-FIPS203/prompt.json
Running keyGen test case 1 ... done
Running keyGen test case 2 ... done
Running keyGen test case 3 ... done
Running keyGen test case 4 ... done
Running keyGen test case 5 ... done
Running keyGen test case 6 ... done
Running keyGen test case 7 ... done
Running keyGen test case 8 ... done
Running keyGen test case 9 ... done
Running keyGen test case 10 ... done
Running keyGen test case 11 ... done
Running keyGen test case 12 ... done
Running keyGen test case 13 ... done
Running keyGen test case 14 ... done
Running keyGen test case 15 ... done
Running keyGen test case 16 ... done
Running keyGen test case 17 ... done
Running keyGen test case 18 ... done
Running keyGen test case 19 ... done
Running keyGen test case 20 ... done
Running keyGen test case 21 ... done
Running keyGen test case 22 ... done
Running keyGen test case 23 ... done
Running keyGen test case 24 ... done
Running keyGen test case 25 ... done
Running keyGen test case 26 ... done
Running keyGen test case 27 ... done
Running keyGen test case 28 ... done
Running keyGen test case 29 ... done
Running keyGen test case 30 ... done
Running keyGen test case 31 ... done
Running keyGen test case 32 ... done
Running keyGen test case 33 ... done
Running keyGen test case 34 ... done
Running keyGen test case 35 ... done
Running keyGen test case 36 ... done
Running keyGen test case 37 ... done
Running keyGen test case 38 ... done
Running keyGen test case 39 ... done
Running keyGen test case 40 ... done
Running keyGen test case 41 ... done
Running keyGen test case 42 ... done
Running keyGen test case 43 ... done
Running keyGen test case 44 ... done
Running keyGen test case 45 ... done
Running keyGen test case 46 ... done
Running keyGen test case 47 ... done
Running keyGen test case 48 ... done
Running keyGen test case 49 ... done
Running keyGen test case 50 ... done
Running keyGen test case 51 ... done
Running keyGen test case 52 ... done
Running keyGen test case 53 ... done
Running keyGen test case 54 ... done
Running keyGen test case 55 ... done
Running keyGen test case 56 ... done
Running keyGen test case 57 ... done
Running keyGen test case 58 ... done
Running keyGen test case 59 ... done
Running keyGen test case 60 ... done
Running keyGen test case 61 ... done
Running keyGen test case 62 ... done
Running keyGen test case 63 ... done
Running keyGen test case 64 ... done
Running keyGen test case 65 ... done
Running keyGen test case 66 ... done
Running keyGen test case 67 ... done
Running keyGen test case 68 ... done
Running keyGen test case 69 ... done
Running keyGen test case 70 ... done
Running keyGen test case 71 ... done
Running keyGen test case 72 ... done
Running keyGen test case 73 ... done
Running keyGen test case 74 ... done
Running keyGen test case 75 ... done
Comparing results with test/.acvp-data/v1.1.0.40/files/ML-KEM-keyGen-FIPS203/expectedResults.json
OK
Running ACVP tests for test/.acvp-data/v1.1.0.40/files/ML-KEM-encapDecap-FIPS203/prompt.json
Running encapDecap test case 1 (encapsulation) ... done
Running encapDecap test case 2 (encapsulation) ... done
Running encapDecap test case 3 (encapsulation) ... done
Running encapDecap test case 4 (encapsulation) ... done
Running encapDecap test case 5 (encapsulation) ... done
Running encapDecap test case 6 (encapsulation) ... done
Running encapDecap test case 7 (encapsulation) ... done
Running encapDecap test case 8 (encapsulation) ... done
Running encapDecap test case 9 (encapsulation) ... done
Running encapDecap test case 10 (encapsulation) ... done
Running encapDecap test case 11 (encapsulation) ... done
Running encapDecap test case 12 (encapsulation) ... done
Running encapDecap test case 13 (encapsulation) ... done
Running encapDecap test case 14 (encapsulation) ... done
Running encapDecap test case 15 (encapsulation) ... done
Running encapDecap test case 16 (encapsulation) ... done
Running encapDecap test case 17 (encapsulation) ... done
Running encapDecap test case 18 (encapsulation) ... done
Running encapDecap test case 19 (encapsulation) ... done
Running encapDecap test case 20 (encapsulation) ... done
Running encapDecap test case 21 (encapsulation) ... done
Running encapDecap test case 22 (encapsulation) ... done
Running encapDecap test case 23 (encapsulation) ... done
Running encapDecap test case 24 (encapsulation) ... done
Running encapDecap test case 25 (encapsulation) ... done
Running encapDecap test case 26 (encapsulation) ... done
Running encapDecap test case 27 (encapsulation) ... done
Running encapDecap test case 28 (encapsulation) ... done
Running encapDecap test case 29 (encapsulation) ... done
Running encapDecap test case 30 (encapsulation) ... done
Running encapDecap test case 31 (encapsulation) ... done
Running encapDecap test case 32 (encapsulation) ... done
Running encapDecap test case 33 (encapsulation) ... done
Running encapDecap test case 34 (encapsulation) ... done
Running encapDecap test case 35 (encapsulation) ... done
Running encapDecap test case 36 (encapsulation) ... done
Running encapDecap test case 37 (encapsulation) ... done
Running encapDecap test case 38 (encapsulation) ... done
Running encapDecap test case 39 (encapsulation) ... done
Running encapDecap test case 40 (encapsulation) ... done
Running encapDecap test case 41 (encapsulation) ... done
Running encapDecap test case 42 (encapsulation) ... done
Running encapDecap test case 43 (encapsulation) ... done
Running encapDecap test case 44 (encapsulation) ... done
Running encapDecap test case 45 (encapsulation) ... done
Running encapDecap test case 46 (encapsulation) ... done
Running encapDecap test case 47 (encapsulation) ... done
Running encapDecap test case 48 (encapsulation) ... done
Running encapDecap test case 49 (encapsulation) ... done
Running encapDecap test case 50 (encapsulation) ... done
Running encapDecap test case 51 (encapsulation) ... done
Running encapDecap test case 52 (encapsulation) ... done
Running encapDecap test case 53 (encapsulation) ... done
Running encapDecap test case 54 (encapsulation) ... done
Running encapDecap test case 55 (encapsulation) ... done
Running encapDecap test case 56 (encapsulation) ... done
Running encapDecap test case 57 (encapsulation) ... done
Running encapDecap test case 58 (encapsulation) ... done
Running encapDecap test case 59 (encapsulation) ... done
Running encapDecap test case 60 (encapsulation) ... done
Running encapDecap test case 61 (encapsulation) ... done
Running encapDecap test case 62 (encapsulation) ... done
Running encapDecap test case 63 (encapsulation) ... done
Running encapDecap test case 64 (encapsulation) ... done
Running encapDecap test case 65 (encapsulation) ... done
Running encapDecap test case 66 (encapsulation) ... done
Running encapDecap test case 67 (encapsulation) ... done
Running encapDecap test case 68 (encapsulation) ... done
Running encapDecap test case 69 (encapsulation) ... done
Running encapDecap test case 70 (encapsulation) ... done
Running encapDecap test case 71 (encapsulation) ... done
Running encapDecap test case 72 (encapsulation) ... done
Running encapDecap test case 73 (encapsulation) ... done
Running encapDecap test case 74 (encapsulation) ... done
Running encapDecap test case 75 (encapsulation) ... done
Running encapDecap test case 76 (decapsulation) ... done
Running encapDecap test case 77 (decapsulation) ... done
Running encapDecap test case 78 (decapsulation) ... done
Running encapDecap test case 79 (decapsulation) ... done
Running encapDecap test case 80 (decapsulation) ... done
Running encapDecap test case 81 (decapsulation) ... done
Running encapDecap test case 82 (decapsulation) ... done
Running encapDecap test case 83 (decapsulation) ... done
Running encapDecap test case 84 (decapsulation) ... done
Running encapDecap test case 85 (decapsulation) ... done
Running encapDecap test case 86 (decapsulation) ... done
Running encapDecap test case 87 (decapsulation) ... done
Running encapDecap test case 88 (decapsulation) ... done
Running encapDecap test case 89 (decapsulation) ... done
Running encapDecap test case 90 (decapsulation) ... done
Running encapDecap test case 91 (decapsulation) ... done
Running encapDecap test case 92 (decapsulation) ... done
Running encapDecap test case 93 (decapsulation) ... done
Running encapDecap test case 94 (decapsulation) ... done
Running encapDecap test case 95 (decapsulation) ... done
Running encapDecap test case 96 (decapsulation) ... done
Running encapDecap test case 97 (decapsulation) ... done
Running encapDecap test case 98 (decapsulation) ... done
Running encapDecap test case 99 (decapsulation) ... done
Running encapDecap test case 100 (decapsulation) ... done
Running encapDecap test case 101 (decapsulation) ... done
Running encapDecap test case 102 (decapsulation) ... done
Running encapDecap test case 103 (decapsulation) ... done
Running encapDecap test case 104 (decapsulation) ... done
Running encapDecap test case 105 (decapsulation) ... done
Running encapDecap test case 106 (decapsulationKeyCheck) ... done
Running encapDecap test case 107 (decapsulationKeyCheck) ... done
Running encapDecap test case 108 (decapsulationKeyCheck) ... done
Running encapDecap test case 109 (decapsulationKeyCheck) ... done
Running encapDecap test case 110 (decapsulationKeyCheck) ... done
Running encapDecap test case 111 (decapsulationKeyCheck) ... done
Running encapDecap test case 112 (decapsulationKeyCheck) ... done
Running encapDecap test case 113 (decapsulationKeyCheck) ... done
Running encapDecap test case 114 (decapsulationKeyCheck) ... done
Running encapDecap test case 115 (decapsulationKeyCheck) ... done
Running encapDecap test case 116 (encapsulationKeyCheck) ... done
Running encapDecap test case 117 (encapsulationKeyCheck) ... done
Running encapDecap test case 118 (encapsulationKeyCheck) ... done
Running encapDecap test case 119 (encapsulationKeyCheck) ... done
Running encapDecap test case 120 (encapsulationKeyCheck) ... done
Running encapDecap test case 121 (encapsulationKeyCheck) ... done
Running encapDecap test case 122 (encapsulationKeyCheck) ... done
Running encapDecap test case 123 (encapsulationKeyCheck) ... done
Running encapDecap test case 124 (encapsulationKeyCheck) ... done
Running encapDecap test case 125 (encapsulationKeyCheck) ... done
Running encapDecap test case 126 (decapsulationKeyCheck) ... done
Running encapDecap test case 127 (decapsulationKeyCheck) ... done
Running encapDecap test case 128 (decapsulationKeyCheck) ... done
Running encapDecap test case 129 (decapsulationKeyCheck) ... done
Running encapDecap test case 130 (decapsulationKeyCheck) ... done
Running encapDecap test case 131 (decapsulationKeyCheck) ... done
Running encapDecap test case 132 (decapsulationKeyCheck) ... done
Running encapDecap test case 133 (decapsulationKeyCheck) ... done
Running encapDecap test case 134 (decapsulationKeyCheck) ... done
Running encapDecap test case 135 (decapsulationKeyCheck) ... done
Running encapDecap test case 136 (encapsulationKeyCheck) ... done
Running encapDecap test case 137 (encapsulationKeyCheck) ... done
Running encapDecap test case 138 (encapsulationKeyCheck) ... done
Running encapDecap test case 139 (encapsulationKeyCheck) ... done
Running encapDecap test case 140 (encapsulationKeyCheck) ... done
Running encapDecap test case 141 (encapsulationKeyCheck) ... done
Running encapDecap test case 142 (encapsulationKeyCheck) ... done
Running encapDecap test case 143 (encapsulationKeyCheck) ... done
Running encapDecap test case 144 (encapsulationKeyCheck) ... done
Running encapDecap test case 145 (encapsulationKeyCheck) ... done
Running encapDecap test case 146 (decapsulationKeyCheck) ... done
Running encapDecap test case 147 (decapsulationKeyCheck) ... done
Running encapDecap test case 148 (decapsulationKeyCheck) ... done
Running encapDecap test case 149 (decapsulationKeyCheck) ... done
Running encapDecap test case 150 (decapsulationKeyCheck) ... done
Running encapDecap test case 151 (decapsulationKeyCheck) ... done
Running encapDecap test case 152 (decapsulationKeyCheck) ... done
Running encapDecap test case 153 (decapsulationKeyCheck) ... done
Running encapDecap test case 154 (decapsulationKeyCheck) ... done
Running encapDecap test case 155 (decapsulationKeyCheck) ... done
Running encapDecap test case 156 (encapsulationKeyCheck) ... done
Running encapDecap test case 157 (encapsulationKeyCheck) ... done
Running encapDecap test case 158 (encapsulationKeyCheck) ... done
Running encapDecap test case 159 (encapsulationKeyCheck) ... done
Running encapDecap test case 160 (encapsulationKeyCheck) ... done
Running encapDecap test case 161 (encapsulationKeyCheck) ... done
Running encapDecap test case 162 (encapsulationKeyCheck) ... done
Running encapDecap test case 163 (encapsulationKeyCheck) ... done
Running encapDecap test case 164 (encapsulationKeyCheck) ... done
Running encapDecap test case 165 (encapsulationKeyCheck) ... done
Comparing results with test/.acvp-data/v1.1.0.40/files/ML-KEM-encapDecap-FIPS203/expectedResults.json
OK
ALL GOOD!
Everything checks fine!
[00:01] danny@ltc-zz4-lp9 mlkem-native_dev %

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 12, 2025

@dannytsen Thank you. As mentioned, can you please find out how to test the code using qemu? The ci-cross shell already provides you with a cross compiler and an emulator (and a test script to use them, e.g. tests func --cross-prefix=powerpc64le-unknown-linux-gnu- --exec-wrapper="qemu-ppc64le -cpu power9" --opt=opt), but it appears that some configuration options are missing. We don't have PPC machines in CI, so the only way to test is via QEMU.

Which means that it doesn't work with p8 or some instructions was not supported in qmenu. I only tested on p9 and p10 HW platforms.

Can you find out which one it is? The PR documentation states that the ASM works P8 upwards.

@dannytsen
Copy link
Author

@dannytsen Thank you. As mentioned, can you please find out how to test the code using qemu? The ci-cross shell already provides you with a cross compiler and an emulator, but it appears that some configuration options are missing. We don't have PPC machines in CI, so the only way to test is via QEMU.

Which means that it doesn't work with p8 or some instructions was not supported in qmenu. I only tested on p9 and p10 HW platforms.

Can you find out which one it is? The PR documentation states that the ASM works P8 upwards.

@hanno-becker It looks like qmenu cross compiler soen't support "xxpermdi" instruction. I'll check.
Is your cross compiler support ISA2.07?

xxpermdi

@hanno-becker
Copy link
Contributor

@dannytsen I don't know. You should be able to find out assembling a minimal example and using powerpc64le-unknown-linux-gnu-objdump -d your_object_file.o for the disassembly.

@dannytsen
Copy link
Author

@dannytsen I don't know. You should be able to find out assembling a minimal example and using powerpc64le-unknown-linux-gnu-objdump -d your_object_file.o for the disassembly.

I'll check.

@dannytsen
Copy link
Author

@hanno-becker I don't have qemu on my system. But I installed nix on my Mac and run the following command under nix,
(CC=powerpc64le-unknown-linux-gnu-gcc make build) and it compiled fine. And the objdump is fine, powerpc64le-unknown-linux-gnu-objdump -d ntt_ppc.S.o. So, the assembly file should be good. But I can't run tests on my Mac with the cross-compiled binary.

These are the final build output.

FUNC ML-KEM-1024: test/build/mlkem1024/bin/test_mlkem1024
CC test/build/mlkem512/test/gen_KAT.c.o
LD test/build/mlkem512/bin/gen_KAT512
KAT ML-KEM-512: test/build/mlkem512/bin/gen_KAT512
CC test/build/mlkem768/test/gen_KAT.c.o
LD test/build/mlkem768/bin/gen_KAT768
KAT ML-KEM-768: test/build/mlkem768/bin/gen_KAT768
CC test/build/mlkem1024/test/gen_KAT.c.o
LD test/build/mlkem1024/bin/gen_KAT1024
KAT ML-KEM-1024: test/build/mlkem1024/bin/gen_KAT1024
CC test/build/mlkem512/test/acvp_mlkem.c.o
LD test/build/mlkem512/bin/acvp_mlkem512
ACVP ML-KEM-512: test/build/mlkem512/bin/acvp_mlkem512
CC test/build/mlkem768/test/acvp_mlkem.c.o
LD test/build/mlkem768/bin/acvp_mlkem768
ACVP ML-KEM-768: test/build/mlkem768/bin/acvp_mlkem768
CC test/build/mlkem1024/test/acvp_mlkem.c.o
LD test/build/mlkem1024/bin/acvp_mlkem1024
ACVP ML-KEM-1024: test/build/mlkem1024/bin/acvp_mlkem1024
Everything builds fine!

I'll check about qemu.

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 12, 2025

I don't have qemu on my system. But I installed nix on my Mac and run the following command under nix, (CC=powerpc64le-unknown-linux-gnu-gcc make build) and it compiled fine.

Please see #1184 (comment) again -- once in the ci-cross shell, you can use the tests script to build and run with cross-compiler / QEMU emulation.


# ppc64le backend (little endian)

This directory contains a native backend for little endian POWER 8 (ppc64le) and above systems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed before, I couldn't yet run the code with power8 emulation, and I think you mentioned you only tested power9 and power10. Are you sure it works on power8?

Copy link
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dannytsen I note that you do not cache the twisted twiddles in the Montgomery multiplication: That is, in

        vmladduhm 15, 15, V_QINV, 3
        vmladduhm 20, 20, V_QINV, 3
        vmladduhm 25, 25, V_QINV, 3
        vmladduhm 30, 30, V_QINV, 3

this could all be precomputed and stored in the constant table. This is what the AArch64 and x86_64 backends do.

Have you considered that?

Comment on lines +57 to +83
# fqmul = zeta * coefficient
# Modular multification bond by 2^16 * q in abs value
vmladduhm 15, 13, \_vz0, 3
vmladduhm 20, 18, \_vz1, 3
vmladduhm 25, 23, \_vz2, 3
vmladduhm 30, 28, \_vz3, 3

# Signed multiply-high-round; outputs are bound by 2^15 * q in abs value
vmhraddshs 14, 13, \_vz0, 3
vmhraddshs 19, 18, \_vz1, 3
vmhraddshs 24, 23, \_vz2, 3
vmhraddshs 29, 28, \_vz3, 3

vmladduhm 15, 15, V_QINV, 3
vmladduhm 20, 20, V_QINV, 3
vmladduhm 25, 25, V_QINV, 3
vmladduhm 30, 30, V_QINV, 3

vmhraddshs 15, 15, V_NMKQ, 14
vmhraddshs 20, 20, V_NMKQ, 19
vmhraddshs 25, 25, V_NMKQ, 24
vmhraddshs 30, 30, V_NMKQ, 29

vsrah 13, 15, 4 # >> 1
vsrah 18, 20, 4 # >> 1
vsrah 23, 25, 4 # >> 1
vsrah 28, 30, 4 # >> 1
Copy link
Contributor

@hanno-becker hanno-becker Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dannytsen @bhess You tread new ground here, algorithmically, I think? This is essentially Montgomery via Rounding but this has been documented to require an odd operand, which is not given (the original idea was that one would bump twiddles by Q to make them odd if necessary, but that's not happening here). Interestingly, the off-by-plus-one error that may (and does!) emerge from this is then compensated for by the >> 1, an observation that I don't think has been made in the literature.

I've exhaustively tested the equivalence between the above Montmul and a normal one here:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <assert.h>

#define MLKEM_Q 3329

// PPC instruction equivalents for int16
int16_t vmladduhm(int16_t a, int16_t b, int16_t c) {
    return (a * b + c) & 0xFFFF;
}

int16_t vmhraddshs(int16_t a, int16_t b, int16_t c) {
    int32_t doubled_mul = 2 * (int32_t)a * (int32_t)b;
    int16_t high = (doubled_mul + 0x8000) >> 16;
    return high + c;
}

int16_t vsrah(int16_t a, int shift) {
    return (int32_t) a >> shift;
}

// PPC Montgomery multiplication
int16_t montgomery_mult_ppc(int16_t coeff, int16_t zeta) {
    const int16_t V_QINV = -3327;
    const int16_t V_NMKQ = -3329;

    int16_t r13 = coeff, r14, r15;

    r15 = vmladduhm(r13, zeta, 0);
    r14 = vmhraddshs(r13, zeta, 0);

    r15 = vmladduhm(r15, V_QINV, 0);
    r15 = vmhraddshs(r15, V_NMKQ, r14);

    /* // We can tolerate an off-by-plus-one in the rounding which makes the result */
    /* // odd, because the correct value is even and shifted by 1 in the end anyway. */
    /*     if ((r15 & 1) != 0) { */
    /*         printf("Assert failed: input to shift is odd: a=0x%04x (%d), b=0x%04x (%d)\n", */
    /*                coeff, coeff, zeta, zeta); */
    /*     } */

    r13 = vsrah(r15, 1);

    return r13;
}

// Ordinary Montgomery multiplication
int16_t montgomery_mult_c(int16_t a, int16_t b) {
    const uint32_t QINV = 62209;
    int32_t prod = (int32_t)a * (int32_t)b;

    const uint16_t a_reduced = prod & UINT16_MAX;
    const uint16_t a_inverted = (a_reduced * QINV) & UINT16_MAX;
    const int16_t t = (int16_t)a_inverted;

    int32_t r = prod - ((int32_t)t * MLKEM_Q);
    r = r >> 16;

    return (int16_t)r;
}

int main() {
    printf("Exhaustively testing all int16 input pairs...\n");

    long int passed = 0;
    long int total = 0;

    int16_t a_low = INT16_MIN;
    int16_t a_high = INT16_MAX;

    int16_t b_low = -29000;
    int16_t b_high = 29000;

    for (int a = a_low; a <= a_high; a++) {
        for (int b = b_low; b <= b_high; b++) {
            int16_t result_c = montgomery_mult_c((int16_t)a, (int16_t)b);
            int16_t result_ppc = montgomery_mult_ppc((int16_t)a, (int16_t)b);

            total++;
            if (result_c == result_ppc) {
                passed++;
            } else {
                printf("FAIL: a=0x%04x (%d) b=0x%04x (%d) c_result=0x%04x (%d) ppc_result=0x%04x (%d)\n",
                       (uint16_t)a, a,
		       (uint16_t)b, b,
		       (uint16_t)result_c, result_c,
		       (uint16_t)result_ppc, result_ppc);
            }
        }
    }

    printf("Passed %ld/%ld tests (%.6f%%)\n", passed, total, 100.0 * passed / total);
    return passed == total ? 0 : 1;
}

Nice 😄

On the other hand, I don't think there is a reason here not to just use Barret multiplication as in the AArch64 code: This would remove a) one low multiplication, and b) the final >> 1, and would likely also pipeline better since it uses low-MLA rather than high-MLA. This would of course double the number of twiddles, though. What are your thoughts?

@hanno-becker
Copy link
Contributor

@dannytsen @bhess A general note: I don't know anything about the Power microarchitectures, but depending on how OOO they are, whether latency/throughput/units are public knowledge, and how important performance is to you, you may consider adding a microarchitecture model for SLOTHY and applying that here. Ordinarily, that would be a fair amount of boilerplate work, but in today's world of LLMs, I would imagine that an LLM can do this pretty quickly for you if you feed some microarchitecture documentation as input. Me or @mkannwischer can of course help with generic SLOTHY questions.

Obviously, this is not a blocker, but just an FYI for you.

#include "x86_64/meta.h"
#endif

#ifdef MLK_SYS_PPC64LE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dannytsen This needs to be guarded by feature flags indicating supporting for the extensions used by the backend. Could you adjust this?

Removed non-p8 instruction, xxspltib.

Signed-off-by: Danny Tsen <[email protected]>
@dannytsen
Copy link
Author

@dannytsen I note that you do not cache the twisted twiddles in the Montgomery multiplication: That is, in

        vmladduhm 15, 15, V_QINV, 3
        vmladduhm 20, 20, V_QINV, 3
        vmladduhm 25, 25, V_QINV, 3
        vmladduhm 30, 30, V_QINV, 3

this could all be precomputed and stored in the constant table. This is what the AArch64 and x86_64 backends do.

Have you considered that?

@hanno-becker I did not read their codes. I'll think about it.

@dannytsen
Copy link
Author

@dannytsen @bhess A general note: I don't know anything about the Power microarchitectures, but depending on how OOO they are, whether latency/throughput/units are public knowledge, and how important performance is to you, you may consider adding a microarchitecture model for SLOTHY and applying that here. Ordinarily, that would be a fair amount of boilerplate work, but in today's world of LLMs, I would imagine that an LLM can do this pretty quickly for you if you feed some microarchitecture documentation as input. Me or @mkannwischer can of course help with generic SLOTHY questions.

Obviously, this is not a blocker, but just an FYI for you.

@hanno-becker I don't know about SLOTHY. My codes are all based my experience and understanding of the PPC microarchitecture which may not be optimal. I try to make my codes easier for me to read and maintain later but also optimize the performance. I'll spend some time on SLOTHY. Thanks.

@hanno-becker
Copy link
Contributor

@hanno-becker I don't know about SLOTHY. My codes are all based my experience and understanding of the PPC microarchitecture which may not be optimal. I try to make my codes easier for me to read and maintain later but also optimize the performance. I'll spend some time on SLOTHY. Thanks.

Is there such a thing as a Software Optimization Guide for Power (e.g. like this one for Cortex-A78)?

@dannytsen
Copy link
Author

@hanno-becker I don't know about SLOTHY. My codes are all based my experience and understanding of the PPC microarchitecture which may not be optimal. I try to make my codes easier for me to read and maintain later but also optimize the performance. I'll spend some time on SLOTHY. Thanks.

Is there such a thing as a Software Optimization Guide for Power (e.g. like this one for Cortex-A78)?

@hanno-becker @bhess As far as I know, there are none.

Comment on lines 590 to 619
.align 4
#
# Montgomery reduce loops with constant 1441
#
addi 14, 4, C1441_OFFSET
lvx V1441, 0, 14

Reload_4coeffs
MREDUCE_4X V1441, V1441, V1441, V1441, 6, 7, 8, 9
Reload_4coeffs
MREDUCE_4X V1441, V1441, V1441, V1441, 13, 18, 23, 28
MWrite_8X 32+6, 32+7, 32+8, 32+9, 32+13, 32+18, 32+23, 32+28

Reload_4coeffs
MREDUCE_4X V1441, V1441, V1441, V1441, 6, 7, 8, 9
Reload_4coeffs
MREDUCE_4X V1441, V1441, V1441, V1441, 13, 18, 23, 28
MWrite_8X 32+6, 32+7, 32+8, 32+9, 32+13, 32+18, 32+23, 32+28

Reload_4coeffs
MREDUCE_4X V1441, V1441, V1441, V1441, 6, 7, 8, 9
Reload_4coeffs
MREDUCE_4X V1441, V1441, V1441, V1441, 13, 18, 23, 28
MWrite_8X 32+6, 32+7, 32+8, 32+9, 32+13, 32+18, 32+23, 32+28

Reload_4coeffs
MREDUCE_4X V1441, V1441, V1441, V1441, 6, 7, 8, 9
Reload_4coeffs
MREDUCE_4X V1441, V1441, V1441, V1441, 13, 18, 23, 28
MWrite_8X 32+6, 32+7, 32+8, 32+9, 32+13, 32+18, 32+23, 32+28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dannytsen This will need to be moved to the beginning of the invNTT: We don't reduce the output of the base multiplication, which means that input coefficient to the invNTT is essentially unconstrained in size. Frontloading the Montgomery scaling acts as a reduction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely the reason why you see the newly added unit tests failing in #1193.

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 25, 2025

@dannytsen I don't yet understand how the last two NTT layers handle data permutation. It seems to me that Layers 1-5 load and store data without interleaving. But for layer 6, the butterflies operate at distance of 4 coefficients, which is smaller than the vector size (8 coefficients), so I would expect some intra-vector rearrangement or a modified load (like ld2 on Arm). I see that the store operations do some interleaving, but I don't see anything up until and including the load for layer 6. What am I missing?

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 25, 2025

@dannytsen Looking at the invNTT, beyond the ask to move the scaling to the front (blocker for the PR), is there a particular reason why you Barrett-reduce the entire data at every layer? Esp. when the scaling is moved to the front and the coefficients thereby reduced, you only need few reductions in the middle of the invNTT. Have a look at the bounds estimates in https://github.com/pq-code-package/mlkem-native/blob/main/dev/aarch64_clean/src/intt.S#L238.

Performance improvements are not a blocker the initial merge, but we should at least document them in the code.

@dannytsen
Copy link
Author

@dannytsen I don't yet understand how the last two NTT layers handle data permutation. It seems to me that Layers 1-5 load and store data without interleaving. But for layer 6, the butterflies operate at distance of 4 coefficients, which is smaller than the vector size (8 coefficients), so I would expect some intra-vector rearrangement or a modified load (like ld2 on Arm). I see that the store operations do some interleaving, but I don't see anything up until and including the load for layer 6. What am I missing?

@hanno-becker I did not read x86 or arm codes. I just coded it based on the C-file. There maybe something I missed and can be improved. I'm not mathematician so there will be some mistakes or incorrect approach.

Layer 6 could be re-worked. The data was re-arranged in the beginning and re-arrange back which may be more work needs to be done. I definitely will look more into it. Thanks.

@dannytsen
Copy link
Author

@dannytsen Looking at the invNTT, beyond the ask to move the scaling to the front (blocker for the PR), is there a particular reason why you Barrett-reduce the entire data at every layer? Esp. when the scaling is moved to the front and the coefficients thereby reduced, you only need few reductions in the middle of the invNTT. Have a look at the bounds estimates in https://github.com/pq-code-package/mlkem-native/blob/main/dev/aarch64_clean/src/intt.S#L238.

Performance improvements are not a blocker the initial merge, but we should at least document them in the code.

@hanno-becker My code was originally based on the C-file from kyber NTT/INTT. I just based on the C-code and did it in assembly. I am sure it may not be mathematically optimal and ideal.

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 26, 2025

I did not read x86 or arm codes. I just coded it based on the C-file. There maybe something I missed and can be improved. I'm not mathematician so there will be some mistakes or incorrect approach.

@dannytsen Ok. We do need confidence in the correctness of the code, that's not negotiable. Performance is negotiable, but we leave a lot on the table without optimizations like a) lazy reduction or b) layer merging, and I'm not sure the effort of adding the PPC64 assembly is worth if we don't unlock them. Ping @mkannwischer @bhess for other opinions on how to proceed here.

Would you have time to provide new version of the assembly for NTT and invNTT which follows the same approach as the AArch64 [inv]NTT? The main points are:

  1. Use layer merging: With 32 vector registers, it is easy to "merge" layers 1,2,3 in the sense that you load the coefficients once, do the first 3 levels of butterflies, and then store (see https://github.com/pq-code-package/mlkem-native/blob/main/dev/aarch64_clean/src/ntt.S#L221); similarly, you can then merge layers 4,5,6,7, with a transpose of coefficients in between layer 5 and 6.
  2. Use lazy reduction in the invNTT: Only apply Barret reduction when coefficients are about to overflow. See https://github.com/pq-code-package/mlkem-native/blob/main/dev/aarch64_clean/src/intt.S#L238
  3. Use Barrett multiplication for the modular arithmetic: https://github.com/pq-code-package/mlkem-native/blob/main/dev/aarch64_clean/src/ntt.S#L64 It appears that you have the exact same instructions in VSX, so this should be completely straightforward.

I am as little an expert in PPC64LE as you are in AArch64, but I am sure that we can work out the translation. In fact, I asked an LLM to make a start at documenting the AArch64 NTT in such a way that will -- hopefully -- be of some use to you. I am pretty sure some if it will be wrong, but I did look more closely at what it did for the modular arithmetic primitives, and I think it's correct to say that there is a straightforward translation of Barrett multiplication into PPC64LE.

/* Copyright (c) 2022 Arm Limited
 * Copyright (c) 2022 Hanno Becker
 * Copyright (c) 2023 Amin Abdulrahman, Matthias Kannwischer
 * Copyright (c) The mlkem-native project authors
 * SPDX-License-Identifier: Apache-2.0 OR ISC OR MIT
 */

/* ANNOTATED FOR PPC64LE TRANSLATION
 * =================================
 * This file contains the original AArch64 NTT implementation with detailed
 * commentary explaining how each instruction maps to PPC64LE VSX equivalents.
 *
 * Key AArch64 -> PPC64LE instruction mappings:
 * - NEON vectors (v0-v31) -> VSX vectors (v0-v31, accessed as VSR32-VSR63)
 * - ld1/st1 -> lxvd2x/stxvx (with endianness correction via xxpermdi)
 * - add/sub -> vadduhm/vsubuhm (unsigned halfword modulo)
 * - mul -> vmladduhm with zero accumulator
 * - sqrdmulh -> vmhraddshs (multiply-high-round-add signed halfword saturate)
 * - mls -> vmladduhm with accumulator (effectively multiply-add)
 * - trn1/trn2 -> xxmrglw/xxmrghw (merge low/high words)
 */

#include "../../../common.h"
#if defined(MLK_ARITH_BACKEND_AARCH64) && \
    !defined(MLK_CONFIG_MULTILEVEL_NO_SHARED)
/* simpasm: header-end */

/* PPC64LE TRANSLATION: Montgomery Multiplication Macro
 * ====================================================
 * AArch64 uses sqrdmulh + mul + mls for Montgomery reduction.
 * PPC64LE equivalent uses vmhraddshs + vmladduhm + vmladduhm (with zero/non-zero accumulators).
 *
 * AArch64 sqrdmulh: Signed saturating rounding doubling multiply high
 * -> PPC64LE vmhraddshs: Vector multiply-high-round-add signed halfword saturate
 *
 * AArch64 mul: Vector multiply (low part)
 * -> PPC64LE vmladduhm with zero accumulator: vmladduhm dst, src, const, zero_reg
 *
 * AArch64 mls: Vector multiply-subtract (dst = dst - src1*src2)
 * -> PPC64LE vmladduhm with accumulator: vmladduhm dst, t2, neg_q, dst
 */
.macro mulmodq dst, src, const, idx0, idx1
        // Signed barrett multiplication @[NeonNTT, Section 3.1.2] using
        // round-to-nearest-even-integer approximation. Following loc.cit.,
        // this is functionally the same as a signed Montgomery multiplication
        // with a suitable constant of absolute value < q.

        // AArch64: sqrdmulh t2.8h, src.8h, const.h[idx1]
        // PPC64LE: vmhraddshs t2, src, const_high, zero_reg
        // Computes high part of multiplication with rounding and saturation
        sqrdmulh t2.8h,      \src\().8h, \const\().h[\idx1\()]

        // AArch64: mul dst.8h, src.8h, const.h[idx0]
        // PPC64LE: vmladduhm dst, src, const_low, zero_reg (zero accumulator)
        // Computes low part of multiplication (modulo 2^16)
        mul      \dst\().8h, \src\().8h, \const\().h[\idx0\()]

        // AArch64: mls dst.8h, t2.8h, consts.h[0]  (dst = dst - t2*q)
        // PPC64LE: vmladduhm dst, t2, neg_q, dst (dst accumulator)
        // Performs Montgomery reduction: dst = dst + (t2 * (-q))
        mls      \dst\().8h, t2.8h,      consts.h[0]
.endm

/* PPC64LE TRANSLATION: Vector Montgomery Multiplication
 * =====================================================
 * This version uses full vector operands instead of indexed access.
 * More suitable for PPC64LE where indexed vector operations are limited.
 */
.macro mulmod dst, src, const, const_twisted
        // AArch64: sqrdmulh t2.8h, src.8h, const_twisted.8h
        // PPC64LE: vmhraddshs t2, src, const_twisted, zero_reg
        sqrdmulh t2.8h,   \src\().8h, \const_twisted\().8h

        // AArch64: mul dst.8h, src.8h, const.8h
        // PPC64LE: vmladduhm dst, src, const, zero_reg (zero accumulator)
        mul      \dst\().8h, \src\().8h, \const\().8h

        // AArch64: mls dst.8h, t2.8h, consts.h[0]
        // PPC64LE: vmladduhm dst, t2, neg_q, dst (dst accumulator)
        mls      \dst\().8h, t2.8h,   consts.h[0]
.endm

/* PPC64LE TRANSLATION: Cooley-Tukey Butterfly with Indexed Twiddles
 * ==================================================================
 * Implements: (a', b') = (a + t*w, a - t*w) where t = b, w = twiddle
 *
 * AArch64 approach:
 * 1. Compute t = mulmodq(b, w)  // Montgomery multiply b by twiddle
 * 2. b' = a - t                 // Upper butterfly output
 * 3. a' = a + t                 // Lower butterfly output
 *
 * PPC64LE equivalent uses the MREDUCE_4X macro pattern from existing code.
 */
.macro ct_butterfly a, b, root, idx0, idx1
        // Step 1: Montgomery multiply b by twiddle factor
        // AArch64: mulmodq tmp, b, root, idx0, idx1
        // PPC64LE: Use MREDUCE_4X pattern with single twiddle
        mulmodq tmp, \b, \root, \idx0, \idx1

        // Step 2: Butterfly subtraction (upper output)
        // AArch64: sub b.8h, a.8h, tmp.8h
        // PPC64LE: vsubuhm b, a, tmp
        sub \b\().8h, \a\().8h, tmp.8h

        // Step 3: Butterfly addition (lower output)
        // AArch64: add a.8h, a.8h, tmp.8h
        // PPC64LE: vadduhm a, a, tmp
        add \a\().8h, \a\().8h, tmp.8h
.endm

/* PPC64LE TRANSLATION: Cooley-Tukey Butterfly with Vector Twiddles
 * =================================================================
 * Same butterfly operation but with full vector twiddle factors.
 * Better suited for PPC64LE's vector instruction set.
 */
.macro ct_butterfly_v a, b, root, root_twisted
        // Montgomery multiply with vector twiddles
        mulmod tmp, \b, \root, \root_twisted

        // Butterfly operations
        // AArch64: sub/add -> PPC64LE: vsubuhm/vadduhm
        sub \b\().8h, \a\().8h, tmp.8h
        add \a\().8h, \a\().8h, tmp.8h
.endm

/* PPC64LE TRANSLATION: Twiddle Factor Loading Macros
 * ===================================================
 * AArch64 uses ldr (load register) to load 128-bit vectors from memory.
 * PPC64LE uses lxv/lxvd2x for vector loads, with endianness considerations.
 */

/* Load twiddle factors for layers 1, 2, 3
 * AArch64: ldr loads 128 bits (8 int16_t values) into NEON register
 * PPC64LE: lxv loads 128 bits into VSX register (no byte swap needed if data is correct endian)
 *          lxvd2x loads with automatic little-endian byte swap
 */
.macro load_roots_123
        // AArch64: ldr q_root0, [r12345_ptr], #32  // Load and post-increment by 32 bytes
        // PPC64LE: lxv VSR(root0), 0(r12345_ptr)   // Load 16 bytes
        //          addi r12345_ptr, r12345_ptr, 32  // Manual increment
        ldr q_root0, [r12345_ptr], #32

        // AArch64: ldr q_root1, [r12345_ptr, #-16] // Load with negative offset
        // PPC64LE: lxv VSR(root1), -16(r12345_ptr) // Load with offset
        ldr q_root1, [r12345_ptr, #-16]
.endm

/* Load twiddle factors for layers 4, 5
 * Simpler loading pattern - single vector per call
 */
.macro load_next_roots_45
        // AArch64: ldr q_root0, [r12345_ptr], #16  // Load 16 bytes, post-increment
        // PPC64LE: lxv VSR(root0), 0(r12345_ptr)
        //          addi r12345_ptr, r12345_ptr, 16
        ldr q_root0, [r12345_ptr], #16
.endm

/* Load twiddle factors for layers 6, 7
 * More complex loading pattern with multiple vectors and twisted versions
 * PPC64LE note: "twisted" versions are pre-computed high parts for Montgomery reduction
 */
.macro load_next_roots_67
        // Load primary twiddle factors
        // AArch64: ldr with complex addressing: base + offset, then increment base
        // PPC64LE: Multiple lxv instructions with calculated offsets
        ldr q_root0,    [r67_ptr], #(6*16)          // Load root0, advance by 96 bytes
        ldr q_root0_tw, [r67_ptr, #(-6*16 + 1*16)]  // Load twisted version at offset -80
        ldr q_root1,    [r67_ptr, #(-6*16 + 2*16)]  // Load root1 at offset -64
        ldr q_root1_tw, [r67_ptr, #(-6*16 + 3*16)]  // Load twisted version at offset -48
        ldr q_root2,    [r67_ptr, #(-6*16 + 4*16)]  // Load root2 at offset -32
        ldr q_root2_tw, [r67_ptr, #(-6*16 + 5*16)]  // Load twisted version at offset -16

        /* PPC64LE equivalent would be:
         * lxv VSR(root0), 0(r67_ptr)
         * lxv VSR(root0_tw), 16(r67_ptr)
         * lxv VSR(root1), 32(r67_ptr)
         * lxv VSR(root1_tw), 48(r67_ptr)
         * lxv VSR(root2), 64(r67_ptr)
         * lxv VSR(root2_tw), 80(r67_ptr)
         * addi r67_ptr, r67_ptr, 96
         */
.endm

/* PPC64LE TRANSLATION: Matrix Transpose for 4x4 Layout
 * =====================================================
 * AArch64 uses trn1/trn2 (transpose) instructions for data reorganization.
 * PPC64LE uses xxmrglw/xxmrghw (merge low/high words) and xxpermdi (permute doublewords).
 *
 * This macro transforms data layout from:
 * [a0 a1 a2 a3 | b0 b1 b2 b3 | c0 c1 c2 c3 | d0 d1 d2 d3]
 * to:
 * [a0 b0 c0 d0 | a1 b1 c1 d1 | a2 b2 c2 d2 | a3 b3 c3 d3]
 */
.macro transpose4 data
        // Step 1: Transpose at 32-bit word level
        // AArch64: trn1/trn2 t.4s, data0.4s, data1.4s
        // Interleaves 32-bit words: trn1 takes even positions, trn2 takes odd
        // PPC64LE: xxmrglw/xxmrghw merge low/high 32-bit words from two vectors
        trn1 t0.4s, \data\()0.4s, \data\()1.4s  // t0 = [a0 b0 a2 b2]
        trn2 t1.4s, \data\()0.4s, \data\()1.4s  // t1 = [a1 b1 a3 b3]
        trn1 t2.4s, \data\()2.4s, \data\()3.4s  // t2 = [c0 d0 c2 d2]
        trn2 t3.4s, \data\()2.4s, \data\()3.4s  // t3 = [c1 d1 c3 d3]

        // Step 2: Transpose at 64-bit doubleword level
        // AArch64: trn1/trn2 t.2d, t0.2d, t2.2d
        // Interleaves 64-bit doublewords
        // PPC64LE: xxpermdi permutes doublewords between vectors
        trn2 \data\()2.2d, t0.2d, t2.2d         // data2 = [a2 b2 c2 d2]
        trn2 \data\()3.2d, t1.2d, t3.2d         // data3 = [a3 b3 c3 d3]
        trn1 \data\()0.2d, t0.2d, t2.2d         // data0 = [a0 b0 c0 d0]
        trn1 \data\()1.2d, t1.2d, t3.2d         // data1 = [a1 b1 c1 d1]

        /* PPC64LE equivalent:
         * xxmrglw t0, data0, data1      // Merge low words
         * xxmrghw t1, data0, data1      // Merge high words
         * xxmrglw t2, data2, data3      // Merge low words
         * xxmrghw t3, data2, data3      // Merge high words
         * xxpermdi data0, t0, t2, 0     // Permute: low of t0, low of t2
         * xxpermdi data1, t1, t3, 0     // Permute: low of t1, low of t3
         * xxpermdi data2, t0, t2, 3     // Permute: high of t0, high of t2
         * xxpermdi data3, t1, t3, 3     // Permute: high of t1, high of t3
         */
.endm
/* PPC64LE TRANSLATION: Stack Management
 * =====================================
 * AArch64 uses stp/ldp (store/load pair) for efficient register saving.
 * PPC64LE: This needs adjusting to whatever calling convention PPC64LE has.
 */
.macro save_vregs
        // AArch64: Adjust stack pointer and save register pairs
        sub sp, sp, #(16*4)                    // Allocate 64 bytes on stack
        stp  d8,  d9, [sp, #16*0]             // Store d8,d9 pair at sp+0
        stp d10, d11, [sp, #16*1]             // Store d10,d11 pair at sp+16
        stp d12, d13, [sp, #16*2]             // Store d12,d13 pair at sp+32
        stp d14, d15, [sp, #16*3]             // Store d14,d15 pair at sp+48
.endm

.macro restore_vregs
        // AArch64: Restore register pairs and adjust stack pointer
        ldp  d8,  d9, [sp, #16*0]             // Load d8,d9 pair from sp+0
        ldp d10, d11, [sp, #16*1]             // Load d10,d11 pair from sp+16
        ldp d12, d13, [sp, #16*2]             // Load d12,d13 pair from sp+32
        ldp d14, d15, [sp, #16*3]             // Load d14,d15 pair from sp+48
        add sp, sp, #(16*4)                   // Deallocate 64 bytes from stack
.endm
.endm

.macro push_stack
        save_vregs
.endm

.macro pop_stack
        restore_vregs
.endm

/* PPC64LE TRANSLATION: Register Assignments
 * ==========================================
 * AArch64 uses x0-x30 for general purpose, v0-v31 for NEON vectors.
 * PPC64LE uses r0-r31 for general purpose, raw numbers for vector registers.
 * VSX registers: v0-v31 are accessed as 32+0 through 32+31 in instructions.
 */

        // Arguments - AArch64 calling convention
        in         .req x0 // Input/output buffer -> PPC64LE: 3
        r12345_ptr .req x1 // twiddles for layer 0,1,2,3,4 -> PPC64LE: 4
        r67_ptr    .req x2 // twiddles for layer 5,6 -> PPC64LE: 5

        // Working registers
        inp     .req x3    // Saved input pointer -> PPC64LE: 6
        count   .req x4    // Loop counter -> PPC64LE: 7
        wtmp    .req w5    // 32-bit temporary -> PPC64LE: 8

        // Data vectors - using callee-saved NEON registers v8-v15
        // PPC64LE: Use raw numbers, accessed as 32+N in instructions
        data0  .req v8     // PPC64LE: 20 (accessed as 32+20)
        data1  .req v9     // PPC64LE: 21 (accessed as 32+21)
        data2  .req v10    // PPC64LE: 22 (accessed as 32+22)
        data3  .req v11    // PPC64LE: 23 (accessed as 32+23)
        data4  .req v12    // PPC64LE: 24 (accessed as 32+24)
        data5  .req v13    // PPC64LE: 25 (accessed as 32+25)
        data6  .req v14    // PPC64LE: 26 (accessed as 32+26)
        data7  .req v15    // PPC64LE: 27 (accessed as 32+27)

        // 128-bit (quadword) versions of data vectors
        q_data0  .req q8   // PPC64LE: 32+20 in VSX instructions
        q_data1  .req q9   // PPC64LE: 32+21 in VSX instructions
        q_data2  .req q10  // PPC64LE: 32+22 in VSX instructions
        q_data3  .req q11  // PPC64LE: 32+23 in VSX instructions
        q_data4  .req q12  // PPC64LE: 32+24 in VSX instructions
        q_data5  .req q13  // PPC64LE: 32+25 in VSX instructions
        q_data6  .req q14  // PPC64LE: 32+26 in VSX instructions
        q_data7  .req q15  // PPC64LE: 32+27 in VSX instructions

        // Twiddle factor vectors - using caller-saved registers
        // PPC64LE: Use raw numbers, accessed as 32+N in instructions
        root0    .req v0   // PPC64LE: 0 (accessed as 32+0)
        root1    .req v1   // PPC64LE: 1 (accessed as 32+1)
        root2    .req v2   // PPC64LE: 2 (accessed as 32+2)
        root0_tw .req v4   // PPC64LE: 4 (accessed as 32+4) - twisted version
        root1_tw .req v5   // PPC64LE: 5 (accessed as 32+5)
        root2_tw .req v6   // PPC64LE: 6 (accessed as 32+6)

        // 128-bit versions of twiddle vectors
        q_root0    .req q0 // PPC64LE: 32+0 in VSX instructions
        q_root1    .req q1 // PPC64LE: 32+1 in VSX instructions
        q_root2    .req q2 // PPC64LE: 32+2 in VSX instructions
        q_root0_tw .req q4 // PPC64LE: 32+4 in VSX instructions
        q_root1_tw .req q5 // PPC64LE: 32+5 in VSX instructions
        q_root2_tw .req q6 // PPC64LE: 32+6 in VSX instructions

        // Constants vector (holds MLKEM_Q and other constants)
        consts    .req v7  // PPC64LE: 7 (accessed as 32+7)

        // Temporary vectors for intermediate calculations
        tmp .req v24       // PPC64LE: 28 (accessed as 32+28)
        t0  .req v25       // PPC64LE: 29 (accessed as 32+29)
        t1  .req v26       // PPC64LE: 30 (accessed as 32+30)
        t2  .req v27       // PPC64LE: 31 (accessed as 32+31)
        t3  .req v28       // PPC64LE: 3 (accessed as 32+3) - additional temp if needed
        t2  .req v27       // PPC64LE: v31 (VSR63)
        t3  .req v28       // PPC64LE: v3 (VSR35) - additional temp if needed
        .text
        .global MLK_ASM_NAMESPACE(ntt_asm)
        .balign 4
MLK_ASM_FN_SYMBOL(ntt_asm)
        /* PPC64LE TRANSLATION: Function Entry
         * ===================================
         * AArch64: push_stack saves callee-saved NEON registers
         * PPC64LE: Save callee-saved VSX registers and set up constants
         */
        push_stack

        /* PPC64LE TRANSLATION: Constant Initialization
         * =============================================
         * AArch64 uses mov to load immediate values into vector elements.
         * PPC64LE loads constants from memory or uses vector splat operations.
         */
        // Load MLKEM_Q = 3329 into constants vector
        // AArch64: mov wtmp, #3329; mov consts.h[0], wtmp
        // PPC64LE: li r9, 3329; mtvsrwz VSR39, r9; xxspltw VSR39, VSR39, 0
        mov wtmp, #3329                        // Load immediate 3329 into 32-bit register
        mov consts.h[0], wtmp                  // Move to first halfword of constants vector

        // Load Barrett constant 20159 for Montgomery reduction
        // This is used in the high-part multiplication for reduction
        mov wtmp, #20159                       // Barrett reduction constant
        mov consts.h[1], wtmp                  // Store in second halfword

        /* PPC64LE equivalent:
         * li r9, 3329                          // Load immediate MLKEM_Q
         * li r10, 20159                        // Load Barrett constant
         * mtvsrwz VSR39, r9                    // Move r9 to VSX register
         * mtvsrwz VSR40, r10                   // Move r10 to VSX register
         * xxspltw VSR39, VSR39, 0              // Splat word across vector (all elements = 3329)
         * xxspltw VSR40, VSR40, 0              // Splat word across vector (all elements = 20159)
         */

        // Initialize working pointers and loop counter
        mov inp, in                            // Save original input pointer
        mov count, #4                          // Process 4 blocks in layers 1-2-3

        // Load initial twiddle factors for layers 1, 2, 3
        load_roots_123

        .p2align 2

        // Bounds reasoning:
        // - There are 7 layers
        // - When passing from layer N to layer N+1, each layer-N value
        // is modified through the addition/subtraction of a Montgomery
        // product of a twiddle of absolute value < q/2 and a layer-N value.
        // - Recalling that for C such that |a| < C * q and |t|<q/2, we have
        // |mlk_fqmul(a,t)| < q * (0.0254*C + 1/2), we see that the coefficients
        // of layer N (starting with layer 0 = input data) are bound by q * f^N(1),
        // where f(C) = 1/2 + 1.0508*C.
        // For N=7, we get the bound of f^7(1) * q < 18295.
        //
        // See test/test_bounds.py for more details.

ntt_layer123_start:
        /* PPC64LE TRANSLATION: Data Loading for Layers 1-2-3
         * ====================================================
         * AArch64: ldr loads 128 bits (8 int16_t values) from memory
         * PPC64LE: lxvd2x loads 128 bits with little-endian byte swap
         *          xxpermdi corrects doubleword order if needed
         *
         * Memory layout: 256 int16_t values = 512 bytes total
         * Each vector holds 8 int16_t values = 16 bytes
         * 512/8 = 64 bytes between corresponding elements in different "columns"
         */

        // Load 8 vectors of data (64 int16_t values total)
        // AArch64: ldr q_data0, [in, #0] loads 16 bytes from in+0
        // PPC64LE: lxvd2x VSR52, r3, r9 (where r9=0) loads with byte swap
        ldr q_data0, [in, #0]                  // Load data[0:7]
        ldr q_data1, [in, #(1*(512/8))]        // Load data[64:71] (offset 64 bytes)
        ldr q_data2, [in, #(2*(512/8))]        // Load data[128:135] (offset 128 bytes)
        ldr q_data3, [in, #(3*(512/8))]        // Load data[192:199] (offset 192 bytes)
        ldr q_data4, [in, #(4*(512/8))]        // Load data[256:263] (offset 256 bytes)
        ldr q_data5, [in, #(5*(512/8))]        // Load data[320:327] (offset 320 bytes)
        ldr q_data6, [in, #(6*(512/8))]        // Load data[384:391] (offset 384 bytes)
        ldr q_data7, [in, #(7*(512/8))]        // Load data[448:455] (offset 448 bytes)

        /* PPC64LE equivalent:
         * li r9, 0                             // Offset 0
         * li r10, 64                           // Offset 64
         * li r11, 128                          // Offset 128
         * li r12, 192                          // Offset 192
         * lxvd2x VSR52, r3, r9                 // Load data0 with byte swap
         * lxvd2x VSR53, r3, r10                // Load data1 with byte swap
         * lxvd2x VSR54, r3, r11                // Load data2 with byte swap
         * lxvd2x VSR55, r3, r12                // Load data3 with byte swap
         * xxpermdi VSR52, VSR52, VSR52, 2      // Correct doubleword order
         * xxpermdi VSR53, VSR53, VSR53, 2      // Correct doubleword order
         * xxpermdi VSR54, VSR54, VSR54, 2      // Correct doubleword order
         * xxpermdi VSR55, VSR55, VSR55, 2      // Correct doubleword order
         * (Continue for data4-data7 with offsets 256, 320, 384, 448)
         */

        /* PPC64LE TRANSLATION: Layer 1 NTT Butterflies
         * =============================================
         * Layer 1 processes 4 butterflies with stride 128 (4*64 bytes)
         * Each butterfly: (data[i], data[i+128]) -> (data[i]+t, data[i]-t)
         * where t = Montgomery_multiply(data[i+128], twiddle)
         */

        // Layer 1: 4 butterflies with the same twiddle factor
        // Butterfly 1: (data0, data4) using root0.h[0] and root0.h[1]
        ct_butterfly data0, data4, root0, 0, 1
        // Butterfly 2: (data1, data5) using same twiddle
        ct_butterfly data1, data5, root0, 0, 1
        // Butterfly 3: (data2, data6) using same twiddle
        ct_butterfly data2, data6, root0, 0, 1
        // Butterfly 4: (data3, data7) using same twiddle
        ct_butterfly data3, data7, root0, 0, 1

        /* PPC64LE TRANSLATION: Layer 2 NTT Butterflies
         * =============================================
         * Layer 2 processes 8 butterflies with stride 64 (2*64 bytes)
         * Uses different twiddle factors for different butterfly pairs
         */

        // Layer 2: 8 butterflies with 2 different twiddle factors
        // Butterflies 1-2: (data0,data2) and (data1,data3) using root0.h[2,3]
        ct_butterfly data0, data2, root0, 2, 3
        ct_butterfly data1, data3, root0, 2, 3
        // Butterflies 3-4: (data4,data6) and (data5,data7) using root0.h[4,5]
        ct_butterfly data4, data6, root0, 4, 5
        ct_butterfly data5, data7, root0, 4, 5

        /* PPC64LE TRANSLATION: Layer 3 NTT Butterflies
         * =============================================
         * Layer 3 processes 16 butterflies with stride 32 (1*64 bytes)
         * Uses 4 different twiddle factors from root0 and root1
         */

        // Layer 3: 16 butterflies with 4 different twiddle factors
        // Butterflies 1-2: (data0,data1) using root0.h[6,7]
        ct_butterfly data0, data1, root0, 6, 7
        // Butterflies 3-4: (data2,data3) using root1.h[0,1]
        ct_butterfly data2, data3, root1, 0, 1
        // Butterflies 5-6: (data4,data5) using root1.h[2,3]
        ct_butterfly data4, data5, root1, 2, 3
        // Butterflies 7-8: (data6,data7) using root1.h[4,5]
        ct_butterfly data6, data7, root1, 4, 5

        /* PPC64LE TRANSLATION: Data Storage After Layers 1-2-3
         * =====================================================
         * AArch64: str stores 128 bits to memory with post-increment addressing
         * PPC64LE: stxvd2x stores 128 bits with little-endian byte swap
         */

        // Store results back to memory
        // AArch64: str q_data0, [in], #16 stores and increments pointer by 16
        // PPC64LE: stxvd2x VSR52, r3, r9; addi r3, r3, 16
        str q_data0, [in], #(16)               // Store data0, advance pointer by 16
        str q_data1, [in, #(-16 + 1*(512/8))] // Store data1 at offset 48 from new position
        str q_data2, [in, #(-16 + 2*(512/8))] // Store data2 at offset 112 from new position
        str q_data3, [in, #(-16 + 3*(512/8))] // Store data3 at offset 176 from new position
        str q_data4, [in, #(-16 + 4*(512/8))] // Store data4 at offset 240 from new position
        str q_data5, [in, #(-16 + 5*(512/8))] // Store data5 at offset 304 from new position
        str q_data6, [in, #(-16 + 6*(512/8))] // Store data6 at offset 368 from new position
        str q_data7, [in, #(-16 + 7*(512/8))] // Store data7 at offset 432 from new position

        /* PPC64LE equivalent:
         * xxpermdi VSR52, VSR52, VSR52, 2      // Correct doubleword order before store
         * xxpermdi VSR53, VSR53, VSR53, 2      // (Reverse the correction from load)
         * stxvd2x VSR52, r3, r9                // Store data0 with byte swap
         * addi r3, r3, 16                      // Advance pointer
         * li r10, 48                           // Calculate offset for data1
         * stxvd2x VSR53, r3, r10               // Store data1
         * (Continue for remaining data vectors)
         */

        // Loop control: decrement counter and branch if not zero
        // AArch64: subs sets flags, cbnz branches if not zero
        // PPC64LE: subic. sets condition register, bne branches if not equal
        subs count, count, #1                  // Decrement counter and set flags
        cbnz count, ntt_layer123_start         // Branch if count != 0

        /* PPC64LE equivalent:
         * subic. r7, r7, 1                     // Subtract immediate and set CR0
         * bne ntt_layer123_start               // Branch if not equal (CR0[EQ] = 0)
         */
        /* PPC64LE TRANSLATION: Setup for Layers 4-5-6-7
         * ===============================================
         * After layers 1-2-3, we reset pointers and process remaining layers
         * with different stride patterns and data organization
         */

        // Reset input pointer and set up for layers 4-5-6-7
        mov in, inp                            // Restore original input pointer
        mov count, #8                          // Process 8 blocks in layers 4-5-6-7

        .p2align 2
ntt_layer4567_start:
        /* PPC64LE TRANSLATION: Data Loading for Layers 4-5-6-7
         * ======================================================
         * Now we process 4 vectors at a time (32 int16_t values)
         * with different memory stride patterns
         */

        // Load 4 consecutive vectors (64 bytes total)
        // AArch64: ldr loads 16 bytes each
        // PPC64LE: lxvd2x with consecutive offsets
        ldr q_data0, [in, #(16*0)]             // Load 16 bytes at offset 0
        ldr q_data1, [in, #(16*1)]             // Load 16 bytes at offset 16
        ldr q_data2, [in, #(16*2)]             // Load 16 bytes at offset 32
        ldr q_data3, [in, #(16*3)]             // Load 16 bytes at offset 48

        /* PPC64LE equivalent:
         * li r9, 0                             // Offset 0
         * li r10, 16                           // Offset 16
         * li r11, 32                           // Offset 32
         * li r12, 48                           // Offset 48
         * lxvd2x VSR52, r3, r9                 // Load data0
         * lxvd2x VSR53, r3, r10                // Load data1
         * lxvd2x VSR54, r3, r11                // Load data2
         * lxvd2x VSR55, r3, r12                // Load data3
         * xxpermdi VSR52, VSR52, VSR52, 2      // Endian correction
         * xxpermdi VSR53, VSR53, VSR53, 2
         * xxpermdi VSR54, VSR54, VSR54, 2
         * xxpermdi VSR55, VSR55, VSR55, 2
         */

        // Load twiddle factors for layer 4-5
        load_next_roots_45

        /* PPC64LE TRANSLATION: Layer 4 and 5 Butterflies
         * ===============================================
         * Layer 4: stride 16, Layer 5: stride 8
         * Uses indexed twiddle access for efficiency
         */

        // Layer 4: 2 butterflies with stride 16 (2 vectors apart)
        ct_butterfly data0, data2, root0, 0, 1 // Butterfly: (data0, data2)
        ct_butterfly data1, data3, root0, 0, 1 // Butterfly: (data1, data3)

        // Layer 5: 4 butterflies with stride 8 (1 vector apart)
        ct_butterfly data0, data1, root0, 2, 3 // Butterfly: (data0, data1)
        ct_butterfly data2, data3, root0, 4, 5 // Butterfly: (data2, data3)

        /* PPC64LE TRANSLATION: Matrix Transpose for Layers 6-7
         * =====================================================
         * The transpose operation reorganizes data for efficient processing
         * of the final two NTT layers with different access patterns
         */

        // Transpose the 4x4 matrix of vectors for layers 6-7
        // This changes the data layout to enable vectorized processing
        // of the remaining butterfly operations
        transpose4 data

        // Load twiddle factors for layers 6-7 (including twisted versions)
        load_next_roots_67

        /* PPC64LE TRANSLATION: Layer 6 and 7 Butterflies
         * ===============================================
         * These layers use vector twiddle factors (not indexed)
         * and process all elements in parallel within each vector
         */

        // Layer 6: 8 butterflies with stride 4 (within vectors)
        // Uses vector twiddle factors for parallel processing
        ct_butterfly_v data0, data2, root0, root0_tw // Butterfly with vector twiddles
        ct_butterfly_v data1, data3, root0, root0_tw // Same twiddles for parallel lanes

        // Layer 7: 16 butterflies with stride 2 (within vectors)
        ct_butterfly_v data0, data1, root1, root1_tw // Different twiddles for each pair
        ct_butterfly_v data2, data3, root2, root2_tw // Different twiddles for each pair

        /* PPC64LE TRANSLATION: Final Transpose and Storage
         * ================================================
         * Transpose back to restore the original data organization
         * before storing results to memory
         */

        // Transpose back to original layout
        transpose4 data

        // Store results back to memory
        // AArch64: str with post-increment addressing
        // PPC64LE: stxvd2x with manual pointer arithmetic
        str q_data0, [in], #(16*4)             // Store data0, advance by 64 bytes
        str q_data1, [in, #(-16*3)]            // Store data1 at offset -48
        str q_data2, [in, #(-16*2)]            // Store data2 at offset -32
        str q_data3, [in, #(-16*1)]            // Store data3 at offset -16

        /* PPC64LE equivalent:
         * xxpermdi VSR52, VSR52, VSR52, 2      // Endian correction before store
         * xxpermdi VSR53, VSR53, VSR53, 2
         * xxpermdi VSR54, VSR54, VSR54, 2
         * xxpermdi VSR55, VSR55, VSR55, 2
         * li r9, 0                             // Offset 0
         * stxvd2x VSR52, r3, r9                // Store data0
         * addi r3, r3, 64                      // Advance pointer by 64 bytes
         * li r10, -48                          // Offset -48
         * li r11, -32                          // Offset -32
         * li r12, -16                          // Offset -16
         * stxvd2x VSR53, r3, r10               // Store data1
         * stxvd2x VSR54, r3, r11               // Store data2
         * stxvd2x VSR55, r3, r12               // Store data3
         */

        // Loop control for layers 4-5-6-7
        subs count, count, #1                  // Decrement counter
        cbnz count, ntt_layer4567_start        // Continue if not zero

        pop_stack                              // Restore saved NEON/VSX registers
        ret                                    // Return to caller

/* simpasm: footer-start */
#endif /* MLK_ARITH_BACKEND_AARCH64 && !MLK_CONFIG_MULTILEVEL_NO_SHARED */

/* PPC64LE TRANSLATION SUMMARY
 * ===========================
 *
 * Key instruction mappings:
 * 1. Memory operations:
 *    - ldr/str -> lxvd2x/stxvd2x (with endianness correction)
 *    - Addressing modes -> manual offset calculation
 *
 * 2. Arithmetic operations:
 *    - add/sub -> vadduhm/vsubuhm (unsigned halfword modulo)
 *    - mul -> vmladduhm with zero accumulator
 *    - sqrdmulh -> vmhraddshs (multiply-high-round-add signed halfword saturate)
 *    - mls -> vmladduhm with accumulator (effectively multiply-add)
 *
 * 3. Data reorganization:
 *    - trn1/trn2 -> xxmrglw/xxmrghw + xxpermdi
 *    - Matrix transpose -> word merge + doubleword permute
 */

Could you give this a shot?

@dannytsen
Copy link
Author

@hanno-becker Thanks for taking so much on ppc64le integration. I do learn a lot from here in all aspects. I don't know how much time will take me but I'll work on it.

@hanno-becker
Copy link
Contributor

@hanno-becker Thanks for taking so much on ppc64le integration. I do learn a lot from here in all aspects. I don't know how much time will take me but I'll work on it.

@dannytsen This is great to hear! Please don't hesitate to ask if anything is unclear about the algorithmic aspects of the AArch64 implementation, or how to translate it. Otherwise, I'm looking forward to seeing the updated code!

@bhess
Copy link
Contributor

bhess commented Sep 29, 2025

Thanks @dannytsen! Feel free to let me know if there’s any area where I can jump in and help out with coding.

@dannytsen
Copy link
Author

@hanno-becker @bhess Thanks.

@hanno-becker
Copy link
Contributor

@dannytsen Any update?

@dannytsen
Copy link
Author

@hanno-becker Still working on fixing current code before the other.

@hanno-becker
Copy link
Contributor

@dannytsen Ack. Let me know when you're done with the rework or have any questions.

@dannytsen
Copy link
Author

@dannytsen Ack. Let me know when you're done with the rework or have any questions.

@hanno-becker @bhess My fix of my implementation to work with your new backend unit test is very straight forward and simple, just matched work flow of your C-implementation. This implementation worked as is. So, I don't plan on re-work for any time soon. Thanks.

@hanno-becker
Copy link
Contributor

@bhess @dannytsen Could you provide performance improvement data for the backend as it stands? What is the plan towards integrating assembly that leverages lazy reduction and layer merging?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants