-
Notifications
You must be signed in to change notification settings - Fork 37
Initial riscv64 vector support (uses standard vector instrinsics for rvv 1.0. Presently VLEN=256 only.) #1037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0637941
to
6b4f845
Compare
@mjosaarinen If you have |
dcab861
to
38e79bb
Compare
Note that on the "fastntt3" branch, there are layer-merged implementations of the NTT and INTT that are highly amenable to auto-vectorization with compilers like GCC 14. Benchmarks of that code on an RV64v target were encouraging, so might provide some inspiration for a fully vectorized, hand-written back-end. |
Yeah you can easily double the speed with autovectorization alone, and some Google folks were of the opinion that they wanted to rely on that entirely in BoringSSL (RISC-V Android etc), rather than maintain a hand-optimized version. The resulting code is pretty wild; I looked at that when considering RISC-V ISA extensions ( see slides 17 for example in https://mjos.fi/doc/20240325-rwc-riscv.pdf ). It was almost "too good" -- I suspect that Google has used those NTTs as a microbenchmark when developing LLVM autovectorizers :) |
Yeah, sorry for abusing your CI like that (I wasn't expecting it to be that extensive), I could have just read the documentation. I'll set up this nix thing. |
@mjosaarinen Sorry, we should have pointed that out earlier. With the |
@mkannwischer @mjosaarinen The RV functional tests in the ci-cross shell don't seem to pass. % nix develop --extra-experimental-features 'nix-command flakes' .#ci-cross
% CROSS_PREFIX=riscv64-unknown-linux-gnu- make func OPT=1 AUTO=1 -j32
% EXEC_WRAPPER=qemu-riscv64 make run_func_512 -j32 OPT=1 AUTO=1
qemu-riscv64 test/build/mlkem512/bin/test_mlkem512
ERROR (test/test_mlkem.c,41)
ERROR (test/test_mlkem.c,225)
make: *** [Makefile:53: run_func_512] Error 1 Same for Could you investigate? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RV64 cross tests in CI are failing. I don't know if this is an issue with the code or the CI setup, but it needs looking into.
It looks like some things are running fine, and some things have build problems. I suspect that platform auto-detection code (in the C header files + makefies) is to blame (failing no_opt means "no optimization"?) It's a quite complicated CI -- can you tell me what would be the best way to debug this stage of the CI build process locally? |
Sure. If you have nix installed, then it's exactly what I posted above: You open the nix shell with % nix develop --extra-experimental-features 'nix-command flakes' .#ci-cross Building using cross compiler:
And then running:
The base command line used by
What is confusing to me is that this happens regardless of whether If you don't want be bothered by auto-settings, we can for now set
|
It looks like the issue is solely due to
it'll fail with
Probably some configuration is missing in the invocation of |
|
Hmm. I'm not sure if qemu is able to pull the standard runtime and dynamic libraries that match with the "vector ABI" (the calling convention changes somewhat with vector registers). I had to link the test programs with Anyway, the code assumes "vector 1.0" extension, 256-bit VLEN, and the standard "gc" stuff, so hat looks correct. I left the Keccak instruction stuff out. |
I adjusted the CI. Most tests pass, but there are some issues in the ACVP tests and the monobuild examples. The ACVP one seems to be an issue in the test script (not expecting parameters in the exec-wrapper), while the monobuild failure seems to be an architecture confusion. I'll take a look. |
951ab19
to
0c4d7e8
Compare
Yes! For the record, this is exactly what the HOL-Light proofs for Arm are doing. You will see tight bounds in the HOL-Light specs that would be hard to verify on paper and impossible by eye. However, the frontend never relies on those bounds. The bounds required by the frontend can so far be audited by eye. This means, we go as far into lazy reduction as can easily be annotated and followed in the code, but not further. That may leave some unnecessary reduction in the code, but at least for the invNTT the returns of more and more lazy reduction diminish quickly. |
I think it can be a follow-up - there is always a fallback to the C back-end. The thing about that is; while some parts can be easily made VLEN-agnostic using the "strip-mining" approach (discussed in the RISC-V manual), it seems to me that for efficient NTT/INTT, one would need to have separate versions for VLEN in { 128, 256, 512, 1024, perhaps others } due to the actual register allocation. btw.. In RVV, the vector length VL is different fom VLEN -- VL holds the dynamic vector length (in elements, like bytes or floats) while VLEN is the hardware size (in bits) of each of the 32 vector registers. Vectors (VL) do not have to be a power-of-two in size and can span up to 8 vector registers (LMUL). The max hardware register size allowed by the spec is VLEN=65536 bits. Due to LMUL one can technically do a lot of stuff in a single instruction (operate on 8*256 = 2048 bits even with VLEN=256) so that's an another thing to explore. This code uses only LMUL=1 and LMUL=2. |
@mjosaarinen Very interesting, thank you. I should say "VLEN" then, not "VL". What else needs doing then for this PR, from your perspective? I will aim to go through everything again, probably later or tomorrow. Do you have input on my note above on runtime detection of vector extension support?
|
b31855d
to
040c339
Compare
0-diff rebase to remove merge commits. |
Signed-off-by: Hanno Becker <[email protected]>
…sics.) Signed-off-by: Markku-Juhani O. Saarinen <[email protected]> Signed-off-by: Matthias J. Kannwischer <[email protected]> Signed-off-by: Hanno Becker <[email protected]>
040c339
to
1ffff09
Compare
Did some history cleanup, and re-added bounds assertions aiming to check that coefficients stay unsigned-canonical. |
It no longer even compiles (this file is needed.)
|
Well, I was thinking of running some benchmarks on real RVV hardware (SpecemiT X60) on Monday-Tuesday and perhaps tweaking some things a bit based on results.
From a practical viewpoint, it should be ok to have the existence of RVV (yes/no) as a build-time flag. After all, I understand that the Android and Ubuntu kernel & userland will be built for an RVA23-like processor profile and wouldn't work without Vector. Now, the physical vector length (VLEN) is another matter; it is intended to be dynamic and is supposed to be handled by the programs directly. RVV code can do that with 1 instruction (it's designed to be fast) and without syscalls etc. |
d30f74e
to
92d71c4
Compare
- Don't skip over 'opt' tests in CI when testing RV64 - Configure qemu to use 256-bit VL. Once we support other VLs, the invocation of qemu needs to be generalized accordingly. Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Matthias J. Kannwischer <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
92d71c4
to
3fa9c60
Compare
If that's the case, the PR needs further work to make the dispatch to the backend dynamic -- at the moment, it is static. If I understand you correctly, at compile-time we should not have Can you please make the corresponding adjustments? Is there a similarly convenient way to infer without syscalls whether RVV itself is present? |
Current SpacemiT K1 8 (Banana Pi F3) benchmarks 👍
|
pos = 0; | ||
ctr = 0; | ||
|
||
while (ctr < len && pos + (vl23 * 2) <= buflen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If buflen
is not a multiple of vl23 * 2
, then this function may leave bytes unprocessed in the input buffer, which would be incorrect. We should reject the input upfront if that's the case (thereby falling back to C), or add tail-handling in support of this case. As far as I understand the code, it should be simple to basically handle the tail in the same way as the bulk of the loop, adapting vl23
.
This starts to be an issue with VLEN>=512
, where vl23 = 48 = 3 * 16
which is not a multiple of the concrete buflens MLK_XOF_RATE
and MLK_XOF_RATE * MLKEM_GEN_MATRIX_NBLOCKS
this function is called with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue surfaces in the KATs as soon as one tests with VLEN=512.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while (ctr < len && pos < buflen)
{
const size_t vl = __riscv_vsetvl_e16m1((buflen - pos) * 8 / 12);
const size_t vl23 = (vl * 24) / 32;
const vuint16m1_t vid = __riscv_vid_v_u16m1(vl);
const vuint16m1_t srl12v = __riscv_vmul_vx_u16m1(vid, 12, vl);
const vuint16m1_t sel12v = __riscv_vsrl_vx_u16m1(srl12v, 4, vl);
const vuint16m1_t sll12v = __riscv_vsll_vx_u16m1(vid, 2, vl);
...
seems to work. Experimented a bit with the other functions, too, and it looks like 1f2fce3 passes QEMU tests for all supported VLENs (128,256,512,1024), skipping [inv]NTT unless VLEN=256 for now.
It's not yet clear to me why the [inv]NTT doesn't work for VLEN > 256 even if in the code one forces understoodvl = 16
. Should that behave as if VLEN=256?
@mjosaarinen At the moment, we compile with "CFLAGS += -march=rv64gcv_zvl256b". Is this still the right option if we want to compile for RVV 1.0 but without fixing the vector length? Or does this merely signify a minimal vector length? EDIT: Looks like we want just Btw, leaving aside #1037 (comment), it's pretty cool to see the length-agnostic implementation of |
Signed-off-by: Markku-Juhani O. Saarinen <[email protected]>
So, I did some further benchmarks. It's well-known that LLVM has a really good autovectorizer for RISC-V (with large contributions from Google, I think. BoringSSL's MLKEM is not too slow on RISC-V because of that.) gcc: riscv64-unknown-linux-gnu-gcc (GCC) 14.2.1 20250322 The gcc intrin vs autovec numbers are within 1 or 2% of those previous benchmarks, probably down to measurement method (my runs were probably shorter).
Some observations:
|
Thanks for the analysis @mjosaarinen! Looks like the move to intrinsics still brings a notably benefit over autovectorization. Note that we omit |
@mjosaarinen Thanks for those data. Could you re-run the "Autovec" set with clang 22 on the "fastntt3" branch please? I'm intrigued to see how clang does on that code, which was specifically written to be amenable to auto-vectorization. Results using GCC 14 on AArch64 were surprisingly good. |
Unfortunately I already packed away my own SpacemiT X60 board as I am traveling tomorrow. (But I think there is one plugged into the CI somehow, perhaps one could try that one.) |
Summary:
rv64v support (risc-v vector extension 1.0, which is available on newer application-class silicon.)
Do you expect this change to impact performance: Yes/No
yes (risc-v only)
If yes, please provide local benchmarking results.
Roughly 2.5x perf on silicon with vector hardware.