Skip to content

Conversation

rerdavies
Copy link
Contributor

Provides better compiler optimization flags for ARM Cortex-A72/Pi 4, and ARM Cortext-A76/Pi 5.

The A72 optimizations provide about a 15% performance improvement on a Pi 4 compared to your current build. The A76 optimizations are more interesting. As reported by a remote colleague (I don't have a pi 5), the optimizations allow 16 simultaneous instances to run on a Pi5 (PiPedal hosted), instead of 8 instances. (No buffering, 128x3 unfortunately, but no reason to think that it won't also work (or even work slightly better) at 64x3). Informally, a 2x speed improvement compared to your current build with current optimization flags (which are currently CMake defaults on aarch64). Dramatic and very significant!

Optimizations are conditioned on two new CMake Options in the root makefile.

option(A72_OPTIMIZATION "Optimize for cortex-a72/Pi4 (arm64 builds only)" ON)
option(A76_OPTIMIZATION "Optimize for cortex-a76/Pi5 (arm64 builds only)" OFF)

Both can be provided on the build command line as usual. e.g:

cmake etc etc ... -D A72_OPTIMIZATION

Compiler flags have to be applied in the root CMakeLists.txt file so that all of your deps build with the correct flags.

I went back and forth on whether the A72 optimizations should be on or off by default. I settled on ON, because optimizing for A72 is probably the Right Thing To Do by default when building for aarch64. Your discretion as to whether you think that's right or not.

For the A72/Pi 4, improvements are (probably) attributable to the addition of -ffast-math, and to use of correct cache-line size definitions which actually make a measurable improvement to Eigen matrix multiplication performance, and better instruction scheduling for the A72 execution unit. The resulting binary will run on any ARM Cortex 8.1a processor, but will be mis-tuned for lesser Cortex 8.1a processors.

The A76/Pi 5 improvements are probably mainly attributable to brand new instructions on ARM Cortex 8.2a processors, that are specifically designed to do fast matrix multiplies. There are lesser improvements from --ffast-math, better instruction scheduling, and correct cache-line size definitions (twice the L1-cache compared to an A72, which would make a big difference when doing a matrix multiply). Binaries will run on any ARM Cortex 8.2a processor, but will be mis-tuned on lesser processors in the family.

The optimizations have been lightly tested in your build, but heavily tested on a forthcoming release of TooB Nam which switches over to your NeuralAudio library , and abandons its previous optimizations. My current TooB NAM dev build uses these compiler flags applied to your NeuralAudio library. I have not explicitly tested your plugin on a Pi 5, but it will get the same performance boost that TooB NAM gets. Out of caution, you may want to verify that the 2x figure is correct if you have the hardware to do so.

I'm not sure how best to document the new options. I THINK Pi 4 and Pi 5 users will respond to "A72" and "A76", and that users of cousins of the Pi Family (Orange Pi, Neoverse-based Pi-Like things, a variety of other flavors and relatives) will also know enough to know that A72 and A76 improvements will also give them benefits on their particular flavor of Pi. (A76s are fairly common in Pi-Like things, I think).

Congratulations on NeuralAudio. Having done that optimization on NAM Core code (fixed-size matrices), for the TooB project, I know exactly how difficult and precise that work is, from first-hand experience. And your implementation is MUCH tidier than mine. And it addresses all the pain pain points of NAM Core integration. It is a really fine piece of work! I am frankly, in awe of what you have done.

If I can ever return favors, please let me know. Or if you have development tasks on NeuralAudio you feel like delegating to someone who is already familiar with fundamentals of NeuralAudio code, feel free to ping me. Not like I'm short of things to do; but I do feel a duty to contribute if I can do so usefully.

And you really need to know (if you don't) that 2x performance improvements on Pi 5 is both astonishing, and game changing. It completely changes the way NAM will be used on Pi 5s, where you can now afford to replace ALL of your non-linear effect pedals (Overdrive distortion pedals, compressors, a strange and beautiful NAM emulation of a Fender Spring Reverb that should not work at all but does, ...), and carelessly use split amp models, e.g. Front half of a vintage DeVille + a full 3 knob Fender Tone stack + the back half of a vintage Deville = a NAM-driven model of a Fender DeVille with full 3 knob tone controls. Or use mixed outputs from two or more amp models, .... or... or.. I'm not sure what life looks like in x64 world; but on a Pi 5 that really does change EVERYTHING!

And thank you ESPECIALLY for having provided the NeuralAudio library under an MIT license!

Best regards,

Robin Davies
Pipedal
TooB Plugins Project

@rerdavies rerdavies changed the title ARM Cortex-A72/Pi 4, and ARM Cortext-A76/Pi 5 compiler options ARM Cortex-A72/Pi 4, and ARM Cortex-A76/Pi 5 compiler options Aug 5, 2025
@rerdavies
Copy link
Contributor Author

The -Ofast option, and -O3 -ffast-math options are actually equivalent, although Ofast is (infamously) not officially documented. I prefer the former in my projects; you may prefer the latter, for consistency.

These options are actually redundant for the dep/NeuralAudio/NeuralAudio build, and possibly others as well. I include them from the root, just to make sure they get applied everywhere. (And because I do so myself, in my own project). The cpu/architecture options, however, are not specified anywhere, and they actually boost performance of Nam Core, and RTNeural as well as NeuralAudio.

Benchmarks are performed on a Raspberry Pi 4. Builds are performed on the branch in my repo that generated the pull request. The "Recommended" build is A72_OPTIMIZATION ON. The "Default" build is A72_OPTIMIZATION OFF, with a build type of RelWithDebInfo, which would use -O2 in parts of the build prior to the pull request.

Benchmark Results (Pi 4)

Recommended options: -mcpu=cortex-a72" "-mtune=cortex-a72 -O3 -ffast-math -g -DNDEBUG
Default options: -O3 -ffast-math -g -DNDEBUG

WaveNet (Standard) Test (Lower is better)

Recommended Default
Internal: 2.04793 (0.000797626) Internal: 2.52373 (0.00217203)

23% better.

LSTM (1x16) Test (Higher is better)

Recommended Default
Internal: 0.306412 (0.000201907) Internal: 0.235382 (0.000100666)

30% better.

ModelTest Results (Recommended)

Block size: 64
Loading models from: "~/src/neural-amp-modeler-lv2/build/src/NeuralAudio/Utils/Models"
WaveNet (Standard) Test
Model: "~/src/neural-amp-modeler-lv2/build/src/NeuralAudio/Utils/Models/BossWN-standard.nam"

NAM vs Internal RMS err: 8.41824e-08
NAM vs RTNeural RMS err: 0.000659912

NAM Core: 2.75893 (0.00313284)
RTNeural: 3.42047 (0.00231592)
Internal: 2.04793 (0.000797626)
RTNeural is: 0.806595x NAM
Internal is: 1.34718x NAM

***here
LSTM (1x16) Test
Model: "~/src/neural-amp-modeler-lv2/build/src/NeuralAudio/Utils/Models/BossLSTM-1x16.nam"

NAM vs Internal RMS err: 1.51097e-05
NAM vs RTNeural RMS err: 0.0168944

NAM Core: 0.445212 (0.000200091)
RTNeural: 0.429001 (0.000325758)
Internal: 0.306412 (0.000201907)
RTNeural is: 1.03779x NAM
Internal is: 1.45298x NAM

***here

Modeltest Results (default)

Block size: 64
Loading models from: "~/src/neural-amp-modeler-lv2/build/src/NeuralAudio/Utils/Models"
WaveNet (Standard) Test
Model: "~/src/neural-amp-modeler-lv2/build/src/NeuralAudio/Utils/Models/BossWN-standard.nam"

NAM vs Internal RMS err: 8.41824e-08
NAM vs RTNeural RMS err: 0.000659912

NAM Core: 3.17404 (0.001193)
RTNeural: 3.96817 (0.00184733)
Internal: 2.52373 (0.00217203)
RTNeural is: 0.799876x NAM
Internal is: 1.25768x NAM

***here
LSTM (1x16) Test
Model: "~/src/neural-amp-modeler-lv2/build/src/NeuralAudio/Utils/Models/BossLSTM-1x16.nam"

NAM vs Internal RMS err: 1.51097e-05
NAM vs RTNeural RMS err: 0.0168944

NAM Core: 0.333684 (0.000172222)
RTNeural: 0.328912 (8.987e-05)
Internal: 0.235382 (0.000100666)
RTNeural is: 1.01451x NAM
Internal is: 1.41763x NAM

***here

@mikeoliphant
Copy link
Owner

Interesting. I thought that GCC did NEON optimizations by default on aarch64. I'm sure that I tested the A72-specific optimizations when I switched my RPi4 to a 64-bit OS...

I'll test again. Architecture-specific optimizations make a huge difference, so I definitely want to get it right.

Regarding the NeuralAudio library - I ended up rolling my own because I had a lot of changes I wanted to make that would be impractical to do as a contributor to NeuralAmpModelerCore. It allowed me to make structural changes I've been wanting to make for a while, as well as continuing to work on optimization.

It is pretty fiddly stuff. Working with template classes can be tricky, and was new to me. And interactions with Eigen have to be handled very carefully to avoid unexpected performance problems. I think I've got it in pretty good shape now, but I'm sure there is still potential work to be done.

I definitely understand how important this kind of optimization is for lower-powered devices. When I started running NAM on RPi4 back in the early days, I could only run a single feather model...

@mikeoliphant
Copy link
Owner

Btw, I've got a discussion open on the the NeuralAudio repo dealing with benchmarks:

mikeoliphant/NeuralAudio#9

I don't have an RPi5 to test with, but another user posted some numbers there, and could maybe be persuaded to try with the A76 optimizations.

@mikeoliphant
Copy link
Owner

Hmm - I'm getting Internal: 1.98197 on my RPi4 with the default build.

I tried with the A72 options, and got Internal: 1.99201.

Both are in line with what you are seeing with your optimized build. I guess the question is, why are you seeing a worse result using the default build?

@rerdavies
Copy link
Contributor Author

I think you have to do a clean rebuild to get the option changes to take effect. That was my expeirence. And it seems to be the same for other options you are already using: the NAM and RTNeural and Tool build options behave the same way. They don't take effect for me until I do a clean rebuild.

Bug in CMake? "Feature" in Cmake? or a fundamental misunderstanding about what "option" is, and how it's supposed to work? I'm not sure.

@mikeoliphant
Copy link
Owner

I just added the flags directly (ie: not through a CMake option), so I'm pretty sure they took effect. I'll double-check with a completely clean CMake, though.

@mikeoliphant
Copy link
Owner

Same result.

I also added -DCMAKE_EXPORT_COMPILE_COMMANDS=ON to verify that the correct compiler flags are being used.

@rerdavies
Copy link
Contributor Author

rerdavies commented Aug 5, 2025 via email

@mikeoliphant
Copy link
Owner

I built from a completely cleaned out CMake, so there definitely isn't any caching.

I'm building directly on RPi4 - are you? If you want to put a build somewhere I can grab it, I'll be happy to test.

Given that I'm seeing the same performance as your optimized build in ModelTest (~2secs), it seems more likely to me at this point that something is off with your default build since it is significantly slower (~2.5secs vs ~2secs) than my default build.

@rerdavies
Copy link
Contributor Author

rerdavies commented Aug 5, 2025 via email

@rerdavies
Copy link
Contributor Author

rerdavies commented Aug 5, 2025 via email

@rerdavies
Copy link
Contributor Author

rerdavies commented Aug 6, 2025

Ok. I see a potential issue:

Repo: rerdavies/neural-amp-modeler-v2
Branch: toobAmpFork
(should be the same as your pull request).

I am currently running:

# cd to the root of the project....

## Reset the project
echo  build/ > .gitignore   # :-/ Move .gitignore to root to avoid an unpleasant surprise.
rm -rf build

## Configure
mkdir build 
cd build 
cmake .. -D CMAKE_BUILD_TYPE=Release -D A72_OPTIMIZATION=OFF -D BUILD_STATIC_RTNEURAL=ON -D BUILD_NAMCORE=ON -D BUILD_UTILS=ON 
cd ..

## Build.
cmake --build ./build --config Release --target all 

Or something similar that avoids removing the .gitignore file in the build directory, which has alarming consequences.

The build is started, and I will have results for you in about 15 minutes... -ish.

The stumbling block might be, that with pull request as given the A72_OPTIMIZATION defaults to ON (in the theory that if you are building on aarch64, that's the one you want by default.

As a result, you must turn the optimization off to get "default" results:

-D A72_OPTIMIZATION=OFF

I know. That's a little bit weird. If you would prefer me to change it, I can certainly make it so.
While my build is running, perhaps you might want to get your build running in parallel and we compare notes in... 14 minutes.

Maybe something like this, instead.

option(AARCH64_CPU "cortex-a72 for Pi 4, cortex-a76 for Pi 5 (aarch64 build only)" "cortex-a72")

& following adjustments.

@rerdavies
Copy link
Contributor Author

Or something like this (final answer, need to check that single quotes actually does work):

option(AARCH_CPU '"cortex-a72" for Pi 4, "cortex-a76" for Pi 5, "default" for compiler default' "cortex-a72")

@rerdavies
Copy link
Contributor Author

Pull request withdrawn. So sorry to have wasted your time. :-(

I think the a76 results are still valid. That there is a 2x performance boost when compiling for a76. I'll contact you again when I can confirm the results. If correct (I think they are), you may want to think about providing an a76 distro. I do have an Pi 5 on the way. (And the results WOULD be completely game changing.

The issue: I had the default CPU governor set to default value (I'm not actually sure what that is, tbh). CPUs frequencies were bouncing around with two CPUs running at 1.3GHz and two CPUs running at 1.8GHZ, giving me inconsistent times between runs. Anything between 1.8 and 2.53 seconds for the same executable. With the CPU governor set to performance, I get rock-solid times of about 1.72, and binaries with and without optimizations give pretty much identical results.

I did have trouble getting "consistent builds". I was building once, running the test once, and (I guess) repeating that process until I got the results I expected. Pure hallucination, combined with selection bias.

Consecutive runs from the SAME executable (A72_OPTIMIZATION=ON, in this case), with the governor at default settings. Values between 1.81 and 2.53.

Internal: 2.07215 (0.000760034)
Internal: 2.48385 (0.00224075)
Internal: 2.07011 (0.00183727)
Internal: 2.02839 (0.000862089)
Internal: 2.48445 (0.00183694)
Internal: 2.53287 (0.00159053)
Internal: 2.29668 (0.00162985)
Internal: 1.81224 (0.00177773)
Internal: 2.51364 (0.00211882)
Internal: 2.53037 (0.00103396)
Internal: 2.30048 (0.000942478)
Internal: 2.34258 (0.00105313)

@rerdavies rerdavies closed this Aug 6, 2025
@mikeoliphant
Copy link
Owner

👍

Let me know what you see with the A76 optimizations once you get your Pi5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants