-
Notifications
You must be signed in to change notification settings - Fork 39
ARM Cortex-A72/Pi 4, and ARM Cortex-A76/Pi 5 compiler options #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The These options are actually redundant for the Benchmarks are performed on a Raspberry Pi 4. Builds are performed on the branch in my repo that generated the pull request. The "Recommended" build is Benchmark Results (Pi 4)Recommended options: -mcpu=cortex-a72" "-mtune=cortex-a72 -O3 -ffast-math -g -DNDEBUG WaveNet (Standard) Test (Lower is better)
23% better. LSTM (1x16) Test (Higher is better)
30% better. ModelTest Results (Recommended)
Modeltest Results (default)
|
Interesting. I thought that GCC did NEON optimizations by default on aarch64. I'm sure that I tested the A72-specific optimizations when I switched my RPi4 to a 64-bit OS... I'll test again. Architecture-specific optimizations make a huge difference, so I definitely want to get it right. Regarding the NeuralAudio library - I ended up rolling my own because I had a lot of changes I wanted to make that would be impractical to do as a contributor to NeuralAmpModelerCore. It allowed me to make structural changes I've been wanting to make for a while, as well as continuing to work on optimization. It is pretty fiddly stuff. Working with template classes can be tricky, and was new to me. And interactions with Eigen have to be handled very carefully to avoid unexpected performance problems. I think I've got it in pretty good shape now, but I'm sure there is still potential work to be done. I definitely understand how important this kind of optimization is for lower-powered devices. When I started running NAM on RPi4 back in the early days, I could only run a single feather model... |
Btw, I've got a discussion open on the the NeuralAudio repo dealing with benchmarks: I don't have an RPi5 to test with, but another user posted some numbers there, and could maybe be persuaded to try with the A76 optimizations. |
Hmm - I'm getting I tried with the A72 options, and got Both are in line with what you are seeing with your optimized build. I guess the question is, why are you seeing a worse result using the default build? |
I think you have to do a clean rebuild to get the option changes to take effect. That was my expeirence. And it seems to be the same for other options you are already using: the NAM and RTNeural and Tool build options behave the same way. They don't take effect for me until I do a clean rebuild. Bug in CMake? "Feature" in Cmake? or a fundamental misunderstanding about what "option" is, and how it's supposed to work? I'm not sure. |
I just added the flags directly (ie: not through a CMake option), so I'm pretty sure they took effect. I'll double-check with a completely clean CMake, though. |
Same result. I also added |
Ha! I was using the "introduce an arbitrary syntax error in a source file"
feature to get CMake to cough up the command-line options.
I, too, struggled to get options applied properly. And I had exactly the
same problems when trying to get BUILD_NAMCORE, BUILD_STATIC_RNEURAL and
BUILD_TOOLS options to stick.
I think (in retrospect), the key is that these are all options with the
CACHED keyword - so therefore stored in the build/CMakeCache.txt when the
build is configured, and ignored on the commandline when running an actual
`--build`.
So, configure first, by running CMake without the `--build` option, with
all the options specified on the commandline. And then run the actual
build, with the `--build` option, without (?) the configuration defines.
During the actual build, CMake takes the definitions that were supplied
while configuring. I have seen documentation that says that on Windows
builds, the build configuration is also treated this way (that the
Debug/release/Remindependency settings on the build commandline are ignored
when doing the build , and that the actual settings are taken from the
configuration step. (Although I have never done a Windows build with CMake.
You might know better).
That being said, at the time, what seemed to work for me was to run a clean
rebuild, and then take a look inside build/CMakeCache.txt to confirm that
the right options were being used. And there might have been something in
there about a VSCODE command: "CMake: rebuild cache". :-/ And there were
quite a few builds where I *thought* I had applied build configuration
changes, but there was no performance change, going in both directions.
…On Tue, Aug 5, 2025, 15:07 Mike Oliphant ***@***.***> wrote:
*mikeoliphant* left a comment (mikeoliphant/neural-amp-modeler-lv2#90)
<#90 (comment)>
Same result.
I also added -DCMAKE_EXPORT_COMPILE_COMMANDS=ON to verify that the
correct compiler flags are being used.
—
Reply to this email directly, view it on GitHub
<#90 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXK2DF53EPAUIFNHEOXRTD3MD6HXAVCNFSM6AAAAACDD52CDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCNJWGI3TOMJZHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I built from a completely cleaned out CMake, so there definitely isn't any caching. I'm building directly on RPi4 - are you? If you want to put a build somewhere I can grab it, I'll be happy to test. Given that I'm seeing the same performance as your optimized build in ModelTest (~2secs), it seems more likely to me at this point that something is off with your default build since it is significantly slower (~2.5secs vs ~2secs) than my default build. |
GCC does do neon optimizations by default. All Cortex 8.1a processors
implement Neon. And all Cortex 8.2a processors implement Neon plus the
matrix multiplication extensions. And GCC does absolutely amazing
vectorization when running on any one of them.
The benefits for architecture optimizations fall into two categories. The
first is that picking the machine architecture makes GCC use and report the
actual and correct sizes of the L1 and L2 cache. Cache line sizes are
accessible through the standard C++ library. Probably because they are
critically important when doing matrix multiplies! The default setting of
the cache sizes is much lower than it should be with default options;
Specifying A72 doubles the reported L1 cache size, and specifying a76
doubles it again. Something to do with performing tiled matrix multiplies.
One half of the inputs to each tile needs to sit in L1 cache so it doesn't
need to be read from L2 again when moving to the next tile. Eigen uses the
compiler's value for L1 Cache Size to calculate what tile size to use, and
does it at compile time.
The other is instruction scheduling. Modern compilers have very detailed
models of the execution units of the processors they are generating code
for, which allows them to do quite extraordinary things like shuffle
instructions in order to avoid.. Oh I dunno, a stall seven stages deep in
the execution pipeline because only two write operations and one read
operation can be outstanding at any given time. Or to prevent a stall while
trying to rename a register at the 4th stage in the pipeline. That sort of
detail. It's pretty amazing. They are not guessing when they do that. They
are actually using highly detailed models of the execution units of
hundreds of different processors to predict and avoid stalls that are
occuring deep within specific execution units. And they are not broadly
generalizing. There are execution models for each of a few dozen ARM
processors. And models for generations and variants of maybe a very large
number of Intel CPUs. Quite amazing to see the source code for that. Both
GCC and Clang do it. I assume MSVC does too. I would assume that ARM and
Intel do the hard work, pushing new CPU models into GCC and Clang when they
are about to be released.
The problem with not specifying an architecture and a CPU is that you end
up using a model of the least capable common ancestor: the worst execution
pipeline possible, with the smallest L1 cache you could ever encounter
(which turns out to be critical).
Across an ARM generation, there can be an extraordinary amount of variation
in configuration of the execution units. One processor may be able to do
parallel decoding of 3 simultaneous instructions, another 5. One may have 2
integer ALUs, another 4, and so forth all the way down the 11-ish stages of
an ARM execution pipeline. So specifying that you don't want to use the
execution model of the oldest or least capable member of the Arm 8.1a
processor family matters. Mercifully, each of the dozen or so aarch64 ARM
CPUs comes in only one flavor, and always has the same L1 cache size. Not
so in Intel world where there are hundreds of processor variations in each
generation, each of which has an L1 and L2 cache size tailored to match the
price tag.
My strong impression when doing detailed profiling work on my version of
the fixed-size-matrix optimization was that that the matrix multiples
(which are about 80% of the work) are completely memory bound, and the
entire cost of the computation is the cost of waiting for memory transfers
to complete, with all of the compute being done in parallel while waiting
for the memory units to become available. (And another 15% spent executing
the vectorized tanh function). So it is not impossible that ALL of the
performance improvement comes just from using the right L1 cache size.
(Twice the tile size, so 25% fewer L1 cache misses. That's a
pretty plausible theory).
The other possibility is that some of the improvement comes about because
the compiler has found a way to more efficiently schedule instructions that
make better use of features in the memory controller section of the a72
pipeline. Given that I'm virtually certain that the matrix multiply is
memory bound, based on profiling results, I can imagine that's a
possibility.
And it's kind of fun to speculate, because for our purposes the only thing
that really matters is how long it takes for the 7 or 8 instructions in a
4x4 matrix multiply to execute (and how often they hit L1 cache while doing
so). With possibly scheduling variations as the compiler generates code for
the hard-coded row and column sizes that you have thoughtfully given it.
But we mortals can only speculate and do our best to make sense of the
results we actually see.
I have done A/B testing on cache line size before. It does provide a real
and substantial speed-up if you get it right.
If I HAD to guess, I would guess that L1-Cache size is the most important
thing in the Pi 4 case. With better instruction scheduling MAYBE
contributing another 5% or so.But that's very much a guess. If I HAD to
guess, I would guess that the 2x a76 improvement comes about because there
are burst transfers of large blocks of memory going on as part of the
implementation of the new matrix multiply instructions.
But that is somewhat clouded by the fact that your code runs faster than
mine, even though we are using the same trick/brilliant idea to make it
happen. Which I ... ehem... didn't truthfully think was possible. So maybe
not completely memory bound.
…On Tue, Aug 5, 2025 at 11:01 AM Mike Oliphant ***@***.***> wrote:
*mikeoliphant* left a comment (mikeoliphant/neural-amp-modeler-lv2#90)
<#90 (comment)>
Interesting. I *thought* that GCC did NEON optimizations by default on
aarch64. I'm sure that I tested the A72-specific optimizations when I
switched my RPi4 to a 64-bit OS...
I'll test again. Architecture-specific optimizations make a huge
difference, so I definitely want to get it right.
Regarding the NeuralAudio library - I ended up rolling my own because I
had a lot of changes I wanted to make that would be impractical to do as a
contributor to NeuralAmpModelerCore. It allowed me to make structural
changes I've been wanting to make for a while, as well as continuing to
work on optimization.
It is pretty fiddly stuff. Working with template classes can be tricky,
and was new to me. And interactions with Eigen have to be handled very
carefully to avoid unexpected performance problems. I *think* I've got it
in pretty good shape now, but I'm sure there is still potential work to be
done.
I definitely understand how important this kind of optimization is for
lower-powered devices. When I started running NAM on RPi4 back in the early
days, I could only run a single feather model...
—
Reply to this email directly, view it on GitHub
<#90 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXK2DDGWB7YPHFPDSV4ETL3MDBLDAVCNFSM6AAAAACDD52CDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCNJVGU4DENBSGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Odd. I've run the non-architecture build several times. And generated
non-performant versions of the default build at least twice. Maybe three
times. Most recently to generate the benchmark report.
Check the build/CMakeCache.txt to make sure you have the correct options
applied, maybe. Not sure if CmakeCache.txt is the place to SET options or
not. But it definitely includes most-recently-applied options.
And what is your version of the reference build? Your mainline build, or my
pull request with the A72 optimizations turned off? Or the NeuralModel
build? I will admit that I did hack the neural-amp-model.lv2 build a bit in
order convince dep/NeuralAudio to set BUILD_NAMCORE and
BUILD_STATIC_RTNEURAL and BUILD_TESTS options. I can try doing that with a
proper commandline if you like.
The -oFast vs. -O3 -ffast-math issue is actually redundant. You do seem to
have -Fast in the NeuralModel build under
…On Tue, Aug 5, 2025 at 5:14 PM Mike Oliphant ***@***.***> wrote:
*mikeoliphant* left a comment (mikeoliphant/neural-amp-modeler-lv2#90)
<#90 (comment)>
I built from a completely cleaned out CMake, so there definitely isn't any
caching.
I'm building directly on RPi4 - are you? If you want to put a build
somewhere I can grab it, I'll be happy to test.
Given that I'm seeing the same performance as your optimized build in
ModelTest (~2secs), it seems more likely to me at this point that something
is off with your default build since it is significantly slower (~2.5secs
vs ~2secs) than my default build.
—
Reply to this email directly, view it on GitHub
<#90 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXK2DH5YYUGCDF5FBITAHT3MENEVAVCNFSM6AAAAACDD52CDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCNJWGY3TINBVGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Ok. I see a potential issue: Repo: rerdavies/neural-amp-modeler-v2 I am currently running:
Or something similar that avoids removing the .gitignore file in the build directory, which has alarming consequences. The build is started, and I will have results for you in about 15 minutes... -ish. The stumbling block might be, that with pull request as given the A72_OPTIMIZATION defaults to ON (in the theory that if you are building on aarch64, that's the one you want by default. As a result, you must turn the optimization off to get "default" results:
I know. That's a little bit weird. If you would prefer me to change it, I can certainly make it so. Maybe something like this, instead.
|
Or something like this (final answer, need to check that single quotes actually does work):
|
Pull request withdrawn. So sorry to have wasted your time. :-( I think the a76 results are still valid. That there is a 2x performance boost when compiling for a76. I'll contact you again when I can confirm the results. If correct (I think they are), you may want to think about providing an a76 distro. I do have an Pi 5 on the way. (And the results WOULD be completely game changing. The issue: I had the default CPU governor set to default value (I'm not actually sure what that is, tbh). CPUs frequencies were bouncing around with two CPUs running at 1.3GHz and two CPUs running at 1.8GHZ, giving me inconsistent times between runs. Anything between 1.8 and 2.53 seconds for the same executable. With the CPU governor set to performance, I get rock-solid times of about 1.72, and binaries with and without optimizations give pretty much identical results. I did have trouble getting "consistent builds". I was building once, running the test once, and (I guess) repeating that process until I got the results I expected. Pure hallucination, combined with selection bias. Consecutive runs from the SAME executable (A72_OPTIMIZATION=ON, in this case), with the governor at default settings. Values between 1.81 and 2.53.
|
👍 Let me know what you see with the A76 optimizations once you get your Pi5. |
Provides better compiler optimization flags for ARM Cortex-A72/Pi 4, and ARM Cortext-A76/Pi 5.
The A72 optimizations provide about a 15% performance improvement on a Pi 4 compared to your current build. The A76 optimizations are more interesting. As reported by a remote colleague (I don't have a pi 5), the optimizations allow 16 simultaneous instances to run on a Pi5 (PiPedal hosted), instead of 8 instances. (No buffering, 128x3 unfortunately, but no reason to think that it won't also work (or even work slightly better) at 64x3). Informally, a 2x speed improvement compared to your current build with current optimization flags (which are currently CMake defaults on aarch64). Dramatic and very significant!
Optimizations are conditioned on two new CMake Options in the root makefile.
Both can be provided on the build command line as usual. e.g:
cmake etc etc ... -D A72_OPTIMIZATION
Compiler flags have to be applied in the root
CMakeLists.txt
file so that all of your deps build with the correct flags.I went back and forth on whether the A72 optimizations should be on or off by default. I settled on ON, because optimizing for A72 is probably the Right Thing To Do by default when building for aarch64. Your discretion as to whether you think that's right or not.
For the A72/Pi 4, improvements are (probably) attributable to the addition of -ffast-math, and to use of correct cache-line size definitions which actually make a measurable improvement to Eigen matrix multiplication performance, and better instruction scheduling for the A72 execution unit. The resulting binary will run on any ARM Cortex 8.1a processor, but will be mis-tuned for lesser Cortex 8.1a processors.
The A76/Pi 5 improvements are probably mainly attributable to brand new instructions on ARM Cortex 8.2a processors, that are specifically designed to do fast matrix multiplies. There are lesser improvements from --ffast-math, better instruction scheduling, and correct cache-line size definitions (twice the L1-cache compared to an A72, which would make a big difference when doing a matrix multiply). Binaries will run on any ARM Cortex 8.2a processor, but will be mis-tuned on lesser processors in the family.
The optimizations have been lightly tested in your build, but heavily tested on a forthcoming release of TooB Nam which switches over to your NeuralAudio library , and abandons its previous optimizations. My current TooB NAM dev build uses these compiler flags applied to your NeuralAudio library. I have not explicitly tested your plugin on a Pi 5, but it will get the same performance boost that TooB NAM gets. Out of caution, you may want to verify that the 2x figure is correct if you have the hardware to do so.
I'm not sure how best to document the new options. I THINK Pi 4 and Pi 5 users will respond to "A72" and "A76", and that users of cousins of the Pi Family (Orange Pi, Neoverse-based Pi-Like things, a variety of other flavors and relatives) will also know enough to know that A72 and A76 improvements will also give them benefits on their particular flavor of Pi. (A76s are fairly common in Pi-Like things, I think).
Congratulations on NeuralAudio. Having done that optimization on NAM Core code (fixed-size matrices), for the TooB project, I know exactly how difficult and precise that work is, from first-hand experience. And your implementation is MUCH tidier than mine. And it addresses all the pain pain points of NAM Core integration. It is a really fine piece of work! I am frankly, in awe of what you have done.
If I can ever return favors, please let me know. Or if you have development tasks on NeuralAudio you feel like delegating to someone who is already familiar with fundamentals of NeuralAudio code, feel free to ping me. Not like I'm short of things to do; but I do feel a duty to contribute if I can do so usefully.
And you really need to know (if you don't) that 2x performance improvements on Pi 5 is both astonishing, and game changing. It completely changes the way NAM will be used on Pi 5s, where you can now afford to replace ALL of your non-linear effect pedals (Overdrive distortion pedals, compressors, a strange and beautiful NAM emulation of a Fender Spring Reverb that should not work at all but does, ...), and carelessly use split amp models, e.g. Front half of a vintage DeVille + a full 3 knob Fender Tone stack + the back half of a vintage Deville = a NAM-driven model of a Fender DeVille with full 3 knob tone controls. Or use mixed outputs from two or more amp models, .... or... or.. I'm not sure what life looks like in x64 world; but on a Pi 5 that really does change EVERYTHING!
And thank you ESPECIALLY for having provided the NeuralAudio library under an MIT license!
Best regards,
Robin Davies
Pipedal
TooB Plugins Project