ARM Cortex-A72/Pi 4, and ARM Cortex-A76/Pi 5 compiler options #90

rerdavies · 2025-08-05T06:18:22Z

Provides better compiler optimization flags for ARM Cortex-A72/Pi 4, and ARM Cortext-A76/Pi 5.

The A72 optimizations provide about a 15% performance improvement on a Pi 4 compared to your current build. The A76 optimizations are more interesting. As reported by a remote colleague (I don't have a pi 5), the optimizations allow 16 simultaneous instances to run on a Pi5 (PiPedal hosted), instead of 8 instances. (No buffering, 128x3 unfortunately, but no reason to think that it won't also work (or even work slightly better) at 64x3). Informally, a 2x speed improvement compared to your current build with current optimization flags (which are currently CMake defaults on aarch64). Dramatic and very significant!

Optimizations are conditioned on two new CMake Options in the root makefile.

option(A72_OPTIMIZATION "Optimize for cortex-a72/Pi4 (arm64 builds only)" ON)
option(A76_OPTIMIZATION "Optimize for cortex-a76/Pi5 (arm64 builds only)" OFF)

Both can be provided on the build command line as usual. e.g:

cmake etc etc ... -D A72_OPTIMIZATION

Compiler flags have to be applied in the root CMakeLists.txt file so that all of your deps build with the correct flags.

I went back and forth on whether the A72 optimizations should be on or off by default. I settled on ON, because optimizing for A72 is probably the Right Thing To Do by default when building for aarch64. Your discretion as to whether you think that's right or not.

For the A72/Pi 4, improvements are (probably) attributable to the addition of -ffast-math, and to use of correct cache-line size definitions which actually make a measurable improvement to Eigen matrix multiplication performance, and better instruction scheduling for the A72 execution unit. The resulting binary will run on any ARM Cortex 8.1a processor, but will be mis-tuned for lesser Cortex 8.1a processors.

The A76/Pi 5 improvements are probably mainly attributable to brand new instructions on ARM Cortex 8.2a processors, that are specifically designed to do fast matrix multiplies. There are lesser improvements from --ffast-math, better instruction scheduling, and correct cache-line size definitions (twice the L1-cache compared to an A72, which would make a big difference when doing a matrix multiply). Binaries will run on any ARM Cortex 8.2a processor, but will be mis-tuned on lesser processors in the family.

The optimizations have been lightly tested in your build, but heavily tested on a forthcoming release of TooB Nam which switches over to your NeuralAudio library , and abandons its previous optimizations. My current TooB NAM dev build uses these compiler flags applied to your NeuralAudio library. I have not explicitly tested your plugin on a Pi 5, but it will get the same performance boost that TooB NAM gets. Out of caution, you may want to verify that the 2x figure is correct if you have the hardware to do so.

I'm not sure how best to document the new options. I THINK Pi 4 and Pi 5 users will respond to "A72" and "A76", and that users of cousins of the Pi Family (Orange Pi, Neoverse-based Pi-Like things, a variety of other flavors and relatives) will also know enough to know that A72 and A76 improvements will also give them benefits on their particular flavor of Pi. (A76s are fairly common in Pi-Like things, I think).

Congratulations on NeuralAudio. Having done that optimization on NAM Core code (fixed-size matrices), for the TooB project, I know exactly how difficult and precise that work is, from first-hand experience. And your implementation is MUCH tidier than mine. And it addresses all the pain pain points of NAM Core integration. It is a really fine piece of work! I am frankly, in awe of what you have done.

If I can ever return favors, please let me know. Or if you have development tasks on NeuralAudio you feel like delegating to someone who is already familiar with fundamentals of NeuralAudio code, feel free to ping me. Not like I'm short of things to do; but I do feel a duty to contribute if I can do so usefully.

And you really need to know (if you don't) that 2x performance improvements on Pi 5 is both astonishing, and game changing. It completely changes the way NAM will be used on Pi 5s, where you can now afford to replace ALL of your non-linear effect pedals (Overdrive distortion pedals, compressors, a strange and beautiful NAM emulation of a Fender Spring Reverb that should not work at all but does, ...), and carelessly use split amp models, e.g. Front half of a vintage DeVille + a full 3 knob Fender Tone stack + the back half of a vintage Deville = a NAM-driven model of a Fender DeVille with full 3 knob tone controls. Or use mixed outputs from two or more amp models, .... or... or.. I'm not sure what life looks like in x64 world; but on a Pi 5 that really does change EVERYTHING!

And thank you ESPECIALLY for having provided the NeuralAudio library under an MIT license!

Best regards,

Robin Davies
Pipedal
TooB Plugins Project

rerdavies · 2025-08-05T11:18:32Z

The -Ofast option, and -O3 -ffast-math options are actually equivalent, although Ofast is (infamously) not officially documented. I prefer the former in my projects; you may prefer the latter, for consistency.

These options are actually redundant for the dep/NeuralAudio/NeuralAudio build, and possibly others as well. I include them from the root, just to make sure they get applied everywhere. (And because I do so myself, in my own project). The cpu/architecture options, however, are not specified anywhere, and they actually boost performance of Nam Core, and RTNeural as well as NeuralAudio.

Benchmarks are performed on a Raspberry Pi 4. Builds are performed on the branch in my repo that generated the pull request. The "Recommended" build is A72_OPTIMIZATION ON. The "Default" build is A72_OPTIMIZATION OFF, with a build type of RelWithDebInfo, which would use -O2 in parts of the build prior to the pull request.

Benchmark Results (Pi 4)

Recommended options: -mcpu=cortex-a72" "-mtune=cortex-a72 -O3 -ffast-math -g -DNDEBUG
Default options: -O3 -ffast-math -g -DNDEBUG

WaveNet (Standard) Test (Lower is better)

Recommended	Default
Internal: 2.04793 (0.000797626)	Internal: 2.52373 (0.00217203)

23% better.

LSTM (1x16) Test (Higher is better)

Recommended	Default
Internal: 0.306412 (0.000201907)	Internal: 0.235382 (0.000100666)

30% better.

ModelTest Results (Recommended)

Block size: 64
Loading models from: "~/src/neural-amp-modeler-lv2/build/src/NeuralAudio/Utils/Models"
WaveNet (Standard) Test
Model: "~/src/neural-amp-modeler-lv2/build/src/NeuralAudio/Utils/Models/BossWN-standard.nam"

NAM vs Internal RMS err: 8.41824e-08
NAM vs RTNeural RMS err: 0.000659912

NAM Core: 2.75893 (0.00313284)
RTNeural: 3.42047 (0.00231592)
Internal: 2.04793 (0.000797626)
RTNeural is: 0.806595x NAM
Internal is: 1.34718x NAM

***here
LSTM (1x16) Test
Model: "~/src/neural-amp-modeler-lv2/build/src/NeuralAudio/Utils/Models/BossLSTM-1x16.nam"

NAM vs Internal RMS err: 1.51097e-05
NAM vs RTNeural RMS err: 0.0168944

NAM Core: 0.445212 (0.000200091)
RTNeural: 0.429001 (0.000325758)
Internal: 0.306412 (0.000201907)
RTNeural is: 1.03779x NAM
Internal is: 1.45298x NAM

***here

Modeltest Results (default)

Block size: 64
Loading models from: "~/src/neural-amp-modeler-lv2/build/src/NeuralAudio/Utils/Models"
WaveNet (Standard) Test
Model: "~/src/neural-amp-modeler-lv2/build/src/NeuralAudio/Utils/Models/BossWN-standard.nam"

NAM vs Internal RMS err: 8.41824e-08
NAM vs RTNeural RMS err: 0.000659912

NAM Core: 3.17404 (0.001193)
RTNeural: 3.96817 (0.00184733)
Internal: 2.52373 (0.00217203)
RTNeural is: 0.799876x NAM
Internal is: 1.25768x NAM

***here
LSTM (1x16) Test
Model: "~/src/neural-amp-modeler-lv2/build/src/NeuralAudio/Utils/Models/BossLSTM-1x16.nam"

NAM vs Internal RMS err: 1.51097e-05
NAM vs RTNeural RMS err: 0.0168944

NAM Core: 0.333684 (0.000172222)
RTNeural: 0.328912 (8.987e-05)
Internal: 0.235382 (0.000100666)
RTNeural is: 1.01451x NAM
Internal is: 1.41763x NAM

***here

mikeoliphant · 2025-08-05T15:00:43Z

Interesting. I thought that GCC did NEON optimizations by default on aarch64. I'm sure that I tested the A72-specific optimizations when I switched my RPi4 to a 64-bit OS...

I'll test again. Architecture-specific optimizations make a huge difference, so I definitely want to get it right.

Regarding the NeuralAudio library - I ended up rolling my own because I had a lot of changes I wanted to make that would be impractical to do as a contributor to NeuralAmpModelerCore. It allowed me to make structural changes I've been wanting to make for a while, as well as continuing to work on optimization.

It is pretty fiddly stuff. Working with template classes can be tricky, and was new to me. And interactions with Eigen have to be handled very carefully to avoid unexpected performance problems. I think I've got it in pretty good shape now, but I'm sure there is still potential work to be done.

I definitely understand how important this kind of optimization is for lower-powered devices. When I started running NAM on RPi4 back in the early days, I could only run a single feather model...

mikeoliphant · 2025-08-05T15:08:31Z

Btw, I've got a discussion open on the the NeuralAudio repo dealing with benchmarks:

mikeoliphant/NeuralAudio#9

I don't have an RPi5 to test with, but another user posted some numbers there, and could maybe be persuaded to try with the A76 optimizations.

mikeoliphant · 2025-08-05T16:34:05Z

Hmm - I'm getting Internal: 1.98197 on my RPi4 with the default build.

I tried with the A72 options, and got Internal: 1.99201.

Both are in line with what you are seeing with your optimized build. I guess the question is, why are you seeing a worse result using the default build?

rerdavies · 2025-08-05T18:49:27Z

I think you have to do a clean rebuild to get the option changes to take effect. That was my expeirence. And it seems to be the same for other options you are already using: the NAM and RTNeural and Tool build options behave the same way. They don't take effect for me until I do a clean rebuild.

Bug in CMake? "Feature" in Cmake? or a fundamental misunderstanding about what "option" is, and how it's supposed to work? I'm not sure.

mikeoliphant · 2025-08-05T18:58:26Z

I just added the flags directly (ie: not through a CMake option), so I'm pretty sure they took effect. I'll double-check with a completely clean CMake, though.

mikeoliphant · 2025-08-05T19:07:16Z

Same result.

I also added -DCMAKE_EXPORT_COMPILE_COMMANDS=ON to verify that the correct compiler flags are being used.

rerdavies · 2025-08-05T20:37:10Z

Ha! I was using the "introduce an arbitrary syntax error in a source file" feature to get CMake to cough up the command-line options. I, too, struggled to get options applied properly. And I had exactly the same problems when trying to get BUILD_NAMCORE, BUILD_STATIC_RNEURAL and BUILD_TOOLS options to stick. I think (in retrospect), the key is that these are all options with the CACHED keyword - so therefore stored in the build/CMakeCache.txt when the build is configured, and ignored on the commandline when running an actual `--build`. So, configure first, by running CMake without the `--build` option, with all the options specified on the commandline. And then run the actual build, with the `--build` option, without (?) the configuration defines. During the actual build, CMake takes the definitions that were supplied while configuring. I have seen documentation that says that on Windows builds, the build configuration is also treated this way (that the Debug/release/Remindependency settings on the build commandline are ignored when doing the build , and that the actual settings are taken from the configuration step. (Although I have never done a Windows build with CMake. You might know better). That being said, at the time, what seemed to work for me was to run a clean rebuild, and then take a look inside build/CMakeCache.txt to confirm that the right options were being used. And there might have been something in there about a VSCODE command: "CMake: rebuild cache". :-/ And there were quite a few builds where I *thought* I had applied build configuration changes, but there was no performance change, going in both directions.

…

On Tue, Aug 5, 2025, 15:07 Mike Oliphant ***@***.***> wrote: *mikeoliphant* left a comment (mikeoliphant/neural-amp-modeler-lv2#90) <#90 (comment)> Same result. I also added -DCMAKE_EXPORT_COMPILE_COMMANDS=ON to verify that the correct compiler flags are being used. — Reply to this email directly, view it on GitHub <#90 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACXK2DF53EPAUIFNHEOXRTD3MD6HXAVCNFSM6AAAAACDD52CDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCNJWGI3TOMJZHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mikeoliphant · 2025-08-05T21:14:28Z

I built from a completely cleaned out CMake, so there definitely isn't any caching.

I'm building directly on RPi4 - are you? If you want to put a build somewhere I can grab it, I'll be happy to test.

Given that I'm seeing the same performance as your optimized build in ModelTest (~2secs), it seems more likely to me at this point that something is off with your default build since it is significantly slower (~2.5secs vs ~2secs) than my default build.

rerdavies · 2025-08-05T23:39:47Z

GCC does do neon optimizations by default. All Cortex 8.1a processors implement Neon. And all Cortex 8.2a processors implement Neon plus the matrix multiplication extensions. And GCC does absolutely amazing vectorization when running on any one of them. The benefits for architecture optimizations fall into two categories. The first is that picking the machine architecture makes GCC use and report the actual and correct sizes of the L1 and L2 cache. Cache line sizes are accessible through the standard C++ library. Probably because they are critically important when doing matrix multiplies! The default setting of the cache sizes is much lower than it should be with default options; Specifying A72 doubles the reported L1 cache size, and specifying a76 doubles it again. Something to do with performing tiled matrix multiplies. One half of the inputs to each tile needs to sit in L1 cache so it doesn't need to be read from L2 again when moving to the next tile. Eigen uses the compiler's value for L1 Cache Size to calculate what tile size to use, and does it at compile time. The other is instruction scheduling. Modern compilers have very detailed models of the execution units of the processors they are generating code for, which allows them to do quite extraordinary things like shuffle instructions in order to avoid.. Oh I dunno, a stall seven stages deep in the execution pipeline because only two write operations and one read operation can be outstanding at any given time. Or to prevent a stall while trying to rename a register at the 4th stage in the pipeline. That sort of detail. It's pretty amazing. They are not guessing when they do that. They are actually using highly detailed models of the execution units of hundreds of different processors to predict and avoid stalls that are occuring deep within specific execution units. And they are not broadly generalizing. There are execution models for each of a few dozen ARM processors. And models for generations and variants of maybe a very large number of Intel CPUs. Quite amazing to see the source code for that. Both GCC and Clang do it. I assume MSVC does too. I would assume that ARM and Intel do the hard work, pushing new CPU models into GCC and Clang when they are about to be released. The problem with not specifying an architecture and a CPU is that you end up using a model of the least capable common ancestor: the worst execution pipeline possible, with the smallest L1 cache you could ever encounter (which turns out to be critical). Across an ARM generation, there can be an extraordinary amount of variation in configuration of the execution units. One processor may be able to do parallel decoding of 3 simultaneous instructions, another 5. One may have 2 integer ALUs, another 4, and so forth all the way down the 11-ish stages of an ARM execution pipeline. So specifying that you don't want to use the execution model of the oldest or least capable member of the Arm 8.1a processor family matters. Mercifully, each of the dozen or so aarch64 ARM CPUs comes in only one flavor, and always has the same L1 cache size. Not so in Intel world where there are hundreds of processor variations in each generation, each of which has an L1 and L2 cache size tailored to match the price tag. My strong impression when doing detailed profiling work on my version of the fixed-size-matrix optimization was that that the matrix multiples (which are about 80% of the work) are completely memory bound, and the entire cost of the computation is the cost of waiting for memory transfers to complete, with all of the compute being done in parallel while waiting for the memory units to become available. (And another 15% spent executing the vectorized tanh function). So it is not impossible that ALL of the performance improvement comes just from using the right L1 cache size. (Twice the tile size, so 25% fewer L1 cache misses. That's a pretty plausible theory). The other possibility is that some of the improvement comes about because the compiler has found a way to more efficiently schedule instructions that make better use of features in the memory controller section of the a72 pipeline. Given that I'm virtually certain that the matrix multiply is memory bound, based on profiling results, I can imagine that's a possibility. And it's kind of fun to speculate, because for our purposes the only thing that really matters is how long it takes for the 7 or 8 instructions in a 4x4 matrix multiply to execute (and how often they hit L1 cache while doing so). With possibly scheduling variations as the compiler generates code for the hard-coded row and column sizes that you have thoughtfully given it. But we mortals can only speculate and do our best to make sense of the results we actually see. I have done A/B testing on cache line size before. It does provide a real and substantial speed-up if you get it right. If I HAD to guess, I would guess that L1-Cache size is the most important thing in the Pi 4 case. With better instruction scheduling MAYBE contributing another 5% or so.But that's very much a guess. If I HAD to guess, I would guess that the 2x a76 improvement comes about because there are burst transfers of large blocks of memory going on as part of the implementation of the new matrix multiply instructions. But that is somewhat clouded by the fact that your code runs faster than mine, even though we are using the same trick/brilliant idea to make it happen. Which I ... ehem... didn't truthfully think was possible. So maybe not completely memory bound.

…

On Tue, Aug 5, 2025 at 11:01 AM Mike Oliphant ***@***.***> wrote: *mikeoliphant* left a comment (mikeoliphant/neural-amp-modeler-lv2#90) <#90 (comment)> Interesting. I *thought* that GCC did NEON optimizations by default on aarch64. I'm sure that I tested the A72-specific optimizations when I switched my RPi4 to a 64-bit OS... I'll test again. Architecture-specific optimizations make a huge difference, so I definitely want to get it right. Regarding the NeuralAudio library - I ended up rolling my own because I had a lot of changes I wanted to make that would be impractical to do as a contributor to NeuralAmpModelerCore. It allowed me to make structural changes I've been wanting to make for a while, as well as continuing to work on optimization. It is pretty fiddly stuff. Working with template classes can be tricky, and was new to me. And interactions with Eigen have to be handled very carefully to avoid unexpected performance problems. I *think* I've got it in pretty good shape now, but I'm sure there is still potential work to be done. I definitely understand how important this kind of optimization is for lower-powered devices. When I started running NAM on RPi4 back in the early days, I could only run a single feather model... — Reply to this email directly, view it on GitHub <#90 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACXK2DDGWB7YPHFPDSV4ETL3MDBLDAVCNFSM6AAAAACDD52CDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCNJVGU4DENBSGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

rerdavies · 2025-08-05T23:51:06Z

Odd. I've run the non-architecture build several times. And generated non-performant versions of the default build at least twice. Maybe three times. Most recently to generate the benchmark report. Check the build/CMakeCache.txt to make sure you have the correct options applied, maybe. Not sure if CmakeCache.txt is the place to SET options or not. But it definitely includes most-recently-applied options. And what is your version of the reference build? Your mainline build, or my pull request with the A72 optimizations turned off? Or the NeuralModel build? I will admit that I did hack the neural-amp-model.lv2 build a bit in order convince dep/NeuralAudio to set BUILD_NAMCORE and BUILD_STATIC_RTNEURAL and BUILD_TESTS options. I can try doing that with a proper commandline if you like. The -oFast vs. -O3 -ffast-math issue is actually redundant. You do seem to have -Fast in the NeuralModel build under

…

On Tue, Aug 5, 2025 at 5:14 PM Mike Oliphant ***@***.***> wrote: *mikeoliphant* left a comment (mikeoliphant/neural-amp-modeler-lv2#90) <#90 (comment)> I built from a completely cleaned out CMake, so there definitely isn't any caching. I'm building directly on RPi4 - are you? If you want to put a build somewhere I can grab it, I'll be happy to test. Given that I'm seeing the same performance as your optimized build in ModelTest (~2secs), it seems more likely to me at this point that something is off with your default build since it is significantly slower (~2.5secs vs ~2secs) than my default build. — Reply to this email directly, view it on GitHub <#90 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACXK2DH5YYUGCDF5FBITAHT3MENEVAVCNFSM6AAAAACDD52CDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCNJWGY3TINBVGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

rerdavies · 2025-08-06T00:32:11Z

Ok. I see a potential issue:

Repo: rerdavies/neural-amp-modeler-v2
Branch: toobAmpFork
(should be the same as your pull request).

I am currently running:

# cd to the root of the project....

## Reset the project
echo  build/ > .gitignore   # :-/ Move .gitignore to root to avoid an unpleasant surprise.
rm -rf build

## Configure
mkdir build 
cd build 
cmake .. -D CMAKE_BUILD_TYPE=Release -D A72_OPTIMIZATION=OFF -D BUILD_STATIC_RTNEURAL=ON -D BUILD_NAMCORE=ON -D BUILD_UTILS=ON 
cd ..

## Build.
cmake --build ./build --config Release --target all

Or something similar that avoids removing the .gitignore file in the build directory, which has alarming consequences.

The build is started, and I will have results for you in about 15 minutes... -ish.

The stumbling block might be, that with pull request as given the A72_OPTIMIZATION defaults to ON (in the theory that if you are building on aarch64, that's the one you want by default.

As a result, you must turn the optimization off to get "default" results:

-D A72_OPTIMIZATION=OFF

I know. That's a little bit weird. If you would prefer me to change it, I can certainly make it so.
While my build is running, perhaps you might want to get your build running in parallel and we compare notes in... 14 minutes.

Maybe something like this, instead.

option(AARCH64_CPU "cortex-a72 for Pi 4, cortex-a76 for Pi 5 (aarch64 build only)" "cortex-a72")

& following adjustments.

rerdavies · 2025-08-06T00:44:32Z

Or something like this (final answer, need to check that single quotes actually does work):

option(AARCH_CPU '"cortex-a72" for Pi 4, "cortex-a76" for Pi 5, "default" for compiler default' "cortex-a72")

rerdavies · 2025-08-06T04:33:28Z

Pull request withdrawn. So sorry to have wasted your time. :-(

I think the a76 results are still valid. That there is a 2x performance boost when compiling for a76. I'll contact you again when I can confirm the results. If correct (I think they are), you may want to think about providing an a76 distro. I do have an Pi 5 on the way. (And the results WOULD be completely game changing.

The issue: I had the default CPU governor set to default value (I'm not actually sure what that is, tbh). CPUs frequencies were bouncing around with two CPUs running at 1.3GHz and two CPUs running at 1.8GHZ, giving me inconsistent times between runs. Anything between 1.8 and 2.53 seconds for the same executable. With the CPU governor set to performance, I get rock-solid times of about 1.72, and binaries with and without optimizations give pretty much identical results.

I did have trouble getting "consistent builds". I was building once, running the test once, and (I guess) repeating that process until I got the results I expected. Pure hallucination, combined with selection bias.

Consecutive runs from the SAME executable (A72_OPTIMIZATION=ON, in this case), with the governor at default settings. Values between 1.81 and 2.53.

Internal: 2.07215 (0.000760034)
Internal: 2.48385 (0.00224075)
Internal: 2.07011 (0.00183727)
Internal: 2.02839 (0.000862089)
Internal: 2.48445 (0.00183694)
Internal: 2.53287 (0.00159053)
Internal: 2.29668 (0.00162985)
Internal: 1.81224 (0.00177773)
Internal: 2.51364 (0.00211882)
Internal: 2.53037 (0.00103396)
Internal: 2.30048 (0.000942478)
Internal: 2.34258 (0.00105313)

mikeoliphant · 2025-08-06T14:07:22Z

👍

Let me know what you see with the A76 optimizations once you get your Pi5.

ARM Cortex-A72/Pi 4, and ARM Cortext-A76/Pi 5 compiler options

81d2932

rerdavies changed the title ~~ARM Cortex-A72/Pi 4, and ARM Cortext-A76/Pi 5 compiler options~~ ARM Cortex-A72/Pi 4, and ARM Cortex-A76/Pi 5 compiler options Aug 5, 2025

rerdavies closed this Aug 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARM Cortex-A72/Pi 4, and ARM Cortex-A76/Pi 5 compiler options #90

ARM Cortex-A72/Pi 4, and ARM Cortex-A76/Pi 5 compiler options #90

Uh oh!

rerdavies commented Aug 5, 2025

Uh oh!

rerdavies commented Aug 5, 2025

Uh oh!

mikeoliphant commented Aug 5, 2025

Uh oh!

mikeoliphant commented Aug 5, 2025

Uh oh!

mikeoliphant commented Aug 5, 2025

Uh oh!

rerdavies commented Aug 5, 2025

Uh oh!

mikeoliphant commented Aug 5, 2025

Uh oh!

mikeoliphant commented Aug 5, 2025

Uh oh!

rerdavies commented Aug 5, 2025 via email

Uh oh!

mikeoliphant commented Aug 5, 2025

Uh oh!

rerdavies commented Aug 5, 2025 via email

Uh oh!

rerdavies commented Aug 5, 2025 via email

Uh oh!

rerdavies commented Aug 6, 2025 •

edited

Loading

Uh oh!

rerdavies commented Aug 6, 2025

Uh oh!

rerdavies commented Aug 6, 2025

Uh oh!

mikeoliphant commented Aug 6, 2025

Uh oh!

Uh oh!

ARM Cortex-A72/Pi 4, and ARM Cortex-A76/Pi 5 compiler options #90

ARM Cortex-A72/Pi 4, and ARM Cortex-A76/Pi 5 compiler options #90

Uh oh!

Conversation

rerdavies commented Aug 5, 2025

Uh oh!

rerdavies commented Aug 5, 2025

Benchmark Results (Pi 4)

WaveNet (Standard) Test (Lower is better)

LSTM (1x16) Test (Higher is better)

ModelTest Results (Recommended)

Modeltest Results (default)

Uh oh!

mikeoliphant commented Aug 5, 2025

Uh oh!

mikeoliphant commented Aug 5, 2025

Uh oh!

mikeoliphant commented Aug 5, 2025

Uh oh!

rerdavies commented Aug 5, 2025

Uh oh!

mikeoliphant commented Aug 5, 2025

Uh oh!

mikeoliphant commented Aug 5, 2025

Uh oh!

rerdavies commented Aug 5, 2025 via email

Uh oh!

mikeoliphant commented Aug 5, 2025

Uh oh!

rerdavies commented Aug 5, 2025 via email

Uh oh!

rerdavies commented Aug 5, 2025 via email

Uh oh!

rerdavies commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rerdavies commented Aug 6, 2025

Uh oh!

rerdavies commented Aug 6, 2025

Uh oh!

mikeoliphant commented Aug 6, 2025

Uh oh!

Uh oh!

rerdavies commented Aug 6, 2025 •

edited

Loading