Larq CE on a Jetson NX CPU #658

gj-raza · 2021-06-10T17:33:14Z

gj-raza
Jun 10, 2021

I wanted to benchmark the performance of Larq CE on a Jetson NX (which has arm v8.2 based Carmel cpu), for that i compiled CE from source natively as mentioned in the docs, and here is the result of BinaryResNetE18 model from the zoo:

STARTING!
Log parameter values verbosely: [0]
Min num runs: [100]
Num threads: [1]
Graph: [/home/nvidia/larq_testing/binaryResnet18.tflite]
Loaded model /home/nvidia/larq_testing/binaryResnet18.tflite
The input model file size (MB): 4.20602
Initialized session in 1.41ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=7 first=111784 curr=69770 min=66658 max=111784 avg=75527.1 std=15019

Running benchmark for at least 100 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=100 first=66694 curr=67795 min=66417 max=74388 avg=68730.5 std=1724

Inference timings in us: Init: 1410, First inference: 111784, Warmup (avg): 75527.1, Inference (avg): 68730.5
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.

The concern is that , the infer time numbers aren't even beating the Pixel benchmarks as reported on your zoo page, despite Carmel being a better processor than the Pixels. Can you please shed some light on that?

Tombana · 2021-06-10T18:42:05Z

Tombana
Jun 10, 2021
Maintainer

Hi @gj-raza,
Great to hear that you tried Larq Compute Engine on the Jetson NX!

First I want to note that LCE will only use the ARM CPU and not any of the deep learning accelerators on the Jetson, only the CPU. Nevertheless, seeing as these are cores that support the ARMv8.2 instruction set, they should indeed use LCE's optimized kernel implementations. Our docs report a time of 42 ms on the Pixel 1 phone, whereas you found 69 ms, assuming its exactly the same network that is being benchmarked.

I tried searching for some info to compare the CPUs:
A Pixel 1 has a Snapdragon 821 SoC with Kryo CPU.
I couldn't find a lot of benchmarks of the Nvidia Carmel CPU, here its suggested that its comparable to a Snapdragon 845 SoC, which should indeed be a lot faster than the Pixel 1 phone in general benchmarks. This is for the Jetson AGX, I didn't find such a reference for the NX, I'm not sure if its the same CPU.
The Pixel 1 CPU runs at 2.34 GHz, and the Carmel CPU at at most 1.9 GHz.

It looks like the Nvidia Carmel CPU should be faster according to general benchmarks even though its clockspeed is slightly lower. It is possible that the Carmel cores are very optimized for particular tasks which appear in those benchmarks but not optimized for the binary operations that we employ in LCE. For example, it is possible that the popcount instruction that we use is not pipelined as efficiently as it is on the CPU in the Pixel 1. It is hard to say anything about this without knowing more details about the internals of the Nvidia Carmel core and without doing extensive profiling.

4 replies

gj-raza Jun 11, 2021
Author

Thank @Tombana for detailed reply. The motivation for this experiment was to explore the possibility of utilizing CPU cores for inference alongwith the jetson GPU because in some multi model use cases gpu isnt enough.

I was also looking to explore larq ce performance on various ARM based CPUs (Snapdragons, Jetsons,Cortex, Custom ARM based cores) for my use case but from your answer it seems that larq is highly optimized for select few CPU families within ARM umbrella and we wont be able to get max speedup from all ARM based CPUs. So can you please share which CPU families Larq works best with and what key optmizations larq used on these platforms (so we can look for these in other CPUs while choosing). Also is it in your plans to add optimizations for a wider ranger of ARM CPUs?

AdamHillier Jun 11, 2021

Hi @gj-raza, we didn't take advantage of any cpu-specific features. While tuning the assembly kernels we did use the Cortex A72 software optimisation guide as well as the Cortex A76 software optimisation guide as references for information such as instruction throughputs and latencies, which guided some instruction ordering decisions to minimise the risk of stalling the Neon instruction pipelines. It's certainly possible that in other CPU designs, for example, the popcount instruction might have higher cycle latency, which could lead to pipeline stalls in these kernels and lower performance. I think this is possible but to be honest quite unlikely, though the only way to know for sure is to find a similar document that gives instruction latencies and throughputs for the CPU you're using.

One thing I would note:

count=100 first=66694 curr=67795 min=66417 max=74388 avg=68730.5 std=1724

The std measurement seems very high here to me. When I benchmark single-threaded inference on either Raspberry Pi boards or Android phones I usually see std values that are double-digit microseconds, and higher values usually indicate that there's some other task competing for CPU resources, which makes the inference times higher variance (and higher in general). That's only anecdotal, of course, but maybe see if you can get benchmark results with lower variation.

My other recommendation is trying our experimental Aarch64 kernels (which we refer to as IndirectBGEMM, and are inspired by XNNPack). You can do this by changing this line:

compute-engine/larq_compute_engine/tflite/kernels/bconv2d.cc

Line 595 in 950e66e

return Register_BCONV_2D_OPT_BGEMM();

to read return Register_BCONV_2D_OPT_INDIRECT_BGEMM();.

Tombana Jun 11, 2021
Maintainer

So can you please share which CPU families Larq works best with and what key optmizations larq used on these platforms (so we can look for these in other CPUs while choosing)

Predicting the performance of LCE on a particular CPU is difficult without actually running LCE on that CPU. It depends on details (found in the software optimization guide that @AdamHillier linked to) which are sometimes not publicly available, and even if they were its hard to predict how everything interacts (cache hits, pipeline stalls, etc).

gj-raza Jun 11, 2021
Author

Thanks @Tombana , @AdamHillier for answers. will definitely look at the resources and explore further.

simonmaurer · 2021-07-30T12:40:08Z

simonmaurer
Jul 30, 2021

@gj-raza did you run the Xavier NX platform in MAXP power mode such that the cores can go to the limit of 1.9Ghz?
I do think maximum 2 cores can run @1.9Ghz by default if in MAXN mode (15W) whereas with 10W the maximum of 2cores is @1.5Ghz.

theoretically it should also be possible to build the kernel to exceed the official limits (at least for one core such that the system remains stable for your single-core tests with LCE).
some hints:
https://forums.developer.nvidia.com/t/is-it-safe-to-increase-the-frequencies-of-6-cpu-cores-to-1-9ghz-for-commercial-products/159706
https://forums.developer.nvidia.com/t/overclocking-jetson-nanos-cpu-and-gpu/83501

2 replies

gj-raza Jul 30, 2021
Author

no i didnt try this, but i'd definitly give it a go when ill test next time

simonmaurer Jul 30, 2021

alright.. that would mean in your setting the Xavier NX is in 10W mode and 2cores will run @1.5Ghz which is quite a substantial gap compared to the Snapdragon 821 on Pixel1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Larq CE on a Jetson NX CPU #658

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Larq CE on a Jetson NX CPU #658

gj-raza Jun 10, 2021

Replies: 2 comments · 6 replies

Tombana Jun 10, 2021 Maintainer

gj-raza Jun 11, 2021 Author

AdamHillier Jun 11, 2021

Tombana Jun 11, 2021 Maintainer

gj-raza Jun 11, 2021 Author

simonmaurer Jul 30, 2021

gj-raza Jul 30, 2021 Author

simonmaurer Jul 30, 2021

gj-raza
Jun 10, 2021

Replies: 2 comments 6 replies

Tombana
Jun 10, 2021
Maintainer

gj-raza Jun 11, 2021
Author

Tombana Jun 11, 2021
Maintainer

gj-raza Jun 11, 2021
Author

simonmaurer
Jul 30, 2021

gj-raza Jul 30, 2021
Author