Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query regarding support of Executorch for ARM Ethos-U65 backend #9356

Open
vikasbalaga opened this issue Mar 18, 2025 · 17 comments
Open

Query regarding support of Executorch for ARM Ethos-U65 backend #9356

vikasbalaga opened this issue Mar 18, 2025 · 17 comments
Assignees
Labels
partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm

Comments

@vikasbalaga
Copy link

vikasbalaga commented Mar 18, 2025

Hi,

I have started working with Executorch and in this section of launching Executorch on ARM Ethos-U, I have observed that only Ethos-U55 and Ethos-U85 are mentioned.
Also in the model conversion script and setup utilities, the supported targets only contain Ethos-U55 and Ethos-U85 variants

But I am trying to work with a Ethos-U65 based system, so does that mean Executorch only supports the above mentioned variants or does it also support Ethos-U65?

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218

@kimishpatel
Copy link
Contributor

@zingo @Erik-Lundell @digantdesai can any of you answer?

@kimishpatel kimishpatel added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label Mar 18, 2025
@freddan80
Copy link
Collaborator

Hi @vikasbalaga , thx for your interest in Executorch and Ethos-U 🥇 Ethos-U65 is supported with Executorch as well, but we haven't given it too much love, yet. A couple of reasons; 1) there's no FVP to test it on 2) the AOT flow is very similar to Ethos-U55. Ethos-U65 is supported conceptually, it just needs some plumbing. For example the list you mention (model conversion script) and ArmCompileSpecBuilder. There's are some more places as well. If you are happy to give it a go, we can support you. Just push a PR and tag us (@digantdesai @freddan80 @per @zingo @oscarandersson8218 ). We'll give Ethos-U65 more attention medium term future.

The runtime flow is slightly different. Ethos-U65 sits on an "ML island" (Cortex-M + Ethos-U subsystem, embedded) as part of a larger system (Cortex-A, rich OS). That means Executorch runtime should be on the ML island, and your application calling into Executorch runtime needs to communicate somehow with the Cortex-A system. That mean of communication could build on e.g. ethos-u-linux-driver-stack. Some (not too big) modifications will probably be needed for Executorch workloads.

Hope this helps 👍

@vikasbalaga
Copy link
Author

vikasbalaga commented Mar 19, 2025

@freddan80 and others, thanks for your quick response.

The runtime flow is slightly different. Ethos-U65 sits on an "ML island" (Cortex-M + Ethos-U subsystem, embedded) as part of a larger system (Cortex-A, rich OS)...

Yes in my case it is (Cortex-A55, OS) and (Cortex-M33 + Ethos-U65) ML island and also I have a hardware setup available, so I don't need FVP

There's are some more places as well. If you are happy to give it a go, we can support you

Yes, I am interested in trying it. I could find the following places, which require modification :

So, could you help me in finding other modifications that are required?

Also, (I think this is a naive question), will this Executorch implementation work for my CPU Cortex-M33?

@freddan80
Copy link
Collaborator

So, could you help me in finding other modifications that are required?

I'd start with those and debug from there. The important thing is that the call to vela argument looks right. (we can help checking that)

run (It looks like some config is being done, not sure what it will be for Ethos-U65)

The run.sh script will AOT compile, build and run an inference using the FVP. In your case you'd probably be happy just generating a .pte (with arm_aot_compile.py) AOT, then use build_executorch_runner.sh to build modified under the hood to your application code, linker script and startup code (rather than the Corstone-FVP's) to produce an .elf. Or perhaps, maybe it's even better to modify the cmake build to adapt to your setup.

Note that you'd want to use a 'vela.ini' file that fits your system config, and provide that to build_executorch_runner.sh. Probably you have such a file already with you dev board SDK?

@vikasbalaga
Copy link
Author

@freddan80 ,

I have modified the arm_aot_compile.py and I think I am able to generate *.pte model for Ethos U65 backend. The configuration details I picked based on my hardware type. (I have forked the repo and committed my changes in a private branch for your reference)

then use build_executorch_runner.sh to build modified under the hood to your application code....

I tried modifying the build_executorch_runner.sh for my system (Cortex-M33 CPU and Ethos U65 NPU) but here I am observing cmake errors

CMake Error at CMakeLists.txt:408 (message):
  Unsupported SYSTEM_CONFIG: Ethos_U65_High_End

I tried to debug it, but I couldn't understand how to update the "NPU timing adapters" as per Ethos U65 requirements. Also, it looks like we need to specify a TARGET_BOARD which corresponds to Corstone-300/320 but since in this case there is no simulator, how can I set that macro?

@freddan80
Copy link
Collaborator

I have forked the repo and committed my changes in a private branch for your reference

Would it be possible to share the changes?

I assigned this to @AdrianLundell. He'll help you.

@vikasbalaga
Copy link
Author

This points to the private branch I created by forking the repo.

Thanks!

@AdrianLundell
Copy link
Collaborator

Hi, nice work so far!

The examples/arm/executor_runner-code and related CMake-scripts used to built it should be viewed as an example to get you started when building your own application. The build_executorch_runner.sh script and all flags containing target specific info in the runtime flow is there to make this example convenient to run and to help our testing, rather than being an official API.

For example, the timing adapters and related macros TARGET_BOARD, SYSTEM_CONFIG and MEMORY_MODE which you mention are only relevant for the simulators so to answer you question there you can ignore those completely. The relevant parts of this CMakeScript is the linking of the libraries and the converting of the .pte to a header-file, with that done you can approach this as writing for any other application for u65.

The simulator is of course very useful when developing so if you have not done so, I would suggest to start testing your model and executor_runner on u55 using the Corstone-300 target, and move to u65 when you have that working.

@vikasbalaga
Copy link
Author

vikasbalaga commented Mar 24, 2025

I would suggest to start testing your model and executor_runner on u55 using the Corstone-300 target

I have tried performing inference on Corstone-300 FVP by following the steps mentioned here. With this I am able to perform inference on the FVP (I have tried simple model with "ADD" operation)

I [executorch:arm_perf_monitor.cpp:133] NPU Inferences : 1
I [executorch:arm_perf_monitor.cpp:134] Profiler report, CPU cycles per operator:
I [executorch:arm_perf_monitor.cpp:138] ethos-u : cycle_cnt : 0 cycles
I [executorch:arm_perf_monitor.cpp:145] Operator(s) total: 0 CPU cycles
I [executorch:arm_perf_monitor.cpp:151] Inference runtime: 2585 CPU cycles total
I [executorch:arm_perf_monitor.cpp:153] NOTE: CPU cycle values and ratio calculations require FPGA and identical CPU/NPU frequency
I [executorch:arm_perf_monitor.cpp:162] Inference CPU ratio: 90.10 %
I [executorch:arm_perf_monitor.cpp:166] Inference NPU ratio: 9.90 %
I [executorch:arm_perf_monitor.cpp:175] cpu_wait_for_npu_cntr : 256 CPU cycles
I [executorch:arm_perf_monitor.cpp:180] Ethos-U PMU report:
I [executorch:arm_perf_monitor.cpp:181] ethosu_pmu_cycle_cntr : 411
I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr0 : 6
I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr1 : 43
I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr2 : 3
I [executorch:arm_perf_monitor.cpp:184] ethosu_pmu_cntr3 : 634
I [executorch:arm_perf_monitor.cpp:187] Ethos-U PMU Events:[ETHOSU_PMU_AXI0_RD_DATA_BEAT_RECEIVED, ETHOSU_PMU_AXI1_RD_DATA_BEAT_RECEIVED, ETHOSU_PMU_AXI0_WR_DATA_BEAT_WRITTEN, ETHOSU_PMU_NPU_IDLE]
I [executorch:arm_executor_runner.cpp:630] model_pte_program_size:     2032 bytes.
I [executorch:arm_executor_runner.cpp:631] model_pte_loaded_size:      2032 bytes.
I [executorch:arm_executor_runner.cpp:645] method_allocator_used:     308 / 62914560  free: 62914252 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:652] method_allocator_planned:  64 bytes
I [executorch:arm_executor_runner.cpp:654] method_allocator_loaded:   220 bytes
I [executorch:arm_executor_runner.cpp:655] method_allocator_input:    24 bytes
I [executorch:arm_executor_runner.cpp:656] method_allocator_executor: 0 bytes
I [executorch:arm_executor_runner.cpp:659] temp_allocator_used:       0 / 1048576 free: 1048576 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:675] Model executed successfully.
I [executorch:arm_executor_runner.cpp:679] 1 outputs:
Output[0][0]: (int) 2
Output[0][1]: (int) 2
Output[0][2]: (int) 2
Output[0][3]: (int) 2
Output[0][4]: (int) 2

However, I have observed that the arm_executorch_runner application is (~62 MB) which is huge (Am I missing something here?)

@zingo
Copy link
Collaborator

zingo commented Mar 24, 2025

Nope that is correct, the example just allocates a 60MB buffer so we can test/used large models out of the box as the FVP can use quite much memory.

See

#if !defined(ET_ARM_BAREMETAL_METHOD_ALLOCATOR_POOL_SIZE)
#define ET_ARM_BAREMETAL_METHOD_ALLOCATOR_POOL_SIZE (60 * 1024 * 1024)
#endif
const size_t method_allocation_pool_size =
    ET_ARM_BAREMETAL_METHOD_ALLOCATOR_POOL_SIZE;
unsigned char __attribute__((
    section("input_data_sec"),
    aligned(16))) method_allocation_pool[method_allocation_pool_size];

You can either just change the code or set ET_ARM_BAREMETAL_METHOD_ALLOCATOR_POOL_SIZE from cmake as a workaround.

We hope/plan to look into making the handling of this area better, and not ending up in the elf and such, but right now it is working like this. It's just a bad left over from when we "forked" from the examples/devtools/example_runner/example_runner.cpp :)

@vikasbalaga
Copy link
Author

vikasbalaga commented Apr 4, 2025

The examples/arm/executor_runner-code and related CMake-scripts used to built it should be viewed as an example to get you started when building your own application...

@AdrianLundell, I tried to build my own application by adding CMake-scripts but I have hit a road block in providing custom linker script. I almost spent 2 weeks with not much progress :(

So, I gave it up and then tried a second approach, where I will try to integrate Executorch libraries into a "working firmware application" that is available for my board (which comes with its own linker script).
With this approach, atleast I can see the application is launching but I have received a following error message from Executorch runtime

E [executorch:method.cpp:748] Missing operator: [0] aten::add.out
E [executorch:method.cpp:989] There are 1 instructions don't have corresponding operator registered.See logs for details

(I have taken the arm_executor_runner as a reference for my application and using a sample pte model with only "add" operation)

It looks like there is some operator registry in which all the ops need to be registered, but I am not sure how it works.
So, can I please get some insights into this "operator registration"?

Also, when I tried the examples, it looks like "add" is mapped to ethos-u delegate, so when I try to run it without delegate option, even on simulator I observed similar errors

Thanks!

@Juanfi8
Copy link
Contributor

Juanfi8 commented Apr 4, 2025

Hi,
Are you still using the examples scripts? If yes, you can give the --portable_kernels=aten::add.out to the run.sh and it will do the job. The full command will be: ./run.sh --model_name=add --aot_arm_compiler_flags="" --portable_kernels=aten::add.out. If it is not the case, you can take a look at the backends/arm/scripts/build_portable_kernels.sh script to build the custom operators library and link it to your executable.

@vikasbalaga
Copy link
Author

Hi,
With that option, I can see the issue is fixed on simulator!
But, I tried to do the same for the application that I have built by invoking the following scripts while building the Executorch libs

./executorch/backends/arm/scripts/build_executorch.sh
./executorch/backends/arm/scripts/build_portable_kernels.sh  --portable_kernels="aten::add.out"

And here are the libs that I am linking to my application. (I am linking all *.a generated in arm_test/cmake-out folder

libexecutorch.a
libexecutorch_core.a
libarm_portable_ops_lib.a
libexecutorch_delegate_ethos_u.a
libextension_runner_util.a
libextension_tensor.a
liboptimized_portable_kernels.a
libportable_kernels.a
libportable_ops_lib.a
libquantized_kernels.a
libquantized_ops_lib.a
libarm_portable_ops_lib.a

But I am still facing that issue in my application :(

I can confirm that the build logs generated by CMake are identical b/w simulator and my application

@AdrianLundell
Copy link
Collaborator

Sounds like a good approach to me to start with a known working application and build from there!

For the operator registration, the executorch runtime has no operator implementations by default, so you need to cross compile them for Cortex-M55 using the executorch/backends/arm/scripts/build_portable_kernels.sh script and link them as described by Juanfi8. You could try debugging using the toolchain tools s.a. executorch/examples/arm/ethos-u-scratch/arm-gnu-toolchain-13.3.rel1-x86_64-arm-none-eabi/bin/arm-none-eabi-readelf/objdump to inspect your binaries and see what is included. Conceptually there should not be a difference between the simulator and a real application so I suspect there is an issue in how you build. Are there any more logs you could share since the error message mentions "See logs for details"?

This error suggests to me however that the network has not lowered probably since the add operator should be delegated to the Ethos-U rather than run on CPU, but maybe you have not come to that part yet?

@vikasbalaga
Copy link
Author

This error suggests to me however that the network has not lowered probably since the add operator should be delegated to the Ethos-U rather than run on CPU, but maybe you have not come to that part yet?...

Yes, I wanted to start it slow, by first executing directly on my CPU (Cortex-M33) and then delegate it to Ethos-U65

I will try to compare both builds to see what I am missing, but one thing I am not sure is that, does the "kernel not found" indicate that, the above mentioned libs that I linked somehow don't contain ADD operation or is it related to some configuration similar to
--portable_kernels=aten::add.out which I missed earlier?
Can you confirm the above mentioned libs are sufficient or am I missing something?

"See logs for details"?...

I just haven't implemented that part yet ;)

@AdrianLundell
Copy link
Collaborator

I see, it could be a problem with the bindings as well, from the examples/arm/CmakeLists.txt:

# Generate C++ bindings to register kernels into both PyTorch (for AOT) and
# Executorch (for runtime). Here select all ops in functions.yaml
gen_selected_ops(
  LIB_NAME
  "arm_portable_ops_lib"
  OPS_SCHEMA_YAML
  ""
  ROOT_OPS
  "${EXECUTORCH_SELECT_OPS_LIST}"
  INCLUDE_ALL_OPS
  ""
)
generate_bindings_for_kernels(
  LIB_NAME "arm_portable_ops_lib" FUNCTIONS_YAML
  ${EXECUTORCH_ROOT}/kernels/portable/functions.yaml
)
gen_operators_lib(
  LIB_NAME "arm_portable_ops_lib" KERNEL_LIBS portable_kernels DEPS executorch
)

Are you doing this?

@vikasbalaga
Copy link
Author

I tried comparing .map file of my application with that of simulator and observed that in my app, the symbols from some of the libs libportable_kernels.a, libportable_ops_lib.a,... are not being included. (I think that might be the reason for this issue)

So, I tried linking the libs with --whole-archive option and observed overflows in .text and .data section.
I am not sure if I can afford such huge memories for my board (as Cortex-M33 is on a ML-Island).

So, is there a way, where I can target only specific libs (for the kernels) among the list, so that I will try to fit only those in the available memories?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm
Projects
None yet
Development

No branches or pull requests

7 participants