ggml-hexagon: Initial Hexagon v68 support #17394

mediouni-m · 2025-11-20T00:31:44Z

Runs models, but really slow

Performance numbers on Llama-3.2-1B-Instruct-q4_0.gguf with ./llama-cli -ngl 999 on Makena:

common_perf_print:    sampling time =      27.44 ms
common_perf_print:    samplers time =      12.94 ms /   137 tokens
common_perf_print:        load time =     625.88 ms
common_perf_print: prompt eval time =    1561.50 ms /    28 tokens (   55.77 ms per token,    17.93 tokens per second)
common_perf_print:        eval time =   14988.70 ms /   108 runs   (  138.78 ms per token,     7.21 tokens per second)
common_perf_print:       total time =   25197.05 ms /   136 tokens
common_perf_print: unaccounted time =    8619.42 ms /  34.2 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =        108
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute       unaccounted |
llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + ( 522 =   522 +       0 +       0) + 17592186043894 |
llama_memory_breakdown_print: |   - Host               |                 1186 =   266 +     128 +     792                   |

CPU for comparison:

common_perf_print:    sampling time =      51.48 ms
common_perf_print:    samplers time =      24.49 ms /   306 tokens
common_perf_print:        load time =     468.88 ms
common_perf_print: prompt eval time =     183.36 ms /    14 tokens (   13.10 ms per token,    76.35 tokens per second)
common_perf_print:        eval time =   10807.84 ms /   291 runs   (   37.14 ms per token,    26.92 tokens per second)
common_perf_print:       total time =   15182.34 ms /   305 tokens
common_perf_print: unaccounted time =    4139.66 ms /  27.3 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =        289
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
llama_memory_breakdown_print: |   - Host               |                 1165 =   779 +     128 +     258                |
llama_memory_breakdown_print: |   - CPU_REPACK         |                  522 =   522 +       0 +       0                |

And with FA on NPU

common_perf_print: prompt eval time =     478.30 ms /    16 tokens (   29.89 ms per token,    33.45 tokens per second)
common_perf_print:        eval time =   18251.43 ms /   162 runs   (  112.66 ms per token,     8.88 tokens per second)

And on CPU

common_perf_print: prompt eval time =     197.30 ms /    16 tokens (   12.33 ms per token,    81.10 tokens per second)
common_perf_print:        eval time =    5058.76 ms /    91 runs   (   55.59 ms per token,    17.99 tokens per second)

mediouni-m · 2025-11-20T22:03:57Z

edit: worked around

for reference, the observed crash traces:

adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: ############################### Process on cDSP CRASHED!!!!!!! ########################################
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: --------------------- Crash Details are furnished below ------------------------------------
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: Process "/frpc/f0491850 test-backend-op" crashed in thread "" for unknown reason
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: Crashed Shared Object "./libggml-htp-v68.so" load address : 0x20010000 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: fastrpc_shell_unsigned_3 load address : D00000  and size : FD434 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: Fault PC   :    0x20020044 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: LR         :    0x2001FFB0 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: SP         :    0xE32D20 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: Bad VA     :    0xFEC02000 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: FP         :    0xE32D28 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: SSR        :    0x1970428 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: Error code :    0x428 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: Call trace : 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: [<2001FFB0>] op_matmul_id+0x4148:     (./libggml-htp-v68.so) 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: [<2001FFB0>] op_matmul_id+0x4148:     (./libggml-htp-v68.so) 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: [<20020390>] op_matmul_id+0x4528:     (./libggml-htp-v68.so) 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: [<2001BB20>] op_matmul+0xC30:     (./libggml-htp-v68.so) 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: [<20015EB0>] worker_pool_run_jobs+0x250:     (./libggml-htp-v68.so) 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: [<20015FF4>] worker_pool_run_func+0xF4:     (./libggml-htp-v68.so) 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: [<2001B97C>] op_matmul+0xA8C:     (./libggml-htp-v68.so) 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: [<200139C4>] htp_iface_stop+0x710:     (./libggml-htp-v68.so) 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: [<20012B14>] htp_iface_start+0x708:     (./libggml-htp-v68.so) 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: [<20020044>] op_matmul_id+0x41DC:     (./libggml-htp-v68.so) 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: ----------------------------- End of Crash Report --------------------------------------------------
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x191:1: Please refer to Hexagon SDK documentation "<HEXAGON_SDK_ROOT>/docs/tools/debug.html" for debugging the user PD exceptions.
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: ############################### Process on cDSP CRASHED!!!!!!! ########################################
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: --------------------- Crash Details are furnished below ------------------------------------
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: Process "/frpc/f0491850 test-backend-op" crashed in thread "0x  e54b30:work" because Application called qurt_exit()
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: Crashed Shared Object "./libggml-htp-v68.so" load address : 0x20010000 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: fastrpc_shell_unsigned_3 load address : D00000  and size : FD434 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: Fault PC   :    0xD270FC 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: LR         :    0x20026758 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: SP         :    0xE54700 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: Bad VA     :    0x0 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: FP         :    0xE548A8 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: SSR        :    0x1970427 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: Error code :    0x108 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: Call trace : 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: [<20026758>] op_binary+0x1D58:     (./libggml-htp-v68.so) 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: [<200259BC>] op_binary+0xFBC:     (./libggml-htp-v68.so) 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: [<2002521C>] op_binary+0x81C:     (./libggml-htp-v68.so) 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: [<00D270FC>] qurt_exception_raise_nonfatal+0x4:     (fastrpc_shell_unsigned_3) 
adsprpc:dsp: CDSP:platform_qdi_driver.c:792:0x192:1: ----------------------------- End of Crash Report --------------------------------------------------

The latter fault goes away if given GGML_HEXAGON_NHVX=1 ... maybe it's VTCM memory management related?

For the former, it's an issue with the currently filled DMA descriptors on v68. There's a dearth of public documentation about those.

A question (cc @max-krasnyansky): are there DMA docs around somewhere? And is anything regarding the v1 DMA descriptors that isn't supported on v68? ~~Should I use the v0 ones there?~~ The v1 descriptors do work but I had to disable bypass on destination

Add stdexcept include to fix GCC build errors Signed-off-by: Mohamed Mediouni <[email protected]>

v68 is the Hexagon revision notably used on the Snapdragon 8cx Gen 3 and the QCM6490. Signed-off-by: Mohamed Mediouni <[email protected]>

It turns out that the reason why HAP_compute_res_attr_set_vtcm_param_v2 errored out is that 8MB isn't a supported page size. Signed-off-by: Mohamed Mediouni <[email protected]>

Signed-off-by: Mohamed Mediouni <[email protected]>

At least on v68 this made things actually work... not a proper fix though, so to look at later... Signed-off-by: Mohamed Mediouni <[email protected]>

…NN intrinsics

max-krasnyansky · 2025-11-21T19:29:20Z

@mediouni-m It's very cool that you got v68 to work!
Originally, we decided not to support it because the performance is not going to be that good.
Those missing int32 -> fp32 conversion instructions and other things do add up.
The changes you added are clean and small though. I don't mind merging this since it's functional.
Perhaps, it'll be useful for running tiny models, and there are more general optimizations coming that would help.

mediouni-m requested review from lhez and max-krasnyansky as code owners November 20, 2025 00:31

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 20, 2025

mediouni-m changed the title ~~Initial Hexagon v68 support boilerplate~~ ggml-hexagon: Initial Hexagon v68 support boilerplate Nov 20, 2025

mediouni-m force-pushed the hexagon-v68 branch from 58c10ff to b0c9a9f Compare November 21, 2025 07:53

mediouni-m changed the title ~~ggml-hexagon: Initial Hexagon v68 support boilerplate~~ ggml-hexagon: Initial Hexagon v68 support Nov 21, 2025

mediouni-m added 5 commits November 21, 2025 09:01

ggml-hexagon: fix build error with GCC

1d7e8c8

Add stdexcept include to fix GCC build errors Signed-off-by: Mohamed Mediouni <[email protected]>

ggml-hexagon: add initial v68 support on Linux

54f632e

v68 is the Hexagon revision notably used on the Snapdragon 8cx Gen 3 and the QCM6490. Signed-off-by: Mohamed Mediouni <[email protected]>

ggml-hexagon: change asked for page size to optimal

20d9951

It turns out that the reason why HAP_compute_res_attr_set_vtcm_param_v2 errored out is that 8MB isn't a supported page size. Signed-off-by: Mohamed Mediouni <[email protected]>

ggml-hexagon: check VTCM acquire failures

c0b942a

Signed-off-by: Mohamed Mediouni <[email protected]>

ggml-hexagon: disable bypass on older than v73 for now

f1d38e5

At least on v68 this made things actually work... not a proper fix though, so to look at later... Signed-off-by: Mohamed Mediouni <[email protected]>

mediouni-m force-pushed the hexagon-v68 branch from b0c9a9f to f1d38e5 Compare November 21, 2025 08:07

mediouni-m added 2 commits November 21, 2025 09:15

ggml-hexagon: fix after rebase

f770c81

ggml-hexagon: take a better implementation of Q6_Vsf_equals_Vw from Q…

dbdf91d

…NN intrinsics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-hexagon: Initial Hexagon v68 support #17394

ggml-hexagon: Initial Hexagon v68 support #17394

mediouni-m commented Nov 20, 2025 •

edited

Loading

Uh oh!

mediouni-m commented Nov 20, 2025 •

edited

Loading

Uh oh!

max-krasnyansky commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggml-hexagon: Initial Hexagon v68 support #17394

Are you sure you want to change the base?

ggml-hexagon: Initial Hexagon v68 support #17394

Conversation

mediouni-m commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mediouni-m commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-krasnyansky commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mediouni-m commented Nov 20, 2025 •

edited

Loading

mediouni-m commented Nov 20, 2025 •

edited

Loading