docs: Improve Hexagon backend README for Android deployment and commands #17370

Ethan-a2 · 2025-11-18T22:20:45Z

This commit refines the by:

Adding a step to explicitly create the directory on the Android device using before pushing GGUF dels.
This ensures the target directory exists and prevents potential failures.
Correcting the example command by properly escaping the double quotes around the prompt string (). This ensures the command is correctly interpreted when executed in a shell environment.

These changes enhance the clarity and correctness of the instructions for deploying and running on Snapdragon-based Android devices.

This commit refines the by: - Adding a step to explicitly create the directory on the Android device using before pushing GGUF dels. This ensures the target directory exists and prevents potential failures. - Correcting the example command by properly escaping the double quotes around the prompt string (). This ensures the command is correctly interpreted when executed in a shell environment. These changes enhance the clarity and correctness of the instructions for deploying and running on Snapdragon-based Android devices.

Ethan-a2 · 2025-11-18T22:25:37Z

The original question is as follows:

M=Llama-3.2-1B-Instruct-Q4_0.gguf D=HTP0 ./scripts/snapdragon/adb/run-cli.sh -no-cnv -p "what is the most popular cookie in the world?"

adb shell cd /data/local/tmp/llama.cpp; ulimit -c unlimited; LD_LIBRARY_PATH=/data/local/tmp/llama.cpp/./lib ADSP_LIBRARY_PATH=/data/local/tmp/llama.cpp/./lib ././bin/llama-cli --no-mmap -m /data/local/tmp/llama.cpp/../gguf/Llama-3.2-1B-Instruct-Q4_0.gguf --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 --ctx-size 8192 --batch-size 128 -ctk q8_0 -ctv q8_0 -fa on -ngl 99 --device HTP0 -no-cnv -p what is the most popular cookie in the world?
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.13
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: device max workgroup size: 1024
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels.........................................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
ggml-hex: Hexagon Arch version v79
ggml-hex: allocating new session: HTP0
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb400007434518e10
error: invalid argument: is

Ethan-a2 · 2025-11-18T22:28:00Z

Fixed logs:

M=Llama-3.2-1B-Instruct-Q4_0.gguf D=HTP0 ./scripts/snapdragon/adb/run-cli.sh -no-cnv -p "what is the most popular cookie in the world?"

adb shell cd /data/local/tmp/llama.cpp; ulimit -c unlimited; LD_LIBRARY_PATH=/data/local/tmp/llama.cpp/./lib ADSP_LIBRARY_PATH=/data/local/tmp/llama.cpp/./lib ././bin/llama-cli --no-mmap -m /data/local/tmp/llama.cpp/../gguf/Llama-3.2-1B-Instruct-Q4_0.gguf --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 --ctx-size 8192 --batch-size 128 -ctk q8_0 -ctv q8_0 -fa on -ngl 99 --device HTP0 -no-cnv -p "what is the most popular cookie in the world?"
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.13
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: device max workgroup size: 1024
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels.........................................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
ggml-hex: Hexagon Arch version v79
ggml-hex: allocating new session: HTP0
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb400006d8e582950
build: 7094 (bc4064c) with Android (13324770, +pgo, +bolt, +lto, +mlgo, based on r530567d) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device HTP0 (Hexagon) (unknown id) - 2048 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 147 tensors from /data/local/tmp/llama.cpp/../gguf/Llama-3.2-1B-Instruct-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 1B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 1B
llama_model_loader: - kv 6: general.license str = llama3.2
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 16
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 2048
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: llama.attention.key_length u32 = 64
llama_model_loader: - kv 18: llama.attention.value_length u32 = 64
llama_model_loader: - kv 19: general.file_type u32 = 2
llama_model_loader: - kv 20: llama.vocab_size u32 = 128256
llama_model_loader: - kv 21: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: quantize.imatrix.file str = /models_out/Llama-3.2-1B-Instruct-GGU...
llama_model_loader: - kv 32: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 112
llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 125
llama_model_loader: - type f32: 34 tensors
llama_model_loader: - type q4_0: 110 tensors
llama_model_loader: - type q4_1: 2 tensors
llama_model_loader: - type q6_K: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_0
print_info: file size = 729.75 MiB (4.95 BPW)
load: printing all EOG tokens:
load: - 128001 ('<|end_of_text|>')
load: - 128008 ('<|eom_id|>')
load: - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 2048
print_info: n_embd_inp = 2048
print_info: n_layer = 16
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 8192
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 1B
print_info: model params = 1.24 B
print_info: general.name = Llama 3.2 1B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128001 '<|end_of_text|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 16 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 17/17 layers to GPU
load_tensors: CPU model buffer size = 225.49 MiB
load_tensors: HTP0 model buffer size = 0.26 MiB
load_tensors: HTP0-REPACK model buffer size = 504.00 MiB
.............................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8192
llama_context: n_ctx_seq = 8192
llama_context: n_batch = 128
llama_context: n_ubatch = 128
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.49 MiB
llama_kv_cache: HTP0 KV buffer size = 136.00 MiB
llama_kv_cache: size = 136.00 MiB ( 8192 cells, 16 layers, 1/1 seqs), K (q8_0): 68.00 MiB, V (q8_0): 68.00 MiB
llama_context: HTP0 compute buffer size = 15.00 MiB
llama_context: CPU compute buffer size = 62.62 MiB
llama_context: graph nodes = 503
llama_context: graph splits = 41
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
what is the most popular cookie in the world?
system_info: n_threads = 6 (n_threads_batch = 6) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | REPACK = 1 |

sampler seed: 2092038201
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 8192
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 8192, n_batch = 128, n_predict = -1, n_keep = 1

Chocolate chip cookies are the most popular cookie in the world, according to a survey conducted by the National Confectioners Association.
However, the popularity of cookies can vary greatly from country to country and culture. In the United States, for example, peanut butter cookies are the most popular, followed closely by chocolate chip and oatmeal raisin cookies.
Here are some interesting facts about cookies from around the world:

In Italy, the most popular cookie is the biscotti, a twice-baked cookie that is traditionally broken into pieces and dipped in coffee or wine.
In Japan, mochi cookies, made with glutinous rice flour and filled with sweet fillings, are a popular treat during the New Year's holiday (Oshogatsu).
In India, cardamom-flavored cookies are a popular dessert during the festival of Diwali.
In Brazil, the most popular cookie is the "bolo de batata," a sweet potato cookie that is traditionally served during Carnival.
In Mexico, the "Polvorones" are a sweet cookie that is traditionally made with ground almonds and is often served as a dessert or snack.
In Russia, the "Kuschei" are a type of sweet cookie that is traditionally made with ground nuts and is often served at holiday gatherings.
In Greece, the "Melomakarona" are a traditional Christmas cookie that is made with honey and topped with walnuts.
In Thailand, the "Khao Niew" are a sweet cookie that is traditionally made with tapioca flour and filled with a sweet filling.
In Poland, the "Kremowka" are a type of sweet cookie that is traditionally made with ground nuts and is often served at holiday gatherings.
In China, the "Zhú Mīng" are a sweet cookie that is traditionally made with ground nuts and filled with a sweet filling.
These are just a few examples of the diverse and varied cookie traditions around the world. Each cookie has its unique ingredients, flavorings, and cultural associations, reflecting the richness and diversity of human culture.

Cookies are a universal language that bring people together and evoke joy and comfort. Whether it's a classic chocolate chip cookie or a unique cookie from another culture, cookies have the power to bring people together and create lasting memories. [end of text]

llama_perf_sampler_print: sampling time = 50.06 ms / 488 runs ( 0.10 ms per token, 9747.33 tokens per second)
llama_perf_context_print: load time = 447.06 ms
llama_perf_context_print: prompt eval time = 95.93 ms / 11 tokens ( 8.72 ms per token, 114.66 tokens per second)
llama_perf_context_print: eval time = 9207.60 ms / 476 runs ( 19.34 ms per token, 51.70 tokens per second)
llama_perf_context_print: total time = 9541.64 ms / 487 tokens
llama_perf_context_print: graphs reused = 474
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 504 = 504 + 0 + 0) + 17592186043912 |
llama_memory_breakdown_print: | - Host | 439 = 225 + 136 + 77 |

DamonFool

Looks reasonable to me.

max-krasnyansky · 2025-11-20T21:23:14Z

The README updates and the run-server.sh script look good.
The rest I cannot review because it's in Chinese. If you don't mind adding an English version (under docs/backend/hexagon) we can review and consider merging both.
The surfing.txt file is just an example in the README. No need to add that to the repo.

github-actions bot added the documentation Improvements or additions to documentation label Nov 18, 2025

DamonFool approved these changes Nov 19, 2025

View reviewed changes

add run-server.sh vscode/setting.json surfing.txt

d021da4

github-actions bot added the script Script related label Nov 19, 2025

add doc

92d418f

Ethan-a2 requested review from lhez and max-krasnyansky as code owners November 20, 2025 00:57

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: Improve Hexagon backend README for Android deployment and commands #17370

docs: Improve Hexagon backend README for Android deployment and commands #17370

Ethan-a2 commented Nov 18, 2025

Uh oh!

Ethan-a2 commented Nov 18, 2025

Uh oh!

Ethan-a2 commented Nov 18, 2025

Uh oh!

DamonFool left a comment

Uh oh!

max-krasnyansky commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

docs: Improve Hexagon backend README for Android deployment and commands #17370

Are you sure you want to change the base?

docs: Improve Hexagon backend README for Android deployment and commands #17370

Conversation

Ethan-a2 commented Nov 18, 2025

Uh oh!

Ethan-a2 commented Nov 18, 2025

Uh oh!

Ethan-a2 commented Nov 18, 2025

Uh oh!

DamonFool left a comment

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants