Skip to content

Eval bug:Garbled text appears when running the Qwen3-0.6B model on a mobile phone using the Hexagon backend #16854

@Arvin-928

Description

@Arvin-928

Name and Version

zorn:/data/local/tmp/llama.cpp $ LD_LIBRARY_PATH=lib DSP_LIBRARY_PATH=lib ./bin/llama-cli --version
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 750 (OpenCL 3.0 Adreno(TM) 750)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.45.02.16
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: device max workgroup size: 1024
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels........................................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 750 (OpenCL 3.0 Adreno(TM) 750)'
ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
ggml-hex: Hexagon Arch version v75
ggml-hex: allocating new session: HTP0
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v75.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb40000750d1ffe90
version: 6824 (128a2d37b)
built with Android (13324770, +pgo, +bolt, +lto, +mlgo, based on r530567d) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-unknown-linux-gnu

Operating systems

Linux

GGML backends

HIP

Hardware

Qualcomm Snapdragon 8 Gen 3

Models

Qwen/Qwen3-1.7 or Qwen/Qwen3-0.6B

Problem description & steps to reproduce

Hello, @max-krasnyansky . Thank you for your outstanding open-source work: "Add experimental ggml-hexagon backend for the Hexagon NPU".

I successfully converted the quantized Q4_0 versions of the Qwen3-4B, Qwen3-1.7B and Qwen3-0.6B models using the same method on the host machine, and the demo can indeed run on the phone. However, it's strange that only the output from Qwen3-4B looks normal, while the other two produce garbled text.

append: Th way to build llama.cpp for a Snapdragon-based Android device

First Bad Commit

No response

Relevant log output

### The output from Qwen3-4B is normal:

127|zorn:/data/local/tmp/llama.cpp $ 
PATH=. DSP_LIBRARY_PATH=. ./llama-cli -t 4 -fa -m mode -p "Hello my name is"                                                                             <
models-npu/  models/
D_LIBRARY_PATH=. DSP_LIBRARY_PATH=. ./llama-cli -t 4 -fa -m models-npu/ -p "Hello my name is"                                                            <
Qwen3-0.6B-Q4_0.gguf  Qwen3-1.7B-Q4_0.gguf  Qwen3-4B-Q4_0.gguf
D_LIBRARY_PATH=lib DSP_LIBRARY_PATH=lib ./bin/llama-cli -t 4 -fa on -m models-npu/Qwen3-4B-Q4_0.gguf  -p "Hello my name is"                              <
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 750 (OpenCL 3.0 Adreno(TM) 750)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.45.02.16
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: device max workgroup size: 1024
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels........................................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 750 (OpenCL 3.0 Adreno(TM) 750)'
ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
ggml-hex: Hexagon Arch version v75
ggml-hex: allocating new session: HTP0
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v75.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb400006f10582950
build: 6824 (128a2d37b) with Android (13324770, +pgo, +bolt, +lto, +mlgo, based on r530567d) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device GPUOpenCL (QUALCOMM Adreno(TM) 750) (unknown id) - 0 MiB free
llama_model_load_from_file_impl: using device HTP0 (Hexagon) (unknown id) - 2048 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 398 tensors from models-npu/Qwen3-4B-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 4B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 4B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-4B/...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 4B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-4B-...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv  12:                          qwen3.block_count u32              = 36
llama_model_loader: - kv  13:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv  14:                     qwen3.embedding_length u32              = 2560
llama_model_loader: - kv  15:                  qwen3.feed_forward_length u32              = 9728
llama_model_loader: - kv  16:                 qwen3.attention.head_count u32              = 32
llama_model_loader: - kv  17:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  18:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  19:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  21:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 2
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q4_0:  253 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 2.11 GiB (4.50 BPW) 
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 2560
print_info: n_layer          = 36
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 9728
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: model type       = 4B
print_info: model params     = 4.02 B
print_info: general.name     = Qwen3 4B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   208.65 MiB
load_tensors:   CPU_REPACK model buffer size =   208.65 MiB
load_tensors:       OpenCL model buffer size =    54.16 MiB
load_tensors:         HTP0 model buffer size =     0.73 MiB
load_tensors:  HTP0-REPACK model buffer size =  1894.92 MiB
.....................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache:     OpenCL KV buffer size =    16.00 MiB
llama_kv_cache:       HTP0 KV buffer size =   560.00 MiB
llama_kv_cache: size =  576.00 MiB (  4096 cells,  36 layers,  1/1 seqs), K (f16):  288.00 MiB, V (f16):  288.00 MiB
llama_context:     OpenCL compute buffer size =    79.01 MiB
llama_context:       HTP0 compute buffer size =    75.00 MiB
llama_context:        CPU compute buffer size =   296.75 MiB
llama_context: graph nodes  = 1267
llama_context: graph splits = 216
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | REPACK = 1 | 

main: interactive mode on.
sampler seed: 3797008693
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

user
Hello my name is
assistant
<think>
Okay, the user started with "Hello my name is" and then stopped. I need to respond appropriately. Since they didn't finish their message, I should prompt them to continue. Let me make sure my response is friendly and encouraging. Maybe say something like, "Hello! I'm [Name], and I'm here to help. What's your name?" Wait, but the user hasn't introduced themselves yet. Wait, the user just said "Hello my name is" and then stopped. So maybe I should ask them to finish their sentence. Let me check the guidelines. I should respond in a way that's helpful and not assume their name. Maybe say, "Hello! I'm [Name], and I'm here to help. What's your name?" But since the user hasn't provided their name yet, maybe I should just ask them to finish their message. Alternatively, maybe I can respond with a friendly greeting and prompt them to continue. Let me make sure not to assume their name. So, "Hello! I'm [Name], and I'm here to help. What's your name?" Hmm, but the user hasn of yet introduced themselves. Wait, the user's message is "Hello my name is" and then it stops. So maybe the user is starting to introduce themselves. So the best response would be to ask them to finish their sentence. Like, "Hello! I'm [Name], and I'm here to help. What's your name?" Wait, but that might be assuming my own name. Wait, no, the user is the one introducing themselves. The assistant should respond by asking for their name. So maybe the correct approach is to say, "Hello! What's your name?" or "Hello, I'm [Name], and I'm here to help. What's your name?" But the user hasn't introduced themselves yet. Wait, the user's message is "Hello my name is" and then it stops. So the user is starting to say their name. So the assistant should prompt them to finish their sentence. Like, "Hello! I'm [Name], and I'm here to help. What's your name?" Wait, but the user hasn't said their name yet. Maybe the assistant should respond by asking for the user's name. So the correct response would be, "Hello! What's your name?" or "Hello, I'm [Name], and I'm here to help. What's your name?" But the user hasn't introduced themselves yet. Wait, the user's message is "Hello my name is" and then it stops. So the user is in the process of introducing themselves. The assistant should respond by asking them to complete their sentence. Maybe, "Hello! I'm [Name], and I'm here to help. What's your name?" But that might be confusing. Alternatively, "Hello! What's your name?" to prompt them to finish their message. I think the best approach is to ask the user to finish their sentence by saying, "Hello! I'm [Name], and I'm here to help. What's your name?" Wait, but the user hasn't introduced themselves yet. Maybe the assistant should just ask for the user's name. Let me check the guidelines again. The assistant should be helpful, polite, and not assume the user's name. So the response should be, "Hello! What's your name?" or "Hello, I'm [Name], and I'm here to help. What's your name?" Wait, but the user hasn't introduced themselves. So the assistant should ask the user to finish their message. So the correct response would be, "Hello! I'm [Name], and I'm here to help. What's your name?" But that might be confusing. Alternatively, "Hello! What's your name?" to prompt them to continue. I think that's the best approach. So the assistant should respond with a friendly greeting and ask for the user's name.
</think>

Hello! I'm [Name], and I'm here to help. What's your name?

> 
llama_perf_sampler_print:    sampling time =      73.20 ms /   848 runs   (    0.09 ms per token, 11585.33 tokens per second)
llama_perf_context_print:        load time =    2871.50 ms
llama_perf_context_print: prompt eval time =     618.70 ms /    12 tokens (   51.56 ms per token,    19.40 tokens per second)
llama_perf_context_print:        eval time =   82902.04 ms /   835 runs   (   99.28 ms per token,    10.07 tokens per second)
llama_perf_context_print:       total time =   86435.55 ms /   847 tokens
llama_perf_context_print:    graphs reused =        831
llama_memory_breakdown_print: | memory breakdown [MiB]                  | total   free    self   model   context   compute       unaccounted |
llama_memory_breakdown_print: |   - GPUOpenCL (QUALCOMM Adreno(TM) 750) |     0 =    0 + ( 149 =    54 +      16 +      79) + 17592186044266 |
llama_memory_breakdown_print: |   - HTP0 (Hexagon)                      |  2048 = 2048 + (   0 =     0 +       0 +       0) +              0 |
llama_memory_breakdown_print: |   - Host                                |                 1141 =   209 +     560 +     371                   |
llama_memory_breakdown_print: |   - CPU_REPACK                          |                  208 =   208 +       0 +       0                   |
llama_memory_breakdown_print: |   - HTP0-REPACK                         |                 1894 =  1894 +       0 +       0                   |
Interrupted by user


### The output from Qwen3-0.6B or Qwen3-1.7B contains garbled text:

zorn:/data/local/tmp/llama.cpp $ LD_LIBRARY_PATH=lib DSP_LIBRARY_PATH=lib ./bin/llama-cli -t 4 -fa on -m models-npu/Qwen3-0.6B-Q4_0.gguf  -p "Hello my name>
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 750 (OpenCL 3.0 Adreno(TM) 750)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.45.02.16
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: device max workgroup size: 1024
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels........................................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 750 (OpenCL 3.0 Adreno(TM) 750)'
ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
ggml-hex: Hexagon Arch version v75
ggml-hex: allocating new session: HTP0
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v75.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb400007e00991bd0
build: 6824 (128a2d37b) with Android (13324770, +pgo, +bolt, +lto, +mlgo, based on r530567d) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device GPUOpenCL (QUALCOMM Adreno(TM) 750) (unknown id) - 0 MiB free
llama_model_load_from_file_impl: using device HTP0 (Hexagon) (unknown id) - 2048 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 311 tensors from models-npu/Qwen3-0.6B-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 0.6B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 0.6B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-0.6...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3 0.6B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-0.6...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv  12:                          qwen3.block_count u32              = 28
llama_model_loader: - kv  13:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv  14:                     qwen3.embedding_length u32              = 1024
llama_model_loader: - kv  15:                  qwen3.feed_forward_length u32              = 3072
llama_model_loader: - kv  16:                 qwen3.attention.head_count u32              = 16
llama_model_loader: - kv  17:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  18:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  19:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  21:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 2
llama_model_loader: - type  f32:  113 tensors
llama_model_loader: - type q4_0:  198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 403.42 MiB (4.50 BPW) 
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 1024
print_info: n_layer          = 28
print_info: n_head           = 16
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 3072
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: model type       = 0.6B
print_info: model params     = 751.63 M
print_info: general.name     = Qwen3 0.6B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:       OpenCL model buffer size =     8.45 MiB
load_tensors:   CPU_REPACK model buffer size =    83.46 MiB
load_tensors:   CPU_Mapped model buffer size =    83.46 MiB
load_tensors:  HTP0-REPACK model buffer size =   227.81 MiB
load_tensors:         HTP0 model buffer size =     0.24 MiB
.............................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache:     OpenCL KV buffer size =    16.00 MiB
llama_kv_cache:       HTP0 KV buffer size =   432.00 MiB
llama_kv_cache: size =  448.00 MiB (  4096 cells,  28 layers,  1/1 seqs), K (f16):  224.00 MiB, V (f16):  224.00 MiB
llama_context:     OpenCL compute buffer size =    26.01 MiB
llama_context:       HTP0 compute buffer size =    24.00 MiB
llama_context:        CPU compute buffer size =   296.75 MiB
llama_context: graph nodes  = 987
llama_context: graph splits = 168
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | REPACK = 1 | 

main: interactive mode on.
sampler seed: 2872072462
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

user
Hello my name is
assistant
 Angeles, José
员انت
</think>和服务 // 服务线

服务 there is a service line.安德里亚是 service linelete.为了提上您的 service line,gment, pleaseانت

服务 thereheat isforces, but thereentials. please try it out.از

服务 there isuably, but there are. pleaseانت
员间的 service line, pleaseانت

服务 thereforces, but there. {
service line: 123amaha, 4uably, 777, 456, 7ithmetic, 45forces, 777, 456, 777
service line: 1
llama_perf_sampler_print:    sampling time =      12.40 ms /   151 runs   (    0.08 ms per token, 12181.35 tokens per second)
llama_perf_context_print:        load time =     640.53 ms
llama_perf_context_print: prompt eval time =     120.46 ms /    12 tokens (   10.04 ms per token,    99.62 tokens per second)
llama_perf_context_print:        eval time =    4096.02 ms /   138 runs   (   29.68 ms per token,    33.69 tokens per second)
llama_perf_context_print:       total time =   15409.25 ms /   150 tokens
llama_perf_context_print:    graphs reused =        138
llama_memory_breakdown_print: | memory breakdown [MiB]                  | total   free    self   model   context   compute       unaccounted |
llama_memory_breakdown_print: |   - GPUOpenCL (QUALCOMM Adreno(TM) 750) |     0 =    0 + (  50 =     8 +      16 +      26) + 17592186044365 |
llama_memory_breakdown_print: |   - HTP0 (Hexagon)                      |  2048 = 2048 + (   0 =     0 +       0 +       0) +              0 |
llama_memory_breakdown_print: |   - Host                                |                  836 =    83 +     432 +     320                   |
llama_memory_breakdown_print: |   - CPU_REPACK                          |                   83 =    83 +       0 +       0                   |
llama_memory_breakdown_print: |   - HTP0-REPACK                         |                  227 =   227 +       0 +       0                   |
Interrupted by user

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions