LlamaDemo fails to load model

Hi! I've been trying out the LlamaDemo app in executorch-examples and have a two-part question.

The Android app in question: https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android/LlamaDemo

1. I'm able to run the model using the `adb` CLI. As the screenshots show the app can identify the model. However it's either throwing an error code 1, or does nothing in trying to configure the model settings. Even if no error is thrown, the model's output is empty. Attaching some screenshots (sorry for the giant images ☺️).

![Image](https://github.com/user-attachments/assets/0f6c5d1e-679b-45af-8afe-45add0c9af32)
![Image](https://github.com/user-attachments/assets/232e98eb-9142-45fa-a9c8-371398a495c2)
![Image](https://github.com/user-attachments/assets/ab4e30bc-dfc3-44be-9a12-73db1dfb310d)
![Image](https://github.com/user-attachments/assets/fad6d843-9b24-4bb2-b522-bc8af6b3a721)
![Image](https://github.com/user-attachments/assets/bfcec459-52f2-4c88-b92e-70b6fc019382)
![Image](https://github.com/user-attachments/assets/a48a197c-d980-4532-9075-5374f6b45a94)

CLI output working fine:
```
adb shell "cd /data/local/tmp/llama && ./llama_main --model_path llama3_1B_kv_sdpa_xnn_qe_4_64_1024_embedding_4bit.pte --tokenizer_path tokenizer.model --prompt '<|start_header_id|>system<|end_header_id|>\n<|eot_id|><|start_header_id|>user<|end_header_id|>\nWhat is the capital of Germany?<|eot_id|><|start_header_id|>assistant<|end_header_id|>' --warmup=1 --cpu_threads=5" 
I tokenizers:regex.cpp:27] Registering override fallback regex
I 00:00:00.006891 executorch:main.cpp:87] Resetting threadpool with num threads = 5
I 00:00:00.009134 executorch:runner.cpp:44] Creating LLaMa runner: model_path=llama3_1B_kv_sdpa_xnn_qe_4_64_1024_embedding_4bit.pte, tokenizer_path=tokenizer.model
(...)
I 00:00:02.339483 executorch:text_llm_runner.cpp:208] Warmup run finished!
I 00:00:02.339496 executorch:text_llm_runner.cpp:95] RSS after loading model: 1128.445312 MiB (0 if unsupported)
I 00:00:02.339657 executorch:text_llm_runner.cpp:152] Max new tokens resolved: 108, given pos_ 0, num_prompt_tokens 20, max_context_len 1024

<|start_header_id|>system<|end_header_id|>\n<|eot_id|><|start_header_id|>user<|end_header_id|>\nWhat is the capital of Germany?<|eot_id|><|start_header_id|>assistant<|end_header_id|>I 00:00:02.412924 executorch:text_prefiller.cpp:93] Prefill token result numel(): 128256


I 00:00:02.413203 executorch:text_llm_runner.cpp:178] RSS after prompt prefill: 1128.445312 MiB (0 if unsupported)
Die Hauptstadt Deutschlands ist Berlin.<|eot_id|>I 00:00:02.583634 executorch:text_token_generator.h:123] 
Reached to the end of generation
(... the rest of the output is truncated)
```

2. I see the README.md in the parent `llm` directory mentions that the demos will be migrated to the main ExecuTorch examples directory. Do you have a timeline for this? Arm is referring this piece in one of our [Learning Paths](https://learn.arm.com/learning-paths/mobile-graphics-and-gaming/build-llama3-chat-android-app-using-executorch-and-xnnpack/), so I'd like to keep this up to date. 

I appreciate your time!

Kind regards,
Annie

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LlamaDemo fails to load model #14656

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LlamaDemo fails to load model #14656

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions