diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/1-prerequisites.md b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/1-prerequisites.md index c26f0e762b..865a90aa20 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/1-prerequisites.md +++ b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/1-prerequisites.md @@ -14,6 +14,7 @@ Begin by installing the latest version of [Android Studio](https://developer.and Next, install the following command-line tools: - `cmake`; a cross-platform build system. +- `python3`; interpreted programming language, used by project to fetch dependencies and models. - `git`; a version control system that you use to clone the Voice Assistant codebase. - `adb`; Android Debug Bridge, used to communicate with and control Android devices. @@ -22,9 +23,20 @@ Install these tools with the appropriate command for your OS: {{< tabpane code=true >}} {{< tab header="Linux/Ubuntu" language="bash">}} sudo apt update -sudo apt install git adb cmake -y +sudo apt install git adb cmake python3 -y {{< /tab >}} {{< tab header="macOS" language="bash">}} -brew install git android-platform-tools cmake +brew install git android-platform-tools cmake python + {{< /tab >}} +{{< /tabpane >}} + +Ensure the correct version of python is installed, the project needs python version 3.9 or later: + +{{< tabpane code=true >}} + {{< tab header="Linux/Ubuntu" language="bash">}} +python3 --version + {{< /tab >}} + {{< tab header="macOS" language="bash">}} +python3 --version {{< /tab >}} {{< /tabpane >}} diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/2-overview.md b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/2-overview.md index 22348d6cfc..ddc12c895f 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/2-overview.md +++ b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/2-overview.md @@ -33,6 +33,26 @@ This process includes the following stages: - A neural network analyzes these features to predict the most likely transcription based on grammar and context. - The recognized text is passed to the next stage of the pipeline. +The voice assistant pipeline imports and builds a separate module to provide this STT functionality. You can access this at: + +``` +https://gitlab.arm.com/kleidi/kleidi-examples/speech-to-text +``` + +and build for various platforms to independently benchmark STT functionality: + +|Platform|Details| +|---|---| +|Linux|x86_64 - KleidiAI is disabled by default, aarch64 - KleidiAI is enabled by default.| +|Android|Cross-compile for an Android device, ensure the Android NDK path is set and correct toolchain file is provided. KleidiAI enabled by default.| +|MacOS|Native or cross-compilation for a Mac device. KleidiAI and SME kernels can be used if available on device.| + +Currently, this module uses [whisper.cpp](https://github.com/ggml-org/whisper.cpp) and wraps the backend library by a thin C++ layer. The module also provides JNI bindings for developers targetting Android based applications. + +{{% notice %}} +You can get more information on how to build and use this module [here](https://gitlab.arm.com/kleidi/kleidi-examples/speech-to-text/-/blob/main/README.md?ref_type=heads) +{{% /notice %}} + ## Large Language Model Large Language Models (LLMs) enable natural language understanding and, in this application, are used for question-answering. @@ -41,8 +61,37 @@ The text transcription from the previous part of the pipeline is used as input t By default, the LLM runs asynchronously, streaming tokens as they are generated. The UI updates in real time with each token, which is also passed to the final pipeline stage. +The voice assistant pipeline imports and builds a separate module to provide this LLM functionality. You can access this at: + +``` +https://gitlab.arm.com/kleidi/kleidi-examples/large-language-models +``` + +and build for various platforms to independently benchmark LLM functionality: + +|Platform|Details| +|---|---| +|Linux|x86_64 - KleidiAI is disabled by default, aarch64 - KleidiAI is enabled by default.| +|Android|Cross-compile for an Android device, ensure the Android NDK path is set and correct toolchain file is provided. KleidiAI enabled by default.| +|MacOS|Native or cross-compilation for a Mac device. KleidiAI and SME kernels can be used if available on device.| + +Currently, this module provides a thin C++ layer as well as JNI bindings for developers targetting Android based applications, supported backends are: +|Framework|Dependency|Input modalities supported|Output modalities supported|Neural Network| +|---|---|---|---|---| +|llama.cpp|https://github.com/ggml-org/llama.cpp|`image`, `text`|`text`|phi-2,Qwen2-VL-2B-Instruct| +|onnxruntime-genai|https://github.com/microsoft/onnxruntime-genai|`text`|`text`|phi-4-mini-instruct-onnx| +|mediapipe|https://github.com/google-ai-edge/mediapipe|`text`|`text`|gemma-2b-it-cpu-int4| + + + +{{% notice %}} +You can get more information on how to build and use this module [here](https://gitlab.arm.com/kleidi/kleidi-examples/large-language-models/-/blob/main/README.md?ref_type=heads) +{{% /notice %}} + ## Text-to-Speech This part of the application pipeline uses the Android Text-to-Speech API along with additional logic to produce smooth, natural speech. In synchronous mode, speech playback begins only after the full LLM response is received. By default, the application operates in asynchronous mode, where speech synthesis starts as soon as a full or partial sentence is ready. Remaining tokens are buffered and processed by the Android Text-to-Speech engine to ensure uninterrupted playback. + +You are now familiar with the building blocks of this application and can build these independently for various platforms. You can now build the multi-modal Voice Assistant example which runs on Android OS in the next step. diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/4-run.md b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/4-run.md index 21ef93cb95..6deafb4cee 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/4-run.md +++ b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/4-run.md @@ -16,33 +16,51 @@ By default, Android devices ship with developer mode disabled. To enable it, fol Once developer mode is enabled, connect your phone to your computer with USB. It should appear as a running device in the top toolbar. Select the device and click **Run** (a small green triangle, as shown below). This transfers the app to your phone and launches it. +In the graphic below, a Google Pixel 8 Pro phone is connected to the USB cable: -In the graphic below, a Samsung Galaxy Z Flip 6 phone is connected to the USB cable: ![upload image alt-text#center](upload.png "Upload the Voice App") -======= + ## Launch the Voice Assistant The app starts with this welcome screen: -![welcome image alt-text#center](voice_assistant_view1.jpg "Welcome Screen") +![welcome image alt-text#center](voice_assistant_view1.png "Welcome Screen") Tap **Press to talk** at the bottom of the screen to begin speaking your request. ## Voice Assistant controls -### View performance counters +You can use application controls to enable extra functionality or gather performance data. -You can toggle performance counters such as: -- Speech recognition time. -- LLM encode tokens per second. -- LLM decode tokens per second. -- Speech generation time. +|Button|Control name|Description| +|---|---|---| +|1|Performance counters|Performance counters are hidden by default, click this to show speech recognition time, LLM encode and decode rate.| +|2|Speech generation|Speech generation is disabled by default, click this to use Android Text-to-Speech and get audible answers.| +|3|Reset conversation|By default, the application keeps context so you can follow-up questions, click this to reset voice assistant conversation history.| Click the icon circled in red in the top left corner to show or hide these metrics: -![performance image alt-text#center](voice_assistant_view2.jpg "Performance Counters") +![performance image alt-text#center](voice_assistant_view2.png "Performance Counters") + +### Multimodal Question Answering + +If you have built the application using the default `llama.cpp` backend, you can also use it in multimodal `(input + text)` question answering mode. + +For this, click the image button first: + +![use image alt-text#center](voice_assistant_multimodal_1.png "Add image button") + +This will bring up the photos you can chose from: + +![choose image alt-text#center](choose_image.png "Choose image from the gallery") + +Choose the image, and add image for voice assistant: + +![add image alt-text#center](add_image.png "Add image to the question") + +You can now ask questions related to this image, the large language model will you the image and text for multimodal question answering. -To reset the Voice Assistant's conversation history, click the icon circled in red in the top right: +![ask question image alt-text#center](voice_assistant_multimodal_2.png "Add image to the question") -![reset image alt-text#center](voice_assistant_view3.jpg "Reset the Voice Assistant's Context") +Now that you have explored how the android application is set up and built, you can see in detail how KleidiAI library is used in the next step. diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/5-kleidiai.md b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/5-kleidiai.md index 31fd09ea46..cd311bf4e1 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/5-kleidiai.md +++ b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/5-kleidiai.md @@ -31,4 +31,5 @@ To disable KleidiAI during build: KleidiAI simplifies development by abstracting away low-level optimization: developers can write high-level code while the KleidiAI library selects the most efficient implementation at runtime based on the target hardware. This is possible thanks to its deeply optimized micro-kernels tailored for Arm architectures. -As newer versions of the architecture become available, KleidiAI becomes even more powerful: simply updating the library allows applications like the Voice Assistant to take advantage of the latest architectural improvements - such as SME2 — without requiring any code changes. This means better performance on newer devices with no additional effort from developers. \ No newline at end of file +As newer versions of the architecture become available, KleidiAI becomes even more powerful: simply updating the library allows applications like the multi-modal Voice Assistant to take advantage of the latest architectural improvements - such as SME2 — without requiring any code changes. This means better performance on newer devices with no additional effort from developers. + diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/_index.md b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/_index.md index 1fb425143e..67f8a26d22 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/_index.md +++ b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/_index.md @@ -1,19 +1,23 @@ --- -title: Accelerate Voice Assistant performance with KleidiAI and SME2 +title: Accelerate multi-modal Voice Assistant performance with KleidiAI and SME2 minutes_to_complete: 30 -who_is_this_for: This is an introductory topic for developers who want to accelerate Voice Assistant performance on Android devices using KleidiAI and SME2. +who_is_this_for: This is an introductory topic for developers who want to see a pipeline of a multi-modal Voice Assistant application and accelerate the performance on Android devices using KleidiAI and SME2. learning_objectives: - - Compile and run a Voice Assistant Android application. - - Optimize performance using KleidiAI and SME2. + - Learn about the multi-modal Voice Assistant pipeline and different components used. + - Learn about the functionality of ML components used and how these can be built and benchmarked on various platforms. + - Compile and run a multi-modal Voice Assistant example based on Android OS. + - Optimize performance of multi-modal Voice Assistant using KleidiAI and SME2. prerequisites: - - An Android phone that supports the i8mm Arm architecture feature (8-bit integer matrix multiplication). This Learning Path was tested on a Samsung Galaxy Z Flip 6. + - An Android phone that supports the i8mm Arm architecture feature (8-bit integer matrix multiplication). This Learning Path was tested on a Google Pixel 8 Pro. - A development machine with [Android Studio](https://developer.android.com/studio) installed. -author: Arnaud de Grandmaison +author: + - Arnaud de Grandmaison + - Nina Drozd skilllevels: Introductory subjects: Performance and Architecture @@ -22,10 +26,11 @@ armips: tools_software_languages: - Java - Kotlin + - C++ operatingsystems: + - Android - Linux - macOS - - Android further_reading: diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/add_image.png b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/add_image.png new file mode 100644 index 0000000000..b9db5a2421 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/add_image.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/choose_image.png b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/choose_image.png new file mode 100644 index 0000000000..26dd58ff93 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/choose_image.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/upload.png b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/upload.png index 30d7a4e478..9768c1577b 100644 Binary files a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/upload.png and b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/upload.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/upload_old.png b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/upload_old.png new file mode 100644 index 0000000000..30d7a4e478 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/upload_old.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_multimodal_1.png b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_multimodal_1.png new file mode 100644 index 0000000000..f00927c744 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_multimodal_1.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_multimodal_2.png b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_multimodal_2.png new file mode 100644 index 0000000000..6d2bb5f367 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_multimodal_2.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_use_multimodal_1.png b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_use_multimodal_1.png new file mode 100644 index 0000000000..dc75319530 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_use_multimodal_1.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_use_multimodal_2.png b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_use_multimodal_2.png new file mode 100644 index 0000000000..d7fee1b46a Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_use_multimodal_2.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view1.png b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view1.png new file mode 100644 index 0000000000..59fbceb399 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view1.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view1.jpg b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view1_old.jpg similarity index 100% rename from content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view1.jpg rename to content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view1_old.jpg diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view2.jpg b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view2.jpg deleted file mode 100644 index cd46a52085..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view2.jpg and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view2.png b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view2.png new file mode 100644 index 0000000000..50a479bc68 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view2.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view3.jpg b/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view3.jpg deleted file mode 100644 index 427cfe0ca8..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/voice-assistant/voice_assistant_view3.jpg and /dev/null differ