llama-server crashes with vk::DeviceLostError during the first llama_decode call after multimodal image processing (mmproj) on Mali-G720 MC10 using the panvk Vulkan driver. The vision encoder (mmproj) completes successfully on CPU (--no-mmproj-offload), but the subsequent LLM decode step on GPU triggers the crash. Pure text models (Qwen3-4B Q4_K_M) load and run on GPU without crashing, confirming the issue is specific to the multimodal decode path.
Hardware & Software
| Component |
Details |
| Device |
Radxa Orion O6n |
| SoC |
CIX P1 |
| GPU |
Mali-G720 MC10 |
| RAM |
48 GB (UMA – shared with GPU) |
| OS |
Debian 12 (aarch64) |
| Mesa |
26.0.0-1sky1.2 (Sky1-Linux apt repo) |
| Vulkan driver |
panvk (DRIVER_ID_MESA_PANVK) |
| Vulkan API |
1.4.335 |
| llama.cpp |
build 8208 (b5ed0e058), GNU 15.2.0, aarch64 |
| llama.cpp build |
-DGGML_VULKAN=ON -DGGML_NATIVE=ON |
The pure text model (Qwen3-4B) running on GPU confirms that panvk and llama.cpp are correctly built and that basic GPU inference works. The crash is specific to the multimodal decode path – specifically the first llama_decode call after image token injection.
Hypothesis
The image tokens (456 tokens from the vision encoder) are injected into the KV cache and the first decode batch contains 3 tokens (batch.n_tokens = 3 in logs). This mixed prompt — partly processed on CPU (vision), partly on GPU (LLM) — may trigger a memory barrier or synchronization issue in panvk when the GPU begins processing the combined context. This is potentially related to the WLS race condition fixed in:
panvk/csf: serialize WLS dispatches to prevent pipelining race
31525ee
However, the crash here is DeviceLostError / waitForFences, not TRANSLATION_FAULT, suggesting a different (possibly related) synchronization failure in the CSF command stream when handling large batches after cross-backend (CPU→GPU) token handoff.
llama-server crashes with vk::DeviceLostError during the first llama_decode call after multimodal image processing (mmproj) on Mali-G720 MC10 using the panvk Vulkan driver. The vision encoder (mmproj) completes successfully on CPU (--no-mmproj-offload), but the subsequent LLM decode step on GPU triggers the crash. Pure text models (Qwen3-4B Q4_K_M) load and run on GPU without crashing, confirming the issue is specific to the multimodal decode path.
Hardware & Software
The pure text model (Qwen3-4B) running on GPU confirms that panvk and llama.cpp are correctly built and that basic GPU inference works. The crash is specific to the multimodal decode path – specifically the first llama_decode call after image token injection.
Hypothesis
The image tokens (456 tokens from the vision encoder) are injected into the KV cache and the first decode batch contains 3 tokens (batch.n_tokens = 3 in logs). This mixed prompt — partly processed on CPU (vision), partly on GPU (LLM) — may trigger a memory barrier or synchronization issue in panvk when the GPU begins processing the combined context. This is potentially related to the WLS race condition fixed in:
panvk/csf: serialize WLS dispatches to prevent pipelining race
31525ee
However, the crash here is DeviceLostError / waitForFences, not TRANSLATION_FAULT, suggesting a different (possibly related) synchronization failure in the CSF command stream when handling large batches after cross-backend (CPU→GPU) token handoff.