Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Nov 10, 2025

While looking into #17033 (comment) found that the warmup in llama-batched-bench trashes the worst-case graph allocation using the Metal backend, causing extra graph allocations later on. The reason is because the extra FA fleeting memory for the small warmup batch ends up being larger than the memory for the worst case estimate.

Fix by making the extra size more correlated with the input shapes.

make -j && ./bin/llama-batched-bench -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -c 150792 -npp 8192 -ntg 32 -npl 1,2,4,8,16 -kvu -tgs --no-mmap

master

main: n_kv_max = 151040, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
8192 32 1 8224 3.406 2405.30 0.289 110.58 3.695 2225.58
8192 32 2 16448 6.843 2394.42 0.590 108.46 7.433 2212.94
8192 32 4 32896 14.106 2322.90 1.238 103.37 15.345 2143.80
8192 32 8 65792 29.769 2201.46 2.767 92.53 32.536 2022.12
8192 32 16 131584 65.512 2000.74 6.663 76.84 72.175 1823.13

PR

main: n_kv_max = 151040, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
8192 32 1 8224 3.287 2491.90 0.289 110.62 3.577 2299.30
8192 32 2 16448 6.646 2465.21 0.589 108.59 7.235 2273.25
8192 32 4 32896 13.581 2412.81 1.240 103.19 14.821 2219.51
8192 32 8 65792 28.191 2324.72 2.751 93.04 30.942 2126.28
8192 32 16 131584 60.293 2173.91 6.633 77.19 66.927 1966.09

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants