UPSTREAM PR #17143: metal : make the FA extra sizes consistent #157

DajanaV · 2025-11-10T12:46:35Z

While looking into ggml-org/llama.cpp#17033 (comment) found that the warmup in llama-batched-bench trashes the worst-case graph allocation using the Metal backend, causing extra graph allocations later on. The reason is because the extra FA fleeting memory for the small warmup batch ends up being larger than the memory for the worst case estimate.

Fix by making the extra size more correlated with the input shapes.

make -j && ./bin/llama-batched-bench -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -c 150792 -npp 8192 -ntg 32 -npl 1,2,4,8,16 -kvu -tgs --no-mmap

master

main: n_kv_max = 151040, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
8192	32	1	8224	3.406	2405.30	0.289	110.58	3.695	2225.58
8192	32	2	16448	6.843	2394.42	0.590	108.46	7.433	2212.94
8192	32	4	32896	14.106	2322.90	1.238	103.37	15.345	2143.80
8192	32	8	65792	29.769	2201.46	2.767	92.53	32.536	2022.12
8192	32	16	131584	65.512	2000.74	6.663	76.84	72.175	1823.13

PR

main: n_kv_max = 151040, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
8192	32	1	8224	3.287	2491.90	0.289	110.62	3.577	2299.30
8192	32	2	16448	6.646	2465.21	0.589	108.59	7.235	2273.25
8192	32	4	32896	13.581	2412.81	1.240	103.19	14.821	2219.51
8192	32	8	65792	28.191	2324.72	2.751	93.04	30.942	2126.28
8192	32	16	131584	60.293	2173.91	6.633	77.19	66.927	1966.09

loci-agentic-ai · 2025-11-10T13:25:58Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: Metal Flash Attention Memory Allocation Optimization

Overview

This analysis examines PR #157, which addresses memory allocation inconsistencies in the Metal backend's Flash Attention implementation. The changes focus on ensuring predictable memory allocation patterns to prevent graph reallocations during batched inference operations.

Key Findings

Performance Metrics Impact:

Highest response time change: _RegexMask constructor with 0.082% increase (0.018 ns absolute change)
No changes detected in core inference functions (llama_decode, llama_encode, llama_tokenize)
Tokens per second impact: None - No core tokenization/inference functions were modified

Power Consumption Analysis:

All binaries show 0.0% power consumption change
build.bin.libllama.so maintains consistent power profile despite code modifications
No computational overhead introduced by the memory allocation changes

Code Changes Analysis:
The modifications target three Metal Flash Attention memory allocation functions:

ggml_metal_op_flash_attn_ext_extra_pad: Always reserves padding space regardless of input alignment
ggml_metal_op_flash_attn_ext_extra_blk: Ensures block buffer allocation for all kernel types
ggml_metal_op_flash_attn_ext_extra_tmp: Caps temp buffer size at 32 batch elements while maintaining consistent allocation

Flame Graph & CFG Analysis:

The _RegexMask constructor shows identical assembly code between versions
Performance difference (0.01 ns) attributed to external factors like memory layout or cache alignment
No structural changes in control flow or instruction sequences

GitHub Code Review Insights:

Addresses root cause of graph reallocation issues during warmup phases
Benchmark results show 3-8% throughput improvements in batched inference scenarios
Trades minimal memory overhead for allocation consistency and performance stability

Impact Assessment:
This optimization affects Metal backend acceleration without impacting core inference performance metrics. The changes improve batched processing stability while maintaining identical computational characteristics for primary inference functions.

metal : make the FA extra sizes consistent

b8aa540

DajanaV temporarily deployed to PROD__AL_DEMO November 10, 2025 12:46 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from 9248736 to 4f73918 Compare November 10, 2025 13:17

DajanaV force-pushed the main branch 12 times, most recently from 87bfdb3 to a14857a Compare November 11, 2025 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17143: metal : make the FA extra sizes consistent #157

UPSTREAM PR #17143: metal : make the FA extra sizes consistent #157

Uh oh!

DajanaV commented Nov 10, 2025

Uh oh!

loci-agentic-ai bot commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17143: metal : make the FA extra sizes consistent #157

Are you sure you want to change the base?

UPSTREAM PR #17143: metal : make the FA extra sizes consistent #157

Uh oh!

Conversation

DajanaV commented Nov 10, 2025

master

PR

Uh oh!

loci-agentic-ai bot commented Nov 10, 2025

Performance Analysis Summary: Metal Flash Attention Memory Allocation Optimization

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants