Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 10, 2025

Mirrored from ggml-org/llama.cpp#17143

While looking into ggml-org/llama.cpp#17033 (comment) found that the warmup in llama-batched-bench trashes the worst-case graph allocation using the Metal backend, causing extra graph allocations later on. The reason is because the extra FA fleeting memory for the small warmup batch ends up being larger than the memory for the worst case estimate.

Fix by making the extra size more correlated with the input shapes.

make -j && ./bin/llama-batched-bench -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -c 150792 -npp 8192 -ntg 32 -npl 1,2,4,8,16 -kvu -tgs --no-mmap

master

main: n_kv_max = 151040, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
8192 32 1 8224 3.406 2405.30 0.289 110.58 3.695 2225.58
8192 32 2 16448 6.843 2394.42 0.590 108.46 7.433 2212.94
8192 32 4 32896 14.106 2322.90 1.238 103.37 15.345 2143.80
8192 32 8 65792 29.769 2201.46 2.767 92.53 32.536 2022.12
8192 32 16 131584 65.512 2000.74 6.663 76.84 72.175 1823.13

PR

main: n_kv_max = 151040, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
8192 32 1 8224 3.287 2491.90 0.289 110.62 3.577 2299.30
8192 32 2 16448 6.646 2465.21 0.589 108.59 7.235 2273.25
8192 32 4 32896 13.581 2412.81 1.240 103.19 14.821 2219.51
8192 32 8 65792 28.191 2324.72 2.751 93.04 30.942 2126.28
8192 32 16 131584 60.293 2173.91 6.633 77.19 66.927 1966.09

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: Metal Flash Attention Memory Allocation Optimization

Overview

This analysis examines PR #157, which addresses memory allocation inconsistencies in the Metal backend's Flash Attention implementation. The changes focus on ensuring predictable memory allocation patterns to prevent graph reallocations during batched inference operations.

Key Findings

Performance Metrics Impact:

  • Highest response time change: _RegexMask constructor with 0.082% increase (0.018 ns absolute change)
  • No changes detected in core inference functions (llama_decode, llama_encode, llama_tokenize)
  • Tokens per second impact: None - No core tokenization/inference functions were modified

Power Consumption Analysis:

  • All binaries show 0.0% power consumption change
  • build.bin.libllama.so maintains consistent power profile despite code modifications
  • No computational overhead introduced by the memory allocation changes

Code Changes Analysis:
The modifications target three Metal Flash Attention memory allocation functions:

  • ggml_metal_op_flash_attn_ext_extra_pad: Always reserves padding space regardless of input alignment
  • ggml_metal_op_flash_attn_ext_extra_blk: Ensures block buffer allocation for all kernel types
  • ggml_metal_op_flash_attn_ext_extra_tmp: Caps temp buffer size at 32 batch elements while maintaining consistent allocation

Flame Graph & CFG Analysis:

  • The _RegexMask constructor shows identical assembly code between versions
  • Performance difference (0.01 ns) attributed to external factors like memory layout or cache alignment
  • No structural changes in control flow or instruction sequences

GitHub Code Review Insights:

  • Addresses root cause of graph reallocation issues during warmup phases
  • Benchmark results show 3-8% throughput improvements in batched inference scenarios
  • Trades minimal memory overhead for allocation consistency and performance stability

Impact Assessment:
This optimization affects Metal backend acceleration without impacting core inference performance metrics. The changes improve batched processing stability while maintaining identical computational characteristics for primary inference functions.

@DajanaV DajanaV force-pushed the main branch 12 times, most recently from 87bfdb3 to a14857a Compare November 11, 2025 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants