You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FlashAttention and FlexAttention benchmarks currently assess the performance of MHA (Multi-Head Attention).
GQA (Group-Query Attention) is a technique widely used in LLM models (e.g. llama-3.1) to improve performance (by reducing memory usage).
FlashAttention and FlexAttention benchmarks should be improved to evaluate the performance of FA using GQA.
The text was updated successfully, but these errors were encountered:
mfrancepillois
changed the title
Enhance FlashAttention and FlexAttention benchmark with GQA
Enhance FlashAttention and FlexAttention benchmarks with GQA
Mar 5, 2025
Need to highlight, for the DeepSeek-v3, the MHA computation will become a MQA with 128 heads and the head_dim=512, the large head dim may put new challenges.
FlashAttention and FlexAttention benchmarks currently assess the performance of MHA (Multi-Head Attention).
GQA (Group-Query Attention) is a technique widely used in LLM models (e.g. llama-3.1) to improve performance (by reducing memory usage).
FlashAttention and FlexAttention benchmarks should be improved to evaluate the performance of FA using GQA.
The text was updated successfully, but these errors were encountered: