Enhance FlashAttention and FlexAttention benchmarks with GQA #3615

mfrancepillois · 2025-03-05T16:52:17Z

FlashAttention and FlexAttention benchmarks currently assess the performance of MHA (Multi-Head Attention).
GQA (Group-Query Attention) is a technique widely used in LLM models (e.g. llama-3.1) to improve performance (by reducing memory usage).
FlashAttention and FlexAttention benchmarks should be improved to evaluate the performance of FA using GQA.

liangan1 · 2025-03-05T23:30:32Z

Need to highlight, for the DeepSeek-v3, the MHA computation will become a MQA with 128 heads and the head_dim=512, the large head dim may put new challenges.

mfrancepillois changed the title ~~Enhance FlashAttention and FlexAttention benchmark with GQA~~ Enhance FlashAttention and FlexAttention benchmarks with GQA Mar 5, 2025

mfrancepillois mentioned this issue Mar 5, 2025

[FlexAttention] Add initial benchmarks #3578

Open

vlad-penkin added this to the 4. [Performance] Core milestone Mar 5, 2025

vlad-penkin added codegen: attention performance labels Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance FlashAttention and FlexAttention benchmarks with GQA #3615

Enhance FlashAttention and FlexAttention benchmarks with GQA #3615

mfrancepillois commented Mar 5, 2025 •

edited

Loading

liangan1 commented Mar 5, 2025

Enhance FlashAttention and FlexAttention benchmarks with GQA #3615

Enhance FlashAttention and FlexAttention benchmarks with GQA #3615

Comments

mfrancepillois commented Mar 5, 2025 • edited Loading

liangan1 commented Mar 5, 2025

mfrancepillois commented Mar 5, 2025 •

edited

Loading