Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance FlashAttention and FlexAttention benchmarks with GQA #3615

Open
mfrancepillois opened this issue Mar 5, 2025 · 1 comment
Open

Comments

@mfrancepillois
Copy link
Contributor

mfrancepillois commented Mar 5, 2025

FlashAttention and FlexAttention benchmarks currently assess the performance of MHA (Multi-Head Attention).
GQA (Group-Query Attention) is a technique widely used in LLM models (e.g. llama-3.1) to improve performance (by reducing memory usage).
FlashAttention and FlexAttention benchmarks should be improved to evaluate the performance of FA using GQA.

@mfrancepillois mfrancepillois changed the title Enhance FlashAttention and FlexAttention benchmark with GQA Enhance FlashAttention and FlexAttention benchmarks with GQA Mar 5, 2025
@vlad-penkin vlad-penkin added this to the 4. [Performance] Core milestone Mar 5, 2025
@liangan1
Copy link

liangan1 commented Mar 5, 2025

Need to highlight, for the DeepSeek-v3, the MHA computation will become a MQA with 128 heads and the head_dim=512, the large head dim may put new challenges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants