Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance FlashAttention and FlexAttention benchmarks to evaluate Decode and Append stages #3617

Open
mfrancepillois opened this issue Mar 5, 2025 · 0 comments

Comments

@mfrancepillois
Copy link
Contributor

mfrancepillois commented Mar 5, 2025

FlashAttention and FlexAttention benchmarks currently assess the performance of the FA prefill stage.
LLM inference workflows generally include three stages:

  • Prefill stages which parallel compute the prompt tokens.
  • Decode stage which autoregressive generate the token one by one.
  • Append stage: for example, in the multi-around chat, we have finished the 1st round chat with prefill and decode stage and there are already some kv cache tokens. For the 2nd round chat, we need to process the new prompts for this round.

FlashAttention and FlexAttention benchmarks should be improved to evaluate the performance of FA for the Decode and Append stages.
Standard shapes for these two stages can be found in https://jira.devtools.intel.com/browse/TRITONXPU-172

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant