Enhance FlashAttention and FlexAttention benchmarks to evaluate `Decode` and `Append` stages #3617

mfrancepillois · 2025-03-05T17:03:02Z

FlashAttention and FlexAttention benchmarks currently assess the performance of the FA prefill stage.
LLM inference workflows generally include three stages:

Prefill stages which parallel compute the prompt tokens.
Decode stage which autoregressive generate the token one by one.
Append stage: for example, in the multi-around chat, we have finished the 1st round chat with prefill and decode stage and there are already some kv cache tokens. For the 2nd round chat, we need to process the new prompts for this round.

FlashAttention and FlexAttention benchmarks should be improved to evaluate the performance of FA for the Decode and Append stages.
Standard shapes for these two stages can be found in https://jira.devtools.intel.com/browse/TRITONXPU-172

The text was updated successfully, but these errors were encountered:

mfrancepillois mentioned this issue Mar 5, 2025

[FlexAttention] Add initial benchmarks #3578

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance FlashAttention and FlexAttention benchmarks to evaluate `Decode` and `Append` stages #3617

Enhance FlashAttention and FlexAttention benchmarks to evaluate `Decode` and `Append` stages #3617

mfrancepillois commented Mar 5, 2025 •

edited

Loading

Enhance FlashAttention and FlexAttention benchmarks to evaluate Decode and Append stages #3617

Enhance FlashAttention and FlexAttention benchmarks to evaluate Decode and Append stages #3617

Comments

mfrancepillois commented Mar 5, 2025 • edited Loading

Enhance FlashAttention and FlexAttention benchmarks to evaluate `Decode` and `Append` stages #3617

Enhance FlashAttention and FlexAttention benchmarks to evaluate `Decode` and `Append` stages #3617

mfrancepillois commented Mar 5, 2025 •

edited

Loading