You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FlashAttention and FlexAttention benchmarks currently assess the performance of the FA prefill stage.
LLM inference workflows generally include three stages:
Prefill stages which parallel compute the prompt tokens.
Decode stage which autoregressive generate the token one by one.
Append stage: for example, in the multi-around chat, we have finished the 1st round chat with prefill and decode stage and there are already some kv cache tokens. For the 2nd round chat, we need to process the new prompts for this round.
FlashAttention and FlexAttention benchmarks should be improved to evaluate the performance of FA for the Decode and Append stages.
Standard shapes for these two stages can be found in https://jira.devtools.intel.com/browse/TRITONXPU-172
The text was updated successfully, but these errors were encountered:
FlashAttention and FlexAttention benchmarks currently assess the performance of the FA prefill stage.
LLM inference workflows generally include three stages:
FlashAttention and FlexAttention benchmarks should be improved to evaluate the performance of FA for the
Decode
andAppend
stages.Standard shapes for these two stages can be found in https://jira.devtools.intel.com/browse/TRITONXPU-172
The text was updated successfully, but these errors were encountered: