You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I profiled the end2end latency of a Llama model with all attention layers set to sliding window attention (SWA). I experimented with different input and output sequence lengths, expecting that for a fixed large output length (e.g., 8k), increasing the input sequence length would result in comparable overall latency. This expectation arises because decoding is the bottleneck at an 8k output length, and SWA ensures a constant decoding time under a fixed output length.
I found that this assumption holds when batch_size=1. However, when I increase the batch size, latency increases significantly, as shown in the attached figure (batch_size=64).
I have attached my profiling script below. I'm not sure whether I missed anything in the profiling setup, or the support for SWA is still limited.
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Proposal to improve performance
Hi,
I profiled the end2end latency of a Llama model with all attention layers set to sliding window attention (SWA). I experimented with different input and output sequence lengths, expecting that for a fixed large output length (e.g., 8k), increasing the input sequence length would result in comparable overall latency. This expectation arises because decoding is the bottleneck at an 8k output length, and SWA ensures a constant decoding time under a fixed output length.
I found that this assumption holds when batch_size=1. However, when I increase the batch size, latency increases significantly, as shown in the attached figure (batch_size=64).
I have attached my profiling script below. I'm not sure whether I missed anything in the profiling setup, or the support for SWA is still limited.
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: