[Performance]: Weird Sliding Window Attention Profiling Results #12616

tilmto · 2025-01-31T16:26:27Z

Proposal to improve performance

Hi,

I profiled the end2end latency of a Llama model with all attention layers set to sliding window attention (SWA). I experimented with different input and output sequence lengths, expecting that for a fixed large output length (e.g., 8k), increasing the input sequence length would result in comparable overall latency. This expectation arises because decoding is the bottleneck at an 8k output length, and SWA ensures a constant decoding time under a fixed output length.

I found that this assumption holds when batch_size=1. However, when I increase the batch size, latency increases significantly, as shown in the attached figure (batch_size=64).

I have attached my profiling script below. I'm not sure whether I missed anything in the profiling setup, or the support for SWA is still limited.

python3 benchmarks/benchmark_latency.py --model $my_llama_model_with_swa --load-format dummy --trust-remote-code \
                --input-len 8192 \
                --output-len 8192 \
                --batch-size 64 \
                --num-iters-warmup 3 \
                --num-iters 5 \
                --output-json

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

tilmto added the performance Performance-related issues label Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: Weird Sliding Window Attention Profiling Results #12616

[Performance]: Weird Sliding Window Attention Profiling Results #12616

tilmto commented Jan 31, 2025

[Performance]: Weird Sliding Window Attention Profiling Results #12616

[Performance]: Weird Sliding Window Attention Profiling Results #12616

Comments

tilmto commented Jan 31, 2025

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...