Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: Weird Sliding Window Attention Profiling Results #12616

Open
1 task done
tilmto opened this issue Jan 31, 2025 · 0 comments
Open
1 task done

[Performance]: Weird Sliding Window Attention Profiling Results #12616

tilmto opened this issue Jan 31, 2025 · 0 comments
Labels
performance Performance-related issues

Comments

@tilmto
Copy link

tilmto commented Jan 31, 2025

Proposal to improve performance

Hi,

I profiled the end2end latency of a Llama model with all attention layers set to sliding window attention (SWA). I experimented with different input and output sequence lengths, expecting that for a fixed large output length (e.g., 8k), increasing the input sequence length would result in comparable overall latency. This expectation arises because decoding is the bottleneck at an 8k output length, and SWA ensures a constant decoding time under a fixed output length.

I found that this assumption holds when batch_size=1. However, when I increase the batch size, latency increases significantly, as shown in the attached figure (batch_size=64).

Image

I have attached my profiling script below. I'm not sure whether I missed anything in the profiling setup, or the support for SWA is still limited.

python3 benchmarks/benchmark_latency.py --model $my_llama_model_with_swa --load-format dummy --trust-remote-code \
                --input-len 8192 \
                --output-len 8192 \
                --batch-size 64 \
                --num-iters-warmup 3 \
                --num-iters 5 \
                --output-json

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@tilmto tilmto added the performance Performance-related issues label Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

1 participant