-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[misc] Add LoRA kernel micro benchmarks #11579
base: main
Are you sure you want to change the base?
[misc] Add LoRA kernel micro benchmarks #11579
Conversation
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
@jeejeelee This PR adds some tooling for benchmarking LoRA kernels. Should be useful for further optimizing LoRA kernels and for #11234 . Note that this PR emulates the @mgoin fyi |
benchmarks/kernels/benchmark_lora.py
Outdated
args.with_cuda_graph)) | ||
seq_len_timers.append( | ||
bench_optype(_ctx, args.arg_pool_size, bench_op, | ||
args.with_cuda_graph)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we need to ensure the compute results are aligned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For expand related operations, with add_inputs=True, testing for correctness on the benchmarking results is hard as the function is run an indeterminate number of times.
I have added a test_correctness method to BenchmarkTensor class that can be invoked with a CLI argument --test-correctness
. Note that this tests for correctness before the benchmarking is run. This should give us enough confidence about the validity of the results.
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_ops_in_cuda_graph=arg_pool_size) if with_cuda_graph else None | ||
with Bench(cuda_graph_params, ctx.bench_label(), | ||
ctx.bench_sublabel(op_type), description, torch.mm, | ||
**mm_kwargs) as bench: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ:Does torch.mm
support group gemm? If not, as baseline, how does it compute multi-lora gemm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik, it does not. I meant for the torch.mm
(just a matmul) benchmark to serve as a roofline. sorry about the confusion, I have renamed the functions and added a comment.
'max_seq_length': max_seq_len, | ||
'token_nums': num_tokens, | ||
'add_inputs': True, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If add_inputs
is True, the expand-related kernel performs group-gemm + outputs, rather than just group-gemm alone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was intentional, so we benchmark the most used and most expensive version. But, I see the value in passing this via the CLI. Added --expand-fn-add-inputs
argument to the CLI.
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Add LoRA kernel micro benchmarks for tuning/optimizing LoRA kernels
Added a utils.py in
benchmarks/kernels/
that implements a Bench class. This Bench class is abstract enough to use in other future benchmark implementations.The benchmarking script, can run in one of 3 modes,
Example :
python3 benchmarks/kernels/benchmark_lora.py range_bench --arg-pool-size 32 --batch-sizes 1 16 32 --dtype torch.float16 --num-loras 1 4 --op-types bgmv_shrink bgmv_expand sgmv_shrink sgmv_expand sgmv_expand_slice bgmv_expand_slice --seq-lengths 1 16 --sort-by-lora-id 1 --with-cuda-graph --hidden-sizes-start 1024 --hidden-sizes-end 4096 --hidden-sizes-increment 1024 --lora-ranks-start 8 --lora-ranks-end 24 --lora-ranks-increment 8
Use this to benchmark a range of hidden dimension sizes and lora-ranks
Example :
python3 benchmarks/kernels/benchmark_lora.py list_bench --arg-pool-size 32 --batch-sizes 1 16 32 --dtype torch.float16 --hidden-sizes 2048 2049 4096 8192 --lora-ranks 2 8 16 20 --num-loras 1 4 --op-types bgmv_shrink bgmv_expand sgmv_shrink sgmv_expand sgmv_expand_slice bgmv_expand_slice --seq-lengths 1 16 --sort-by-lora-id 1 --with-cuda-graph
When range benchmarking is too restrictive, use this version to simply list the hidden-dimension sizes and lora-rank values.
Example :
python3 benchmarks/kernels/benchmark_lora.py model_bench --models meta-llama/Llama-3-8b --arg-pool-size 32 --batch-sizes 1 16 32 --dtype torch.float16 --lora-ranks 16 --num-loras 1 4 --op-types bgmv_shrink bgmv_expand sgmv_shrink sgmv_expand sgmv_expand_slice bgmv_expand_slice --seq-lengths 1 16 --sort-by-lora-id 1 --with-cuda-graph
Specify a model to use the weight shapes in the model to understand the model execution performance.
Some benchmarks run on main, using
and later collated can be found here https://docs.google.com/spreadsheets/d/16iA8nZyuhfOctNg6KSJ1Y0Ve5udZKDOMsiDYDORNyks/edit?usp=sharing