[misc] Add LoRA kernel micro benchmarks #11579

varun-sundar-rabindranath · 2024-12-28T07:15:10Z

Add LoRA kernel micro benchmarks for tuning/optimizing LoRA kernels

The benchmarking script creates a pool of tensors for each kernel argument and uses the tensors in order for benchmarking.
The benchmarking script has the ability to run the kernels inside a cuda graph. This is particularly useful for benchmarking triton kernels due to their launch overhead.
The benchmarking script also benchmarks torch.mm as a baseline.

Added a utils.py in benchmarks/kernels/ that implements a Bench class. This Bench class is abstract enough to use in other future benchmark implementations.

The benchmarking script, can run in one of 3 modes,

range_bench
Example : python3 benchmarks/kernels/benchmark_lora.py range_bench --arg-pool-size 32 --batch-sizes 1 16 32 --dtype torch.float16 --num-loras 1 4 --op-types bgmv_shrink bgmv_expand sgmv_shrink sgmv_expand sgmv_expand_slice bgmv_expand_slice --seq-lengths 1 16 --sort-by-lora-id 1 --with-cuda-graph --hidden-sizes-start 1024 --hidden-sizes-end 4096 --hidden-sizes-increment 1024 --lora-ranks-start 8 --lora-ranks-end 24 --lora-ranks-increment 8

Use this to benchmark a range of hidden dimension sizes and lora-ranks

list_bench
Example : python3 benchmarks/kernels/benchmark_lora.py list_bench --arg-pool-size 32 --batch-sizes 1 16 32 --dtype torch.float16 --hidden-sizes 2048 2049 4096 8192 --lora-ranks 2 8 16 20 --num-loras 1 4 --op-types bgmv_shrink bgmv_expand sgmv_shrink sgmv_expand sgmv_expand_slice bgmv_expand_slice --seq-lengths 1 16 --sort-by-lora-id 1 --with-cuda-graph

When range benchmarking is too restrictive, use this version to simply list the hidden-dimension sizes and lora-rank values.

model_bench
Example : python3 benchmarks/kernels/benchmark_lora.py model_bench --models meta-llama/Llama-3-8b --arg-pool-size 32 --batch-sizes 1 16 32 --dtype torch.float16 --lora-ranks 16 --num-loras 1 4 --op-types bgmv_shrink bgmv_expand sgmv_shrink sgmv_expand sgmv_expand_slice bgmv_expand_slice --seq-lengths 1 16 --sort-by-lora-id 1 --with-cuda-graph

Specify a model to use the weight shapes in the model to understand the model execution performance.

Some benchmarks run on main, using

NUM_LORAS=(4)
BATCH_SIZES=(16 128 256 512 1024 2048 8192)
HIDDEN_SIZES=(1024 2048 4096 8192 16384)
RANKS=(16)

echo "Benchmarking bgmv punica kernels ..."
python3 benchmarks/kernels/benchmark_lora.py list_bench --dtype torch.float16 --arg-pool-size 32 --with-cuda-graph --num-loras ${NUM_LORAS[@]} --op-types bgmv_shrink bgmv_expand --seq-lengths 1 --hidden-sizes ${HIDDEN_SIZES[@]} --batch-sizes ${BATCH_SIZES[@]} --sort-by-lora-id 1

echo "Benchmarking sgmv punica kernels ..."
python3 benchmarks/kernels/benchmark_lora.py list_bench --dtype torch.float16 --arg-pool-size 32 --with-cuda-graph --num-loras ${NUM_LORAS[@]} --op-types sgmv_shrink sgmv_expand --seq-lengths 8 --hidden-sizes ${HIDDEN_SIZES[@]} --batch-sizes ${BATCH_SIZES[@]} --sort-by-lora-id 1

and later collated can be found here https://docs.google.com/spreadsheets/d/16iA8nZyuhfOctNg6KSJ1Y0Ve5udZKDOMsiDYDORNyks/edit?usp=sharing

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

github-actions · 2024-12-28T07:15:24Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath · 2024-12-28T07:33:32Z

@jeejeelee This PR adds some tooling for benchmarking LoRA kernels. Should be useful for further optimizing LoRA kernels and for #11234 . Note that this PR emulates the *_expand_slice operations by calling the kernels back-to-back like in the tests. However, the change should be simple enough to support #11234. PTAL.

@mgoin fyi

jeejeelee · 2024-12-31T07:05:27Z

benchmarks/kernels/benchmark_lora.py

+                                   args.with_cuda_graph))
+                seq_len_timers.append(
+                    bench_optype(_ctx, args.arg_pool_size, bench_op,
+                                 args.with_cuda_graph))


Perhaps we need to ensure the compute results are aligned

For expand related operations, with add_inputs=True, testing for correctness on the benchmarking results is hard as the function is run an indeterminate number of times.

I have added a test_correctness method to BenchmarkTensor class that can be invoked with a CLI argument --test-correctness. Note that this tests for correctness before the benchmarking is run. This should give us enough confidence about the validity of the results.

What do you think ?

I'm particularly surprised by the table execution time, especially the result shown in A164. SGMV shouldn't be this slow. So I think we should first verify that the calculation results are correct.

benchmarks/kernels/benchmark_lora.py

jeejeelee · 2024-12-31T07:11:44Z

benchmarks/kernels/benchmark_lora.py

+        num_ops_in_cuda_graph=arg_pool_size) if with_cuda_graph else None
+    with Bench(cuda_graph_params, ctx.bench_label(),
+               ctx.bench_sublabel(op_type), description, torch.mm,
+               **mm_kwargs) as bench:


QQ:Does torch.mm support group gemm? If not, as baseline, how does it compute multi-lora gemm?

afaik, it does not. I meant for the torch.mm (just a matmul) benchmark to serve as a roofline. sorry about the confusion, I have renamed the functions and added a comment.

jeejeelee · 2024-12-31T07:20:17Z

benchmarks/kernels/benchmark_lora.py

+            'max_seq_length': max_seq_len,
+            'token_nums': num_tokens,
+            'add_inputs': True,
+        }


If add_inputs is True, the expand-related kernel performs group-gemm + outputs, rather than just group-gemm alone

That was intentional, so we benchmark the most used and most expensive version. But, I see the value in passing this via the CLI. Added --expand-fn-add-inputs argument to the CLI.

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

add lora benchmark files

d2678d6

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

Varun Sundar Rabindranath added 2 commits December 28, 2024 02:22

format

9584be3

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix

5c4cfd6

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

jeejeelee reviewed Dec 31, 2024

View reviewed changes

benchmarks/kernels/benchmark_lora.py Outdated Show resolved Hide resolved

jeejeelee reviewed Dec 31, 2024

View reviewed changes

benchmarks/kernels/benchmark_lora.py Outdated Show resolved Hide resolved

jeejeelee reviewed Dec 31, 2024

View reviewed changes

Varun Sundar Rabindranath added 6 commits January 2, 2025 11:45

add output directory

9e4bcc1

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix num slices

7ec2ee7

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

format

c5906ca

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

add expand_fn_add_inputs arg

5b8bdf1

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

format

132615c

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

add test_correctness

4d38367

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[misc] Add LoRA kernel micro benchmarks #11579

[misc] Add LoRA kernel micro benchmarks #11579

varun-sundar-rabindranath commented Dec 28, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 28, 2024

varun-sundar-rabindranath commented Dec 28, 2024

jeejeelee Dec 31, 2024

varun-sundar-rabindranath Jan 2, 2025

jeejeelee Jan 3, 2025

jeejeelee Dec 31, 2024

varun-sundar-rabindranath Jan 2, 2025 •

edited

Loading

jeejeelee Dec 31, 2024 •

edited

Loading

varun-sundar-rabindranath Jan 2, 2025

[misc] Add LoRA kernel micro benchmarks #11579

Are you sure you want to change the base?

[misc] Add LoRA kernel micro benchmarks #11579

Conversation

varun-sundar-rabindranath commented Dec 28, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 28, 2024

varun-sundar-rabindranath commented Dec 28, 2024

jeejeelee Dec 31, 2024

Choose a reason for hiding this comment

varun-sundar-rabindranath Jan 2, 2025

Choose a reason for hiding this comment

jeejeelee Jan 3, 2025

Choose a reason for hiding this comment

jeejeelee Dec 31, 2024

Choose a reason for hiding this comment

varun-sundar-rabindranath Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

jeejeelee Dec 31, 2024 • edited Loading

Choose a reason for hiding this comment

varun-sundar-rabindranath Jan 2, 2025

Choose a reason for hiding this comment

varun-sundar-rabindranath commented Dec 28, 2024 •

edited by github-actions bot

Loading

varun-sundar-rabindranath Jan 2, 2025 •

edited

Loading

jeejeelee Dec 31, 2024 •

edited

Loading