[V1] LoRA Support #10957

varun-sundar-rabindranath · 2024-12-06T17:02:55Z

Changes:

Run LoRA requests through V1
- All LoRA functionality is put in a LoRAGPUModelRunnerMixin class that the GPUModelRunner inherits.
- Changes to GPUModelRunner for loading lora models and setting active loras before every run.
Prefix caching
- Add lora_id as a key to prefix caching hash.
Scheduler:
- Add code to track Current and Newly added LoRA requests.
Detokenizer:
- Use LoRA tokenizers for LoRA requests.

Benchmarks:
Machine : 1xA100
V1

VLLM_USE_V1="1" python3 benchmarks/benchmark_throughput.py --model  meta-llama/Llama-2-7b-hf --backend vllm   --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 200 --max-loras 4 --max-lora-rank 8  --enable-lora --lora-path "yard1/llama-2-7b-sql-lora-test"

Throughput: 2.42 requests/s, 1225.95 total tokens/s, 628.29 output tokens/s

V0

python3 benchmarks/benchmark_throughput.py --model  meta-llama/Llama-2-7b-hf --backend vllm   --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 200 --max-loras 4 --max-lora-rank 8  --enable-lora --lora-path "yard1/llama-2-7b-sql-lora-test"

Throughput: 5.95 requests/s, 3021.90 total tokens/s, 1548.71 output tokens/s

The performance gap between V0 and V1 is due to CUDA Graphs. Refer to benchmarks in reference PR #11613 .

github-actions · 2024-12-06T17:03:10Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

tests/lora/test_baichuan.py

vllm/engine/arg_utils.py

varun-sundar-rabindranath · 2024-12-06T17:21:12Z

vllm/v1/engine/detokenizer.py

+            tokenizer_name=tokenizer_name,
+            tokenizer_mode=tokenizer_mode,
+            trust_remote_code=trust_remote_code,
+            revision=revision)


@ywang96 @njhill small refactor to allow for per-request tokenizers.

varun-sundar-rabindranath · 2024-12-06T17:25:16Z

vllm/v1/worker/gpu_model_runner.py

@@ -602,269 +633,3 @@ def _get_padded_batch_size(self, batch_size: int) -> Optional[int]:
            if batch_size <= size:
                return size
        return None
-


Refactor : Moved CachedRequestState and InputBatch to input_batch.py. It looked like a good refactor to reduce file-size. In this PR it lets both gpu_model_runner.py and lora_model_runner_mixin.py import these datastructures from InputBatch.

varun-sundar-rabindranath · 2024-12-06T17:26:05Z

vllm/v1/worker/input_batch.py

+            max_num_logprobs=self.max_num_logprobs,
+        )
+
+    def make_lora_inputs(self, num_scheduled_tokens: np.array) \


Added for LoRA

WoosukKwon

Thanks for doing this! Left a few early comments. Will look into more details later.

vllm/v1/engine/processor.py

vllm/v1/core/scheduler.py

WoosukKwon · 2024-12-06T17:33:19Z

vllm/v1/core/scheduler.py

+        if self.lora_config:
+            requested_loras =  \
+                set(req.lora_request.lora_int_id \
+                        for req in scheduled_running_reqs \
+                            if req.lora_request and \
+                                req.lora_request.lora_int_id > 0)
+            assert len(requested_loras) <= self.lora_config.max_loras


Can we cache this state and incrementally update it whenever new request joins or finishes?

I explored this a bit. Tracking the additions and deletions to the running queue in the current code is hard. The updates happen in more than one place (with new requests, finish requests and requests moving between running to preempted state and back). one way is to replace the append/remove/pop with

self.running.<operation>() if lora_config: update_active_loras()

A better way is to subclass List and after any Create, Update, Delete operation we can update the active LoRAs. This is a considerable change. I believe we can do this after some profiling to see how bad this code is.
For the moment, I think this localized update is nicer as it doesn't introduce a bunch of if self.lora_configs .

Is there a better way I am missing ?

vllm/v1/worker/input_batch.py

WoosukKwon · 2024-12-06T17:41:55Z

vllm/v1/worker/input_batch.py

+        req_lora_mapping = self.request_lora_mapping[:self.num_reqs]
+        prompt_lora_mapping = tuple(req_lora_mapping)
+        token_lora_mapping = tuple(
+            req_lora_mapping.repeat(num_scheduled_tokens))
+
+        active_lora_ids: set[int] = set(np.unique(req_lora_mapping))
+        active_lora_requests: set[LoRARequest] = \
+            set({lr for lr in self.lora_requests \
+                    if lr.lora_int_id in active_lora_ids})
+        # Update lora requests
+        self.lora_requests = active_lora_requests
+
+        return prompt_lora_mapping, token_lora_mapping, self.lora_requests


How does this work with tunica kernels?

We use the punica SGMV kernel always (as set in

vllm/vllm/v1/worker/lora_model_runner_mixin.py

Line 68 in a7a9626

lora_mapping = LoRAMapping(token_lora_mapping,

). Internally the kernels launch a thread-block set for each request separately. So, as long as the prompt_lora_mapping is correct, the kernels work correctly.

The SGMV kernel codepath merges the sequences that have the same lora-id together in

vllm/vllm/lora/punica.py

Line 28 in 7406274

def compute_meta(

. I chose the SGMV kernel so this merging happens wherever possible.

I'll profile with both SGMV and BGMV kernels and choose the best. For now, SGMV looked like a good default/placeholder.

Regarding V0 LoRA，SGMV implements group gemm, which provides better performance for prefill stage . BGMV implements group gemv, which is better optimized for decoding stage . If only one can be chosen, SGMV is likely more suitable.

mergify · 2024-12-17T06:13:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2024-12-31T00:54:58Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

varun-sundar-rabindranath · 2024-12-31T01:57:19Z

vllm/model_executor/layers/logits_processor.py

-        logits = lm_head.linear_method.apply(lm_head,
-                                             hidden_states,
-                                             bias=embedding_bias)
+    def _gather_logits(self, logits: torch.Tensor) -> torch.Tensor:


Refactor : introduce _gather_logits() that LogitsProcessorWithLoRA also uses.

varun-sundar-rabindranath · 2024-12-31T01:58:10Z

vllm/v1/core/kv_cache_utils.py

+    return [request.lora_request.lora_int_id]
+
+
+def generate_block_hash_extra_keys(


Refactor for using prefix caching with LoRA.

varun-sundar-rabindranath · 2024-12-31T02:03:57Z

vllm/v1/worker/gpu_model_runner.py

-        del hidden_states, logits
-        self.encoder_cache.clear()
+        # For profile, have maximum num_reqs and that collectively have
+        # maximum num_tokens.


Setup num_scheduled_tokens for initializing LoRA for profile_run. @ywang96 will this change interfere with the multi modal setup above ? Can you point me to a test / command that I should confirm that it works ? Thanks.

bump.

I'd like some review on this part please.

@varun-sundar-rabindranath Hey sorry for the delayed review, but this should be okay since you're just moving self.encoder_cache.clear() later.

comaniac

v1/core LGTM

vllm/v1/core/kv_cache_utils.py

vllm/v1/core/scheduler.py

mergify · 2025-01-04T05:15:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jeejeelee · 2025-01-06T14:40:06Z

tests/lora/test_minicpmv.py

+    # test in a package
+    pass
+
+
 @pytest.mark.xfail(


minicpmv does not support v1 yet, see:https://docs.vllm.ai/en/latest/models/supported_models.html#id3

I see. Thanks for the call out 👍 . I was hoping to catch these errors when the PR goes /ready.

vllm/v1/core/kv_cache_utils.py

WoosukKwon

@varun-sundar-rabindranath Thanks for doing this! The code looks very clean to me. Left some minor comments and questions. Please take a look!

vllm/v1/worker/gpu_input_batch.py

WoosukKwon · 2025-01-17T05:23:33Z

vllm/v1/core/scheduler.py

@@ -182,6 +181,14 @@ def schedule(self) -> "SchedulerOutput":
                    self.encoder_cache_manager.allocate(request, i)
                encoder_budget = new_encoder_budget

+        # Record the LoRAs in scheduled_running_reqs
+        requested_loras: Set[int] = set()


nit: why don't we cache this state and updates it incrementally?

Having a cached requested_loras was more invasive and very cumbersome. Incremental updates to this state will require updating this state when the requests move from / to running queue. The updates to the running queue happen in many places in the file and tacking on an update to the requested_loras in all the places was cumbersome and seemed bug-prone.

The idea was to have these localized set of changes for LoRA and to make optimizations if necessary.

vllm/v1/worker/gpu_model_runner.py

WoosukKwon · 2025-01-17T05:30:42Z

vllm/v1/worker/gpu_input_batch.py

+        # only update request_lora_mapping. Defer the updates
+        # to lora_requests to prepare_lora_inputs.


Why do we do so? I think we can maintain an inverse index like Dict: Lora_id --> Set[Request id]?

I did not want to introduce too many data structures and lora_requests was used only in prepare_lora_inputs.
I have updated the code include lora_id_to_lora_request and lora_id_to_request_ids dicts to track the removal properly. This is probably better for consistency / guarantees.

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners December 6, 2024 17:02

varun-sundar-rabindranath marked this pull request as draft December 6, 2024 17:03

varun-sundar-rabindranath commented Dec 6, 2024

View reviewed changes

tests/lora/test_baichuan.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath commented Dec 6, 2024

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath commented Dec 6, 2024

View reviewed changes

varun-sundar-rabindranath changed the title ~~V1 LoRA Support~~ [V1] LoRA Support Dec 6, 2024

varun-sundar-rabindranath mentioned this pull request Dec 6, 2024

[WIP] V1 LoRA support #10579

Closed

WoosukKwon reviewed Dec 6, 2024

View reviewed changes

varun-sundar-rabindranath force-pushed the varun/v1-lora-support-attempt-2 branch from d21df49 to 797dab2 Compare December 17, 2024 03:47

mergify bot added the needs-rebase label Dec 17, 2024

varun-sundar-rabindranath force-pushed the varun/v1-lora-support-attempt-2 branch from d4d70cc to 550da53 Compare December 17, 2024 15:25

mergify bot removed the needs-rebase label Dec 17, 2024

varun-sundar-rabindranath mentioned this pull request Dec 30, 2024

[Do Not Merge] - LoRA V1 Reference PR #11613

Draft

varun-sundar-rabindranath force-pushed the varun/v1-lora-support-attempt-2 branch from 51ef92a to 3200ed4 Compare December 31, 2024 00:54

mergify bot added the needs-rebase label Dec 31, 2024

varun-sundar-rabindranath force-pushed the varun/v1-lora-support-attempt-2 branch from 3200ed4 to 48e9185 Compare December 31, 2024 01:53

mergify bot removed the needs-rebase label Dec 31, 2024

varun-sundar-rabindranath marked this pull request as ready for review December 31, 2024 01:54

varun-sundar-rabindranath commented Dec 31, 2024

View reviewed changes

comaniac reviewed Dec 31, 2024

View reviewed changes

varun-sundar-rabindranath requested review from WoosukKwon and jeejeelee January 2, 2025 16:21

mergify bot added the needs-rebase label Jan 4, 2025

varun-sundar-rabindranath force-pushed the varun/v1-lora-support-attempt-2 branch from d04d56d to 4fc158c Compare January 4, 2025 05:42

mergify bot removed the needs-rebase label Jan 4, 2025

jeejeelee reviewed Jan 6, 2025

View reviewed changes

comaniac reviewed Jan 6, 2025

View reviewed changes

vllm/v1/core/kv_cache_utils.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath force-pushed the varun/v1-lora-support-attempt-2 branch from b57ca04 to 5fc59ef Compare January 10, 2025 09:18

simon-mo mentioned this pull request Jan 10, 2025

[Roadmap] vLLM Roadmap Q1 2025 #11862

Open

36 tasks

varun-sundar-rabindranath force-pushed the varun/v1-lora-support-attempt-2 branch 3 times, most recently from 6576b44 to 6e81bd8 Compare January 17, 2025 03:51

WoosukKwon reviewed Jan 17, 2025

View reviewed changes

Add lora support

7037c91

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath force-pushed the varun/v1-lora-support-attempt-2 branch from 3ba33fd to 7037c91 Compare January 19, 2025 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] LoRA Support #10957

[V1] LoRA Support #10957

varun-sundar-rabindranath commented Dec 6, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 6, 2024

varun-sundar-rabindranath Dec 6, 2024

varun-sundar-rabindranath Dec 6, 2024 •

edited

Loading

varun-sundar-rabindranath Dec 6, 2024

WoosukKwon left a comment

WoosukKwon Dec 6, 2024

varun-sundar-rabindranath Dec 6, 2024

WoosukKwon Dec 6, 2024

varun-sundar-rabindranath Dec 6, 2024

jeejeelee Dec 7, 2024

mergify bot commented Dec 17, 2024

mergify bot commented Dec 31, 2024

varun-sundar-rabindranath Dec 31, 2024

varun-sundar-rabindranath Dec 31, 2024

varun-sundar-rabindranath Dec 31, 2024

varun-sundar-rabindranath Jan 7, 2025

ywang96 Jan 10, 2025

comaniac left a comment

mergify bot commented Jan 4, 2025

jeejeelee Jan 6, 2025

varun-sundar-rabindranath Jan 6, 2025

WoosukKwon left a comment

WoosukKwon Jan 17, 2025

varun-sundar-rabindranath Jan 17, 2025

WoosukKwon Jan 17, 2025

varun-sundar-rabindranath Jan 17, 2025

		return [request.lora_request.lora_int_id]


		def generate_block_hash_extra_keys(

		# only update request_lora_mapping. Defer the updates
		# to lora_requests to prepare_lora_inputs.

[V1] LoRA Support #10957

Are you sure you want to change the base?

[V1] LoRA Support #10957

Conversation

varun-sundar-rabindranath commented Dec 6, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 6, 2024

Choose a reason for hiding this comment

varun-sundar-rabindranath Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WoosukKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Dec 17, 2024

mergify bot commented Dec 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

mergify bot commented Jan 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WoosukKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varun-sundar-rabindranath commented Dec 6, 2024 •

edited by github-actions bot

Loading

varun-sundar-rabindranath Dec 6, 2024 •

edited

Loading