[LoRA, Performance] Add gemm expand triton kernel for multi-LoRA #1728
+367
−74
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR added the option
--lora-backend
to choose between triton and flashinfer backend.Items before merging this PR.
The triton kernels for shrink and 2-D segmented gemm will come up in follow-up PRs.
See example below:
For multi-LoRA serving, what has been done:
This PR gives initial multi-LoRA serving support. Currently, it supports LoRA on attention (
qkvo
) and mlp (gate, up, down
) linear layers. It supports dynamic loading and offloading, but it does not support unified memory. The memory pool for LoRA adapters is pre-allocated. Please use a smaller--mem-frac
to launch server with larger--max-loras-per-batch
.What is in progress:
You can expect the items below in the follow-up PRs.
References:
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Punica: Multi-Tenant LoRA Serving