[LoRA, Performance] Add gemm expand triton kernel for multi-LoRA #1728

Ying1123 · 2024-10-20T09:13:25Z

This PR added the option --lora-backend to choose between triton and flashinfer backend.

Items before merging this PR.

Accuracy test

The triton kernels for shrink and 2-D segmented gemm will come up in follow-up PRs.

See example below:

# launch server
python -m sglang.launch_server --model mistralai/Mistral-7B-Instruct-v0.3 --lora-paths /home/ying/test_lora lora1=/home/ying/test_lora_1 lora2=/home/ying/test_lora_2 --disable-radix --disable-cuda-graph --max-loras-per-batch 4 --lora-backend triton

# send requests
# lora_path[i] specifies the LoRA used for text[i], so make sure they have the same length
# use None to specify base-only prompt, e.x. "lora_path": [None, "/home/ying/test_lora"]
import json
import requests

url = "http://127.0.0.1:30000"
json_data = {
        "text": ["prompt 1", "prompt 2", "prompt 3", "prompt 4", "prompt 5", "prompt 6", "prompt7"],
        "sampling_params": {"max_new_tokens": 32},
        "lora_path": ["/home/ying/test_lora", "lora1", "lora2", "lora1", "lora2", None, None],
}
response = requests.post(
        url + "/generate",
        json=json_data,
)
print(json.dumps(response.json()))

For multi-LoRA serving, what has been done:

Initial LoRA support [Feature] Initial support for multi-LoRA serving #1307
This PR gives initial multi-LoRA serving support. Currently, it supports LoRA on attention (qkvo) and mlp (gate, up, down) linear layers. It supports dynamic loading and offloading, but it does not support unified memory. The memory pool for LoRA adapters is pre-allocated. Please use a smaller --mem-frac to launch server with larger --max-loras-per-batch.
Misc: path renaming [Feature] Support LoRA path renaming and add LoRA serving benchmarks #1433

What is in progress:

Add triton backend and performance optimizations
- expand kernel for segmented gemm (this PR)
- shrink kernel for segmented gemm
- 2-D segmented gemm

You can expect the items below in the follow-up PRs.

OpenAI compatible API
compatibility with cuda graph
compatibility with radix attention
fully sharded LoRAs with tensor parallelism
performance optimization
memory optimization
support LoRAs with different ranks
test cases enhancement

References:
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Punica: Multi-Tenant LoRA Serving

add lora expand triton backend

a346852

Ying1123 marked this pull request as draft October 20, 2024 09:13

This was referenced Oct 20, 2024

[Performance] Add triton kernels for LoRA #1471

Closed

Development Roadmap (2024 Q4) #1487

Open

merrymercy force-pushed the main branch from 55311eb to 2134f08 Compare November 2, 2024 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoRA, Performance] Add gemm expand triton kernel for multi-LoRA #1728

[LoRA, Performance] Add gemm expand triton kernel for multi-LoRA #1728

Ying1123 commented Oct 20, 2024 •

edited

Loading

[LoRA, Performance] Add gemm expand triton kernel for multi-LoRA #1728

Are you sure you want to change the base?

[LoRA, Performance] Add gemm expand triton kernel for multi-LoRA #1728

Conversation

Ying1123 commented Oct 20, 2024 • edited Loading

Ying1123 commented Oct 20, 2024 •

edited

Loading