Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LoRA, Performance] Add gemm expand triton kernel for multi-LoRA #1728

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Ying1123
Copy link
Member

@Ying1123 Ying1123 commented Oct 20, 2024

This PR added the option --lora-backend to choose between triton and flashinfer backend.

Items before merging this PR.

  • Accuracy test

The triton kernels for shrink and 2-D segmented gemm will come up in follow-up PRs.

See example below:

# launch server
python -m sglang.launch_server --model mistralai/Mistral-7B-Instruct-v0.3 --lora-paths /home/ying/test_lora lora1=/home/ying/test_lora_1 lora2=/home/ying/test_lora_2 --disable-radix --disable-cuda-graph --max-loras-per-batch 4 --lora-backend triton
# send requests
# lora_path[i] specifies the LoRA used for text[i], so make sure they have the same length
# use None to specify base-only prompt, e.x. "lora_path": [None, "/home/ying/test_lora"]
import json
import requests

url = "http://127.0.0.1:30000"
json_data = {
        "text": ["prompt 1", "prompt 2", "prompt 3", "prompt 4", "prompt 5", "prompt 6", "prompt7"],
        "sampling_params": {"max_new_tokens": 32},
        "lora_path": ["/home/ying/test_lora", "lora1", "lora2", "lora1", "lora2", None, None],
}
response = requests.post(
        url + "/generate",
        json=json_data,
)
print(json.dumps(response.json()))

For multi-LoRA serving, what has been done:

What is in progress:

  • Add triton backend and performance optimizations
    • expand kernel for segmented gemm (this PR)
    • shrink kernel for segmented gemm
    • 2-D segmented gemm

You can expect the items below in the follow-up PRs.

  • OpenAI compatible API
  • compatibility with cuda graph
  • compatibility with radix attention
  • fully sharded LoRAs with tensor parallelism
  • performance optimization
  • memory optimization
  • support LoRAs with different ranks
  • test cases enhancement

References:
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Punica: Multi-Tenant LoRA Serving

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant