Skip to content

Supporting Multi-LoRA inferencing via JetStream server #221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 14, 2025
Merged

Conversation

aman2930
Copy link
Collaborator

@aman2930 aman2930 commented Mar 6, 2025

Supporting Multi-LoRA inferencing via JetStream server following LLM Inference gateway API protocols.

  • Implemented an adapter_tensorstore to load, store, manage and unload the adapter weights
  • Added and exposed required metrics at prometheus endpoint
  • Added multi_lora_decoding service with corresponding APIs as per the requirement.
  • Implemented single LoRA functionality support.

@aman2930 aman2930 requested a review from vipannalla as a code owner March 6, 2025 22:24
@aman2930 aman2930 requested review from yixinshi, vipannalla and gangji and removed request for vipannalla March 6, 2025 22:28
Copy link
Collaborator

@mailvijayasingh mailvijayasingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at it at high level, left some comments. Will take a deeper look again.

Copy link
Collaborator

@vipannalla vipannalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, its a bit longish and I'd have preferred you to send the adapter_tensorstore.py and related code as a separate PR since its isolated enough along with the unittests before sending the the PR to integrate it into orchestrator.

I've some initial comments.

Copy link
Collaborator

@vipannalla vipannalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for initial version

@github-actions github-actions bot added the pull ready This label is needed if we want the copybara service to auto sync it to g3. label Apr 14, 2025
@jyj0w0 jyj0w0 merged commit 082c0ac into main Apr 14, 2025
4 of 5 checks passed
@jyj0w0 jyj0w0 deleted the amangu-lora branch April 14, 2025 18:58
jyj0w0 pushed a commit that referenced this pull request Apr 16, 2025
Change to ubuntu-lastest since ubuntu-20.04 is deprecated

Change to ubuntu-lastest since ubuntu-20.04 is deprecated

Change to ubuntu-lastest since ubuntu-20.04 is deprecated

Move prefix cache from MaxText (#239)

Retry grpc async request (#240)

The exception raised by asyncio task is not welled catch.
If the server is not ready,
it cause the benchmark serving blocked forever without noticed.
Retry the connection to the server.

Adding PyTests in JetStream unit test workflow for code coverage. (#242)

Supporting Multi-LoRA inferencing via JetStream server (#221)

Supporting Multi-LoRA inferencing via JetStream server following [LLM
Inference gateway API
protocols](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol#inference-api-protocol).

- Implemented an adapter_tensorstore to load, store, manage and unload
the adapter weights
- Added and exposed [required
metrics](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol#metrics-reporting)
at prometheus endpoint
- Added multi_lora_decoding service with corresponding APIs as per the
[requirement](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol#inference-api-protocol).
- Implemented single LoRA functionality support.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pull ready This label is needed if we want the copybara service to auto sync it to g3.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants