-
Notifications
You must be signed in to change notification settings - Fork 38
Supporting Multi-LoRA inferencing via JetStream server #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked at it at high level, left some comments. Will take a deeper look again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, its a bit longish and I'd have preferred you to send the adapter_tensorstore.py
and related code as a separate PR since its isolated enough along with the unittests before sending the the PR to integrate it into orchestrator.
I've some initial comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for initial version
Change to ubuntu-lastest since ubuntu-20.04 is deprecated Change to ubuntu-lastest since ubuntu-20.04 is deprecated Change to ubuntu-lastest since ubuntu-20.04 is deprecated Move prefix cache from MaxText (#239) Retry grpc async request (#240) The exception raised by asyncio task is not welled catch. If the server is not ready, it cause the benchmark serving blocked forever without noticed. Retry the connection to the server. Adding PyTests in JetStream unit test workflow for code coverage. (#242) Supporting Multi-LoRA inferencing via JetStream server (#221) Supporting Multi-LoRA inferencing via JetStream server following [LLM Inference gateway API protocols](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol#inference-api-protocol). - Implemented an adapter_tensorstore to load, store, manage and unload the adapter weights - Added and exposed [required metrics](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol#metrics-reporting) at prometheus endpoint - Added multi_lora_decoding service with corresponding APIs as per the [requirement](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol#inference-api-protocol). - Implemented single LoRA functionality support.
Supporting Multi-LoRA inferencing via JetStream server following LLM Inference gateway API protocols.