You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It should be possible to talk to a single model that is backed by different GPUs. In addition, users may want to use spot, on-demand and reserved capacity that may have to be expressed by different resource profiles.
One idea was to have a higher level Custom Resource: ModelAlias that serves as the single endpoint.
Another idea from @liaddrori1:
Provide flexibility in deploying LLMs across various GPU configurations, such as 1x L4, 2x L4, or 1x A100. This would facilitate efficient resource utilization and scalability.
However, instead of introducing an additional CRD for ModelAlias, you might want to consider extending the existing resourceProfile field to accept multiple configurations. By allowing resourceProfile to be an array with assigned priorities or weights, the KubeAI controller could attempt to schedule models based on the specified preferences. This approach streamlines the configuration and leverages the current architecture.
Proposed Configuration Example:
resourceProfile:
h100-1gpu:
priority: 1l4-2gpu:
priority: 2args: # args override to use 2 GPUs["--tensor-parallel-size","2"]l4-1gpu:
priority: 3
In this setup, the controller would prioritize scheduling on h100-1gpu. If resources are unavailable, it would attempt l4-2gpu, and subsequently l4-1gpu, ensuring optimal resource allocation without requiring a new CRD.
The text was updated successfully, but these errors were encountered:
It should be possible to talk to a single model that is backed by different GPUs. In addition, users may want to use spot, on-demand and reserved capacity that may have to be expressed by different resource profiles.
One idea was to have a higher level Custom Resource: ModelAlias that serves as the single endpoint.
Another idea from @liaddrori1:
Provide flexibility in deploying LLMs across various GPU configurations, such as 1x L4, 2x L4, or 1x A100. This would facilitate efficient resource utilization and scalability.
However, instead of introducing an additional CRD for ModelAlias, you might want to consider extending the existing resourceProfile field to accept multiple configurations. By allowing resourceProfile to be an array with assigned priorities or weights, the KubeAI controller could attempt to schedule models based on the specified preferences. This approach streamlines the configuration and leverages the current architecture.
Proposed Configuration Example:
In this setup, the controller would prioritize scheduling on h100-1gpu. If resources are unavailable, it would attempt l4-2gpu, and subsequently l4-1gpu, ensuring optimal resource allocation without requiring a new CRD.
The text was updated successfully, but these errors were encountered: