Single Model multiple resource profiles #428

samos123 · 2025-02-28T04:14:30Z

It should be possible to talk to a single model that is backed by different GPUs. In addition, users may want to use spot, on-demand and reserved capacity that may have to be expressed by different resource profiles.

One idea was to have a higher level Custom Resource: ModelAlias that serves as the single endpoint.

Another idea from @liaddrori1:
Provide flexibility in deploying LLMs across various GPU configurations, such as 1x L4, 2x L4, or 1x A100. This would facilitate efficient resource utilization and scalability.

However, instead of introducing an additional CRD for ModelAlias, you might want to consider extending the existing resourceProfile field to accept multiple configurations. By allowing resourceProfile to be an array with assigned priorities or weights, the KubeAI controller could attempt to schedule models based on the specified preferences. This approach streamlines the configuration and leverages the current architecture.

Proposed Configuration Example:

resourceProfile: 
  h100-1gpu: 
    priority: 1
  l4-2gpu:
    priority: 2
    args: # args override to use 2 GPUs
    [
      "--tensor-parallel-size",
      "2"
    ]
  l4-1gpu: 
    priority: 3

In this setup, the controller would prioritize scheduling on h100-1gpu. If resources are unavailable, it would attempt l4-2gpu, and subsequently l4-1gpu, ensuring optimal resource allocation without requiring a new CRD.

samos123 mentioned this issue Feb 28, 2025

Allow Updating Model URL for Seamless Model Replacement #422

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single Model multiple resource profiles #428

Single Model multiple resource profiles #428

samos123 commented Feb 28, 2025

Single Model multiple resource profiles #428

Single Model multiple resource profiles #428

Comments

samos123 commented Feb 28, 2025