Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions kubeai/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ The following features are available at the moment.
- Persistent Volume cache for models - tested/working
- Model downloading & inference engine deployment - tested/working
- Scaling pods to/from zero - tested/working
- Load based autoscaling - not tested/included
- Load based autoscaling - tested/working
- Integration with OPEA application - missing

The following models are included.
Expand Down Expand Up @@ -60,7 +60,7 @@ kubectl explain models.kubeai.org

# Deploying the Models

This section describes how to deploy various models. All the examples below use Kubernetes Persistent Volumes and Claims (PV/PVC) to store the models. The Kubernetes Storage Class (SC) is called `standard`. You can tune the storage configuration to match your environment during the installation (see `opea-values.yaml`, `cacheProfiles` for more information).
This section describes how to deploy various models. All the examples below use Kubernetes Persistent Volumes and Claims (PV/PVC) to store the models. The Kubernetes Storage Class (SC) is called `standard`. You can tune the storage configuration to match your environment during the installation (see `cacheProfiles` in `opea-values.yaml`).

The models in the examples below are deployed to `$NAMESPACE`. Please set that according to your needs.

Expand Down Expand Up @@ -98,7 +98,9 @@ kubect apply -f models/llama-3.1-8b-instruct-gaudi.yaml -n $NAMESPACE
kubect apply -f models/llama-3.3-70b-instruct-gaudi.yaml -n $NAMESPACE
```

The rest is the same as in the previous example. You should see a pod running with the name `model-llama-3.1-8b-instruct-gpu-xxxx` and/or `model-llama-3.3-70b-instruct-gpu-xxxx`.
The rest is the same as in the previous example. You should see a pod running with the name `model-llama-3.1-8b-instruct-gaudi-xxxx`. When request load for that model increases enough, KubeAI will automatically deploy more instances (model `maxReplicas` > `minReplicas`).

Latter model is set to scale from zero (`minReplicas` = 0), so `model-llama-3.3-70b-instruct-gaudi-xxxx` pod(s) will be present only when KubeAI gets requests for that model (avoids multiple devices being exclusively reserved for idle pods, but significantly slows down first response).

## Text Embeddings with BGE on CPU

Expand Down
6 changes: 5 additions & 1 deletion kubeai/models/llama-3.1-8b-instruct-gaudi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ spec:
- --max-num-seqs=256
- --max-seq-len-to-capture=2048
env:
OMPI_MCA_btl_vader_single_copy_mechanism: none
OMPI_MCA_btl_vader_single_copy_mechanism: "none"
# vLLM startup takes too long for autoscaling, especially with Gaudi
VLLM_SKIP_WARMUP: "true"
minReplicas: 1
maxReplicas: 4
targetRequests: 120
resourceProfile: gaudi-for-text-generation:1
4 changes: 3 additions & 1 deletion kubeai/models/llama-3.3-70b-instruct-gaudi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,10 @@ spec:
env:
OMPI_MCA_btl_vader_single_copy_mechanism: none
PT_HPU_ENABLE_LAZY_COLLECTIVES: "true"
# vLLM startup takes too long for autoscaling, especially with Gaudi
VLLM_SKIP_WARMUP: "true"

minReplicas: 1
# scale-from-zero avoids idle instance occupying half a node, but causes long delay
minReplicas: 0
maxReplicas: 2
resourceProfile: gaudi-for-text-generation:4
2 changes: 2 additions & 0 deletions kubeai/opea-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,5 @@ resourceProfiles:
requests:
cpu: "2"
memory: "2Gi"
nodeSelector:
#kubeai-inference: "true"