This repo hosts a kubernetes operator that is responsible for creating and managing llama-stack server.
- Automated deployment of Llama Stack servers
- Support for multiple distributions (includes Ollama, vLLM, and others)
- Customizable server configurations
- Volume management for model storage
- Kubernetes-native resource management
You can install the operator directly from a released version or the latest main branch using kubectl apply -f.
To install the latest version from the main branch:
kubectl apply -f https://raw.githubusercontent.com/llamastack/llama-stack-k8s-operator/main/release/operator.yamlTo install a specific released version (e.g., v1.0.0), replace main with the desired tag:
kubectl apply -f https://raw.githubusercontent.com/llamastack/llama-stack-k8s-operator/v1.0.0/release/operator.yaml- Deploy the inference provider server (ollama, vllm)
Ollama Examples:
Deploy Ollama with default model llama3.2:1b
./hack/deploy-quickstart.shDeploy Ollama with other model:
./hack/deploy-quickstart.sh --provider ollama --model llama3.2:7bvLLM Examples:
This would require a secret "hf-token-secret" in namespace "vllm-dist" for HuggingFace token (required for downloading models) to be created in advance.
Deploy vLLM with default model (meta-llama/Llama-3.2-1B):
./hack/deploy-quickstart.sh --provider vllmDeploy vLLM with GPU support:
./hack/deploy-quickstart.sh --provider vllm --runtime-env "VLLM_TARGET_DEVICE=gpu,CUDA_VISIBLE_DEVICES=0"- Create LlamaStackDistribution CR to get the server running. Example:
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
name: llamastackdistribution-sample
spec:
replicas: 1
server:
distribution:
name: starter
containerSpec:
env:
- name: OLLAMA_INFERENCE_MODEL
value: "llama3.2:1b"
- name: OLLAMA_URL
value: "http://ollama-server-service.ollama-dist.svc.cluster.local:11434"
storage:
size: "20Gi"
mountPath: "/home/lls/.lls"
- Verify the server pod is running in the user defined namespace.
A ConfigMap can be used to store run.yaml configuration for each LlamaStackDistribution. Updates to the ConfigMap will restart the Pod to load the new data.
Example to create a run.yaml ConfigMap, and a LlamaStackDistribution that references it:
kubectl apply -f config/samples/example-with-configmap.yaml
The operator can create an ingress-only NetworkPolicy for every LlamaStackDistribution to ensure traffic is limited to:
- Other pods in the same namespace that are part of the Llama Stack deployment (
app.kubernetes.io/part-of: llama-stack) - Components that run inside the operator namespace (default:
llama-stack-k8s-operator-system)
This behavior is guarded by a feature flag and is disabled by default to avoid interfering with existing cluster-level policies. To enable it:
- Identify the namespace where the operator is running. If you used the provided manifests, it is
llama-stack-k8s-operator-system. - Create or update the
llama-stack-operator-configConfigMap in that namespace so thefeatureFlagsentry enables the network policy flag.
cat <<'EOF' > feature-flags.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: llama-stack-operator-config
namespace: llama-stack-k8s-operator-system
data:
featureFlags: |
enableNetworkPolicy:
enabled: true
EOF
kubectl apply -f feature-flags.yamlWithin the next reconciliation loop the operator will begin creating a <name>-network-policy resource for each distribution.
Set enabled: false (or remove the block) to turn the feature back off; the operator will delete the previously managed policies.
The operator supports ConfigMap-driven image updates for LLS Distribution images. This allows independent patching for security fixes or bug fixes without requiring a new operator version.
Create or update the operator ConfigMap with an image-overrides key:
image-overrides: |
starter-gpu: quay.io/custom/llama-stack:starter-gpu
starter: quay.io/custom/llama-stack:starterUse the distribution name directly as the key (e.g., starter-gpu, starter). The operator will apply these overrides automatically
To update the LLS Distribution image for all starter distributions:
kubectl patch configmap llama-stack-operator-config -n llama-stack-k8s-operator-system --type merge -p '{"data":{"image-overrides":"starter: quay.io/opendatahub/llama-stack:latest"}}'This will cause all LlamaStackDistribution resources using the starter distribution to restart with the new image.
- Kubernetes cluster (v1.20 or later)
- Go version go1.24
- operator-sdk v1.39.2 (v4 layout) or newer
- kubectl configured to access your cluster
- A running inference server:
- For local development, you can use the provided script:
/hack/deploy-quickstart.sh
- For local development, you can use the provided script:
-
Prepare release files with specific versions
make release VERSION=0.2.1 LLAMASTACK_VERSION=0.2.12This command updates distribution configurations and generates release manifests with the specified versions.
-
Custom operator image can be built using your local repository
make image IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag>The default image used is
quay.io/llamastack/llama-stack-k8s-operator:latestwhen not supply argument formake imageTo create a local filelocal.mkwith env variables can overwrite the default values set in theMakefile. -
Building multi-architecture images (ARM64, AMD64, etc.)
The operator supports building for multiple architectures including ARM64. To build and push multi-arch images:
make image-buildx IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag>By default, this builds for
linux/amd64,linux/arm64. You can customize the platforms by setting thePLATFORMSvariable:# Build for specific platforms make image-buildx IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag> PLATFORMS=linux/amd64,linux/arm64 # Add more architectures (e.g., for future support) make image-buildx IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag> PLATFORMS=linux/amd64,linux/arm64,linux/s390x,linux/ppc64leNote:
-
The
image-buildxtarget works with both Docker and Podman. It will automatically detect which tool is being used. -
Native cross-compilation: The Dockerfile uses
--platform=$BUILDPLATFORMto run Go compilation natively on the build host, avoiding QEMU emulation for the build process. This dramatically improves build speed and reliability. Only the minimal final stage (package installation) runs under QEMU for cross-platform builds. -
FIPS adherence: Native builds use
CGO_ENABLED=1with full OpenSSL FIPS support. Cross-compiled builds useCGO_ENABLED=0with pure Go FIPS (viaGOEXPERIMENT=strictfipsruntime). Both approaches are Designed for FIPS. -
For Docker: Multi-arch builds require Docker Buildx. Ensure Docker Buildx is set up:
docker buildx create --name x-builder --use -
For Podman: Podman 4.0+ supports
podman buildx(experimental). If buildx is unavailable, the Makefile will automatically fall back to using podman's native manifest-based multi-arch build approach. -
The resulting images are multi-arch manifest lists, which means Kubernetes will automatically select the correct architecture when pulling the image.
-
-
Building ARM64-only images
To build a single ARM64 image (useful for testing or ARM-native systems):
make image-build-arm IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag> make image-push IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag>This works with both Docker and Podman.
-
Once the image is created, the operator can be deployed directly. For each deployment method a kubeconfig should be exported
export KUBECONFIG=<path to kubeconfig>
Deploying operator locally
-
Deploy the created image in your cluster using following command:
make deploy IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag> -
To remove resources created during installation use:
make undeploy
The operator includes end-to-end (E2E) tests to verify the complete functionality of the operator. To run the E2E tests:
- Ensure you have a running Kubernetes cluster
- Run the E2E tests using one of the following commands:
- If you want to deploy the operator and run tests:
make deploy test-e2e - If the operator is already deployed:
make test-e2e
- If you want to deploy the operator and run tests:
The make target will handle prerequisites including deploying ollama server.
Please refer to api documentation