Infrastructure-as-code for deploying a multi-tenant GenAIOps learning environment on OpenShift.
This repository provides a complete deployment solution for the AI501 GenAIOps lab environment. It enables instructors to rapidly provision a multi-tenant AI/ML lab where students can experiment with Large Language Models, prompt engineering, model optimization, and AI governance.
- OpenShift 4.119+ cluster with cluster-admin access
- Helm 3.x installed
ocCLI configured and authenticated- (Optional) AWS credentials for GPU machine provisioning
This lab requires 3 GPU nodes with two different taints:
| Component | GPU Count | Instance Type | Taint |
|---|---|---|---|
| Docling Serve | 1 | g4dn (T4) | nvidia.com/gpu=g4dn |
| Llama 3.2 (cloud-model) | 1 | g5 (A10G) | nvidia.com/gpu=g5 |
| Llama 3.2 FP8 (quantized-model) | 1 | g5 (A10G) | nvidia.com/gpu=g5 |
If you have different taints, make sure you update the necessary template for the deployments' tolerations.
-
Clone the repository
git clone <repository-url> cd deploy-lab
-
Configure your deployment (Required)
Edit
student-content/values.yamlbefore running the installation:cluster_domain: apps.your-cluster.example.com # Your OpenShift apps domain attendees: 20 # Number of students to create
To find your cluster domain:
oc get ingresses.config.openshift.io cluster -o jsonpath='{.spec.domain}' -
Run the installation
./install.sh
Edit student-content/values.yaml before installation:
| Parameter | Description | Default |
|---|---|---|
cluster_domain |
OpenShift apps domain (e.g., apps.mycluster.example.com) |
apps.example.com |
modelName |
Default LLM model name | llama32 |
attendees |
Number of student environments to create | 20 |
deploy-lab/
├── install.sh # Main deployment script
├── operators/ # Helm chart: OpenShift operators
├── toolings/ # Helm chart: shared infrastructure
├── student-content/ # Helm chart: student environments
│ ├── values.yaml # Configuration values
│ └── templates/ # Kubernetes manifests
│ ├── tinyllama/ # TinyLlama model serving
│ ├── cloud-model/ # Larger model serving
│ ├── exercise-app/ # Learning portal
│ ├── gradio/ # Interactive LLM demos
│ ├── canopy-board/ # Gamification dashboard
│ ├── argocd-instance/ # Per-user GitOps
│ └── ...
├── custom-codeserver/ # Custom workbench image
├── exercise-app/ # FastAPI exercise portal
├── gradio-app/ # Gradio LLM interfaces
├── canopy-board/ # Gamification system
├── user-signup-app/ # Student registration
├── prompt-tracker/ # Git monitoring dashboard
└── rdu-website/ # Demo marketing site
Installs required OpenShift operators:
- Red Hat OpenShift AI (RHOAI)
- OpenShift Pipelines (Tekton)
- OpenShift GitOps (ArgoCD)
- NVIDIA GPU Operator
- Node Feature Discovery
Deploys shared infrastructure:
- MinIO (S3-compatible storage)
- TrustyAI (AI explainability)
- Observability stack (Prometheus, Grafana, Tempo)
- Custom workbench templates
Per-student resources:
- ArgoCD instance for GitOps workflows
- Data Science project workspace
- Access to shared LLM models
- Interactive learning applications
| Application | Description | Path |
|---|---|---|
| Exercise Portal | FastAPI gateway to learning modules | exercise-app/ |
| Gradio App | Interactive LLM experimentation | gradio-app/ |
| Canopy Board | Gamification with leaderboards | canopy-board/ |
| User Signup | Student registration system | user-signup-app/ |
| Prompt Tracker | Git-based prompt versioning | prompt-tracker/ |
Pre-configured model deployments using KServe and vLLM:
- TinyLlama 1.1B: CPU-optimized for basic inference
- Llama 3.2: Full-featured model for production use
- Quantized Models: FP8 compressed for efficiency
After running install.sh:
-
Verify operators are ready:
oc get csv -n openshift-operators
-
Check student namespaces:
oc get namespaces | grep user -
Access the signup app to distribute credentials to students
-
Monitor ArgoCD for GitOps sync status:
oc get applications -A
For GPU workloads, use the machine provisioning script:
./machineset.shSupports: T4, A10G, A100, H100, L40 instances
helm list -n ai501oc get pods -n ai501