Dynamic Accelerator Slicer (DAS) is an operator that dynamically partitions GPU accelerators in Kubernetes and OpenShift. It currently ships with a reference implementation for NVIDIA Multi-Instance GPU (MIG) and is designed to support additional technologies such as NVIDIA MPS or GPUs from other vendors.
Minimum supported OpenShift versions: 4.18.21 and 4.19.6.
- Dynamic Accelerator Slicer (DAS) Operator
- On-demand partitioning of GPUs via a custom Kubernetes operator.
- Scheduler integration that allocates NVIDIA MIG slices through a plugin located at
pkg/scheduler/plugins/mig/mig.go. AllocationClaimcustom resource to track slice reservations (pkg/apis/dasoperator/v1alpha1/allocation_types.go).- Emulated mode to exercise the workflow without real hardware.
This project uses just for task automation. Install just first:
# On macOS
brew install just
# On Fedora/RHEL
dnf install just
# On Ubuntu/Debian
apt install just
# Or via cargo
cargo install just-
Configure your images by editing
related_images.your-username.jsonwith your registry:[ {"name": "instaslice-operator-next", "image": "quay.io/your-username/instaslice-operator:latest"}, {"name": "instaslice-webhook-next", "image": "quay.io/your-username/instaslice-webhook:latest"}, {"name": "instaslice-scheduler-next", "image": "quay.io/your-username/instaslice-scheduler:latest"}, {"name": "instaslice-daemonset-next", "image": "quay.io/your-username/instaslice-daemonset:latest"} ] -
Build and push all images:
just build-push-parallel
-
Deploy to OpenShift (with emulated mode for development):
export EMULATED_MODE=enabled export RELATED_IMAGES=related_images.your-username.json just deploy-das-ocp
-
Test the installation:
kubectl apply -f test/test-pod-emulated.yaml
For OpenShift clusters with GPU hardware:
-
Deploy prerequisites:
just deploy-cert-manager-ocp just deploy-nfd-ocp just deploy-nvidia-ocp
-
Deploy DAS operator:
export EMULATED_MODE=disabled export RELATED_IMAGES=related_images.your-username.json just deploy-das-ocp
-
Test with GPU workload:
kubectl apply -f test/test-pod.yaml
For local development:
-
Run operator locally (requires scheduler, webhook, and daemonset images to be built and pushed beforehand):
# Build and push images first just build-push-parallel # Run operator locally # Set EMULATED_MODE to control hardware emulation EMULATED_MODE=enabled just run-local
-
Run tests:
just test-e2e
-
Check code quality:
just lint
- Login into podman and have a repository created for the operator bundle.
- Set
BUNDLE_IMAGEto point to your repository and tag of choice. - Run
just bundle-generateto generate the bundle manifests. - Run
just build-push-bundleto build and push the bundle image to your repository. - Run
just deploy-cert-manager-ocpto install cert-manager on OpenShift. - Run
just deploy-nfd-ocpto install Node Feature Discovery (NFD) on OpenShift. - Run
just deploy-nvidia-ocpto install NVIDIA GPU operator on Openshift. - Run
operator-sdk run bundle --namespace <namespace> ${BUNDLE_IMAGE}to deploy the operator. - Apply the
DASOperatorcustom resource to initialize the operatorkubectl apply -f deploy/03_instaslice_operator.cr.yaml`
Running generate bundle is the first step to publishing an operator to a catalog
and deploying it with OLM. A CSV manifest is generated by collecting data from the
set of manifests passed to this command, such as CRDs, RBAC, etc., and applying
that data to a "base" CSV manifest.
The steps to provide a base CSV:
- create a base CSV file that contains the desired metadata, the base CSV file name can be arbitrary, we can follow
the convention
{operator-name}.base.clusterserviceverison.yaml - put the base CSV file in the
deployfolder. This is the folder from which thegenerate bundlecommand will collect the k8s manifests. Note that the base CSV file can be placed inside a sub-directory within thedeployfolder. - make sure that the
metadata.nameof the base CSV is the same name as the package name provided to thegenerate bundlecommand, otherwise thegenerate bundlecommand will ignore the base CSV and will generate on an empty CSV.
Layout of an example deploy folder:
tree deploy/
deploy/
├── crds
│ └── foo-operator.crd.yaml
├── base-csv
│ └── foo-operator.base.clusterserviceversion.yaml
├── deployment.yaml
├── role.yaml
├── role_binding.yaml
├── service_account.yaml
└── webhooks.yamlThe bundle generation command:
operator-sdk generate bundle --input-dir deploy --version 0.1.0 --output-dir=bundle --package foo-operatorThe base CSV yaml:
apiVersion: operators.coreos.com/v1alpha1
kind: ClusterServiceVersion
metadata:
name: foo-operator.base
annotations:
alm-examples:
# other annotations can be placed here
spec:
displayName: Instaslice
version: 0.0.2
apiservicedefinitions:
customresourcedefinitions:
install:
installModes:
- supported: false
type: OwnNamespace
- supported: false
type: SingleNamespace
- supported: false
type: MultiNamespace
- supported: true
type: AllNamespaces
maturity: alpha
minKubeVersion: 1.16.0
provider:
name: Codeflare
url: https://github.com/openshift/instaslice-operator
relatedImages:
keywords:
- Foo
links:
- name: My Operator
url: https://github.com/foo/bar
maintainers:
description:
icon:- There is no need to provide any permission, or deployment spec inside the base CSV.
- Note that the
metadata.nameof the base CSV has a prefix offoo-operator.which adheres to the format{package name}. - if there are multiple CSV files inside the deploy folder, the one encountered first in lexical order will be selected as the base CSV
The CSV generation details can be found by inspecting the bundle generation code here: https://github.com/operator-framework/operator-sdk/blob/0eefc52889ff3dfe4af406038709e6c5ba7398e5/internal/generate/clusterserviceversion/clusterserviceversion.go#L148-L159
Emulated mode allows the operator to publish synthetic GPU capacity and skip NVML calls. This is handy for development and CI environments with no hardware. Emulated mode is controlled via the EMULATED_MODE environment variable.
The EMULATED_MODE environment variable is read by the operator at startup and determines how the daemonset components behave:
disabled(default): Normal operation mode that requires real MIG compatible GPUs hardware and makes NVML callsenabled: Emulated mode that simulates MIG capable GPUs capacity without requiring actual hardware
For local development:
# Run operator locally with emulation
EMULATED_MODE=enabled just run-localFor deployment:
# Deploy with emulated mode enabled
export EMULATED_MODE=enabled
export RELATED_IMAGES=related_images.your-username.json
just deploy-das-ocpFor production with MIG Compatible GPUs:
# Deploy with emulated mode disabled (default)
export EMULATED_MODE=disabled
export RELATED_IMAGES=related_images.your-username.json
just deploy-das-ocpThe operator reads the EMULATED_MODE environment variable at startup and passes this configuration to the daemonset pods running on each node. When emulated mode is enabled:
- The daemonset skips hardware detection and NVML library calls
- Synthetic GPU resources are published to simulate hardware capacity
- MIG slicing operations are simulated rather than performed on real hardware
This allows for testing and development of the operator functionality without requiring physical GPU hardware.
This project includes a Justfile for convenient task automation. The Justfile provides several commands for building, pushing, and deploying the operator components.
Install just command runner:
# On macOS
brew install just
# On Fedora/RHEL
dnf install just
# On Ubuntu/Debian
apt install just
# Or via cargo
cargo install justList all available commands:
justView current configuration:
just infoRun the operator locally for development:
# Set EMULATED_MODE to 'enabled' for simulated GPUs or 'disabled' for real hardware
EMULATED_MODE=enabled just run-localRun end-to-end tests:
just test-e2eRun tests with a specific focus:
just test-e2e focus="GPU slices"Generate operator bundle:
just bundle-generateBuild and push bundle image:
just build-push-bundleBuild and push developer bundle:
just build-push-developer-bundleDeploy NVIDIA GPU operator to OpenShift:
just deploy-nvidia-ocpRemove NVIDIA GPU operator from OpenShift:
just undeploy-nvidia-ocpDeploy cert-manager for OpenShift:
just deploy-cert-manager-ocpRemove cert-manager from OpenShift:
just undeploy-cert-manager-ocpDeploy cert-manager for Kubernetes:
just deploy-cert-managerDeploy Node Feature Discovery (NFD) operator for OpenShift:
just deploy-nfd-ocpRun all linting (markdown and Go):
just lintRun all linting with automatic fixes:
just lint-fixRun only Go linting:
just lint-goRun only markdown linting:
just lint-mdRun Go linting and automatically fix issues:
just lint-go-fixRun markdown linting and automatically fix issues:
just lint-md-fixClean up all deployed Kubernetes resources:
just undeployBuild and push individual component images:
just build-push-scheduler # Build and push scheduler image
just build-push-daemonset # Build and push daemonset image
just build-push-operator # Build and push operator image
just build-push-webhook # Build and push webhook imageBuild and push all images in parallel:
just build-push-parallelDeploy DAS on OpenShift Container Platform:
just deploy-das-ocpGenerate CRDs and clients:
just regen-crd # Generate CRDs into manifests directory
just regen-crd-k8s # Generate CRDs directly into deploy directory
just generate-clients # Generate client code
just verify-codegen # Verify generated client code is up to date
just generate # Generate all - CRDs and clientsCopy related_images.developer.json to related_images.username.json to use as a template and modify it to contain the target developer image repositories to use.
cp related_images.developer.json related_images.username.json
# Edit related_images.username.json with your registry
quay.io/username/image:latestThen set the RELATED_IMAGES environment variable to related_images.username.json.
RELATED_IMAGES=related_images.username.json justThe Justfile uses environment variables for configuration. You can customize these by setting them in your
environment or creating a .env file:
PODMAN- Container runtime (default:podman)KUBECTL- Kubernetes CLI (default:oc)EMULATED_MODE- Enable emulated mode (default:disabled)RELATED_IMAGES- Path to related images JSON file (default:related_images.json)DEPLOY_DIR- Deployment directory (default:deploy)OPERATOR_SDK- Operator SDK binary (default:operator-sdk)OPERATOR_VERSION- Operator version for bundle generation (default:0.1.0)GOLANGCI_LINT- Golangci-lint binary (default:golangci-lint)
Example:
export EMULATED_MODE=enabled
just deploy-das-ocpThe diagram below summarizes how the operator components interact. Pods requesting GPU slices are mutated by a
webhook to use the mig.das.com extended resource. The scheduler plugin tracks slice availability and creates
AllocationClaim objects processed by the device plugin on each node.
The plugin integrates with the Kubernetes scheduler and runs through three framework phases:
- Filter – ensures the node is MIG capable and stages
AllocationClaims for suitable GPUs. - Score – prefers nodes with the most free MIG slice slots after considering existing and staged claims.
- PreBind – promotes staged claims on the selected node to
createdand removes the rest.
Once promoted, the device plugin provisions the slices.
The daemonset advertises GPU resources only after the NVIDIA GPU Operator's
ClusterPolicy reports a Ready state. This prevents the scheduler from
scheduling pods on a node before the GPU Operator has initialized the drivers.
AllocationClaim is a namespaced CRD that records which MIG slice will be prepared for a pod. Claims start in the
staged state and transition to created once all requests are satisfied. Each claim stores the GPU UUID, slice
position and pod reference.
Example:
$ kubectl get allocationclaims -n das-operator
NAME AGE
8835132e-8a7a-4766-a78f-0cb853d165a2-busy-0 61s$ kubectl get allocationclaims -n das-operator -o yaml
apiVersion: inference.redhat.com/v1alpha1
kind: AllocationClaim
...All components run in the das-operator namespace:
kubectl get pods -n das-operatorInspect the active claims:
kubectl get allocationclaims -n das-operatorOn the node, verify that the CDI devices were created:
ls -l /var/run/cdi/Increase verbosity by editing the DASOperator resource and setting operatorLogLevel to Debug or Trace.
Run all unit tests for the project:
make testRun unit tests with verbose output:
go test -v ./pkg/...Run unit tests with coverage:
go test -cover ./pkg/...A running cluster with a valid KUBECONFIG is required:
just test-e2eYou can focus on specific tests:
just test-e2e focus="GPU slices"Due to kubernetes/kubernetes#128043
pods may enter an UnexpectedAdmissionError state if admission fails. Pods
managed by higher level controllers such as Deployments will be recreated
automatically. Naked pods, however, must be cleaned up manually with
kubectl delete pod. Using controllers is recommended until the upstream issue
is resolved.
Remove the deployed resources with:
just undeployContributions are welcome! Please open issues or pull requests.
This project is licensed under the Apache 2.0 License.
