Skip to content

Kueue Installation with ODH

James Busche edited this page Jan 19, 2024 · 18 revisions

Kueue Installation with ODH

Table of Contents

0. Prereqs:

0.1. You need to have an OpenShift Cluster. Either a medium or a Large QuickBurn Fyre Cluster will work

0.2. You need to have a default storage class, otherwise you can install PortWorx

0.3. You're going to need your oc login info for your cluster so you can login via your Laptop or from the Cluster terminal.

For example:

oc login --token=sha256~lamzJ-exoR16UsbltkT-l0nKCL7XTSvLqqB4i54psBM --server=https://api.jimmed414.cp.fyre.ibm.com:6443

Step 1. Install Kueue

More info about Kueue here: https://github.com/kubernetes-sigs/kueue

oc apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.5.1/manifests.yaml

Check that it started:

oc get pods -n kueue-system

Step 2. Install the Open Data Hub Operator

Using the OpenShift UI, navigate to:

Operators --> OperatorHub

and search for Open Data Hub Operator and install it using the fast channel. (It should be version 2.Y.Z)

You can check it with:

oc get pods -n openshift-operators

Step 3. Create the ODH namespace

ODH_NS=opendatahub  # Note, you can change this as you need it for other namespaces
oc new-project ${ODH_NS}

Step 4. Deploy the DataScienceCluster with:

cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  labels:
    app.kubernetes.io/created-by: opendatahub-operator
    app.kubernetes.io/instance: default
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: datasciencecluster
    app.kubernetes.io/part-of: opendatahub-operator
  name: example-dsc
  namespace: ${ODH_NS}
spec:
  components:
    codeflare:
      managementState: Removed
    dashboard:
      managementState: Managed
    datasciencepipelines:
      managementState: Removed
    kserve:
      managementState: Removed
    modelmeshserving:
      managementState: Removed
    ray:
      managementState: Managed
    workbenches:
      managementState: Managed
EOF

You'll end up with kuberay, Notebook-controller and the underlying dashboard, like this:

oc get pods -n ${ODH_NS}

Returns

NAME                                               READY   STATUS    RESTARTS   AGE
kuberay-operator-5d9567bdf4-gshxm                  1/1     Running   0          79s
notebook-controller-deployment-6468bbf669-89gr8    1/1     Running   0          91s
odh-dashboard-649fdc86bb-4n9xb                     2/2     Running   0          93s
odh-dashboard-649fdc86bb-5t7k4                     2/2     Running   0          93s
odh-notebook-controller-manager-86d9b47b54-8jql7   1/1     Running   0          92s

Step 5. Access the spawner page by going to your Open Data Hub dashboard. It'll be in the format of:

https://odh-dashboard-$ODH_NAMESPACE.apps.<your cluster's uri>

You can find it with this command:

oc get route -n ${ODH_NS} |grep dash

For example: https://odh-dashboard-odh.apps.jimbig412.cp.fyre.ibm.com/

- If prompted, give it your kubeadmin user and password
- If prompted, grant it access as well

5.1 One the far left, click on "Data Science Projects" and the click on Create a Data Science Project. (This will be a new namespace name)

for example:

Name: demo-dsp
Description: Demo's DSP

Then press "Create"

5.2 Within your new Data Science Project, select "Create workbench"

  • give it a name, like "demo-wb"
  • choose "Jupyter Data Science" for the image
  • click "Create workbench" at the bottom.

5.3 You'll see the status as "Starting" initially.

  • Once it's in the running status, click on the blue "Open" link in the workbench to get access to the notebook.

5.4 Click on the black "Terminal" under Other section to open up a terminal window.

Inside this terminal, do an "oc login" so that terminal has access to your OpenShift Cluster. For example:

oc login --token=sha256~lamzJ-exoR16UsbltkT-l0nKCL7XTSvLqqB4i54psBM --server=https://api.jimmed414.cp.fyre.ibm.com:6443

5.5 Now you should be able to see the pods on your OpenShift cluster. For example:

oc get pods

Will return the pods in your newly created namespace:

NAME       READY   STATUS    RESTARTS   AGE
demo-wb-0   2/2     Running   0          14m

Step 6. Create the Kueue cluster-wide ResourceFlavor and ClusterQueue.

Some quick definitions:

ResourceFlavor:

An object that you can define to describe what resources are available in a cluster.

In this case, the resources in our cluster are homogeneous, we use empty ResourceFlavor instead.

Note: To associate a ResourceFlavor with a subset of nodes of your cluster, you can configure the .spec.nodeLabels field with matching node labels that uniquely identify the nodes.

ClusterQueue:

A ClusterQueue is a cluster-scoped object that governs a pool of resources such as Pods, CPU, memory, and hardware accelerators. This ClusterQueue object defines the available quotas for the default-flavors that the cluster-queue manages.

6.1 Apply yaml to create the Kueue cluster-wide ResourceFlavor and ClusterQueue:

cat << EOF | oc apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "default-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
      - name: "memory"
        nominalQuota: 36Gi
EOF

Two items get created at the cluster scope:

resourceflavor.kueue.x-k8s.io/default-flavor created
clusterqueue.kueue.x-k8s.io/cluster-queue created

7. Now Create the local Kueue customization required

LocalQueue

A LocalQueue is a namespaced object that groups closely related Workloads that belong to a single namespace. A LocalQueue points to one ClusterQueue from which resources are allocated to run its Workloads.

To associate a Job to the LocalQueue in the namespace. Add a metadata.labels and the namespace to the Job.

7.1 In the notebook terminal, create a Kueue localqueue:

cat << EOF | oc apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: "demo-queue"
spec:
  clusterQueue: "cluster-queue"
EOF

You will end up with one item created, for example:

localqueue.kueue.x-k8s.io/demo-queue created

The resourceflavor and clusterqueue are at the cluster scope, and the localqueue is created in the default namespace as specified in the yaml above:

oc get resourceflavor -A
NAME             AGE
default-flavor   104s

oc get clusterqueue -A
NAME            COHORT   PENDING WORKLOADS
cluster-queue            0

oc get localqueue -A
NAMESPACE   NAME         CLUSTERQUEUE    PENDING WORKLOADS   ADMITTED WORKLOADS
demo-dsp     demo-queue   cluster-queue   0                   0

8.0 Using Kueue:

8.1 Run a Kueue Sample job against your new local queue (Adjust the queue name and namespace to reflect your names

cat << EOF | oc apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: sample-job-1
  namespace: demo-dsp 
  labels:
    kueue.x-k8s.io/queue-name: demo-queue
spec:
  parallelism: 3
  completions: 3
  suspend: true
  template:
    spec:
      containers:
      - name: dummy-job
        image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
        args: ["30s"]
        resources:
          requests:
            cpu: 1
            memory: "200Mi"
      restartPolicy: Never
EOF

8.2. Check that the job and pods start:

watch oc get jobs,pods

The pods will go to complete status after 30 seconds, for example:

Every 2.0s: oc get jobs,pods                                                                                         jim-wb-0: Wed Jan 17 19:30:03 2024

NAME                     COMPLETIONS   DURATION   AGE
job.batch/sample-job-1   3/3           34s        40s

NAME                     READY   STATUS      RESTARTS   AGE
pod/demo-wb-0             2/2     Running     0          22m
pod/sample-job-1-jc2z5   0/1     Completed   0          40s
pod/sample-job-1-pctkv   0/1     Completed   0          40s
pod/sample-job-1-tk88f   0/1     Completed   0          40s

8.3 Remove the job when you're done with it:

oc delete job sample-job-1

and it'll return:

job.batch "sample-job-1" deleted

Step 9. Cleanup Steps

9.1 Cleanup your jobs, for example:

oc delete job sample-job-1

9.2 Delete your localqueue, for example:

oc delete localqueue demo-queue 

9.3 delete your Cluster-wide Kueue items, for example:

oc delete clusterqueue cluster-queue
oc delete resourceflavor default-flavor

9.4 Exit out of the notebook and delete the notebook resources, for example:

oc delete notebook demo-wb
oc delete pvc demo-wb

9.5 Delete the dsc, for example:

oc delete dsc example-dsc

9.6 delete the Kueue operator

oc delete -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.5.1/manifests.yaml

9.7 Find the subscription and csv for ODH and delete them. For example:

oc get csv,sub -n openshift-operators

and then delete them:

oc delete csv opendatahub-operator.v2.4.0 -n openshift-operators; oc delete sub opendatahub-operator -n openshift-operators

9.8 Delete your Data Science Project, for example:

oc delete ns demo-dsp