Current state of this doc (05/24/23): Seems pretty stable. Recent deployments had a few issues with python failures in Kubeflow pods, but almost definitely not due to this procedure.

Important

This guide assumes Rancher Management Server is installed and configured with Harvester under Rancher management, and an AWS Route53 DNS domain is available.

Overall steps to complete this task

Deploy an RKE2 cluster on Harvester
Deploy the MetalLB load balancer
Verify the correct operation of MetalLB and the Harvester/Longhorn CSI
Install kustomize 5.0.0 or higher
Deploy Kubeflow
Verify the Kubeflow installation
Update Istio to use the MetalLB load balancer
Enable HTTPS on the Kubeflow Istio Gateway

UPDATE START

Update AWS Route53

UPDATE END

Configure cert-manager to manage Let’s Encrypt certificates, using Route 53 DNS records
Use a web browser to connect to Kubeflow UI with the Let’s Encrypt Staging certificate
Update the configuration to use a Let’s Encrypt production certificate
Use a web browser to connect to Kubeflow UI with the Let’s Encrypt Production certificate
Optionally, set the kubeflow-gateway to redirect HTTP to HTTPS

Note	Many of the following steps will be performed from an AMD/Intel Linux workstation with access to Rancher, Harvester, and the Internet.

Deploy an RKE2 cluster on Harvester

Prepare the RKE2 Kubernetes cluster

For this effort, we used Harvester, managed through Rancher, to create a five node RKE2 cluster
A project in Harvester named kubeflow-on-harvester contains a namespace named kubeflow-cluster

Note	Create a cluster through the Rancher UI The resource allocations used here were for basic testing purposes. It is likely the more CPU and RAM would be required for the workload-plane VMs to support a useful Kubeflow workload.

The cluster name set to:

kubeflow-on-harvester

Three instances set with the roles of control-plane and etcd pool
- Configure resources of 4 vCPU, 8GB RAM, 40GB boot drive
- Two VMs make up the workload-plane pool:
  - 8 vCPU, 16GB RAM, 40GB boot drive
Any Harvester namespace
The Image Volume operating systems for all nodes is the SUSE Linux Enterprise 15 SP4 minimal QCOW2 image with cloud-init enabled (previously known as the OpenStack image)
All nodes are connected to a Harvester network is connect to a VLAN with DHCP, DNS, and routing to the Internet
The SSH user for the O/S image is

sles

The following User Data cloud-config (under Show Advanced) was applied to all nodes during RKE2 cluster creation:

### cloud-init
#cloud-config
chpasswd:
  list: |
    root:SUSE
    sles:SUSE
  expire: false
ssh_authorized_keys:
  - >-
    <REPLACE WITH SSH PUBLIC KEY OF THE WORKSTATION>
runcmd:
#  - SUSEConnect --url <REPLACE WITH RMT SERVER ADDRESS>                                               # Uncomment if using an RMT server
#  - SUSEConnect -e <REPLACE WITH REGISTERED EMAILL ADDRESS> -r <REPLACE WITH SCC SUBSCRIPTION KEY>    # Uncomment if using an SCC subscription key
  - zypper -n in -t pattern apparmor
  - zypper -n up
  - zypper in --force-resolution --no-confirm --force kernel-default
  - zypper rm --no-confirm kernel-default-base

Select the tick-box to Install guest agent

Important

These instructions are currently only applicable for Kubernetes versions earlier than 1.25

The Kubernetes Cluster Configuration is as follows:

On the Basic tab:
- Kubernetes version v1.24.9+rke2r2 (currently depricated, but needed for Harvester Cloud Provider support)
- Enable the Harvester Cloud Provider CSI driver
- Container Network Interface is Calico
- Ensure the Default Security Pod Policy is set to Default - RKE2 Embedded
- Leave Pod Security Admission Configuration Template set to (None)
(Ignore this line, it is needed to fix bullet points below)
- Disable the Nginx Ingress controller under System Services
On the Labels and Annotations tab:
- Apply a cluster label where they key is platform and the value is kubeflow
Click Create

Verify and reboot the RKE2 nodes

After the cluster has been created, SSH to each node as the user sles
- Verify that the kernel-default kernel has been installed and kernel-default-base kernel has been removed:

sudo zypper se kernel-default

If needed, remove the kernel-default-base kernel with:

sudo zypper rm --no-confirm kernel-default-base

Verify that all operating system software has been patched to the latest update:

sudo zypper up

Reboot each node, in turn to enable the kernel-default kernel

sudo reboot

After the RKE2 cluster has been created, gather the KUBECONFIG data from the Rancher Management server and provide it to a workstation with kubectl and helm installed

Deploy the MetalLB load balancer

Note	The instructions described below include a section for `Testing MetalLB` after deployment. This can be omitted as both MetalLB and the Harvester CSI will be tested in a later step.

Use these instructions to deploy MetalLB on the RKE2 cluster: https://gist.github.com/alexarnoldy/24dd06d8c4291d04c5d7065b520bcb15

Verify the correct operation of MetalLB and the Harvester/Longhorn CSI

Set this variable with the target namespace:

NAMESPACE="metallb-harvester-csi-test"

Create the namespace:

kubectl create namespace ${NAMESPACE}

Create the manifest for an nginx pod, PVC, and load balancer service:

cat <<EOF> nginx-metallb-test.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: ${NAMESPACE}
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1
        ports:
        - name: http
          containerPort: 80
        volumeMounts:
        - mountPath: /mnt/test-vol
          name: test-vol
      volumes:
      - name: test-vol
        persistentVolumeClaim:
          claimName: nginx-pvc


---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nginx-pvc
  namespace: ${NAMESPACE}
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi


---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  namespace: ${NAMESPACE}
spec:
  ports:
  - name: http
    port: 8080
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer
EOF

Create the pod, service, and the PVC:

kubectl apply -f nginx-metallb-test.yaml

Verify the pod is "Running", the harvester StorageClass is the (default), the persistentvolumeclaim is "Bound", and the service has an "EXTERNAL-IP":

kubectl get pod,sc,pvc,svc -n ${NAMESPACE}

Verify that the service is reachable through the load balancer IP address from outside the cluster:

IPAddr=$(kubectl get svc -n ${NAMESPACE} | grep -w nginx | awk '{print$4":"$5}' | awk -F: '{print$1":"$2}')
curl http://${IPAddr} 2>/dev/null | grep "Thank you for using nginx"

An HTML encoded output should display the phrase "Thank you for using nginx."
- Verify that the volume is mounted in the test pod:

TEST_POD=$(kubectl get pods -n ${NAMESPACE} | awk '/nginx/ {print$1}')
kubectl exec -it ${TEST_POD} -n ${NAMESPACE} -- mount | grep test-vol

The output should show that the volume is mounted at the location /mnt/test-vol
- When finished with testing, delete the pod and service:

kubectl delete -f nginx-metallb-test.yaml
sleep 5
kubectl delete namespace ${NAMESPACE}

Install kustomize 5.0.0 or higher

Note	The instructions for installing Kubeflow can be found at: `https://github.com/kubeflow/manifests#installation`

Important

At the time of writing, Kubeflow requires kustomize version 5.0.0 or higher

Install kustomize 5.0.0 or higher on the Linux workstation:

Find the lastest release of kustomize at https://github.com/kubernetes-sigs/kustomize/releases/
Adjust this variable for the appropriate release: VERSION="v5.0.0"
- Use the following commands to download and install kustomize for a Linux AMD/Intel workstation:

wget https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2F${VERSION}/kustomize_${VERSION}_linux_amd64.tar.gz
tar xvfz kustomize_${VERSION}_linux_amd64.tar.gz
sudo mv kustomize /usr/bin

UPDATE START

Verify the kustomize version:

kustomize version

# UPDATE END

Deploy Kubeflow

Note	The remainder of the procedure will require installing Kubeflow according to the instructions on the Kubeflow GitHub site, then returning to this document to enable TLS for HTTPS connections to the Kubeflow Dashboard.

Important

Before running the first installation command, it is recommended to run git status in the manifests directory to ensure no unexpected changes have been made to this copy of the git repo. Additionally, it is recommeneded to remove the manifests directory and re-clone the repo between installation efforts.

Clone the repository at https://github.com/kubeflow/manifests, change into the manifests directory, then follow the instructions to either install all of the Kubeflow components with a single command, or install individual components

Note	The remainder of this procedure has only been tested with an full installation (E.i. https://github.com/kubeflow/manifests#install-with-a-single-command)

Verify the Kubeflow installation

Ensure all pods have a STATUS of Running and all of the containers in each pod (E.g. 1/1, not 1/2 or 0/1) are running:

for EACH in auth cert-manager istio-system knative-eventing knative-serving kubeflow kubeflow-user-example-com; do kubectl get pods -n ${EACH}; read -p "<Enter to continue>"; echo ""; done

Enable kubectl port-forwarding and ensure the Kubeflow UI permits login:

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

Note	In the following step ensure the connect is HTTP, not HTTPS

In a browser on the Linux workstation, connect to:

http://127.0.0.1:8080

Login with the credentials: Email address

user@example.com

Password

12341234

Use Ctrl+c to close the kubectl port-forward session

Troubleshooting Kubeflow installation

Some things that could prevent connecting or loggging into the Kubeblow dashboard include:
1. The local copy of the https://github.com/kubeflow/manifests git repo doesn’t match the origin
  - While in the manifests directory, run git status to see if any files are different from the origin repo
  - Remove the manifests directory and clone the repo again
2. Using a web browser that is not running on the Linux desktop
  - The kubectl port-forwarding opens a tunnel from the Linux workstation to the Kubeflow gateway service that only a web browser running on the same system can utilize.
3. The Kubeflow installation has not completed or failed to complete
  - Return to the beginning of this Verify the Kubeflow installation section and ensure all containers and pods are running correctly
  - A high number of container restarts can indicate other issues preventing the installation from completing sucessfully
4. The cluster’s resources are saturated
  - Use the Linux top command on the worker nodes to ensure the system’s CPU/memory are not overburdened
  - Check the Harvester dashboard to ensure the physical Harvester nodes are not overburdened or experiencing failures

Update Istio to use the MetalLB load balancer

Verify the current istio-ingressgateway service type (Likely ClusterIP):

kubectl -n istio-system get svc istio-ingressgateway -o jsonpath='{.spec.type}' ; echo ""

Patch the service to change the type to LoadBalancer:

kubectl -n istio-system patch svc istio-ingressgateway -p '{"spec": {"type": "LoadBalancer"}}'

Verify the service is a type of LoadBalancer and take note of the IP address:

kubectl -n istio-system get svc istio-ingressgateway

Enable HTTPS on the Kubeflow Istio Gateway

Edit the kubeflow-gateway resource to add HTTPS routing:

kubectl edit -n kubeflow gateways.networking.istio.io kubeflow-gateway

Add this portion to the bottom of the spec: section:

    tls:
      httpsRedirect: false
  - hosts:
    - "*"
    port:
      name: https
      number: 443
      protocol: HTTPS
    tls:
      mode: SIMPLE
      credentialName: kubeflow-certificate-secret

The entire spec: section should look like this:

spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - '*'
    port:
      name: http
      number: 80
      protocol: HTTP
    tls:
      httpsRedirect: false
  - hosts:
    - "*"
    port:
      name: https
      number: 443
      protocol: HTTPS
    tls:
      mode: SIMPLE
      credentialName: kubeflow-certificate-secret

UPDATE START

Update AWS Route53

UPDATE END

Update the AWS Route53 DNS provider wih the Kubeflow IP address and the desired Fully Qualified Domain Name for the Kubeflow UI

Use a browser to connect, with HTTP (not HTTPS), to Kubeflow UI at the FQDN

The screen should redirect to dex and offer a login prompt
(Optional) Login with the credentials: Email address

user@example.com

Password

12341234

Important

Proceed to the next section only after being able to connect to, and optionally, log into the Kubeflow UI

Configure cert-manager to manage Let’s Encrypt certificates, using Route 53 DNS records

Note	cert-manager can manage certificates from any public DNS provider. See the cert-manager documentation at https://cert-manager.io/docs/configuration/acme/ for more information.

Note	An AWS user with appropriate IAM policies and API access keys is needed for cert-manager to access the Route53 DNS records. See the cert-manager documentation at https://cert-manager.io/docs/configuration/acme/dns01/route53/ for more information.

Create a cert-manager Issuer for Let’s Encrypt:

Set these variables:

# aws_access_key_id and aws_secret_access_key for the configured AWS user:
export AWS_ACCESS_KEY_ID=""
export AWS_SECRET_ACCESS_KEY=""
export AWS_REGION="" # E.g. "us-west-2"
export DNSZONE="" # E.g. "suse.com"
export FQDN="" # E.g. "kubeflow.suse.com"
export EMAIL_ADDR="" # valid email address for the Let's Encrypt certificate

Note	When initially creating the cert-manager Issuer, ensure the `server: https://acme-staging-v02` line is uncommented and the `server: https://acme-v02` line is commented out. After verifying that the certicate can be issued correctly, we will reverse this to obtain the valid, production certificate.

Create the cert-manager Issuer file:

cat <<EOF> letsencrypt-issuer.yaml
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: letsencrypt-issuer
  namespace: istio-system
spec:
  acme:
    email: ${EMAIL_ADDR}
    server: https://acme-staging-v02.api.letsencrypt.org/directory # Use this line to test the process of issuing a certificate to avoid the Let's Encrypt production rate limits
#    server: https://acme-v02.api.letsencrypt.org/directory # Use this line after the certificate issues correctly
    privateKeySecretRef:
      name: letsencrypt-issuer-priv-key # K8s secret that will contain the private key for this, specific issuer
    solvers:
    - selector:
        dnsZones:
          - "${DNSZONE}"
      dns01:
        route53:
          region: ${AWS_REGION}
          accessKeyID: ${AWS_ACCESS_KEY_ID}
          secretAccessKeySecretRef:
            name: route53-credentials-secret
            key: secret-access-key
EOF

Important

Review the letsencrypt-issuer.yaml file for accuracy before continuing

Verify the update to the file:

cat letsencrypt-issuer.yaml

Create the letsencrypt-issuer resource:

kubectl apply -f letsencrypt-issuer.yaml

Create the Kubernetes secret containing the aws_secret_access_key for the AWS user:

UPDATE START

kubectl create -n istio-system secret generic route53-credentials-secret --from-literal=secret-access-key=${AWS_SECRET_ACCESS_KEY}

UPDATE END

Verify the contents of the secret:

kubectl get -n istio-system secret route53-credentials-secret -o jsonpath={.data.secret-access-key} | base64 -d; echo ""

UPDATE START

(Removed the "Update OIDC to allow the Let’s Encrypt DNS01 challenge: section)

UPDATE END

Verify the hostname for the certificate resolves correctly:

getent hosts ${FQDN}

Create the cert-manager Certificate resource file:

cat <<EOF> kubeflow-certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: kubeflow-certificate
  namespace: istio-system
spec:
  secretName: kubeflow-certificate-secret # Kubernetes secret that will contain the tls.key and tls.crt of the new cert
  commonName: ${FQDN}
  dnsNames:
    - ${FQDN}
  issuerRef:
    name: letsencrypt-issuer
    kind: Issuer
EOF

Verify the Certificate resource file:

cat kubeflow-certificate.yaml

Create the Certificate resource:

kubectl apply -f kubeflow-certificate.yaml

Check the status of the certificate:

kubectl get -w -n istio-system certificate

Use Ctrl+c to exit the kubectl -w (watch) command

Note	The certificate commonly takes 100 seconds to be issued but can take up to three minutes. The `READY` status will change to `True` when it is issued.

If needed, check the progress of the certificate:

kubectl describe -n istio-system certificate kubeflow-certificate

Important

If the certificate seems to be taking a long time to be issued, review the cert-manager logs for clues. Common errors are related to DNS resolution, credentials, and IAM policies. Keep checking back for the status of the certificate since it will likely keep working in the background.

If needed, review the cert-manager logs:

kubectl logs -n cert-manager -l app=cert-manager

Important

Proceed to the next section only after the certificate shows a READY status of True

Use a web browser to connect to Kubeflow UI with the Let’s Encrypt Staging certificate

Note	Since the certificate was issued by the Let’s Encrypt Staging servers, it will cause an error in the browser that it is untrusted.

Click the lock icon in the browser’s URL pane, then continue selecting appropriate options until you are able to review the connection certificate. It should say that the certificate was issued by Let’s Encrypt (Staging)

Update the configuration to use a Let’s Encrypt production certificate

Update the letsencrypt-issuer.yaml file to comment out the server: https://acme-staging-v02 line and uncomment the server: https://acme-v02 line:
Update the letsencrypt-issuer resource:

kubectl apply -f letsencrypt-issuer.yaml

Remove the certificatate and its associated secret:

kubectl -n istio-system delete secret kubeflow-certificate-secret
kubectl -n istio-system delete certificate kubeflow-certificate

Recreate the certificate:

kubectl apply -f kubeflow-certificate.yaml

Check the status of the certificate:

kubectl get -w -n istio-system certificate

Use Ctrl+c to exit the kubectl watch (-w) command

Note	The certificate can take up to three minutes to be issued, as indicated by the `READY` status becoming `True`

Refresh the istio-gateway deployment to use the new certificate:

kubectl rollout restart deployment -n istio-system istio-ingressgateway

Use a web browser to connect to Kubeflow UI with the Let’s Encrypt Production certificate

Close and reopen the browser to verify the publicly signed certificate at the Kubeflow UI’s HTTPS URL

Optionally, set the kubeflow-gateway to redirect HTTP to HTTPS

Edit the kubeflow resource:

kubectl edit -n kubeflow gateways.networking.istio.io kubeflow-gateway

Change httpsRedirect: false to httpsRedirect: true

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rancher-Kubeflow-Harvester-RKE2.adoc

Rancher-Kubeflow-Harvester-RKE2.adoc

Current state of this doc (05/24/23): Seems pretty stable. Recent deployments had a few issues with python failures in Kubeflow pods, but almost definitely not due to this procedure.

Overall steps to complete this task

UPDATE START

UPDATE END

Deploy an RKE2 cluster on Harvester

After the RKE2 cluster has been created, gather the KUBECONFIG data from the Rancher Management server and provide it to a workstation with kubectl and helm installed

Deploy the MetalLB load balancer

Verify the correct operation of MetalLB and the Harvester/Longhorn CSI

Install kustomize 5.0.0 or higher

UPDATE START

Deploy Kubeflow

Verify the Kubeflow installation

Troubleshooting Kubeflow installation

Update Istio to use the MetalLB load balancer

Enable HTTPS on the Kubeflow Istio Gateway

UPDATE START

Update AWS Route53

UPDATE END

Use a browser to connect, with HTTP (not HTTPS), to Kubeflow UI at the FQDN

Configure cert-manager to manage Let’s Encrypt certificates, using Route 53 DNS records

UPDATE START

UPDATE END

UPDATE START

UPDATE END

Use a web browser to connect to Kubeflow UI with the Let’s Encrypt Staging certificate

Update the configuration to use a Let’s Encrypt production certificate

Use a web browser to connect to Kubeflow UI with the Let’s Encrypt Production certificate

Optionally, set the kubeflow-gateway to redirect HTTP to HTTPS

Files

Rancher-Kubeflow-Harvester-RKE2.adoc

Latest commit

History

Rancher-Kubeflow-Harvester-RKE2.adoc

File metadata and controls

Current state of this doc (05/24/23): Seems pretty stable. Recent deployments had a few issues with python failures in Kubeflow pods, but almost definitely not due to this procedure.

Overall steps to complete this task

UPDATE START

UPDATE END

Deploy an RKE2 cluster on Harvester

After the RKE2 cluster has been created, gather the KUBECONFIG data from the Rancher Management server and provide it to a workstation with kubectl and helm installed

Deploy the MetalLB load balancer

Verify the correct operation of MetalLB and the Harvester/Longhorn CSI

Install kustomize 5.0.0 or higher

UPDATE START

Deploy Kubeflow

Verify the Kubeflow installation

Troubleshooting Kubeflow installation

Update Istio to use the MetalLB load balancer

Enable HTTPS on the Kubeflow Istio Gateway

UPDATE START

Update AWS Route53

UPDATE END

Use a browser to connect, with HTTP (not HTTPS), to Kubeflow UI at the FQDN

Configure cert-manager to manage Let’s Encrypt certificates, using Route 53 DNS records

UPDATE START

UPDATE END

UPDATE START

UPDATE END

Use a web browser to connect to Kubeflow UI with the Let’s Encrypt Staging certificate

Update the configuration to use a Let’s Encrypt production certificate

Use a web browser to connect to Kubeflow UI with the Let’s Encrypt Production certificate

Optionally, set the kubeflow-gateway to redirect HTTP to HTTPS