Current state of this doc (05/24/23): Seems pretty stable. Recent deployments had a few issues with python failures in Kubeflow pods, but almost definitely not due to this procedure.
Important
|
This guide assumes Rancher Management Server is installed and configured with Harvester under Rancher management, and an AWS Route53 DNS domain is available. |
-
Deploy an RKE2 cluster on Harvester
-
Deploy the MetalLB load balancer
-
Verify the correct operation of MetalLB and the Harvester/Longhorn CSI
-
Install kustomize 5.0.0 or higher
-
Deploy Kubeflow
-
Verify the Kubeflow installation
-
Update Istio to use the MetalLB load balancer
-
Enable HTTPS on the Kubeflow Istio Gateway
-
Update AWS Route53
-
Configure cert-manager to manage Let’s Encrypt certificates, using Route 53 DNS records
-
Use a web browser to connect to Kubeflow UI with the Let’s Encrypt Staging certificate
-
Update the configuration to use a Let’s Encrypt production certificate
-
Use a web browser to connect to Kubeflow UI with the Let’s Encrypt Production certificate
-
Optionally, set the kubeflow-gateway to redirect HTTP to HTTPS
Note
|
Many of the following steps will be performed from an AMD/Intel Linux workstation with access to Rancher, Harvester, and the Internet. |
-
For this effort, we used Harvester, managed through Rancher, to create a five node RKE2 cluster
-
A project in Harvester named kubeflow-on-harvester contains a namespace named kubeflow-cluster
Note
|
Create a cluster through the Rancher UI
The resource allocations used here were for basic testing purposes. It is likely the more CPU and RAM would be required for the workload-plane VMs to support a useful Kubeflow workload.
|
-
The cluster name set to:
kubeflow-on-harvester
-
Three instances set with the roles of control-plane and etcd pool
-
Configure resources of 4 vCPU, 8GB RAM, 40GB boot drive
-
Two VMs make up the workload-plane pool:
-
8 vCPU, 16GB RAM, 40GB boot drive
-
-
-
Any Harvester namespace
-
The
Image Volume
operating systems for all nodes is the SUSE Linux Enterprise 15 SP4 minimal QCOW2 image with cloud-init enabled (previously known as the OpenStack image) -
All nodes are connected to a Harvester network is connect to a VLAN with DHCP, DNS, and routing to the Internet
-
The SSH user for the O/S image is
sles
-
The following
User Data
cloud-config (underShow Advanced
) was applied to all nodes during RKE2 cluster creation:
### cloud-init
#cloud-config
chpasswd:
list: |
root:SUSE
sles:SUSE
expire: false
ssh_authorized_keys:
- >-
<REPLACE WITH SSH PUBLIC KEY OF THE WORKSTATION>
runcmd:
# - SUSEConnect --url <REPLACE WITH RMT SERVER ADDRESS> # Uncomment if using an RMT server
# - SUSEConnect -e <REPLACE WITH REGISTERED EMAILL ADDRESS> -r <REPLACE WITH SCC SUBSCRIPTION KEY> # Uncomment if using an SCC subscription key
- zypper -n in -t pattern apparmor
- zypper -n up
- zypper in --force-resolution --no-confirm --force kernel-default
- zypper rm --no-confirm kernel-default-base
-
Select the tick-box to
Install guest agent
Important
|
These instructions are currently only applicable for Kubernetes versions earlier than 1.25 |
Cluster Configuration
is as follows:-
On the
Basic
tab:-
Kubernetes version v1.24.9+rke2r2 (currently depricated, but needed for Harvester Cloud Provider support)
-
Enable the Harvester Cloud Provider CSI driver
-
Container Network
Interface is Calico -
Ensure the
Default Security Pod Policy
is set toDefault - RKE2 Embedded
-
Leave
Pod Security Admission Configuration Template
set to(None)
-
-
(Ignore this line, it is needed to fix bullet points below)
-
Disable the
Nginx Ingress
controller underSystem Services
-
-
On the
Labels and Annotations
tab:-
Apply a cluster label where they key is
platform
and the value iskubeflow
-
-
Click
Create
-
After the cluster has been created, SSH to each node as the user
sles
-
Verify that the
kernel-default
kernel has been installed andkernel-default-base
kernel has been removed:
-
sudo zypper se kernel-default
-
If needed, remove the
kernel-default-base
kernel with:
sudo zypper rm --no-confirm kernel-default-base
-
Verify that all operating system software has been patched to the latest update:
sudo zypper up
-
Reboot each node, in turn to enable the kernel-default kernel
sudo reboot
After the RKE2 cluster has been created, gather the KUBECONFIG data from the Rancher Management server and provide it to a workstation with kubectl and helm installed
Note
|
The instructions described below include a section for Testing MetalLB after deployment. This can be omitted as both MetalLB and the Harvester CSI will be tested in a later step.
|
-
Use these instructions to deploy MetalLB on the RKE2 cluster: https://gist.github.com/alexarnoldy/24dd06d8c4291d04c5d7065b520bcb15
-
Set this variable with the target namespace:
NAMESPACE="metallb-harvester-csi-test"
-
Create the namespace:
kubectl create namespace ${NAMESPACE}
-
Create the manifest for an nginx pod, PVC, and load balancer service:
cat <<EOF> nginx-metallb-test.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
namespace: ${NAMESPACE}
spec:
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1
ports:
- name: http
containerPort: 80
volumeMounts:
- mountPath: /mnt/test-vol
name: test-vol
volumes:
- name: test-vol
persistentVolumeClaim:
claimName: nginx-pvc
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: nginx-pvc
namespace: ${NAMESPACE}
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: nginx
namespace: ${NAMESPACE}
spec:
ports:
- name: http
port: 8080
protocol: TCP
targetPort: 80
selector:
app: nginx
type: LoadBalancer
EOF
-
Create the pod, service, and the PVC:
kubectl apply -f nginx-metallb-test.yaml
-
Verify the pod is "Running", the
harvester
StorageClass is the(default)
, the persistentvolumeclaim is "Bound", and the service has an "EXTERNAL-IP":
kubectl get pod,sc,pvc,svc -n ${NAMESPACE}
-
Verify that the service is reachable through the load balancer IP address from outside the cluster:
IPAddr=$(kubectl get svc -n ${NAMESPACE} | grep -w nginx | awk '{print$4":"$5}' | awk -F: '{print$1":"$2}')
curl http://${IPAddr} 2>/dev/null | grep "Thank you for using nginx"
-
An HTML encoded output should display the phrase "Thank you for using nginx."
-
Verify that the volume is mounted in the test pod:
-
TEST_POD=$(kubectl get pods -n ${NAMESPACE} | awk '/nginx/ {print$1}')
kubectl exec -it ${TEST_POD} -n ${NAMESPACE} -- mount | grep test-vol
-
The output should show that the volume is mounted at the location
/mnt/test-vol
-
When finished with testing, delete the pod and service:
-
kubectl delete -f nginx-metallb-test.yaml
sleep 5
kubectl delete namespace ${NAMESPACE}
Note
|
The instructions for installing Kubeflow can be found at: https://github.com/kubeflow/manifests#installation
|
Important
|
At the time of writing, Kubeflow requires kustomize version 5.0.0 or higher |
-
Find the lastest release of kustomize at https://github.com/kubernetes-sigs/kustomize/releases/
-
Adjust this variable for the appropriate release:
VERSION="v5.0.0"
-
Use the following commands to download and install kustomize for a Linux AMD/Intel workstation:
-
wget https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2F${VERSION}/kustomize_${VERSION}_linux_amd64.tar.gz
tar xvfz kustomize_${VERSION}_linux_amd64.tar.gz
sudo mv kustomize /usr/bin
-
Verify the kustomize version:
kustomize version
# UPDATE END
Note
|
The remainder of the procedure will require installing Kubeflow according to the instructions on the Kubeflow GitHub site, then returning to this document to enable TLS for HTTPS connections to the Kubeflow Dashboard. |
Important
|
Before running the first installation command, it is recommended to run git status in the manifests directory to ensure no unexpected changes have been made to this copy of the git repo. Additionally, it is recommeneded to remove the manifests directory and re-clone the repo between installation efforts.
|
-
Clone the repository at https://github.com/kubeflow/manifests, change into the manifests directory, then follow the instructions to either install all of the Kubeflow components with a single command, or install individual components
Note
|
The remainder of this procedure has only been tested with an full installation (E.i. https://github.com/kubeflow/manifests#install-with-a-single-command) |
-
Ensure all pods have a
STATUS
ofRunning
and all of the containers in each pod (E.g. 1/1, not 1/2 or 0/1) are running:
for EACH in auth cert-manager istio-system knative-eventing knative-serving kubeflow kubeflow-user-example-com; do kubectl get pods -n ${EACH}; read -p "<Enter to continue>"; echo ""; done
-
Enable kubectl port-forwarding and ensure the Kubeflow UI permits login:
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Note
|
In the following step ensure the connect is HTTP, not HTTPS |
-
In a browser on the Linux workstation, connect to:
http://127.0.0.1:8080
-
Login with the credentials:
Email address
Password
12341234
-
Use
Ctrl+c
to close the kubectl port-forward session
-
Some things that could prevent connecting or loggging into the Kubeblow dashboard include:
-
The local copy of the https://github.com/kubeflow/manifests git repo doesn’t match the origin
-
While in the
manifests
directory, rungit status
to see if any files are different from the origin repo -
Remove the
manifests
directory and clone the repo again
-
-
Using a web browser that is not running on the Linux desktop
-
The kubectl port-forwarding opens a tunnel from the Linux workstation to the Kubeflow gateway service that only a web browser running on the same system can utilize.
-
-
The Kubeflow installation has not completed or failed to complete
-
Return to the beginning of this
Verify the Kubeflow installation
section and ensure all containers and pods are running correctly -
A high number of container restarts can indicate other issues preventing the installation from completing sucessfully
-
-
The cluster’s resources are saturated
-
Use the Linux
top
command on the worker nodes to ensure the system’s CPU/memory are not overburdened -
Check the Harvester dashboard to ensure the physical Harvester nodes are not overburdened or experiencing failures
-
-
-
Verify the current
istio-ingressgateway
service type (LikelyClusterIP
):
kubectl -n istio-system get svc istio-ingressgateway -o jsonpath='{.spec.type}' ; echo ""
-
Patch the service to change the type to LoadBalancer:
kubectl -n istio-system patch svc istio-ingressgateway -p '{"spec": {"type": "LoadBalancer"}}'
-
Verify the service is a type of
LoadBalancer
and take note of the IP address:
kubectl -n istio-system get svc istio-ingressgateway
-
Edit the kubeflow-gateway resource to add HTTPS routing:
kubectl edit -n kubeflow gateways.networking.istio.io kubeflow-gateway
-
Add this portion to the bottom of the
spec:
section:
tls:
httpsRedirect: false
- hosts:
- "*"
port:
name: https
number: 443
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: kubeflow-certificate-secret
-
The entire
spec:
section should look like this:
spec:
selector:
istio: ingressgateway
servers:
- hosts:
- '*'
port:
name: http
number: 80
protocol: HTTP
tls:
httpsRedirect: false
- hosts:
- "*"
port:
name: https
number: 443
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: kubeflow-certificate-secret
-
Update the AWS Route53 DNS provider wih the Kubeflow IP address and the desired Fully Qualified Domain Name for the Kubeflow UI
-
The screen should redirect to dex and offer a login prompt
-
(Optional) Login with the credentials:
Email address
Password
12341234
Important
|
Proceed to the next section only after being able to connect to, and optionally, log into the Kubeflow UI |
Note
|
cert-manager can manage certificates from any public DNS provider. See the cert-manager documentation at https://cert-manager.io/docs/configuration/acme/ for more information. |
Note
|
An AWS user with appropriate IAM policies and API access keys is needed for cert-manager to access the Route53 DNS records. See the cert-manager documentation at https://cert-manager.io/docs/configuration/acme/dns01/route53/ for more information. |
-
Set these variables:
# aws_access_key_id and aws_secret_access_key for the configured AWS user:
export AWS_ACCESS_KEY_ID=""
export AWS_SECRET_ACCESS_KEY=""
export AWS_REGION="" # E.g. "us-west-2"
export DNSZONE="" # E.g. "suse.com"
export FQDN="" # E.g. "kubeflow.suse.com"
export EMAIL_ADDR="" # valid email address for the Let's Encrypt certificate
Note
|
When initially creating the cert-manager Issuer, ensure the server: https://acme-staging-v02 line is uncommented and the server: https://acme-v02 line is commented out. After verifying that the certicate can be issued correctly, we will reverse this to obtain the valid, production certificate.
|
-
Create the cert-manager Issuer file:
cat <<EOF> letsencrypt-issuer.yaml
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: letsencrypt-issuer
namespace: istio-system
spec:
acme:
email: ${EMAIL_ADDR}
server: https://acme-staging-v02.api.letsencrypt.org/directory # Use this line to test the process of issuing a certificate to avoid the Let's Encrypt production rate limits
# server: https://acme-v02.api.letsencrypt.org/directory # Use this line after the certificate issues correctly
privateKeySecretRef:
name: letsencrypt-issuer-priv-key # K8s secret that will contain the private key for this, specific issuer
solvers:
- selector:
dnsZones:
- "${DNSZONE}"
dns01:
route53:
region: ${AWS_REGION}
accessKeyID: ${AWS_ACCESS_KEY_ID}
secretAccessKeySecretRef:
name: route53-credentials-secret
key: secret-access-key
EOF
Important
|
Review the letsencrypt-issuer.yaml file for accuracy before continuing |
-
Verify the update to the file:
cat letsencrypt-issuer.yaml
-
Create the
letsencrypt-issuer
resource:
kubectl apply -f letsencrypt-issuer.yaml
-
Create the Kubernetes secret containing the aws_secret_access_key for the AWS user:
kubectl create -n istio-system secret generic route53-credentials-secret --from-literal=secret-access-key=${AWS_SECRET_ACCESS_KEY}
-
Verify the contents of the secret:
kubectl get -n istio-system secret route53-credentials-secret -o jsonpath={.data.secret-access-key} | base64 -d; echo ""
(Removed the "Update OIDC to allow the Let’s Encrypt DNS01 challenge: section)
-
Verify the hostname for the certificate resolves correctly:
getent hosts ${FQDN}
-
Create the cert-manager Certificate resource file:
cat <<EOF> kubeflow-certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: kubeflow-certificate
namespace: istio-system
spec:
secretName: kubeflow-certificate-secret # Kubernetes secret that will contain the tls.key and tls.crt of the new cert
commonName: ${FQDN}
dnsNames:
- ${FQDN}
issuerRef:
name: letsencrypt-issuer
kind: Issuer
EOF
-
Verify the Certificate resource file:
cat kubeflow-certificate.yaml
-
Create the Certificate resource:
kubectl apply -f kubeflow-certificate.yaml
-
Check the status of the certificate:
kubectl get -w -n istio-system certificate
-
Use
Ctrl+c
to exit the kubectl -w (watch) command
Note
|
The certificate commonly takes 100 seconds to be issued but can take up to three minutes. The READY status will change to True when it is issued.
|
-
If needed, check the progress of the certificate:
kubectl describe -n istio-system certificate kubeflow-certificate
Important
|
If the certificate seems to be taking a long time to be issued, review the cert-manager logs for clues. Common errors are related to DNS resolution, credentials, and IAM policies. Keep checking back for the status of the certificate since it will likely keep working in the background. |
-
If needed, review the cert-manager logs:
kubectl logs -n cert-manager -l app=cert-manager
Important
|
Proceed to the next section only after the certificate shows a READY status of True
|
Note
|
Since the certificate was issued by the Let’s Encrypt Staging servers, it will cause an error in the browser that it is untrusted. |
-
Click the lock icon in the browser’s URL pane, then continue selecting appropriate options until you are able to review the connection certificate. It should say that the certificate was issued by Let’s Encrypt (Staging)
-
Update the letsencrypt-issuer.yaml file to comment out the
server: https://acme-staging-v02
line and uncomment theserver: https://acme-v02
line: -
Update the
letsencrypt-issuer
resource:
kubectl apply -f letsencrypt-issuer.yaml
-
Remove the certificatate and its associated secret:
kubectl -n istio-system delete secret kubeflow-certificate-secret
kubectl -n istio-system delete certificate kubeflow-certificate
-
Recreate the certificate:
kubectl apply -f kubeflow-certificate.yaml
-
Check the status of the certificate:
kubectl get -w -n istio-system certificate
-
Use
Ctrl+c
to exit the kubectl watch (-w) command
Note
|
The certificate can take up to three minutes to be issued, as indicated by the READY status becoming True
|
-
Refresh the istio-gateway deployment to use the new certificate:
kubectl rollout restart deployment -n istio-system istio-ingressgateway
-
Close and reopen the browser to verify the publicly signed certificate at the Kubeflow UI’s HTTPS URL
-
Edit the kubeflow resource:
kubectl edit -n kubeflow gateways.networking.istio.io kubeflow-gateway
-
Change
httpsRedirect: false
tohttpsRedirect: true