Skip to content

Latest commit

 

History

History
710 lines (531 loc) · 22.8 KB

Rancher-Kubeflow-Harvester-RKE2.adoc

File metadata and controls

710 lines (531 loc) · 22.8 KB

Current state of this doc (05/24/23): Seems pretty stable. Recent deployments had a few issues with python failures in Kubeflow pods, but almost definitely not due to this procedure.

Important
This guide assumes Rancher Management Server is installed and configured with Harvester under Rancher management, and an AWS Route53 DNS domain is available.

Overall steps to complete this task

  1. Deploy an RKE2 cluster on Harvester

  2. Deploy the MetalLB load balancer

  3. Verify the correct operation of MetalLB and the Harvester/Longhorn CSI

  4. Install kustomize 5.0.0 or higher

  5. Deploy Kubeflow

  6. Verify the Kubeflow installation

  7. Update Istio to use the MetalLB load balancer

  8. Enable HTTPS on the Kubeflow Istio Gateway

UPDATE START

  1. Update AWS Route53

UPDATE END

  1. Configure cert-manager to manage Let’s Encrypt certificates, using Route 53 DNS records

  2. Use a web browser to connect to Kubeflow UI with the Let’s Encrypt Staging certificate

  3. Update the configuration to use a Let’s Encrypt production certificate

  4. Use a web browser to connect to Kubeflow UI with the Let’s Encrypt Production certificate

  5. Optionally, set the kubeflow-gateway to redirect HTTP to HTTPS

Note
Many of the following steps will be performed from an AMD/Intel Linux workstation with access to Rancher, Harvester, and the Internet.

Deploy an RKE2 cluster on Harvester

Prepare the RKE2 Kubernetes cluster
  • For this effort, we used Harvester, managed through Rancher, to create a five node RKE2 cluster

  • A project in Harvester named kubeflow-on-harvester contains a namespace named kubeflow-cluster

Note
Create a cluster through the Rancher UI
The resource allocations used here were for basic testing purposes. It is likely the more CPU and RAM would be required for the workload-plane VMs to support a useful Kubeflow workload.
  • The cluster name set to:

kubeflow-on-harvester
  • Three instances set with the roles of control-plane and etcd pool

    • Configure resources of 4 vCPU, 8GB RAM, 40GB boot drive

    • Two VMs make up the workload-plane pool:

      • 8 vCPU, 16GB RAM, 40GB boot drive

  • Any Harvester namespace

  • The Image Volume operating systems for all nodes is the SUSE Linux Enterprise 15 SP4 minimal QCOW2 image with cloud-init enabled (previously known as the OpenStack image)

  • All nodes are connected to a Harvester network is connect to a VLAN with DHCP, DNS, and routing to the Internet

  • The SSH user for the O/S image is

sles
  • The following User Data cloud-config (under Show Advanced) was applied to all nodes during RKE2 cluster creation:

### cloud-init
#cloud-config
chpasswd:
  list: |
    root:SUSE
    sles:SUSE
  expire: false
ssh_authorized_keys:
  - >-
    <REPLACE WITH SSH PUBLIC KEY OF THE WORKSTATION>
runcmd:
#  - SUSEConnect --url <REPLACE WITH RMT SERVER ADDRESS>                                               # Uncomment if using an RMT server
#  - SUSEConnect -e <REPLACE WITH REGISTERED EMAILL ADDRESS> -r <REPLACE WITH SCC SUBSCRIPTION KEY>    # Uncomment if using an SCC subscription key
  - zypper -n in -t pattern apparmor
  - zypper -n up
  - zypper in --force-resolution --no-confirm --force kernel-default
  - zypper rm --no-confirm kernel-default-base
  • Select the tick-box to Install guest agent

Important
These instructions are currently only applicable for Kubernetes versions earlier than 1.25
The Kubernetes Cluster Configuration is as follows:
  • On the Basic tab:

    • Kubernetes version v1.24.9+rke2r2 (currently depricated, but needed for Harvester Cloud Provider support)

    • Enable the Harvester Cloud Provider CSI driver

    • Container Network Interface is Calico

    • Ensure the Default Security Pod Policy is set to Default - RKE2 Embedded

    • Leave Pod Security Admission Configuration Template set to (None)

  • (Ignore this line, it is needed to fix bullet points below)

    • Disable the Nginx Ingress controller under System Services

  • On the Labels and Annotations tab:

    • Apply a cluster label where they key is platform and the value is kubeflow

  • Click Create

Verify and reboot the RKE2 nodes
  • After the cluster has been created, SSH to each node as the user sles

    • Verify that the kernel-default kernel has been installed and kernel-default-base kernel has been removed:

sudo zypper se kernel-default
  • If needed, remove the kernel-default-base kernel with:

sudo zypper rm --no-confirm kernel-default-base
  • Verify that all operating system software has been patched to the latest update:

sudo zypper up
  • Reboot each node, in turn to enable the kernel-default kernel

sudo reboot

After the RKE2 cluster has been created, gather the KUBECONFIG data from the Rancher Management server and provide it to a workstation with kubectl and helm installed

Deploy the MetalLB load balancer

Note
The instructions described below include a section for Testing MetalLB after deployment. This can be omitted as both MetalLB and the Harvester CSI will be tested in a later step.

Verify the correct operation of MetalLB and the Harvester/Longhorn CSI

  • Set this variable with the target namespace:

NAMESPACE="metallb-harvester-csi-test"
  • Create the namespace:

kubectl create namespace ${NAMESPACE}
  • Create the manifest for an nginx pod, PVC, and load balancer service:

cat <<EOF> nginx-metallb-test.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: ${NAMESPACE}
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1
        ports:
        - name: http
          containerPort: 80
        volumeMounts:
        - mountPath: /mnt/test-vol
          name: test-vol
      volumes:
      - name: test-vol
        persistentVolumeClaim:
          claimName: nginx-pvc


---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nginx-pvc
  namespace: ${NAMESPACE}
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi


---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  namespace: ${NAMESPACE}
spec:
  ports:
  - name: http
    port: 8080
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer
EOF
  • Create the pod, service, and the PVC:

kubectl apply -f nginx-metallb-test.yaml
  • Verify the pod is "Running", the harvester StorageClass is the (default), the persistentvolumeclaim is "Bound", and the service has an "EXTERNAL-IP":

kubectl get pod,sc,pvc,svc -n ${NAMESPACE}
  • Verify that the service is reachable through the load balancer IP address from outside the cluster:

IPAddr=$(kubectl get svc -n ${NAMESPACE} | grep -w nginx | awk '{print$4":"$5}' | awk -F: '{print$1":"$2}')
curl http://${IPAddr} 2>/dev/null | grep "Thank you for using nginx"
  • An HTML encoded output should display the phrase "Thank you for using nginx."

    • Verify that the volume is mounted in the test pod:

TEST_POD=$(kubectl get pods -n ${NAMESPACE} | awk '/nginx/ {print$1}')
kubectl exec -it ${TEST_POD} -n ${NAMESPACE} -- mount | grep test-vol
  • The output should show that the volume is mounted at the location /mnt/test-vol

    • When finished with testing, delete the pod and service:

kubectl delete -f nginx-metallb-test.yaml
sleep 5
kubectl delete namespace ${NAMESPACE}

Install kustomize 5.0.0 or higher

Note
The instructions for installing Kubeflow can be found at: https://github.com/kubeflow/manifests#installation
Important
At the time of writing, Kubeflow requires kustomize version 5.0.0 or higher
Install kustomize 5.0.0 or higher on the Linux workstation:
wget https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2F${VERSION}/kustomize_${VERSION}_linux_amd64.tar.gz
tar xvfz kustomize_${VERSION}_linux_amd64.tar.gz
sudo mv kustomize /usr/bin

UPDATE START

  • Verify the kustomize version:

kustomize version

# UPDATE END

Deploy Kubeflow

Note
The remainder of the procedure will require installing Kubeflow according to the instructions on the Kubeflow GitHub site, then returning to this document to enable TLS for HTTPS connections to the Kubeflow Dashboard.
Important
Before running the first installation command, it is recommended to run git status in the manifests directory to ensure no unexpected changes have been made to this copy of the git repo. Additionally, it is recommeneded to remove the manifests directory and re-clone the repo between installation efforts.
  • Clone the repository at https://github.com/kubeflow/manifests, change into the manifests directory, then follow the instructions to either install all of the Kubeflow components with a single command, or install individual components

Note
The remainder of this procedure has only been tested with an full installation (E.i. https://github.com/kubeflow/manifests#install-with-a-single-command)

Verify the Kubeflow installation

  • Ensure all pods have a STATUS of Running and all of the containers in each pod (E.g. 1/1, not 1/2 or 0/1) are running:

for EACH in auth cert-manager istio-system knative-eventing knative-serving kubeflow kubeflow-user-example-com; do kubectl get pods -n ${EACH}; read -p "<Enter to continue>"; echo ""; done
  • Enable kubectl port-forwarding and ensure the Kubeflow UI permits login:

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Note
In the following step ensure the connect is HTTP, not HTTPS
  • In a browser on the Linux workstation, connect to:

http://127.0.0.1:8080
  • Login with the credentials: Email address

Password

12341234
  • Use Ctrl+c to close the kubectl port-forward session

Troubleshooting Kubeflow installation

  • Some things that could prevent connecting or loggging into the Kubeblow dashboard include:

    1. The local copy of the https://github.com/kubeflow/manifests git repo doesn’t match the origin

      • While in the manifests directory, run git status to see if any files are different from the origin repo

      • Remove the manifests directory and clone the repo again

    2. Using a web browser that is not running on the Linux desktop

      • The kubectl port-forwarding opens a tunnel from the Linux workstation to the Kubeflow gateway service that only a web browser running on the same system can utilize.

    3. The Kubeflow installation has not completed or failed to complete

      • Return to the beginning of this Verify the Kubeflow installation section and ensure all containers and pods are running correctly

      • A high number of container restarts can indicate other issues preventing the installation from completing sucessfully

    4. The cluster’s resources are saturated

      • Use the Linux top command on the worker nodes to ensure the system’s CPU/memory are not overburdened

      • Check the Harvester dashboard to ensure the physical Harvester nodes are not overburdened or experiencing failures

Update Istio to use the MetalLB load balancer

  • Verify the current istio-ingressgateway service type (Likely ClusterIP):

kubectl -n istio-system get svc istio-ingressgateway -o jsonpath='{.spec.type}' ; echo ""
  • Patch the service to change the type to LoadBalancer:

kubectl -n istio-system patch svc istio-ingressgateway -p '{"spec": {"type": "LoadBalancer"}}'
  • Verify the service is a type of LoadBalancer and take note of the IP address:

kubectl -n istio-system get svc istio-ingressgateway

Enable HTTPS on the Kubeflow Istio Gateway

  • Edit the kubeflow-gateway resource to add HTTPS routing:

kubectl edit -n kubeflow gateways.networking.istio.io kubeflow-gateway
  • Add this portion to the bottom of the spec: section:

    tls:
      httpsRedirect: false
  - hosts:
    - "*"
    port:
      name: https
      number: 443
      protocol: HTTPS
    tls:
      mode: SIMPLE
      credentialName: kubeflow-certificate-secret
  • The entire spec: section should look like this:

spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - '*'
    port:
      name: http
      number: 80
      protocol: HTTP
    tls:
      httpsRedirect: false
  - hosts:
    - "*"
    port:
      name: https
      number: 443
      protocol: HTTPS
    tls:
      mode: SIMPLE
      credentialName: kubeflow-certificate-secret

UPDATE START

Update AWS Route53

UPDATE END

  • Update the AWS Route53 DNS provider wih the Kubeflow IP address and the desired Fully Qualified Domain Name for the Kubeflow UI

Use a browser to connect, with HTTP (not HTTPS), to Kubeflow UI at the FQDN

  • The screen should redirect to dex and offer a login prompt

  • (Optional) Login with the credentials: Email address

Password

12341234
Important
Proceed to the next section only after being able to connect to, and optionally, log into the Kubeflow UI

Configure cert-manager to manage Let’s Encrypt certificates, using Route 53 DNS records

Note
cert-manager can manage certificates from any public DNS provider. See the cert-manager documentation at https://cert-manager.io/docs/configuration/acme/ for more information.
Note
An AWS user with appropriate IAM policies and API access keys is needed for cert-manager to access the Route53 DNS records. See the cert-manager documentation at https://cert-manager.io/docs/configuration/acme/dns01/route53/ for more information.
Create a cert-manager Issuer for Let’s Encrypt:
  • Set these variables:

# aws_access_key_id and aws_secret_access_key for the configured AWS user:
export AWS_ACCESS_KEY_ID=""
export AWS_SECRET_ACCESS_KEY=""
export AWS_REGION="" # E.g. "us-west-2"
export DNSZONE="" # E.g. "suse.com"
export FQDN="" # E.g. "kubeflow.suse.com"
export EMAIL_ADDR="" # valid email address for the Let's Encrypt certificate
Note
When initially creating the cert-manager Issuer, ensure the server: https://acme-staging-v02 line is uncommented and the server: https://acme-v02 line is commented out. After verifying that the certicate can be issued correctly, we will reverse this to obtain the valid, production certificate.
  • Create the cert-manager Issuer file:

cat <<EOF> letsencrypt-issuer.yaml
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: letsencrypt-issuer
  namespace: istio-system
spec:
  acme:
    email: ${EMAIL_ADDR}
    server: https://acme-staging-v02.api.letsencrypt.org/directory # Use this line to test the process of issuing a certificate to avoid the Let's Encrypt production rate limits
#    server: https://acme-v02.api.letsencrypt.org/directory # Use this line after the certificate issues correctly
    privateKeySecretRef:
      name: letsencrypt-issuer-priv-key # K8s secret that will contain the private key for this, specific issuer
    solvers:
    - selector:
        dnsZones:
          - "${DNSZONE}"
      dns01:
        route53:
          region: ${AWS_REGION}
          accessKeyID: ${AWS_ACCESS_KEY_ID}
          secretAccessKeySecretRef:
            name: route53-credentials-secret
            key: secret-access-key
EOF
Important
Review the letsencrypt-issuer.yaml file for accuracy before continuing
  • Verify the update to the file:

cat letsencrypt-issuer.yaml
  • Create the letsencrypt-issuer resource:

kubectl apply -f letsencrypt-issuer.yaml
  • Create the Kubernetes secret containing the aws_secret_access_key for the AWS user:

UPDATE START

kubectl create -n istio-system secret generic route53-credentials-secret --from-literal=secret-access-key=${AWS_SECRET_ACCESS_KEY}

UPDATE END

  • Verify the contents of the secret:

kubectl get -n istio-system secret route53-credentials-secret -o jsonpath={.data.secret-access-key} | base64 -d; echo ""

UPDATE START

(Removed the "Update OIDC to allow the Let’s Encrypt DNS01 challenge: section)

UPDATE END

  • Verify the hostname for the certificate resolves correctly:

getent hosts ${FQDN}
  • Create the cert-manager Certificate resource file:

cat <<EOF> kubeflow-certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: kubeflow-certificate
  namespace: istio-system
spec:
  secretName: kubeflow-certificate-secret # Kubernetes secret that will contain the tls.key and tls.crt of the new cert
  commonName: ${FQDN}
  dnsNames:
    - ${FQDN}
  issuerRef:
    name: letsencrypt-issuer
    kind: Issuer
EOF
  • Verify the Certificate resource file:

cat kubeflow-certificate.yaml
  • Create the Certificate resource:

kubectl apply -f kubeflow-certificate.yaml
  • Check the status of the certificate:

kubectl get -w -n istio-system certificate
  • Use Ctrl+c to exit the kubectl -w (watch) command

Note
The certificate commonly takes 100 seconds to be issued but can take up to three minutes. The READY status will change to True when it is issued.
  • If needed, check the progress of the certificate:

kubectl describe -n istio-system certificate kubeflow-certificate
Important
If the certificate seems to be taking a long time to be issued, review the cert-manager logs for clues. Common errors are related to DNS resolution, credentials, and IAM policies. Keep checking back for the status of the certificate since it will likely keep working in the background.
  • If needed, review the cert-manager logs:

kubectl logs -n cert-manager -l app=cert-manager
Important
Proceed to the next section only after the certificate shows a READY status of True

Use a web browser to connect to Kubeflow UI with the Let’s Encrypt Staging certificate

Note
Since the certificate was issued by the Let’s Encrypt Staging servers, it will cause an error in the browser that it is untrusted.
  • Click the lock icon in the browser’s URL pane, then continue selecting appropriate options until you are able to review the connection certificate. It should say that the certificate was issued by Let’s Encrypt (Staging)

Update the configuration to use a Let’s Encrypt production certificate

kubectl apply -f letsencrypt-issuer.yaml
  • Remove the certificatate and its associated secret:

kubectl -n istio-system delete secret kubeflow-certificate-secret
kubectl -n istio-system delete certificate kubeflow-certificate
  • Recreate the certificate:

kubectl apply -f kubeflow-certificate.yaml
  • Check the status of the certificate:

kubectl get -w -n istio-system certificate
  • Use Ctrl+c to exit the kubectl watch (-w) command

Note
The certificate can take up to three minutes to be issued, as indicated by the READY status becoming True
  • Refresh the istio-gateway deployment to use the new certificate:

kubectl rollout restart deployment -n istio-system istio-ingressgateway

Use a web browser to connect to Kubeflow UI with the Let’s Encrypt Production certificate

  • Close and reopen the browser to verify the publicly signed certificate at the Kubeflow UI’s HTTPS URL

Optionally, set the kubeflow-gateway to redirect HTTP to HTTPS

  • Edit the kubeflow resource:

kubectl edit -n kubeflow gateways.networking.istio.io kubeflow-gateway
  • Change httpsRedirect: false to httpsRedirect: true