Add docs explaining how to deploy the NVIDIA operator #203

manuelbuil · 2024-05-03T07:57:34Z

Adds information about how to deploy the NVIDIA operator

dereknola · 2024-05-03T17:40:36Z

docs/advanced.md

+Once the OS is ready and RKE2 is running, add the NVIDIA Helm repository:
+```
+helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
+```
+
+And install the GPU Operator:
+```
+helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set toolkit.env[0].name=CONTAINERD_SOCKET --set toolkit.env[0].value=/run/k3s/containerd/containerd.sock
+```
+:::warning
+The NVIDIA operator restarts containerd with a hangup call which restarts RKE2
+:::


Since this is helm, can we include an option of installing it via manifests? Use another Tabs section?

The GPU operator only supported installation is via Helm: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#prerequisites

We have a helm operator built into RKE2 though. We don't ever talk about using helm CLI anywhere in the current docs.

Wouldn't the helm chart just be

apiVersion: helm.cattle.io/v1 kind: HelmChart metadata: name: gpu-operator namespace: kube-system spec: chart: nvidia/gpu-operator targetNamespace: gpu-operator set: - tookit.env[0].name=CONTAINERD_SOCKET - toolkit.env[0].value=/run/k3s/containerd/containerd.sock

👍🏻 to suggesting we use the built-in helm controller for this

I would also probably encourage valuesContent instead of set though, since that's more extensible if they want to continue to set chart values.

Also, you need to specify the repo:

apiVersion: helm.cattle.io/v1 kind: HelmChart metadata: name: gpu-operator namespace: kube-system spec: repo: https://helm.ngc.nvidia.com/nvidia chart: gpu-operator targetNamespace: gpu-operator valuesContent: |- toolkit: env: - name: CONTAINERD_SOCKET value: /run/k3s/containerd/containerd.sock

We have a helm operator built into RKE2 though. We don't ever talk about using helm CLI anywhere in the current docs.

Wouldn't the helm chart just be

apiVersion: helm.cattle.io/v1 kind: HelmChart metadata: name: gpu-operator namespace: kube-system spec: chart: nvidia/gpu-operator targetNamespace: gpu-operator set: - tookit.env[0].name=CONTAINERD_SOCKET - toolkit.env[0].value=/run/k3s/containerd/containerd.sock

Good point! That makes everything more cohesive

@brandond When installing a chart using kind: HelmChart, there is no way to automate the creation of the targetNamespace, right? I see the error:

Error: INSTALLATION FAILED: create: failed to create: namespaces "gpu-operator" not found

I guess I need to add the namespace creation to the yaml

No, you can just do createNamespace: true.

https://docs.k3s.io/helm#helmchart-field-definitions

e-minguez · 2024-05-07T11:03:42Z

Just in case, for SLE Micro -> https://documentation.suse.com/suse-edge/3.0/html/edge/id-nvidia-gpus-on-sle-micro.html
And also, https://bugzilla.suse.com/show_bug.cgi?id=1222725 for SLE Micro with SELinux enabled

brandond · 2024-05-08T19:01:43Z

docs/advanced.md

+<Tabs groupId = "GPU Operating System">
+<TabItem value="SLES" default>
+
+The NVIDIA operator is capable of installing the required drivers when not installed correctly in a very limited set of OS, which unfortunately, does not include SLES. Therefore, we need to install the NVIDIA drivers before deploying the operator.


I'm having a hard time understanding this statement. Is this what you're getting at?

Suggested change

The NVIDIA operator is capable of installing the required drivers when not installed correctly in a very limited set of OS, which unfortunately, does not include SLES. Therefore, we need to install the NVIDIA drivers before deploying the operator.

The NVIDIA operator cannot automatically install kernel drivers on SLES. NVIDIA drivers must be manually installed on all GPU nodes before deploying the operator to the cluster.

Thanks Brad. Probably it would have been better: The NVIDIA operator is capable of installing the required drivers when not pre-installed correctly. Your suggestion works as well

brandond · 2024-05-09T08:02:05Z

docs/advanced.md

+</TabItem>
+<TabItem value="RHEL" default>
+
+The NVIDIA operator can automatically install kernel drivers on Ubuntu using the `nvidia-driver-daemonset`. You would only need to create the symlink:


Suggested change

The NVIDIA operator can automatically install kernel drivers on Ubuntu using the `nvidia-driver-daemonset`. You would only need to create the symlink:

The NVIDIA operator can automatically install kernel drivers on RHEL using the `nvidia-driver-daemonset`. You would only need to create the symlink:

Signed-off-by: Manuel Buil <[email protected]>

mgfritch

Confirmed this actually works while using my super cool non-enterprise GPU! 😆

$ lsb-release -a
LSB Version:    n/a
Distributor ID: openSUSE
Description:    openSUSE Tumbleweed
Release:        20240506
Codename:       n/a

$ uname -r -m 
6.8.8-1-default x86_64

$ rpm -q nvidia-gl-G06 nvidia-video-G06 nvidia-compute-utils-G06
nvidia-gl-G06-550.78-22.1.x86_64
nvidia-video-G06-550.78-22.1.x86_64
nvidia-compute-utils-G06-550.78-22.1.x86_64

$ nvidia-smi --query-gpu=name --format=csv,noheader | head -1
Quadro K620

$ kubectl get nodes
NAME       STATUS   ROLES                       AGE     VERSION
host1   Ready    control-plane,etcd,master   6h37m   v1.29.0+rke2r1

$ kubectl -n gpu-operator get pods
NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-pjvct                                   1/1     Running     0          6h32m
gpu-operator-7bbf8bb6b7-qqhm7                                 1/1     Running     0          6h33m
gpu-operator-node-feature-discovery-gc-79d6d968bb-grxdq       1/1     Running     0          6h33m
gpu-operator-node-feature-discovery-master-6d9f8d497c-jdtsz   1/1     Running     0          6h33m
gpu-operator-node-feature-discovery-worker-w6cgr              1/1     Running     0          6h33m
nvidia-container-toolkit-daemonset-rrnng                      1/1     Running     0          6h32m
nvidia-cuda-validator-zljq7                                   0/1     Completed   0          6h32m
nvidia-dcgm-exporter-6wdmj                                    1/1     Running     0          6h32m
nvidia-device-plugin-daemonset-6n67d                          1/1     Running     0          6h32m
nvidia-operator-validator-wp8tm                               1/1     Running     0          6h32m

$ kubectl logs nbody-gpu-benchmark
<snip>
> Compute 5.0 CUDA device: [Quadro K620]
3072 bodies, total time for 10 iterations: 3.705 ms
= 25.470 billion interactions per second
= 509.396 single-precision GFLOP/s at 20 flops per interaction

mgfritch · 2024-05-10T03:20:02Z

docs/advanced.md

+Finally, create the symlink:
+```
+sudo ln -s /sbin/ldconfig /sbin/ldconfig.real
+```


hopefully this workaround can be removed when NVIDIA/nvidia-container-toolkit#147 is resolved.

That seems like a trivial thing to fix, I'm really surprised they haven't yet.

thanks Mike!

manuelbuil requested a review from a team as a code owner May 3, 2024 07:57

dereknola requested changes May 3, 2024

View reviewed changes

manuelbuil force-pushed the gpudocs branch from a5788e1 to dc3fa19 Compare May 7, 2024 10:57

manuelbuil requested review from dereknola and brandond May 7, 2024 15:17

manuelbuil force-pushed the gpudocs branch from dc3fa19 to ff6b1b2 Compare May 8, 2024 10:15

brandond requested changes May 8, 2024

View reviewed changes

manuelbuil force-pushed the gpudocs branch from ff6b1b2 to b302415 Compare May 9, 2024 05:49

manuelbuil requested a review from brandond May 9, 2024 05:50

brandond reviewed May 9, 2024

View reviewed changes

Add docs explaining how to deploy the NVIDIA operator

d559b81

Signed-off-by: Manuel Buil <[email protected]>

manuelbuil force-pushed the gpudocs branch from b302415 to d559b81 Compare May 9, 2024 09:32

brandond approved these changes May 9, 2024

View reviewed changes

dereknola approved these changes May 9, 2024

View reviewed changes

manuelbuil requested review from mgfritch and dereknola May 9, 2024 17:21

mgfritch approved these changes May 10, 2024

View reviewed changes

manuelbuil merged commit 404791c into rancher:main May 10, 2024
1 check passed

manuelbuil deleted the gpudocs branch May 10, 2024 05:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docs explaining how to deploy the NVIDIA operator #203

Add docs explaining how to deploy the NVIDIA operator #203

manuelbuil commented May 3, 2024

dereknola May 3, 2024 •

edited

Loading

manuelbuil May 6, 2024

dereknola May 6, 2024

brandond May 6, 2024

brandond May 6, 2024 •

edited

Loading

manuelbuil May 6, 2024

manuelbuil May 7, 2024 •

edited

Loading

brandond May 8, 2024

e-minguez commented May 7, 2024

brandond May 8, 2024

manuelbuil May 9, 2024

brandond May 9, 2024

mgfritch left a comment

mgfritch May 10, 2024

brandond May 10, 2024

manuelbuil May 10, 2024

	The NVIDIA operator is capable of installing the required drivers when not installed correctly in a very limited set of OS, which unfortunately, does not include SLES. Therefore, we need to install the NVIDIA drivers before deploying the operator.
	The NVIDIA operator cannot automatically install kernel drivers on SLES. NVIDIA drivers must be manually installed on all GPU nodes before deploying the operator to the cluster.

	The NVIDIA operator can automatically install kernel drivers on Ubuntu using the `nvidia-driver-daemonset`. You would only need to create the symlink:
	The NVIDIA operator can automatically install kernel drivers on RHEL using the `nvidia-driver-daemonset`. You would only need to create the symlink:

Add docs explaining how to deploy the NVIDIA operator #203

Add docs explaining how to deploy the NVIDIA operator #203

Conversation

manuelbuil commented May 3, 2024

dereknola May 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandond May 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manuelbuil May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-minguez commented May 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgfritch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dereknola May 3, 2024 •

edited

Loading

brandond May 6, 2024 •

edited

Loading

manuelbuil May 7, 2024 •

edited

Loading