Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs explaining how to deploy the NVIDIA operator #203

Merged
merged 1 commit into from
May 10, 2024

Conversation

manuelbuil
Copy link
Contributor

Adds information about how to deploy the NVIDIA operator

image

@manuelbuil manuelbuil requested a review from a team as a code owner May 3, 2024 07:57
docs/advanced.md Outdated
Comment on lines 306 to 334
Once the OS is ready and RKE2 is running, add the NVIDIA Helm repository:
```
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
```

And install the GPU Operator:
```
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set toolkit.env[0].name=CONTAINERD_SOCKET --set toolkit.env[0].value=/run/k3s/containerd/containerd.sock
```
:::warning
The NVIDIA operator restarts containerd with a hangup call which restarts RKE2
:::
Copy link
Member

@dereknola dereknola May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is helm, can we include an option of installing it via manifests? Use another Tabs section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a helm operator built into RKE2 though. We don't ever talk about using helm CLI anywhere in the current docs.

Wouldn't the helm chart just be

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: gpu-operator
  namespace: kube-system
spec:
  chart: nvidia/gpu-operator
  targetNamespace: gpu-operator
  set:
     - tookit.env[0].name=CONTAINERD_SOCKET
     - toolkit.env[0].value=/run/k3s/containerd/containerd.sock

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 to suggesting we use the built-in helm controller for this

Copy link
Member

@brandond brandond May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also probably encourage valuesContent instead of set though, since that's more extensible if they want to continue to set chart values.

Also, you need to specify the repo:

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: gpu-operator
  namespace: kube-system
spec:
  repo: https://helm.ngc.nvidia.com/nvidia
  chart: gpu-operator
  targetNamespace: gpu-operator
  valuesContent: |-
     toolkit:
       env:
         - name:  CONTAINERD_SOCKET
           value: /run/k3s/containerd/containerd.sock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a helm operator built into RKE2 though. We don't ever talk about using helm CLI anywhere in the current docs.

Wouldn't the helm chart just be

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: gpu-operator
  namespace: kube-system
spec:
  chart: nvidia/gpu-operator
  targetNamespace: gpu-operator
  set:
     - tookit.env[0].name=CONTAINERD_SOCKET
     - toolkit.env[0].value=/run/k3s/containerd/containerd.sock

Good point! That makes everything more cohesive

Copy link
Contributor Author

@manuelbuil manuelbuil May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandond When installing a chart using kind: HelmChart, there is no way to automate the creation of the targetNamespace, right? I see the error:

Error: INSTALLATION FAILED: create: failed to create: namespaces "gpu-operator" not found

I guess I need to add the namespace creation to the yaml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you can just do createNamespace: true.

https://docs.k3s.io/helm#helmchart-field-definitions

@e-minguez
Copy link
Contributor

Just in case, for SLE Micro -> https://documentation.suse.com/suse-edge/3.0/html/edge/id-nvidia-gpus-on-sle-micro.html
And also, https://bugzilla.suse.com/show_bug.cgi?id=1222725 for SLE Micro with SELinux enabled

docs/advanced.md Outdated
<Tabs groupId = "GPU Operating System">
<TabItem value="SLES" default>

The NVIDIA operator is capable of installing the required drivers when not installed correctly in a very limited set of OS, which unfortunately, does not include SLES. Therefore, we need to install the NVIDIA drivers before deploying the operator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time understanding this statement. Is this what you're getting at?

Suggested change
The NVIDIA operator is capable of installing the required drivers when not installed correctly in a very limited set of OS, which unfortunately, does not include SLES. Therefore, we need to install the NVIDIA drivers before deploying the operator.
The NVIDIA operator cannot automatically install kernel drivers on SLES. NVIDIA drivers must be manually installed on all GPU nodes before deploying the operator to the cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Brad. Probably it would have been better: The NVIDIA operator is capable of installing the required drivers when not pre-installed correctly. Your suggestion works as well

docs/advanced.md Outdated
</TabItem>
<TabItem value="RHEL" default>

The NVIDIA operator can automatically install kernel drivers on Ubuntu using the `nvidia-driver-daemonset`. You would only need to create the symlink:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The NVIDIA operator can automatically install kernel drivers on Ubuntu using the `nvidia-driver-daemonset`. You would only need to create the symlink:
The NVIDIA operator can automatically install kernel drivers on RHEL using the `nvidia-driver-daemonset`. You would only need to create the symlink:

Copy link

@mgfritch mgfritch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this actually works while using my super cool non-enterprise GPU! 😆

$ lsb-release -a
LSB Version:    n/a
Distributor ID: openSUSE
Description:    openSUSE Tumbleweed
Release:        20240506
Codename:       n/a

$ uname -r -m 
6.8.8-1-default x86_64

$ rpm -q nvidia-gl-G06 nvidia-video-G06 nvidia-compute-utils-G06
nvidia-gl-G06-550.78-22.1.x86_64
nvidia-video-G06-550.78-22.1.x86_64
nvidia-compute-utils-G06-550.78-22.1.x86_64

$ nvidia-smi --query-gpu=name --format=csv,noheader | head -1
Quadro K620
$ kubectl get nodes
NAME       STATUS   ROLES                       AGE     VERSION
host1   Ready    control-plane,etcd,master   6h37m   v1.29.0+rke2r1

$ kubectl -n gpu-operator get pods
NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-pjvct                                   1/1     Running     0          6h32m
gpu-operator-7bbf8bb6b7-qqhm7                                 1/1     Running     0          6h33m
gpu-operator-node-feature-discovery-gc-79d6d968bb-grxdq       1/1     Running     0          6h33m
gpu-operator-node-feature-discovery-master-6d9f8d497c-jdtsz   1/1     Running     0          6h33m
gpu-operator-node-feature-discovery-worker-w6cgr              1/1     Running     0          6h33m
nvidia-container-toolkit-daemonset-rrnng                      1/1     Running     0          6h32m
nvidia-cuda-validator-zljq7                                   0/1     Completed   0          6h32m
nvidia-dcgm-exporter-6wdmj                                    1/1     Running     0          6h32m
nvidia-device-plugin-daemonset-6n67d                          1/1     Running     0          6h32m
nvidia-operator-validator-wp8tm                               1/1     Running     0          6h32m

$ kubectl logs nbody-gpu-benchmark
<snip>
> Compute 5.0 CUDA device: [Quadro K620]
3072 bodies, total time for 10 iterations: 3.705 ms
= 25.470 billion interactions per second
= 509.396 single-precision GFLOP/s at 20 flops per interaction

Comment on lines +282 to +285
Finally, create the symlink:
```
sudo ln -s /sbin/ldconfig /sbin/ldconfig.real
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully this workaround can be removed when NVIDIA/nvidia-container-toolkit#147 is resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a trivial thing to fix, I'm really surprised they haven't yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Mike!

@manuelbuil manuelbuil merged commit 404791c into rancher:main May 10, 2024
1 check passed
@manuelbuil manuelbuil deleted the gpudocs branch May 10, 2024 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants