-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docs explaining how to deploy the NVIDIA operator #203
Conversation
docs/advanced.md
Outdated
Once the OS is ready and RKE2 is running, add the NVIDIA Helm repository: | ||
``` | ||
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update | ||
``` | ||
|
||
And install the GPU Operator: | ||
``` | ||
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set toolkit.env[0].name=CONTAINERD_SOCKET --set toolkit.env[0].value=/run/k3s/containerd/containerd.sock | ||
``` | ||
:::warning | ||
The NVIDIA operator restarts containerd with a hangup call which restarts RKE2 | ||
::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is helm, can we include an option of installing it via manifests? Use another Tabs section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GPU operator only supported installation is via Helm: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#prerequisites
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a helm operator built into RKE2 though. We don't ever talk about using helm CLI anywhere in the current docs.
Wouldn't the helm chart just be
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: gpu-operator
namespace: kube-system
spec:
chart: nvidia/gpu-operator
targetNamespace: gpu-operator
set:
- tookit.env[0].name=CONTAINERD_SOCKET
- toolkit.env[0].value=/run/k3s/containerd/containerd.sock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 to suggesting we use the built-in helm controller for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also probably encourage valuesContent
instead of set
though, since that's more extensible if they want to continue to set chart values.
Also, you need to specify the repo:
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: gpu-operator
namespace: kube-system
spec:
repo: https://helm.ngc.nvidia.com/nvidia
chart: gpu-operator
targetNamespace: gpu-operator
valuesContent: |-
toolkit:
env:
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a helm operator built into RKE2 though. We don't ever talk about using helm CLI anywhere in the current docs.
Wouldn't the helm chart just be
apiVersion: helm.cattle.io/v1 kind: HelmChart metadata: name: gpu-operator namespace: kube-system spec: chart: nvidia/gpu-operator targetNamespace: gpu-operator set: - tookit.env[0].name=CONTAINERD_SOCKET - toolkit.env[0].value=/run/k3s/containerd/containerd.sock
Good point! That makes everything more cohesive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brandond When installing a chart using kind: HelmChart
, there is no way to automate the creation of the targetNamespace
, right? I see the error:
Error: INSTALLATION FAILED: create: failed to create: namespaces "gpu-operator" not found
I guess I need to add the namespace creation to the yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you can just do createNamespace: true
.
Just in case, for SLE Micro -> https://documentation.suse.com/suse-edge/3.0/html/edge/id-nvidia-gpus-on-sle-micro.html |
docs/advanced.md
Outdated
<Tabs groupId = "GPU Operating System"> | ||
<TabItem value="SLES" default> | ||
|
||
The NVIDIA operator is capable of installing the required drivers when not installed correctly in a very limited set of OS, which unfortunately, does not include SLES. Therefore, we need to install the NVIDIA drivers before deploying the operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time understanding this statement. Is this what you're getting at?
The NVIDIA operator is capable of installing the required drivers when not installed correctly in a very limited set of OS, which unfortunately, does not include SLES. Therefore, we need to install the NVIDIA drivers before deploying the operator. | |
The NVIDIA operator cannot automatically install kernel drivers on SLES. NVIDIA drivers must be manually installed on all GPU nodes before deploying the operator to the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Brad. Probably it would have been better: The NVIDIA operator is capable of installing the required drivers when not pre-installed correctly
. Your suggestion works as well
docs/advanced.md
Outdated
</TabItem> | ||
<TabItem value="RHEL" default> | ||
|
||
The NVIDIA operator can automatically install kernel drivers on Ubuntu using the `nvidia-driver-daemonset`. You would only need to create the symlink: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NVIDIA operator can automatically install kernel drivers on Ubuntu using the `nvidia-driver-daemonset`. You would only need to create the symlink: | |
The NVIDIA operator can automatically install kernel drivers on RHEL using the `nvidia-driver-daemonset`. You would only need to create the symlink: |
Signed-off-by: Manuel Buil <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed this actually works while using my super cool non-enterprise GPU! 😆
$ lsb-release -a
LSB Version: n/a
Distributor ID: openSUSE
Description: openSUSE Tumbleweed
Release: 20240506
Codename: n/a
$ uname -r -m
6.8.8-1-default x86_64
$ rpm -q nvidia-gl-G06 nvidia-video-G06 nvidia-compute-utils-G06
nvidia-gl-G06-550.78-22.1.x86_64
nvidia-video-G06-550.78-22.1.x86_64
nvidia-compute-utils-G06-550.78-22.1.x86_64
$ nvidia-smi --query-gpu=name --format=csv,noheader | head -1
Quadro K620
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
host1 Ready control-plane,etcd,master 6h37m v1.29.0+rke2r1
$ kubectl -n gpu-operator get pods
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-pjvct 1/1 Running 0 6h32m
gpu-operator-7bbf8bb6b7-qqhm7 1/1 Running 0 6h33m
gpu-operator-node-feature-discovery-gc-79d6d968bb-grxdq 1/1 Running 0 6h33m
gpu-operator-node-feature-discovery-master-6d9f8d497c-jdtsz 1/1 Running 0 6h33m
gpu-operator-node-feature-discovery-worker-w6cgr 1/1 Running 0 6h33m
nvidia-container-toolkit-daemonset-rrnng 1/1 Running 0 6h32m
nvidia-cuda-validator-zljq7 0/1 Completed 0 6h32m
nvidia-dcgm-exporter-6wdmj 1/1 Running 0 6h32m
nvidia-device-plugin-daemonset-6n67d 1/1 Running 0 6h32m
nvidia-operator-validator-wp8tm 1/1 Running 0 6h32m
$ kubectl logs nbody-gpu-benchmark
<snip>
> Compute 5.0 CUDA device: [Quadro K620]
3072 bodies, total time for 10 iterations: 3.705 ms
= 25.470 billion interactions per second
= 509.396 single-precision GFLOP/s at 20 flops per interaction
Finally, create the symlink: | ||
``` | ||
sudo ln -s /sbin/ldconfig /sbin/ldconfig.real | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully this workaround can be removed when NVIDIA/nvidia-container-toolkit#147 is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a trivial thing to fix, I'm really surprised they haven't yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Mike!
Adds information about how to deploy the NVIDIA operator