Skip to content

Commit

Permalink
Add docs explaining how to deploy the NVIDIA operator
Browse files Browse the repository at this point in the history
Signed-off-by: Manuel Buil <[email protected]>
  • Loading branch information
manuelbuil committed May 8, 2024
1 parent 80fa0ae commit ff6b1b2
Showing 1 changed file with 129 additions and 0 deletions.
129 changes: 129 additions & 0 deletions docs/advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,3 +253,132 @@ kube-apiserver-extra-env:
- "MY_BAR=BAR"
kube-scheduler-extra-env: "TZ=America/Los_Angeles"
```

## Deploy NVIDIA operator (experimental)

The [NVIDIA operator](https://github.com/NVIDIA/gpu-operator) allows administrators of Kubernetes clusters to manage GPUs just like CPUs. It includes everything needed for pods to be able to operate GPUs.

Depending on the underlying OS, some steps need to be fulfilled

<Tabs groupId = "GPU Operating System">
<TabItem value="SLES" default>

The NVIDIA operator is capable of installing the required drivers when not installed correctly in a very limited set of OS, which unfortunately, does not include SLES. Therefore, we need to install the NVIDIA drivers before deploying the operator.

```
# Assuming you are using sle15sp5, if different, change the url accordingly
sudo zypper addrepo --refresh 'https://download.nvidia.com/suse/sle15sp5' NVIDIA
sudo zypper --gpg-auto-import-keys refresh
sudo zypper install -y –-auto-agree-with-licenses nvidia-gl-G06 nvidia-video-G06 nvidia-compute-utils-G06
```
Then reboot.
If everything worked correctly, after the reboot, you should see the NVRM and GCC version fo the driver when executing the command:
```
cat /proc/driver/nvidia/version
```
Finally, create the symlink:
```
sudo ln -s /sbin/ldconfig /sbin/ldconfig.real
```
</TabItem>
<TabItem value="Ubuntu" default>
The NVIDIA operator is capable of installing the required drivers when not installed correctly in a very limited set of OS, which unfortunately, does not include Ubuntu. Therefore, we need to install the NVIDIA drivers before deploying the operator.
```
sudo apt install nvidia-driver-535-server
```
Then reboot.
If everything worked correclty, after the reboot, you should see a correct output when executing the command:
```
cat /proc/driver/nvidia/version
```
</TabItem>
<TabItem value="RHEL" default>
The NVIDIA operator is capable of installing the required drivers when not installed correctly in a very limited set of OS, which unfortunately, does not include RHEL9. Therefore, you need to use RHEL8 or RHEL7.
You would only need to create the symlink:
```
sudo ln -s /sbin/ldconfig /sbin/ldconfig.real
```
</TabItem>
</Tabs>
Once the OS is ready and RKE2 is running, install the GPU Operator with the following yaml manifest:
```yaml
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: gpu-operator
namespace: kube-system
spec:
repo: https://helm.ngc.nvidia.com/nvidia
chart: gpu-operator
targetNamespace: gpu-operator
createNamespace: true
valuesContent: |-
toolkit:
env:
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
```
:::warning
The NVIDIA operator restarts containerd with a hangup call which restarts RKE2
:::

After one minute approximately, you can make the following checks to verify that everything worked as expected:

1 - Check if the operator detected the driver and GPU correctly:
```
kubectl get node $NODENAME -o jsonpath='{.metadata.labels}' | jq | grep "nvidia.com"
```
You should see labels specifying driver and GPU (e.g. nvidia.com/gpu.machine or nvidia.com/cuda.driver.major)

2 - Check if the gpu was added (by nvidia-device-plugin-daemonset) as an allocatable resource in the node:
```
kubectl get node $NODENAME -o jsonpath='{.status.allocatable}' | jq
```
You should see `"nvidia.com/gpu":` followed by the number of gpus in the node

3 - Check that the container runtime binary was installed by the operator (in particular by the `nvidia-container-toolkit-daemonset`):
```
ls /usr/local/nvidia/toolkit/nvidia-container-runtime
```

4 - Verify if containerd config was updated to include the nvidia container runtime:
```
grep nvidia /etc/containerd/config.toml
```

5 - Run a pod to verify that the GPU resource can successfully be scheduled on a pod and the pod can detect it
```yaml
apiVersion: v1
kind: Pod
metadata:
name: nbody-gpu-benchmark
namespace: default
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
args: ["nbody", "-gpu", "-benchmark"]
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
```

0 comments on commit ff6b1b2

Please sign in to comment.