-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update https://docs.rke2.io/advanced#deploy-nvidia-operator for SLES OS #263
Comments
When I checked it, it was not necessary because that was the default path, but I need to doublecheck. I was trying on SLE15sp6 but I am getting a dependency missing when installing the rpms from |
That does seem like the wrong path, unless you're using the system containerd for some reason. Our managed containerd config is at |
I have just checked and the docs are ok, I can see the nvidia operator working |
Closing with #264. |
I have added path /var/lib/rancher/rke2/agent/etc/containerd/config.toml in gpu-operator ClusterPolicy but still NVIDIA driver libs/bins not mounting inside gpu pod.
As soon as I replaced file path /var/lib/rancher/rke2/agent/etc/containerd/config.toml with /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl in gpu operator ClusterPolicy
all NVIDIA driver libraries and binaries mounting inside gpu pod. See the output of command |
@manuelbuil anything referring to In addition to that, the correct path needs to be added to the env var values in the HelmChart here, as mentioned above: |
I'm now confused for different reasons. Let's take one at a time. First one: either our testing is not complete or there is something I can't replicate in my env. Let me explain you what I do to test this: I follow the instructions in our link: https://docs.rke2.io/advanced#deploy-nvidia-operator. And I can see the nvidia operator running:
Everything seems correct. Then I deploy the testing pod which uses the gpu:
And the logs seem correct because they detect the GPU and even do a small benchmark:
Therefore, my guess is the NVIDIA driver is correctly exposed to that pod, otherwise the logs would be different, or? @sandipt Is the test I am describing not good enough? Can you help us understand what would be a good test to verify things are correct? |
I think the problem is not that it doesn't work, its that the validation steps show some incorrect paths. |
Go inside your gpu pod using
|
@sandipt There is a bug in our code and we need to add the nvidia cdi runtime to k3s and rke2. I'll update the docs once the code is merged and ping you so that you can quickly test it if you have time. Thanks for helping us! |
Hey @sandipt, do you use the following envs in your pod:
and pass |
I have not set them myself . I think it's by default. I see below envs set on my pod :
In gpu operator crd clusterpolicies.nvidia.com I have below settings for container toolkit:
|
Hi kre2 team,
I see NVIDIA libraries and binaries are not mounting inside gpu pod using gpu operator installation method at https://docs.rke2.io/advanced#deploy-nvidia-operator .
In doc https://docs.rke2.io/advanced#deploy-nvidia-operator and doc https://documentation.suse.com/suse-ai/1.0/html/NVIDIA-Operator-installation/index.html Only difference I see is file name
/var/lib/rancher/rke2/agent/etc/containerd/**config.toml.tmpl**
in envCONTAINERD_CONFIG
I quick tested on one of my SLES 15 SP5 node. I replaced
config.toml
withconfig.toml.tmpl
inCONTAINERD_CONFIG
and all NVIDIA libs/bins mounted inside containers.I think you need to test and update doc https://docs.rke2.io/advanced#deploy-nvidia-operator for SLES OS.
The text was updated successfully, but these errors were encountered: