-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Following gpu-operator documentation will break RKE2 cluster after reboot #992
Comments
Anyone on the NVIDIA team object to replacing our sample command with a reference to the RKE2 docs? That's my preference. |
I'm using Ubuntu 22.04 with an NVIDIA RTX A2000 12GB and K8s 1.27.11+RKE2r1. Is there any problem using the driver in version 560 and not 535 as indicated in the RKE Doc? |
I'm fairly confident that using the 560 driver, or any driver covered in the product docs, is OK. However, I'd like SME input from my teammates. When I followed the RKE doc, I've found that I need to specify |
@mikemckiernan I think it's due gpu-operator setting If the gpu-operator would work normally with RKE2, so creating valid |
I am not sure what version of the GPU operator you are using, but would the following values file work for you, @aiicore? |
Is it possible to set the nvidia toolkit not to restart/configure containerd at all? rke2 configures nvidia runtime as well.. |
RKE2 docs says only about passing the config for RKE2's internal CONTAINERD_SOCKET: https://docs.rke2.io/advanced?_highlight=gpu#deploy-nvidia-operator
Nvidia's also about CONTAINERD_CONFIG: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#rancher-kubernetes-engine-2
Following gpu-operator documentation, those things will happen:
/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
/var/lib/rancher/rke2/agent/etc/containerd/config.toml
The most significant errors in the logs would be:
Following RKE2 docs about passing only CONTAINERD_SOCKET works, since gpu-operator will write it's (not working with rke2 config) into
/etc/containerd/config.toml
, even though containerd is not installed at the OS level.Looks like the containerd config, provided by gpu-operator with RKE2, doesn't matter since RKE2 is able to detect
nvidia-container-runtime
and configure it's own containerd conifg with nvidia runtime class:Steps to reproduce on Ubuntu 22.04:
Following Nvidia's docs breaks RKE2 cluster after reboot:
Following RKE2's docs works fine:
Could someone verify the docs?
The text was updated successfully, but these errors were encountered: