-
Notifications
You must be signed in to change notification settings - Fork 633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to initialize NVML: could not load NVML library #444
Comments
after change plugin image version: nvcr.io/nvidia/k8s-device-plugin:v0.14.1, change to below error: |
I'm assuming you use Meaning you need to configure
Instructions for reference: I added the additional |
sorry to late, thx for the reply, I tried some times, it's indeed because of set default runtime problem, when I use rancher to setup k8s cluster and modify this setting, it has broken the k8s cluster. I had to change the k8s cluster installation method to change the default runtime to solve this problem. thx again. @klueska |
checked issue #19 and #47 does not help me out.
when I use k8s-device-plugin have some error, and the k8s resource doesn't have nvidia/gpu type in resource pool. my system is ubuntu2204 and the image version is nvidia/k8s-device-plugin:v0.11.0-ubuntu20.04, my gpu device is RTX3080Ti, I try to reinstall driver and restart system, seems doesn't work, are there have any other suggestion or method to resolve this issue? thx.
2023-10-19T11:11:54.957587844+08:00 stderr F 2023/10/19 03:11:54 Loading NVML
2023-10-19T11:11:54.957723331+08:00 stderr F 2023/10/19 03:11:54 Failed to initialize NVML: could not load NVML library.
2023-10-19T11:11:54.957728922+08:00 stderr F 2023/10/19 03:11:54 If this is a GPU node, did you set the docker default runtime to
nvidia
?2023-10-19T11:11:54.957732401+08:00 stderr F 2023/10/19 03:11:54 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2023-10-19T11:11:54.957734456+08:00 stderr F 2023/10/19 03:11:54 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
nvidia driver already installed, see below:
docker daemon config nvidia runtime:
kubectl version:
docker version:
root@localhost:/var/log/pods/kube-system_nvidia-device-plugin-daemonset-lkpwq_a83d3b15-5886-41c2-ae24-3aa6173dc7b3/nvidia-device-plugin-ctr# docker info
Client: Docker Engine - Community
Version: 24.0.6
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.11.2
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.21.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 3
Running: 1
Paused: 0
Stopped: 2
Images: 28
Server Version: 24.0.6
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: nvidia
Init Binary: docker-init
containerd version: 61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
runc version: v1.1.9-0-gccaecfc
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.15.0-60-generic
Operating System: Ubuntu 22.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 47.15GiB
Name: localhost
ID: d11e5bc6-1f47-47a2-849e-e2e6fff9509e
Docker Root Dir: /var/lib/docker
Debug Mode: false
Username: [email protected]
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
The text was updated successfully, but these errors were encountered: