Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to initialize NVML: could not load NVML library #444

Closed
Frank-Zeng opened this issue Oct 19, 2023 · 3 comments
Closed

Failed to initialize NVML: could not load NVML library #444

Frank-Zeng opened this issue Oct 19, 2023 · 3 comments

Comments

@Frank-Zeng
Copy link

Frank-Zeng commented Oct 19, 2023

checked issue #19 and #47 does not help me out.
when I use k8s-device-plugin have some error, and the k8s resource doesn't have nvidia/gpu type in resource pool. my system is ubuntu2204 and the image version is nvidia/k8s-device-plugin:v0.11.0-ubuntu20.04, my gpu device is RTX3080Ti, I try to reinstall driver and restart system, seems doesn't work, are there have any other suggestion or method to resolve this issue? thx.
image
2023-10-19T11:11:54.957587844+08:00 stderr F 2023/10/19 03:11:54 Loading NVML
2023-10-19T11:11:54.957723331+08:00 stderr F 2023/10/19 03:11:54 Failed to initialize NVML: could not load NVML library.
2023-10-19T11:11:54.957728922+08:00 stderr F 2023/10/19 03:11:54 If this is a GPU node, did you set the docker default runtime to nvidia?
2023-10-19T11:11:54.957732401+08:00 stderr F 2023/10/19 03:11:54 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2023-10-19T11:11:54.957734456+08:00 stderr F 2023/10/19 03:11:54 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

nvidia driver already installed, see below:
image

docker daemon config nvidia runtime:
image

kubectl version:
image

docker version:
root@localhost:/var/log/pods/kube-system_nvidia-device-plugin-daemonset-lkpwq_a83d3b15-5886-41c2-ae24-3aa6173dc7b3/nvidia-device-plugin-ctr# docker info
Client: Docker Engine - Community
Version: 24.0.6
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.11.2
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.21.0
Path: /usr/libexec/docker/cli-plugins/docker-compose

Server:
Containers: 3
Running: 1
Paused: 0
Stopped: 2
Images: 28
Server Version: 24.0.6
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: nvidia
Init Binary: docker-init
containerd version: 61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
runc version: v1.1.9-0-gccaecfc
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.15.0-60-generic
Operating System: Ubuntu 22.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 47.15GiB
Name: localhost
ID: d11e5bc6-1f47-47a2-849e-e2e6fff9509e
Docker Root Dir: /var/lib/docker
Debug Mode: false
Username: [email protected]
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

@Frank-Zeng
Copy link
Author

Frank-Zeng commented Oct 19, 2023

after change plugin image version: nvcr.io/nvidia/k8s-device-plugin:v0.14.1, change to below error:
2023-10-19T14:25:03.801614288+08:00 stderr F I1019 06:25:03.801505 1 main.go:154] Starting FS watcher.
2023-10-19T14:25:03.801644662+08:00 stderr F I1019 06:25:03.801595 1 main.go:161] Starting OS watcher.
2023-10-19T14:25:03.802161792+08:00 stderr F I1019 06:25:03.801902 1 main.go:176] Starting Plugins.
2023-10-19T14:25:03.802167118+08:00 stderr F I1019 06:25:03.801939 1 main.go:234] Loading configuration.
2023-10-19T14:25:03.802169753+08:00 stderr F I1019 06:25:03.802087 1 main.go:242] Updating config with default resource matching patterns.
2023-10-19T14:25:03.80234749+08:00 stderr F I1019 06:25:03.802298 1 main.go:253]
2023-10-19T14:25:03.802351226+08:00 stderr F Running with config:
2023-10-19T14:25:03.802353289+08:00 stderr F {
2023-10-19T14:25:03.802355449+08:00 stderr F "version": "v1",
2023-10-19T14:25:03.802357605+08:00 stderr F "flags": {
2023-10-19T14:25:03.802360031+08:00 stderr F "migStrategy": "none",
2023-10-19T14:25:03.80236218+08:00 stderr F "failOnInitError": false,
2023-10-19T14:25:03.802364254+08:00 stderr F "nvidiaDriverRoot": "/",
2023-10-19T14:25:03.802366444+08:00 stderr F "gdsEnabled": false,
2023-10-19T14:25:03.802368522+08:00 stderr F "mofedEnabled": false,
2023-10-19T14:25:03.802370594+08:00 stderr F "plugin": {
2023-10-19T14:25:03.802372949+08:00 stderr F "passDeviceSpecs": false,
2023-10-19T14:25:03.802375992+08:00 stderr F "deviceListStrategy": [
2023-10-19T14:25:03.80237887+08:00 stderr F "envvar"
2023-10-19T14:25:03.802382516+08:00 stderr F ],
2023-10-19T14:25:03.802385985+08:00 stderr F "deviceIDStrategy": "uuid",
2023-10-19T14:25:03.80238917+08:00 stderr F "cdiAnnotationPrefix": "cdi.k8s.io/",
2023-10-19T14:25:03.802392506+08:00 stderr F "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
2023-10-19T14:25:03.802395771+08:00 stderr F "containerDriverRoot": "/driver-root"
2023-10-19T14:25:03.802398419+08:00 stderr F }
2023-10-19T14:25:03.802401526+08:00 stderr F },
2023-10-19T14:25:03.802404362+08:00 stderr F "resources": {
2023-10-19T14:25:03.80240769+08:00 stderr F "gpus": [
2023-10-19T14:25:03.802410951+08:00 stderr F {
2023-10-19T14:25:03.802413976+08:00 stderr F "pattern": "*",
2023-10-19T14:25:03.802416622+08:00 stderr F "name": "nvidia.com/gpu"
2023-10-19T14:25:03.802418971+08:00 stderr F }
2023-10-19T14:25:03.802420974+08:00 stderr F ]
2023-10-19T14:25:03.802422954+08:00 stderr F },
2023-10-19T14:25:03.802425007+08:00 stderr F "sharing": {
2023-10-19T14:25:03.802427254+08:00 stderr F "timeSlicing": {}
2023-10-19T14:25:03.802429318+08:00 stderr F }
2023-10-19T14:25:03.80243137+08:00 stderr F }
2023-10-19T14:25:03.802433402+08:00 stderr F I1019 06:25:03.802314 1 main.go:256] Retreiving plugins.
2023-10-19T14:25:03.802748485+08:00 stderr F W1019 06:25:03.802699 1 factory.go:31] No valid resources detected, creating a null CDI handler
2023-10-19T14:25:03.803225483+08:00 stderr F I1019 06:25:03.802746 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2023-10-19T14:25:03.803231948+08:00 stderr F I1019 06:25:03.802796 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
2023-10-19T14:25:03.803234731+08:00 stderr F E1019 06:25:03.802804 1 factory.go:115] Incompatible platform detected
2023-10-19T14:25:03.803236869+08:00 stderr F E1019 06:25:03.802807 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
2023-10-19T14:25:03.803239396+08:00 stderr F E1019 06:25:03.802811 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2023-10-19T14:25:03.803241583+08:00 stderr F E1019 06:25:03.802814 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2023-10-19T14:25:03.803244111+08:00 stderr F E1019 06:25:03.802817 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
2023-10-19T14:25:03.803261463+08:00 stderr F I1019 06:25:03.802825 1 main.go:287] No devices found. Waiting indefinitely.

@klueska
Copy link
Contributor

klueska commented Oct 19, 2023

I'm assuming you use containerd as your runtime for kubernetes, not docker (containerd is the default).

Meaning you need to configure containerd to work with the nvidia-container-toolkit:

sudo nvidia-ctk runtime configure --runtime=containerd --set-as-default

Instructions for reference:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-containerd

I added the additional --set-as-default flag since that is how you had configured docker as well.

@Frank-Zeng
Copy link
Author

Frank-Zeng commented Nov 16, 2023

sorry to late, thx for the reply, I tried some times, it's indeed because of set default runtime problem, when I use rancher to setup k8s cluster and modify this setting, it has broken the k8s cluster. I had to change the k8s cluster installation method to change the default runtime to solve this problem. thx again. @klueska

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants