Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

插件能获取GPU的个数,但是获取不了GPU的显存,共享无法调度 #47

Open
gxwangit opened this issue Oct 26, 2021 · 2 comments

Comments

@gxwangit
Copy link

Capacity:
aliyun.com/gpu-count: 8
aliyun.com/gpu-mem: 0
gpu tesla V100

日志如下
[root@localhost ~]# kubectl logs -f -n kube-system gpushare-device-plugin-ds-qjltc
I1012 05:08:46.374978 1 main.go:18] Start gpushare device plugin
I1012 05:08:46.375045 1 gpumanager.go:28] Loading NVML
I1012 05:08:46.379478 1 gpumanager.go:37] Fetching devices.
I1012 05:08:46.379497 1 gpumanager.go:43] Starting FS watcher.
I1012 05:08:46.379930 1 gpumanager.go:51] Starting OS watcher.
I1012 05:08:46.389438 1 nvidia.go:64] Deivce GPU-60805828-8ab0-6124-67c4-9baff56d087b's Path is /dev/nvidia0
I1012 05:08:46.389549 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.389564 1 nvidia.go:40] set gpu memory: 32510
I1012 05:08:46.389577 1 nvidia.go:76] # Add first device ID: GPU-60805828-8ab0-6124-67c4-9baff56d087b--0
I1012 05:08:46.453844 1 nvidia.go:79] # Add last device ID: GPU-60805828-8ab0-6124-67c4-9baff56d087b--32509
I1012 05:08:46.461774 1 nvidia.go:64] Deivce GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01's Path is /dev/nvidia1
I1012 05:08:46.461816 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.461827 1 nvidia.go:76] # Add first device ID: GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01--0
I1012 05:08:46.559867 1 nvidia.go:79] # Add last device ID: GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01--32509
I1012 05:08:46.567541 1 nvidia.go:64] Deivce GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a's Path is /dev/nvidia2
I1012 05:08:46.567574 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.567583 1 nvidia.go:76] # Add first device ID: GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a--0
I1012 05:08:46.658328 1 nvidia.go:79] # Add last device ID: GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a--32509
I1012 05:08:46.666367 1 nvidia.go:64] Deivce GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5's Path is /dev/nvidia3
I1012 05:08:46.666393 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.666399 1 nvidia.go:76] # Add first device ID: GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5--0
I1012 05:08:46.676851 1 nvidia.go:79] # Add last device ID: GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5--32509
I1012 05:08:46.683786 1 nvidia.go:64] Deivce GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991's Path is /dev/nvidia4
I1012 05:08:46.683802 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.683809 1 nvidia.go:76] # Add first device ID: GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991--0
I1012 05:08:46.948055 1 nvidia.go:79] # Add last device ID: GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991--32509
I1012 05:08:46.956435 1 nvidia.go:64] Deivce GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf's Path is /dev/nvidia5
I1012 05:08:46.956486 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.956504 1 nvidia.go:76] # Add first device ID: GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf--0
I1012 05:08:46.972438 1 nvidia.go:79] # Add last device ID: GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf--32509
I1012 05:08:46.980775 1 nvidia.go:64] Deivce GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415's Path is /dev/nvidia6
I1012 05:08:46.980797 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.980805 1 nvidia.go:76] # Add first device ID: GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415--0
I1012 05:08:46.990545 1 nvidia.go:79] # Add last device ID: GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415--32509
I1012 05:08:46.997877 1 nvidia.go:64] Deivce GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2's Path is /dev/nvidia7
I1012 05:08:46.997891 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.997895 1 nvidia.go:76] # Add first device ID: GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2--0
I1012 05:08:47.249585 1 nvidia.go:79] # Add last device ID: GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2--32509
I1012 05:08:47.249606 1 server.go:43] Device Map: map[GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415:6 GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2:7 GPU-60805828-8ab0-6124-67c4-9baff56d087b:0 GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01:1 GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a:2 GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5:3 GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991:4 GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf:5]
I1012 05:08:47.249644 1 server.go:44] Device List: [GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5 GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991 GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415 GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2 GPU-60805828-8ab0-6124-67c4-9baff56d087b GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01 GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a]
I1012 05:08:47.265532 1 podmanager.go:68] No need to update Capacity aliyun.com/gpu-count
I1012 05:08:47.266863 1 server.go:222] Starting to serve on /var/lib/kubelet/device-plugins/aliyungpushare.sock
I1012 05:08:47.267431 1 server.go:230] Registered device plugin with Kubelet
有没有人遇见过 k8s 1.16.3 nvidia-runtime 1.1-dev

@happy2048
Copy link
Contributor

你是不是按MiB为单位上报的GPU显存资源呀?我们默认是按照GiB为单位上报的显存资源,你的节点有大约32G显存,如果按MiB上报,那么会产生32509个Device ID上报给kubelet,造成device plugin和kubelet之间的GRPC通信出现问题。换成按GiB上报试试吧。

@chenwenyan
Copy link

Hi, I also meet this issue. The capacity of aliyun.com/gu-mem is 46, but the Allocatable aliyun.com/gpu-mem is 0. I use kubectl inspect gpushare, there is no node to display. How can I solve this problem?
I also use journal -xefu kubelet on the node, the output is

1月 30 17:12:17 slave6 kubelet[1483]: E1130 17:12:17.560765 1483 endpoint.go:62] Can't create new endpoint with path /var/lib/kubelet/device-plugins/gpushare.sock err failed to dial device plugin: context deadline exceeded 11月 30 17:12:17 slave6 kubelet[1483]: E1130 17:12:17.560805 1483 manager.go:485] Failed to dial device plugin with request &RegisterRequest{Version:v1beta1,Endpoint:gpushare.sock,ResourceName:gpushare/gpu-mem,Options:nil,}: failed to dial device plugin: context deadline exceeded 11月 30 17:12:17 slave6 kubelet[1483]: I1130 17:12:17.607164 1483 manager.go:411] Got registration request from device plugin with resource name "gpushare/gpu-mem" 11月 30 17:12:17 slave6 kubelet[1483]: I1130 17:12:17.607507 1483 endpoint.go:179] parsed scheme: "" 11月 30 17:12:17 slave6 kubelet[1483]: I1130 17:12:17.607553 1483 endpoint.go:179] scheme "" not registered, fallback to default scheme 11月 30 17:12:17 slave6 kubelet[1483]: I1130 17:12:17.607594 1483 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/gpushare.sock <nil> 0 <nil>}] <nil> <nil>} 11月 30 17:12:17 slave6 kubelet[1483]: I1130 17:12:17.607614 1483 clientconn.go:933] ClientConn switching balancer to "pick_first" 11月 30 17:12:17 slave6 kubelet[1483]: W1130 17:12:17.607878 1483 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/device-plugins/gpushare.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/gpushare.sock: connect: no such file or directory". Reconnecting... 11月 30 17:12:17 slave6 kubelet[1483]: W1130 17:12:17.653596 1483 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/device-plugins/gpushare.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/gpushare.sock: connect: no such file or directory". Reconnecting...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants