-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
插件能获取GPU的个数,但是获取不了GPU的显存,共享无法调度 #47
Comments
你是不是按MiB为单位上报的GPU显存资源呀?我们默认是按照GiB为单位上报的显存资源,你的节点有大约32G显存,如果按MiB上报,那么会产生32509个Device ID上报给kubelet,造成device plugin和kubelet之间的GRPC通信出现问题。换成按GiB上报试试吧。 |
Hi, I also meet this issue. The capacity of aliyun.com/gu-mem is 46, but the Allocatable aliyun.com/gpu-mem is 0. I use kubectl inspect gpushare, there is no node to display. How can I solve this problem?
|
Capacity:
aliyun.com/gpu-count: 8
aliyun.com/gpu-mem: 0
gpu tesla V100
日志如下
[root@localhost ~]# kubectl logs -f -n kube-system gpushare-device-plugin-ds-qjltc
I1012 05:08:46.374978 1 main.go:18] Start gpushare device plugin
I1012 05:08:46.375045 1 gpumanager.go:28] Loading NVML
I1012 05:08:46.379478 1 gpumanager.go:37] Fetching devices.
I1012 05:08:46.379497 1 gpumanager.go:43] Starting FS watcher.
I1012 05:08:46.379930 1 gpumanager.go:51] Starting OS watcher.
I1012 05:08:46.389438 1 nvidia.go:64] Deivce GPU-60805828-8ab0-6124-67c4-9baff56d087b's Path is /dev/nvidia0
I1012 05:08:46.389549 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.389564 1 nvidia.go:40] set gpu memory: 32510
I1012 05:08:46.389577 1 nvidia.go:76] # Add first device ID: GPU-60805828-8ab0-6124-67c4-9baff56d087b--0
I1012 05:08:46.453844 1 nvidia.go:79] # Add last device ID: GPU-60805828-8ab0-6124-67c4-9baff56d087b--32509
I1012 05:08:46.461774 1 nvidia.go:64] Deivce GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01's Path is /dev/nvidia1
I1012 05:08:46.461816 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.461827 1 nvidia.go:76] # Add first device ID: GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01--0
I1012 05:08:46.559867 1 nvidia.go:79] # Add last device ID: GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01--32509
I1012 05:08:46.567541 1 nvidia.go:64] Deivce GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a's Path is /dev/nvidia2
I1012 05:08:46.567574 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.567583 1 nvidia.go:76] # Add first device ID: GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a--0
I1012 05:08:46.658328 1 nvidia.go:79] # Add last device ID: GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a--32509
I1012 05:08:46.666367 1 nvidia.go:64] Deivce GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5's Path is /dev/nvidia3
I1012 05:08:46.666393 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.666399 1 nvidia.go:76] # Add first device ID: GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5--0
I1012 05:08:46.676851 1 nvidia.go:79] # Add last device ID: GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5--32509
I1012 05:08:46.683786 1 nvidia.go:64] Deivce GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991's Path is /dev/nvidia4
I1012 05:08:46.683802 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.683809 1 nvidia.go:76] # Add first device ID: GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991--0
I1012 05:08:46.948055 1 nvidia.go:79] # Add last device ID: GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991--32509
I1012 05:08:46.956435 1 nvidia.go:64] Deivce GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf's Path is /dev/nvidia5
I1012 05:08:46.956486 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.956504 1 nvidia.go:76] # Add first device ID: GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf--0
I1012 05:08:46.972438 1 nvidia.go:79] # Add last device ID: GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf--32509
I1012 05:08:46.980775 1 nvidia.go:64] Deivce GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415's Path is /dev/nvidia6
I1012 05:08:46.980797 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.980805 1 nvidia.go:76] # Add first device ID: GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415--0
I1012 05:08:46.990545 1 nvidia.go:79] # Add last device ID: GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415--32509
I1012 05:08:46.997877 1 nvidia.go:64] Deivce GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2's Path is /dev/nvidia7
I1012 05:08:46.997891 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.997895 1 nvidia.go:76] # Add first device ID: GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2--0
I1012 05:08:47.249585 1 nvidia.go:79] # Add last device ID: GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2--32509
I1012 05:08:47.249606 1 server.go:43] Device Map: map[GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415:6 GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2:7 GPU-60805828-8ab0-6124-67c4-9baff56d087b:0 GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01:1 GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a:2 GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5:3 GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991:4 GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf:5]
I1012 05:08:47.249644 1 server.go:44] Device List: [GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5 GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991 GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415 GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2 GPU-60805828-8ab0-6124-67c4-9baff56d087b GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01 GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a]
I1012 05:08:47.265532 1 podmanager.go:68] No need to update Capacity aliyun.com/gpu-count
I1012 05:08:47.266863 1 server.go:222] Starting to serve on /var/lib/kubelet/device-plugins/aliyungpushare.sock
I1012 05:08:47.267431 1 server.go:230] Registered device plugin with Kubelet
有没有人遇见过 k8s 1.16.3 nvidia-runtime 1.1-dev
The text was updated successfully, but these errors were encountered: