Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-cli: device error: unknown device id: no-gpu-has-256MiB-to-run\\\\n\\\"\"": unknown #23

Open
zhaogaolong opened this issue Apr 30, 2020 · 3 comments

Comments

@zhaogaolong
Copy link

zhaogaolong commented Apr 30, 2020

版本信息:

k8s: 1.17
gpushare-device-plugin: v2-1.11-aff8a23
nvidia-smi: 440.36

kubectl descript pod < pod name > -n zhaogaolong
pod errors log

Events:
  Type     Reason     Age                From                      Message
  ----     ------     ----               ----                      -------
  Normal   Scheduled  <unknown>          default-scheduler         Successfully assigned zhaogaolong/gpu-demo-gpushare-659fd6cbb7-6fc8v to gpu-node
  Normal   Pulling    32s (x4 over 70s)  kubelet, gpu-node  Pulling image "hub.xxxx.com/zhaogaolong/gpu-demo.build.build:bccfcbe43f43280d-1584070500-dac37f2c12024544a6cc2871440dc94a577a7ff3"
  Normal   Pulled     32s (x4 over 70s)  kubelet, gpu-node  Successfully pulled image "hub.xxx.com/zhaogaolong/gpu-demo.build.build:bccfcbe43f43280d-1584070500-dac37f2c12024544a6cc2871440dc94a577a7ff3"
  Normal   Created    31s (x4 over 70s)  kubelet, gpu-node  Created container gpu
  Warning  Failed     31s (x4 over 70s)  kubelet, gpu-node  Error: failed to start container "gpu": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-256MiB-to-run\\\\n\\\"\"": unknown
  Warning  BackOff    10s (x5 over 68s)  kubelet, ggpu-node  Back-off restarting failed container

相同问题:

NVIDIA/nvidia-docker#1042

@cheyang

@Joseph516
Copy link

Is anybody fix this? I got the same problem here. AliyunContainerService/gpushare-scheduler-extender#120 (comment)

@vio-f
Copy link

vio-f commented Jun 23, 2022

I encountered the same issue today. Can anybody help please?

@Lanyujiex
Copy link

Lanyujiex commented Aug 9, 2022

update your schedule config with gpushare-sch-extender and restart it. you might be able to fix it. @vio-f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants