-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SR-IOV Device Plugin Pods Keep Restarting with 'terminated' Signal on All Nodes #610
Comments
hi @koh-hr, can you please attach logs from the sriov-network-config-daemon of the same node? That daemon kills/restarts the device plugin when it completes a configuration, and it is probably in a configuration loop. |
@zeeke
It seems the label was applied correctly, but I would appreciate further assistance with any additional troubleshooting steps. Let me know if there are other configurations I should check or if anything else needs to be adjusted. |
SriovNetworkNodeState resources are created by the sriov-network-operator, for each node with labels:
can you please check the node labels and the sriov-network-operator logs? |
It seems that the resource was successfully created.
The logs from the sriov-network-config-daemon are as follows
|
Please, attach the full config-daemon logs or search for |
@zeeke
|
I think the Nvidia-network-operator is the one that restart the device. maybe @adrianchiris @ykulazhenkov @e0ne can have a look on the relevant logs? |
@koh-hr from what i read, you are deploying 3 device plugins
2 and 3 are unfortunately named the same (and are deployed in the same namespace) so are probably overriding each other. do you need all 3 ? what is the use-case that you have ? usually folks just go with one. |
Thank you for your response! As for the use case, |
OK, so i would:
then you will be left with the instance of sriov-device-plugin from sriov-network-operator. FYI : we pushed a fix for the daemonset naming clash to network-operator [1] |
After editing the If that is the case, how can I configure it to specify only VFs for host-dev? The following logs were collected from a test environment, but the observed behavior was the same.
|
What happened?
The sriov-device-plugin Pod keeps restarting repeatedly with the following log messages. This issue occurs on all Pods in the DaemonSet across the targeted nodes, not just a specific Pod.
Looking at the logs of the Operator does not provide any useful information. I'm stuck and would appreciate your help.
What did you expect to happen?
The Pods should remain running without restarting.
What are the minimal steps needed to reproduce the bug?
Deploy the sriov-device-plugin to Kubernetes using the NVIDIA Network Operator v24.1.0, following the guide below:
https://docs.nvidia.com/networking/display/kubernetes2410/getting+started+with+kubernetes#src-2494425587_GettingStartedwithKubernetes-NetworkOperatorDeploymentforGPUDirectWorkloads
Anything else we need to know?
This issue is occurring in an on-premises environment.
The same issue happens in two separate clusters.
Component Versions
Please fill in the below table with the version numbers of components used.
Config Files
Config file locations may be config dependent.
Device pool config file location (Try '/etc/pcidp/config.json')
Command executed on the host:
Multus config (Try '/etc/cni/multus/net.d')
Command executed on the host:
CNI config (Try '/etc/cni/net.d/')
Command executed on the host:
Kubernetes deployment type ( Bare Metal, Kubeadm etc.)
・kubeadm
SR-IOV Network Custom Resource Definition
The configuration for the NicClusterPolicy is as follows:
Logs
The following logs are from a node targeted by SR-IOV. The same issue occurs on non-targeted nodes.
SR-IOV Network Device Plugin Logs (use
kubectl logs $PODNAME
)Multus logs (If enabled. Try '/var/log/multus.log' )
Kubelet logs (journalctl -u kubelet)
The text was updated successfully, but these errors were encountered: