-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VFs may get reseted after being allocated by other pod #219
Comments
Script and manifests I used in order to reproduce this https://gist.github.com/ormergi/3ddbf901ddc95baf316b604994285a69. |
SR-IOV tests execution on CI has shown flakes where the VF configuration has been modified after assigning the device to the pod. The primary suspect is a CNI DELETE instruction from a previous pod which is somehow racing with the re-assignment of the device to the new pod. In order to mitigate this suspected issue (and as side effect, prove it) this change assures the VMI/s used with SR-IOV are deleted and absent before starting the next test. The issue is tracked on the sriov-cni project: k8snetworkplumbingwg/sriov-cni#219 Signed-off-by: Edward Haas <[email protected]>
@ormergi will you be able to provide the logs from multus? that is the only way we will be able to try and trace the issue I think |
@SchSeba are these logs good enough https://pastebin.com/eGhyrevk? EDIT: FWIW its much easier to review these logs in vscode. |
This commit add a type of lock to allocated pci address Fixes: k8snetworkplumbingwg#219 Signed-off-by: Sebastian Sch <[email protected]>
This commit add a type of lock to allocated pci address Fixes: k8snetworkplumbingwg#219 Signed-off-by: Sebastian Sch <[email protected]>
This commit add a type of lock to allocated pci address Fixes: k8snetworkplumbingwg#219 Signed-off-by: Sebastian Sch <[email protected]>
This commit add a type of lock to allocated pci address Fixes: k8snetworkplumbingwg#219 Signed-off-by: Sebastian Sch <[email protected]>
This commit add a type of lock to allocated pci address Fixes: k8snetworkplumbingwg#219 Signed-off-by: Sebastian Sch <[email protected]>
This commit add a type of lock to allocated pci address Fixes: k8snetworkplumbingwg#219 Signed-off-by: Sebastian Sch <[email protected]>
This commit add a type of lock to allocated pci address Fixes: k8snetworkplumbingwg#219 Signed-off-by: Sebastian Sch <[email protected]>
This commit add a type of lock to allocated pci address Fixes: k8snetworkplumbingwg#219 Signed-off-by: Sebastian Sch <[email protected]>
…i addresses Fixes: k8snetworkplumbingwg#219 Signed-off-by: Sebastian Sch <[email protected]>
…i addresses Fixes: k8snetworkplumbingwg#219 Signed-off-by: Sebastian Sch <[email protected]>
…i addresses Fixes: k8snetworkplumbingwg#219 Signed-off-by: Sebastian Sch <[email protected]>
…i addresses Fixes: k8snetworkplumbingwg#219 Signed-off-by: Sebastian Sch <[email protected]>
…i addresses Fixes: k8snetworkplumbingwg#219 Signed-off-by: Sebastian Sch <[email protected]>
@adrianchiris is there a planned release so we could consume the fixed version? 🙂 |
Place a delay between tests to assure resources (VF/s) are fully released before reused again on a new VMI. (results show that waiting for VMI/s to disappear is not enough) Ref: k8snetworkplumbingwg/sriov-cni#219 This workaround should be temporary until the fix [1] can be consumed. [1] k8snetworkplumbingwg/sriov-cni#220 Signed-off-by: Edward Haas <[email protected]>
Place a delay between tests to assure resources (VF/s) are fully released before reused again on a new VMI. (results show that waiting for VMI/s to disappear is not enough) Ref: k8snetworkplumbingwg/sriov-cni#219 This workaround should be temporary until the fix [1] can be consumed. [1] k8snetworkplumbingwg/sriov-cni#220 Signed-off-by: Edward Haas <[email protected]>
Following the flakes we have all over SR-IOV lanes due to [1] and [2], bump sriov-cni to v2.7.0 in order to consume the fix [3]. [1] kubevirt/kubevirt#6776 [2] k8snetworkplumbingwg/sriov-cni#219 [3] k8snetworkplumbingwg/sriov-cni#220 Signed-off-by: Or Mergi <[email protected]>
Following the flakes we have all over SR-IOV lanes due to [1] and [2], bump sriov-cni to v2.7.0 in order to consume the fix [3]. [1] kubevirt/kubevirt#6776 [2] k8snetworkplumbingwg/sriov-cni#219 [3] k8snetworkplumbingwg/sriov-cni#220 Signed-off-by: Or Mergi <[email protected]> Signed-off-by: Or Mergi <[email protected]>
* Create allocation interface and implementation. This is needed to lock the allocation of the same PCI address until the cmdDel is called or the kernel remove the network namespace. Signed-off-by: Sebastian Sch <[email protected]> * Use the allocator interface to prevent allocation of still in used pci addresses Fixes: k8snetworkplumbingwg/sriov-cni#219 Signed-off-by: Sebastian Sch <[email protected]> Signed-off-by: Sebastian Sch <[email protected]>
* Create allocation interface and implementation. This is needed to lock the allocation of the same PCI address until the cmdDel is called or the kernel remove the network namespace. Signed-off-by: Sebastian Sch <[email protected]> * Use the allocator interface to prevent allocation of still in used pci addresses Fixes: k8snetworkplumbingwg/sriov-cni#219 Signed-off-by: Sebastian Sch <[email protected]> Signed-off-by: Sebastian Sch <[email protected]>
* Create allocation interface and implementation. This is needed to lock the allocation of the same PCI address until the cmdDel is called or the kernel remove the network namespace. Signed-off-by: Sebastian Sch <[email protected]> * Use the allocator interface to prevent allocation of still in used pci addresses Fixes: k8snetworkplumbingwg/sriov-cni#219 Signed-off-by: Sebastian Sch <[email protected]> Signed-off-by: Sebastian Sch <[email protected]>
* Create allocation interface and implementation. This is needed to lock the allocation of the same PCI address until the cmdDel is called or the kernel remove the network namespace. Signed-off-by: Sebastian Sch <[email protected]> * Use the allocator interface to prevent allocation of still in used pci addresses Fixes: k8snetworkplumbingwg/sriov-cni#219 Signed-off-by: Sebastian Sch <[email protected]> Signed-off-by: Sebastian Sch <[email protected]>
* Create allocation interface and implementation. This is needed to lock the allocation of the same PCI address until the cmdDel is called or the kernel remove the network namespace. Signed-off-by: Sebastian Sch <[email protected]> * Use the allocator interface to prevent allocation of still in used pci addresses Fixes: k8snetworkplumbingwg/sriov-cni#219 Signed-off-by: Sebastian Sch <[email protected]> Signed-off-by: Sebastian Sch <[email protected]>
What happened?
In the scenario where pods with SR-IOV interface from the same resource pool are created and deleted a few times,
the underlying VF may end up with a default configuration instead of the desired one (i.e: no MAC address, no VLAN).
What did you expect to happen?
The pod underlying VF is to be configured correctly.
What are the minimal steps needed to reproduce the bug?
Terminal 1: Monitor SR-IOV VFs state on node
worker1
Terminal 2:
test
with:nodeSelector
to nodeworker1
.02:02:02:02:02
.test
in the background and immediately create a similar podtest2
.test2
to be ready.After a few iterations, we saw that the pod is Running but looking at the node VFs
it shows that it's not configured with the desired MAC address.
Anything else we need to know?
Scripts and manifests I used for reproducing the issue:
https://gist.github.com/ormergi/3ddbf901ddc95baf316b604994285a69
It seems that when the pod (1) is deleted and CNI cmdDEL command been executed, it will reset the VF whether it's been allocated by another pod or not.
Also, it's not guaranteed that as soon as a pod is disposed its underlying VF is reseted.
It takes 1-2 seconds for it to reset, which seem odd because I would expect all its resource to be free when it's gone.
Component Versions
Config Files
Config file locations may be config dependent.
CNI config (Try '/etc/cni/net.d/')
Device pool config file location (Try '/etc/pcidp/config.json')
Multus config (Try '/etc/cni/multus/net.d')
see ##### CNI config
Kubernetes deployment type ( Bare Metal, Kubeadm etc.)
Kind deployment
Kubeconfig file
SR-IOV Network Custom Resource Definition
Logs
SR-IOV Network Device Plugin Logs (use
kubectl logs $PODNAME
)Multus logs (If enabled. Try '/var/log/multus.log' )
Kubelet logs (journalctl -u kubelet)
Journal log including Multus logs from when the issue accord https://pastebin.com/eGhyrevk.
The text was updated successfully, but these errors were encountered: