Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to add IP addr <IP> to "net1-0": address already in use #20

Closed
sunya-ch opened this issue Sep 27, 2022 · 0 comments
Closed

failed to add IP addr <IP> to "net1-0": address already in use #20

sunya-ch opened this issue Sep 27, 2022 · 0 comments

Comments

@sunya-ch
Copy link
Collaborator

Multi-NIC IPAM CNI allocates the address that is already in use to Pod.

Expected Behavior

The IP address should be successfully add to the interface.

Current Behavior

error adding pod <pod name> to CNI network "multus-cni-network": 
plugin type="multus" name="multus-cni-network" 
failed (add): [default/..:multinic-ipvlanl3]: error adding container to network "multinic-ipvlanl3": 
failed to add IP addr <IP> to "net1-0": address already in use

Troubleshooting Steps

  • daemon log
03:29:55[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.98.65"}]}]
03:29:55[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.34.65"}]}]
03:30:12[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.98.65"},{"pod":"cluster-head-2","namespace":"default","index":2,"address":"192.168.98.66"}]}]
03:30:12[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.34.65"},{"pod":"cluster-head-2","namespace":"default","index":2,"address":"192.168.34.66"}]}]
03:30:12[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.98.65"}]}]
03:30:12[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.34.65"}]}]
03:30:30[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.98.65"},{"pod":"cluster-head-2","namespace":"default","index":2,"address":"192.168.98.66"}]}]
03:30:30[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.34.65"},{"pod":"cluster-head-2","namespace":"default","index":2,"address":"192.168.34.66"}]}]
03:30:30[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.34.65"}]}]
03:30:30[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.98.65"}]}]
03:30:45[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.98.65"},{"pod":"cluster-head-2","namespace":"default","index":2,"address":"192.168.98.66"}]}]
03:30:45[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.34.65"},{"pod":"cluster-head-2","namespace":"default","index":2,"address":"192.168.34.66"}]}]
03:30:45[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.98.65"}]}]
03:30:45[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.34.65"}]}]
03:30:59[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.34.65"},{"pod":"cluster-head-2","namespace":"default","index":2,"address":"192.168.34.66"}]}]
03:30:59[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.98.65"},{"pod":"cluster-head-2","namespace":"default","index":2,"address":"192.168.98.66"}]}]
03:30:59[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.98.65"}]}]
03:30:59[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.34.65"}]}]
03:31:14[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.98.65"},{"pod":"cluster-head-3","namespace":"default","index":2,"address":"192.168.98.66"}]}]
03:31:14[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.34.65"},{"pod":"cluster-head-3","namespace":"default","index":2,"address":"192.168.34.66"}]}]
03:31:14[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.98.65"}]}]
03:31:14[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.34.65"}]}]
03:31:18[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.34.65"},{"pod":"cluster-head-3","namespace":"default","index":2,"address":"192.168.34.66"}]}]
03:31:18[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.98.65"},{"pod":"cluster-head-3","namespace":"default","index":2,"address":"192.168.98.66"}]}]
03:31:18[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.98.65"}]}]
03:31:18[allocation:[{"pod":"cluster-head-1","namespace":"default","index":1,"address":"192.168.34.65"}]}]
  • The pod log seems to be normal but the address 192.168.98.66 is hanging in the kernel.

Steps to Reproduce

(not trivial)

Context (Environment)

  • 48-nodes A100 GPU Cluster
  • 3 RayCluster
  • Operator Image:
    res-cpe-team-docker-local.artifactory.swg-devops.com/multi-nic-cni-operator/controller:v1.0.2-alpha
  • Daemon Image:
    res-cpe-team-docker-local.artifactory.swg-devops.com/net/multi-nic-cni-daemon:v1.0.1-alpha

Possible Implementation

  1. Check IP in-use before allocation --> Is there any way we can check without try assigning to network device?
  2. Detect anomaly request by daemon (request from same pod for multiple times)
  3. Report failure by CNI

Considering current practicality, I will work on the second choice.
However, if we can find a way to do the first choice, it will be more straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant