-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CFP for delegated IPAM with cilium-agent IPs #13
CFP for delegated IPAM with cilium-agent IPs #13
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing! I'll cc @squeed for awareness here. Notably just for the use case in general, and more specifically around the CNI spec interactions, GC operations and just general IPAM interaction.
I'm not doing a detailed review, but I did wonder about a couple of things while reading through the proposal.
|
||
(The above table is highly simplified, see the [CNI spec](https://www.cni.dev/docs/spec) for full details.) | ||
|
||
The semantics of the above operations differ significantly from how other Cilium IPAM implementations work. In particular, Cilium's `ipam.IPAM` struct supports idempotent allocation of a specific IP using [AllocateIP](https://github.com/cilium/cilium/blob/70ae8d0ef536de807aab849291e5a68758cb8d4d/pkg/ipam/allocator.go#L47). This is used to restore IPs on cilium-agent restart, ensuring that the IP doesn't change and potentially disrupt the dataplane. This isn't possible with delegated IPAM, because: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to restore IPs on cilium-agent restart, ensuring that the IP doesn't change and potentially disrupt the dataplane. This isn't possible with delegated IPAM, because [...]
Does cilium-agent need to reallocate on every startup? The IP is already allocated and from the delegated IPAM implementation's perspective, it can continue to be actively used. The delegated IPAM isn't aware of the state of its caller.
Cilium-agent does a bunch of "store to disk + restore on startup" sort of logic already, so as long as cilium-agent has ever allocated the IP once, it could perhaps continue to use that IP as long as it is sure that the IP is still allocated. The CHECK
operation can help with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, that's a good question. The main point I was trying to make here is that it's challenging to shoehorn delegated IPAM into the existing IPAM allocator interface. Maybe there's a way to implement Allocate(ip)
as a CNI CHECK operation, but I'm not sure what arguments cilium-agent would send to the IPAM plugin (maybe there's some mapping from IP address to containerID/netNS/iface?)
|
||
### Complications and caveats | ||
|
||
* **CNI state**: Some IPAM plugins store state on-disk (example: host-local writes to files in /var/lib/cni/networks by default, but this can be overridden in the CNI config). These directories *must* be mounted read-write in the cilium-agent pod, otherwise IPs could be leaked or double-allocated. Since this depends on the specific delegated IPAM plugin used, the user must configure this in the Cilium chart using `extraHostPathMounts`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point that this raises is that it implies the Delegated IPAM plugin is running in context of the cilium-agent container (same netns as host/root netns, but other namespaces may differ, for instance filesystem, user). Typically it's running directly on the host as it's invoked by kubelet->ContainerRuntime->cilium-cni.
I agree that in that context the specific delegated IPAM needs access to all filesystem resources it depends upon. Unfortunately this brushes up against the goal (avoid assumptions about specifics plugins/platforms)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is messy. And -- even worse -- it's easy to misconfigure and not notice, because things will seem to work until IPs leak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I think that this is the sort of thing that CNI 1.1's GC operation is intended to assist with, at least to mitigate the worst case scenarios.
As a side-note, this also raises the point that well-planned lifecycle testing for this feature would help to also mitigate these corner cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is turning Cilium in to both a CNI plugin as well as a runtime. It is definitely complicated.
CNI plugins "expect" to be executed in the same context as the container runtime. Most notably, this means the real root filesystem, but it also means the same SELinux context. Now, I don't expect IPAM plugins to need significant privileges, but SELinux has tripped us up before.
This is a strong motivating example for a dumb "like CNI but gRPC" translation of the protocol; I'd been putting that off as not worth the effort, but this is a good motivating example.
There are a number of other potential stumbling blocks here, but I wont enumerate them all. It does mean that the administrator does have to have an intimate understanding of how the IPAM plugin is expected to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CNI plugins "expect" to be executed in the same context as the container runtime. Most notably, this means the real root filesystem, but it also means the same SELinux context. Now, I don't expect IPAM plugins to need significant privileges, but SELinux has tripped us up before.
Ah, yes, I didn't even think about SELinux.
It does mean that the administrator does have to have an intimate understanding of how the IPAM plugin is expected to run.
Totally agree. For AKS I'd expect things are simpler since we use an IPAM plugin that just calls an HTTP server without storing any state. But it's hard to solve the general case.
|
||
* **Cilium config change**: Suppose a user first configures cilium with endpoint health checking, then disables it. This will leak one IP per IP family per node, since cilium-agent won't execute `CNI DEL` on every possible IP it might have allocated in previous configurations. I'd argue this is acceptable as long as it's documented: the IPs would eventually be released as nodes are deleted and replaced. | ||
|
||
* **Cilium CNI version**: Current default Cilium CNI version is 0.3.1, but the `CNI CHECK` operation isn't supported until 0.4.0. The Cilium CNI code is compatible with 0.4.0, so I think it's safe to set 0.4.0 in the conflist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Yep, the plugin supports 1.0.0. We probably just need to update the examples to bump it up. Filed cilium/cilium#30363 to follow up.
|
||
* **Cilium CNI version**: Current default Cilium CNI version is 0.3.1, but the `CNI CHECK` operation isn't supported until 0.4.0. The Cilium CNI code is compatible with 0.4.0, so I think it's safe to set 0.4.0 in the conflist. | ||
|
||
* **CNI Spec 1.1 GC operation**: [CNI spec 1.1 introduces a new "GC" operation](https://github.com/containernetworking/cni/pull/1022). The idea is that the container runtime calls GC with a list of all known attachments, and the CNI plugin cleans up any attachments not in the list. The cleanup includes invoking delegated IPAM plugins to release IPs. This is a problem, since the container runtime won't know about IPs that cilium-agent allocated for itself by invoking the IPAM plugin directly. One possible solution would be for Cilium CNI's GC operation to inject IPs allocated by cilium-agent before Cilium CNI invokes the delegated IPAM plugin's GC. Unclear if this is allowed or forbidden by the CNI spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
➕ Yeah since the runtime would call CHECK on cilium-cni and cilium-cni would have the ability to know about cilium-agent's allocations (via local agent API call), that path makes the most sense to me. My guess is that the spec doesn't consider this case, but given the implementation details I think it makes sense for the CNI to inject these (as long as it does it correctly, avoid leaks, etc).
|
||
* **CNI Spec 1.1 GC operation**: [CNI spec 1.1 introduces a new "GC" operation](https://github.com/containernetworking/cni/pull/1022). The idea is that the container runtime calls GC with a list of all known attachments, and the CNI plugin cleans up any attachments not in the list. The cleanup includes invoking delegated IPAM plugins to release IPs. This is a problem, since the container runtime won't know about IPs that cilium-agent allocated for itself by invoking the IPAM plugin directly. One possible solution would be for Cilium CNI's GC operation to inject IPs allocated by cilium-agent before Cilium CNI invokes the delegated IPAM plugin's GC. Unclear if this is allowed or forbidden by the CNI spec. | ||
|
||
* **CNI conflist installation**: cilium-agent needs to read the CNI conflist, which might not yet exist if it's installed by another daemonset (e.g. when Cilium is configured with `cni.install=false`). Easy thing to do is exit with an error, but it would be better to retry or watch the conflist directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this problem different from the general case? I guess it's a bootstrapping problem since cilium-agent would now directly invoke the CNI at "some time" (could be very quick after startup), vs. the existing case where cilium-cni binary is only invoked after this conflist is already known to be installed.
Could be mitigated with some sort of retry logic or something I suppose. Watches can often be a bit of a pain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main difference is that if Cilium exits with error on startup, it could look to the user like Cilium is broken, even though waiting for the conflist is expected behavior.
Contrast that with what happens in the normal CNI flow: kubelet/containerd keep the node "NotReady" until the conflist file is created.
### Key Question: Possible to move envoy to pod network? | ||
|
||
If envoy were running in pod network as a separate daemonset, then it would get assigned an IP by the container runtime automatically. I think ingress controller / envoy is the most important feature unblocked by this CFP. I suspect moving envoy out of the host netns would greatly complicate the datapath, however. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One proposal floating around is for there to be a dummy for the purpose of allocating the IP, with some annotation that cilium-agent could use to hijack the IP. Bit ugly though. I agree that there's a fair bit of work to go through and try to migrate the Envoy into hostns and ensure all existing features continue to work.
|
||
* **Cilium CNI version**: Current default Cilium CNI version is 0.3.1, but the `CNI CHECK` operation isn't supported until 0.4.0. The Cilium CNI code is compatible with 0.4.0, so I think it's safe to set 0.4.0 in the conflist. | ||
|
||
* **CNI Spec 1.1 GC operation**: [CNI spec 1.1 introduces a new "GC" operation](https://github.com/containernetworking/cni/pull/1022). The idea is that the container runtime calls GC with a list of all known attachments, and the CNI plugin cleans up any attachments not in the list. The cleanup includes invoking delegated IPAM plugins to release IPs. This is a problem, since the container runtime won't know about IPs that cilium-agent allocated for itself by invoking the IPAM plugin directly. One possible solution would be for Cilium CNI's GC operation to inject IPs allocated by cilium-agent before Cilium CNI invokes the delegated IPAM plugin's GC. Unclear if this is allowed or forbidden by the CNI spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more nit here, looks like GC doesn't arrive until 1.2. The linked PR's commit isn't tagged in 1.1 and the CNI webpages don't publish anything about the GC operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is my fault. The CNI repository is at v1.1, but that's the version of the go library. The spec version is v1.0. If we continue like this, libcni v1.2 will support CNI spec v1.1. Oof.
I need to fix this. We will probably move the go library to its own repository.
(apologies, I've been totally slammed, but reviewing this is very high up on my to-do list) |
The semantics of the above operations differ significantly from how other Cilium IPAM implementations work. In particular, Cilium's `ipam.IPAM` struct supports idempotent allocation of a specific IP using [AllocateIP](https://github.com/cilium/cilium/blob/70ae8d0ef536de807aab849291e5a68758cb8d4d/pkg/ipam/allocator.go#L47). This is used to restore IPs on cilium-agent restart, ensuring that the IP doesn't change and potentially disrupt the dataplane. This isn't possible with delegated IPAM, because: | ||
|
||
* The required inputs do not include the IP address. By convention, some [IPAM plugins support an additional "ips" argument](https://www.cni.dev/docs/spec), but this is not universal. | ||
* The CNI ADD operation is not idempotent. According to [the spec](https://www.cni.dev/docs/spec/#add-add-container-to-network-or-apply-modifications): "A runtime should not call ADD twice (without an intervening DEL) for the same (`CNI_CONTAINERID`, `CNI_IFNAME`) tuple." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CNI author here. That's why RFCs use "SHOULD" :-p.
But seriously, this is a good motivation to relax the no-add-without-del restriction. We've been idly discussing making ADD idempotent in CNI. I'll take this as another data point.
This is a clever solution, but I have concerns about the complicated dance that is Cilium acting as both a CNI plugin and a CNI runtime -- especially the details around lifecycle and making sure we don't leak IPs. CNI GC can save us, to some extent -- but only when runtime (containerd) implements it, which is in progress but not done yet :-). What would it be like if health and ingress IPs were allocated by an external controller and set as annotations on the Node? Cilium could have a mode where it waits for these annotations before proceeding with startup. Alternatively, what if we created a "placeholder" daemonset? Then there would be two noop pods scheduled to the node, but then the lifecycle is crystal clear. |
Thanks for reviewing this @squeed !
Yeah, after thinking this through I tend to agree. Even if CNI GC is able to clean up leaked IPs, I think there's a risk of double-allocating IPs if the volume mounts are misconfigured (cilium restarts, CNI IPAM state isn't preserved due to misconfigured volume mounts, and IPAM plugin returns an IP that was already assigned to a pod).
That seems feasible, although I'd worry about moving this responsibility to the cloud provider. The nice thing about delegated IPAM today is that Azure IPAM doesn't need to know that it's running Cilium, and Cilium doesn't need to know that it's using Azure IPAM. In theory I guess someone could write a controller that tells Azure IPAM to allocate an IP, then annotates the nodes; however, there isn't an easy way to do this in AKS today. In most networking configurations, IPAM allocation happens by invoking an endpoint on a daemonset on the node where the IP is allocated; there isn't an API available that a controller could call to allocate an IP for any node.
I like this approach better. No-op daemonset pods seem like a reasonable tradeoff to reduce the risk of misconfiguration. I think there's a cyclic dependency to figure out though: (1) daemonset pod needs cilium-agent running to start and get an IP, (2) cilium-agent needs the IP from the daemonset pod to start cilium-agent. How hard would it be to let cilium-agent start without the IPs? For ingress/healthcheck IPs I guess cilium-agent could set these up later, but I'm less sure about the router IP. |
I take that back: it is possible to assign additional IPs to the VM through Azure APIs, which is what Cilium's legacy Azure IPAM implementation does: https://github.com/cilium/cilium/blob/4820ebe7b372977c351837eb1ce5ac6024f712aa/pkg/azure/ipam/instances.go#L23 So that might be an option. It's a bit sad that it's cloud-provider specific, but at least someone could implement this controller outside the Cilium codebase.
One other quick thought on this: there's a security risk to granting a controller permission to update k8s nodes, since an attacker could manipulate labels to control where pods get scheduled (https://kccnceu2022.sched.com/event/ytlb/trampoline-pods-node-to-admin-privesc-built-into-popular-k8s-platforms-yuval-avrahami-shaul-ben-hai-palo-alto-networks). Maybe permission to mutate CiliumNode is safer? |
@wedaly We now have status for the CFPs https://github.com/cilium/design-cfps#status. Would you consider moving this to dormant so it could be picked up again in the future? |
Yep, that works for me. Where do I set the status? |
Something like this is good enough: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to #7, I think that it's OK to merge dormant proposals in the state that they're in, even if there are remaining outstanding items to address (discussion also in #48). The CFP can be subsequently picked up by someone in the community, and at that time they can make subsequent proposals to change the CFP and move the status to "Implementable". The proposal's acceptance would be subject to review at that time.
Prior to merge I wanted to take a fresh look across the proposal to try to give a high level take on where it's at right now, so that anyone attempting to pick this up again in future can get some guidance about what to address next. The comments above in this PR provide more specific feedback, but I'll try to summarize below.
I think the overall open question is whether we need more structure from the CNI spec (or successor) to formalize a solution. There is an alternative for a more limited scope, less clean solution through existing mechanisms. There's also the question of whether Cilium is assuming more of the role traditionally associated with the runtime rather than the CNI, and what hurdles that may introduce (including for instance, running in the host rather container context and being compatible with SELinux).
To be merged into the repository in dormant state, we would need two more things:
|
Signed-off-by: Will Daly <[email protected]>
7a3fc60
to
72efc30
Compare
thanks, updated! |
I started this draft CFP and prototype for getting delegated IPAM to work with endpoint health checking and ingress controller back in November. Unfortunately, I'm not going to have time to work further on this, so I'm publishing it as a draft PR to share what I found.