Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFP for delegated IPAM with cilium-agent IPs #13

Merged
merged 1 commit into from
Aug 19, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions cilium/CFP-TODO-delegated-ipam-cilium-ips.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# CFP-003: Template

**SIG: SIG-Agent, SIG-CNI**

**Begin Design Discussion:** 2023-11-17

**Cilium Release:** X.XX

**Authors:** Will Daly <[email protected]>

**Status:** Dormant

## Summary

Enable features such as endpoint health checking and ingress controller that are currently incompatible with Cilium's delegated IPAM mode.


## Motivation

Cilium has an IPAM mode called "delegated plugin". In this mode, Cilium CNI invokes another CNI plugin to allocate and release IP addresses (see ["Plugin Delegation" in the CNI spec](https://www.cni.dev/docs/spec/#section-4-plugin-delegation) for details).

Unlike other IPAM modes, the cilium-agent daemonset is NOT involved in IPAM. However, several Cilium features require Cilium to assign itself an IP, outside the context of a CNI invocation. These features include endpoint health checking (`endpointHealthChecking.enabled=true`) and ingress controller (`ingressController.enabled=true`). When using delegated IPAM, these features are unavailable and [blocked by validation on cilium-agent startup](https://github.com/cilium/cilium/blob/70ae8d0ef536de807aab849291e5a68758cb8d4d/pkg/option/config.go#L3782).


## Goals

* Support endpoint health checking and ingress controller when using Cilium's delegated IPAM mode.
* The solution should work with any conformant CNI IPAM plugin (avoid assumptions about specifics plugins/platforms).
* The solution should *not* leak IPs, even if cilium-agent crashes and restarts.


## Non-Goals

* This CFP does not propose any changes to other IPAM modes, just to delegated IPAM.


## Proposal

### Overview

When it needs to allocate IPs for itself, cilium-agent invokes the delegated IPAM plugin directly.


### IPAM Plugin Operations

The delegated IPAM plugin supports these three operations (as of CNI spec 0.4.0):

| Operation | Usage | Input | Output |
|------------|--------------------------------|----------------------------------------|------------------------------|
| ADD | Allocate an IP | CNI_CONTAINERID, CNI_NETNS, CNI_IFNAME | IPs (possibly IPv4 and IPv6) |
| DEL | Release an IP | CNI_CONTAINERID, CNI_IFNAME | Success/failure |
| CHECK | Verify that an IP is allocated | CNI_CONTAINERID, CNI_NETNS, CNI_IFNAME | Success/failure |

(The above table is highly simplified, see the [CNI spec](https://www.cni.dev/docs/spec) for full details.)

The semantics of the above operations differ significantly from how other Cilium IPAM implementations work. In particular, Cilium's `ipam.IPAM` struct supports idempotent allocation of a specific IP using [AllocateIP](https://github.com/cilium/cilium/blob/70ae8d0ef536de807aab849291e5a68758cb8d4d/pkg/ipam/allocator.go#L47). This is used to restore IPs on cilium-agent restart, ensuring that the IP doesn't change and potentially disrupt the dataplane. This isn't possible with delegated IPAM, because:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to restore IPs on cilium-agent restart, ensuring that the IP doesn't change and potentially disrupt the dataplane. This isn't possible with delegated IPAM, because [...]

Does cilium-agent need to reallocate on every startup? The IP is already allocated and from the delegated IPAM implementation's perspective, it can continue to be actively used. The delegated IPAM isn't aware of the state of its caller.

Cilium-agent does a bunch of "store to disk + restore on startup" sort of logic already, so as long as cilium-agent has ever allocated the IP once, it could perhaps continue to use that IP as long as it is sure that the IP is still allocated. The CHECK operation can help with this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that's a good question. The main point I was trying to make here is that it's challenging to shoehorn delegated IPAM into the existing IPAM allocator interface. Maybe there's a way to implement Allocate(ip) as a CNI CHECK operation, but I'm not sure what arguments cilium-agent would send to the IPAM plugin (maybe there's some mapping from IP address to containerID/netNS/iface?)


* The required inputs do not include the IP address. By convention, some [IPAM plugins support an additional "ips" argument](https://www.cni.dev/docs/spec), but this is not universal.
* The CNI ADD operation is not idempotent. According to [the spec](https://www.cni.dev/docs/spec/#add-add-container-to-network-or-apply-modifications): "A runtime should not call ADD twice (without an intervening DEL) for the same (`CNI_CONTAINERID`, `CNI_IFNAME`) tuple."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CNI author here. That's why RFCs use "SHOULD" :-p.

But seriously, this is a good motivation to relax the no-add-without-del restriction. We've been idly discussing making ADD idempotent in CNI. I'll take this as another data point.



### IP Leakage

Another challenge with delegated IPAM is releasing IPs that are no longer in use. Once CNI ADD completes successfully, the IP is allocated. In a cloud environment, this may involve configuring the cloud network to route the IP to the node. If cilium-agent repeatedly allocates IPs (for example, crashing on startup before recording that it allocated the IP), these IPs would be unavailable for pods. This can be a serious problem in some environments.

Note that it's acceptable for cilium-agent to allocate an IP without releasing it before the node is deleted. This is equivalent to someone "pulling the plug" on the node (or, in a cloud environment, deleting the VM), so any real IPAM implementation will need to handle this case anyway.


### Process for cilium-agent to invoke delegated IPAM

Given the above constraints, how can cilium-agent safely invoke the delegated IPAM plugin?

First, note that cilium-agent allocates a small number of IPs for itself. For example, if both endpoint health checking and ingress controller are enabled in a single-stack cluster, then cilium-agent needs to allocate exactly two IPv4 addresses.

Each "kind" of address that cilium-agent needs to allocate can be assigned a unique CNI_CONTAINERID, known in advance. For example, endpoint health checking might use `CNI_CONTAINERID="cilium-agent-health"`, and ingress controller might use `CNI_CONTAINERID="cilium-agent-ingress"`. This allows cilium-agent to refer to an address that may have been allocated previously without knowing the exact IP address.

The other two parameters (`CNI_NETNS` and `CNI_IFNAME`) can be set to dummy values (perhaps `CNI_NETNS="host"` and `CNI_IFNAME="eth0"`?). These are required by the CNI spec (since a delegated IPAM plugin implements the same interface as a "full" CNI plugin), but are not used by any IPAM plugins that I'm aware of.

The protocol for cilium-agent to call delegated IPAM is then relatively simple:

1. If there is an IP to restore, invoke `CNI CHECK` to ensure that the IP is still allocated. If `CNI CHECK` succeeds, then return success.
2. `CNI DEL` to ensure any previously-allocated IP is released. Continue to step 3 even if `CNI DEL` errors.
3. `CNI ADD` to allocate a new IP. If it succeeds, then use the returned IP; otherwise, return failure.


### Complications and caveats

* **CNI state**: Some IPAM plugins store state on-disk (example: host-local writes to files in /var/lib/cni/networks by default, but this can be overridden in the CNI config). These directories *must* be mounted read-write in the cilium-agent pod, otherwise IPs could be leaked or double-allocated. Since this depends on the specific delegated IPAM plugin used, the user must configure this in the Cilium chart using `extraHostPathMounts`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point that this raises is that it implies the Delegated IPAM plugin is running in context of the cilium-agent container (same netns as host/root netns, but other namespaces may differ, for instance filesystem, user). Typically it's running directly on the host as it's invoked by kubelet->ContainerRuntime->cilium-cni.

I agree that in that context the specific delegated IPAM needs access to all filesystem resources it depends upon. Unfortunately this brushes up against the goal (avoid assumptions about specifics plugins/platforms).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is messy. And -- even worse -- it's easy to misconfigure and not notice, because things will seem to work until IPs leak.

Copy link
Member

@joestringer joestringer Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think that this is the sort of thing that CNI 1.1's GC operation is intended to assist with, at least to mitigate the worst case scenarios.

As a side-note, this also raises the point that well-planned lifecycle testing for this feature would help to also mitigate these corner cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is turning Cilium in to both a CNI plugin as well as a runtime. It is definitely complicated.

CNI plugins "expect" to be executed in the same context as the container runtime. Most notably, this means the real root filesystem, but it also means the same SELinux context. Now, I don't expect IPAM plugins to need significant privileges, but SELinux has tripped us up before.

This is a strong motivating example for a dumb "like CNI but gRPC" translation of the protocol; I'd been putting that off as not worth the effort, but this is a good motivating example.

There are a number of other potential stumbling blocks here, but I wont enumerate them all. It does mean that the administrator does have to have an intimate understanding of how the IPAM plugin is expected to run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CNI plugins "expect" to be executed in the same context as the container runtime. Most notably, this means the real root filesystem, but it also means the same SELinux context. Now, I don't expect IPAM plugins to need significant privileges, but SELinux has tripped us up before.

Ah, yes, I didn't even think about SELinux.

It does mean that the administrator does have to have an intimate understanding of how the IPAM plugin is expected to run.

Totally agree. For AKS I'd expect things are simpler since we use an IPAM plugin that just calls an HTTP server without storing any state. But it's hard to solve the general case.


* **Cilium config change**: Suppose a user first configures cilium with endpoint health checking, then disables it. This will leak one IP per IP family per node, since cilium-agent won't execute `CNI DEL` on every possible IP it might have allocated in previous configurations. I'd argue this is acceptable as long as it's documented: the IPs would eventually be released as nodes are deleted and replaced.

* **Cilium CNI version**: Current default Cilium CNI version is 0.3.1, but the `CNI CHECK` operation isn't supported until 0.4.0. The Cilium CNI code is compatible with 0.4.0, so I think it's safe to set 0.4.0 in the conflist.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Yep, the plugin supports 1.0.0. We probably just need to update the examples to bump it up. Filed cilium/cilium#30363 to follow up.


* **CNI Spec 1.1 GC operation**: [CNI spec 1.1 introduces a new "GC" operation](https://github.com/containernetworking/cni/pull/1022). The idea is that the container runtime calls GC with a list of all known attachments, and the CNI plugin cleans up any attachments not in the list. The cleanup includes invoking delegated IPAM plugins to release IPs. This is a problem, since the container runtime won't know about IPs that cilium-agent allocated for itself by invoking the IPAM plugin directly. One possible solution would be for Cilium CNI's GC operation to inject IPs allocated by cilium-agent before Cilium CNI invokes the delegated IPAM plugin's GC. Unclear if this is allowed or forbidden by the CNI spec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

➕ Yeah since the runtime would call CHECK on cilium-cni and cilium-cni would have the ability to know about cilium-agent's allocations (via local agent API call), that path makes the most sense to me. My guess is that the spec doesn't consider this case, but given the implementation details I think it makes sense for the CNI to inject these (as long as it does it correctly, avoid leaks, etc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more nit here, looks like GC doesn't arrive until 1.2. The linked PR's commit isn't tagged in 1.1 and the CNI webpages don't publish anything about the GC operation.

Copy link
Contributor

@squeed squeed Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is my fault. The CNI repository is at v1.1, but that's the version of the go library. The spec version is v1.0. If we continue like this, libcni v1.2 will support CNI spec v1.1. Oof.

I need to fix this. We will probably move the go library to its own repository.


* **CNI conflist installation**: cilium-agent needs to read the CNI conflist, which might not yet exist if it's installed by another daemonset (e.g. when Cilium is configured with `cni.install=false`). Easy thing to do is exit with an error, but it would be better to retry or watch the conflist directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this problem different from the general case? I guess it's a bootstrapping problem since cilium-agent would now directly invoke the CNI at "some time" (could be very quick after startup), vs. the existing case where cilium-cni binary is only invoked after this conflist is already known to be installed.

Could be mitigated with some sort of retry logic or something I suppose. Watches can often be a bit of a pain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference is that if Cilium exits with error on startup, it could look to the user like Cilium is broken, even though waiting for the conflist is expected behavior.

Contrast that with what happens in the normal CNI flow: kubelet/containerd keep the node "NotReady" until the conflist file is created.



### Prototype

I wrote a small, hacky prototype to demonstrate that the proposed approach is possible:

https://github.com/cilium/cilium/compare/main...wedaly:cilium:delegated-ipam-cilium-agent-prototype


## Impacts / Key Questions

### Key Question: Is this compliant with the CNI spec?

The goal of the CNI spec is to define the interface between the container runtime and the CNI plugin. Invoking it directly from cilium-agent probably isn't something the spec writers ever had in mind. The main concern is that as the CNI ecosystem evolves, assumptions in this proposal will be broken.

### Key Question: Possible to move envoy to pod network?

If envoy were running in pod network as a separate daemonset, then it would get assigned an IP by the container runtime automatically. I think ingress controller / envoy is the most important feature unblocked by this CFP. I suspect moving envoy out of the host netns would greatly complicate the datapath, however.
Comment on lines +112 to +114
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One proposal floating around is for there to be a dummy for the purpose of allocating the IP, with some annotation that cilium-agent could use to hijack the IP. Bit ugly though. I agree that there's a fair bit of work to go through and try to migrate the Envoy into hostns and ensure all existing features continue to work.



## Future Milestones

N/A