|
| 1 | +# RFC 009: Distributed coordinator |
| 2 | + |
| 3 | +## Background |
| 4 | + |
| 5 | +The Contrast Coordinator is a stateful service with a backend storage that can't be shared. |
| 6 | +If the Coordinator pod restarts, it loses access to the secret seed and needs to be recovered manually. |
| 7 | +This leads to predictable outages on AKS, where the nodes are replaced periodically. |
| 8 | + |
| 9 | +## Requirements |
| 10 | + |
| 11 | +* High-availability: node failure, pod migration, etc. must not impact the availability of the `set` or `verify` flow. |
| 12 | +* Auto-recovery: node failure, pod migration, etc. must not require a manual recovery step. |
| 13 | + * A newly started Coordinator that has recovered peers must eventually recover automatically. |
| 14 | +* Consistency: the Coordinator state must be strongly consistent. |
| 15 | + |
| 16 | +## Design |
| 17 | + |
| 18 | +### Persistency changes |
| 19 | + |
| 20 | +The Coordinator uses a generic key-value store interface to read from and write to the persistent store. |
| 21 | +In order to allow distributed read and write requests, we only need to swap the implementation for one that can be used concurrently. |
| 22 | + |
| 23 | +```golang |
| 24 | +type Store interface { |
| 25 | + Get(key string) ([]byte, error) |
| 26 | + Set(key string, value []byte) error |
| 27 | + CompareAndSwap(key string, oldVal, newVal []byte) error |
| 28 | + Watch(key string, callback func(newVal []byte)) error // <-- New in RFC009. |
| 29 | +} |
| 30 | +``` |
| 31 | + |
| 32 | +There are no special requirements for `Get` and `Set`, other than basic consistency guarantees (i.e., `Get` should return what was `Set` at some point in the past). |
| 33 | +We can use Kubernetes resources and their `GET` / `PUT` semantics to implement them. |
| 34 | +`CompareAndSwap` needs to do an atomic update, which is supported by the `ObjectMeta.resourceVersion`[^1] field. |
| 35 | +We also add a new `Watch` method that facilitates reacting on manifest updates. |
| 36 | +This can be implemented as a no-op or via `inotify(7)` on the existing file-backed store, and with the native watch mechanisms for Kubernetes objects. |
| 37 | + |
| 38 | +[RFC 004](004-recovery.md#kubernetes-objects) contains an implementation sketch that uses custom resource definitions. |
| 39 | +However, in order to keep the implementation simple, we implement the KV store in content-addressed `ConfigMaps`. |
| 40 | +A few considerations on using Kubernetes objects for storage: |
| 41 | + |
| 42 | +1. Kubernetes object names need to be valid DNS labels, limiting the length to 63 characters. |
| 43 | + We work around that by truncating the name and storing a map of full hashes to content. |
| 44 | +2. The `latest` transition is not content-addressable. |
| 45 | + We store it under a fixed name for now (`transitions/latest`), which limits us to one Coordinator deployment per namespace. |
| 46 | + Should a use-case arise, we could make that name configurable. |
| 47 | +3. Etcd has a limit of 1.5MiB per value. |
| 48 | + This is more than enough for policies, which are the largest objects we store right now at around 100KiB. |
| 49 | + However, we should make sure that the collision probability is low enough that not to many policies end up in the same config. |
| 50 | + |
| 51 | +```yaml |
| 52 | +apiVersion: v1 |
| 53 | +kind: ConfigMap |
| 54 | +metadata: |
| 55 | + name: "contrast-store-$(dirname)-$(truncated basename)" |
| 56 | + labels: |
| 57 | + app.kubernetes.io/managed-by: "contrast.edgeless.systems" |
| 58 | +data: |
| 59 | + "$(basename)": "$(data)" |
| 60 | +``` |
| 61 | +
|
| 62 | +[^1]: <https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency> |
| 63 | +
|
| 64 | +### Recovery mode |
| 65 | +
|
| 66 | +The Coordinator is said to be in *recovery mode* if its state object (containing manifest and seed engine) is unset. |
| 67 | +Recovery mode can be entered as follows: |
| 68 | +
|
| 69 | +* The Coordinator starts up in recovery mode. |
| 70 | +* Receiving a watch event for the latest manifest, and the latest manifest is not the current one. |
| 71 | +* Syncing the latest manifest during API calls and discovering a new manifest. |
| 72 | +
|
| 73 | +Recovery mode exits after the state has been set by: |
| 74 | +
|
| 75 | +* A successful [Peer recovery](#peer-recovery). |
| 76 | +* A successful recovery through the `userapi.Recover` RPC. |
| 77 | +* A successful initial `userapi.SetManifest` RPC. |
| 78 | + |
| 79 | +### Service changes |
| 80 | + |
| 81 | +Coordinators are going to recover themselves from peers that are already recovered (or set). |
| 82 | +This means that we need to be able to distinguish Coordinators that have secrets from Coordinators that have none. |
| 83 | +Since we're running in Kubernetes, we can leverage the built-in probe types. |
| 84 | + |
| 85 | +We add a few new paths to the existing HTTP server for metrics, supporting the following checks (returning status code 503 on failure): |
| 86 | + |
| 87 | +* `/probe/startup`: returns 200 when all ports are serving and [peer recovery](#peer-recovery) was attempted once. |
| 88 | +* `/probe/liveness`: returns 200 unless the Coordinator is recovered but can't read the transaction store. |
| 89 | +* `/probe/readiness`: returns 200 if the Coordinator is not in recovery mode. |
| 90 | + |
| 91 | +Using these probes, we expose the Coordinators in different ways, depending on audience: |
| 92 | + |
| 93 | +* A `cooordinator-ready` service backed by ready Coordinators should be used by initializers and verifying clients. |
| 94 | +* A `coordinator-peers` headless service backed by ready Coordinators should be used by Coordinators to discover peers to recover from. |
| 95 | + Using a headless service allows getting the individual IPs of ready Coordinators, as opposed to a `ClusterIP` service that just drops requests when there are no backends. |
| 96 | +* The `coordinator` service now accepts unready endpoints (`publishNotReadyAddresses=true`) and should be used for `set`. |
| 97 | + The idea here is backwards compatibility with documented workflows, see [Alternatives considered](#alternatives-considered). |
| 98 | + |
| 99 | +### Manifest changes |
| 100 | + |
| 101 | +In order to allow Coordinators to recover from their peers, we need to define which workloads are allowed to recover. |
| 102 | +There are two reasons why we want to make this explicit: |
| 103 | + |
| 104 | +1. Allowing a Coordinator with a different policy to recover is a prerequisite for automatic updates. |
| 105 | +2. The Coordinator policy can't be embedded into the Coordinator during build, and deriving it at runtime is inconvenient. |
| 106 | + |
| 107 | +Thus, we introduce a new field to the manifest that specifies roles available to the verified identity. |
| 108 | +For now, the only known role is `coordinator`, but this could be extended in the future (for example: delegate CAs). |
| 109 | + |
| 110 | +```golang |
| 111 | +type PolicyEntry struct { |
| 112 | + SANs []string |
| 113 | + WorkloadSecretID string `json:",omitempty"` |
| 114 | + Roles []string `json:",omitempty"` // <-- New in RFC009. |
| 115 | +} |
| 116 | +``` |
| 117 | + |
| 118 | +During `generate`, we append the value of the `contrast.edgeless.systems/pod-role` annotation to the policy entry. |
| 119 | +If there is no coordinator among the resources, we add a policy entry for the embedded coordinator policy. |
| 120 | + |
| 121 | +### Peer recovery |
| 122 | + |
| 123 | +The peer recovery process is attempted by Coordinators that are in [Recovery mode](#recovery-mode): |
| 124 | + |
| 125 | +1. Once after startup, to speed up initialization. |
| 126 | +2. Once as callback to the `Watch` event, . |
| 127 | +3. Periodically from a goroutine, to reconcile missed watch events. |
| 128 | + |
| 129 | +The process starts with a DNS request for the `coordinator-peers` name to discover ready Coordinators. |
| 130 | +For each of the ready Coordinators, the recovering Coordinator calls a new `meshapi.Recover` method. |
| 131 | + |
| 132 | +```proto |
| 133 | +service MeshAPI { |
| 134 | + rpc NewMeshCert(NewMeshCertRequest) returns (NewMeshCertResponse); |
| 135 | + rpc Recover(RecoverRequest) returns (RecoverResponse); // <-- New in RFC009. |
| 136 | +} |
| 137 | +
|
| 138 | +message RecoverRequest {} |
| 139 | +
|
| 140 | +message RecoverResponse { |
| 141 | + bytes Seed = 1; |
| 142 | + bytes Salt = 2; |
| 143 | + bytes RootCAKey = 3; |
| 144 | + bytes RootCACert = 4; |
| 145 | + bytes MeshCAKey = 5; |
| 146 | + bytes MeshCACert = 6; |
| 147 | + bytes LatestManifest = 7; |
| 148 | +} |
| 149 | +``` |
| 150 | + |
| 151 | +When this method is called, the serving Coordinator: |
| 152 | + |
| 153 | +0. Queries the current state and attaches it to the context (this already happens automatically for all `meshapi` calls). |
| 154 | +1. Verifies that the client identity is allowed to recover by the state's manifest (has the `coordinator` role). |
| 155 | +2. Fills the response with values from the current state. |
| 156 | + |
| 157 | +After receiving the `RecoverResponse`, the client Coordinator: |
| 158 | + |
| 159 | +1. Sets up a temporary `SeedEngine` with the received parameters. |
| 160 | +2. Verifies that the received manifest is the current latest. |
| 161 | + If not, it stops and enters recovery mode again. |
| 162 | +3. Updates the state with the current manifest and the temporary seed engine. |
| 163 | + |
| 164 | +If the recovery was successful, the client Coordinator leaves the peer recovery process. |
| 165 | +Otherwise, it continues with the next available peer, or fails the process if none are left. |
| 166 | + |
| 167 | +## Open issues |
| 168 | + |
| 169 | +* TODO(burgerdev): inconsistent state if `userapi.Recover` is called while a recovered coordinator exists. Fix candidates: |
| 170 | + * Store the mesh cert alongside the manifest |
| 171 | + * Sign the latest transition with the mesh key (and resign on `userapi.Recover`). |
| 172 | + |
| 173 | +## Alternatives considered |
| 174 | + |
| 175 | +* TODO(burgerdev): different service structure |
| 176 | +* TODO(burgerdev): push vs pull |
| 177 | +* TODO(burgerdev): active mesh CA key distribution |
| 178 | +* TODO(burgerdev): nested CAs, reuse `NewMeshCert` |
0 commit comments