edgelesssys · burgerdev · Feb 25, 2025 · Dec 17, 2024
diff --git a/rfc/010-distributed-coordinator.md b/rfc/010-distributed-coordinator.md
@@ -0,0 +1,249 @@
+# RFC 010: Distributed coordinator
+
+## Background
+
+The Contrast Coordinator is a stateful service with a backend storage that can't be shared.
+If the Coordinator pod restarts, it loses access to the secret seed and needs to be recovered manually.
+This leads to predictable outages on AKS, where the nodes are replaced periodically.
+
+## Requirements
+
+* High availability: node failure, pod migration, etc. must not impact the availability of the `set` or `verify` flow.
+* Automatic recovery: node failure, pod migration, etc. must not require a manual recovery step.
+  * A newly started Coordinator that has recovered peers must eventually recover automatically.
+* Consistency: the Coordinator state must be strongly consistent.
+* Data-owner security: the mesh CA key must only ever be exposed to Coordinator instances.
+
+## Design
+
+### Security considerations
+
+In order to allow reasoning about the chain of trust, we adhere to the following invariants:
+
+1. If a Coordinator has seed share owners $S_1,...,S_n$ in its active manifest, one of the following is true:
+   * It created its secret seed locally and only shared it in encrypted form with these seed share owners.
+   * It received its secret seed from an authenticated seed share owner $S_i$.
+   * It received its secret seed from an authenticated Coordinator with an active manifest `M` listing exactly $S_1,...,S_n$ as seed share owners.
+2. For the active mesh CA key at a Coordinator with active manifest `M`, one of the following is true:
+   * It was locally generated by the Coordinator when `M` was set.
+   * It was received from an authenticated Coordinator with active manifest `M`.
+3. A Coordinator only shares its secret seed and its mesh CA key with an authenticated Coordinator.
+
+Taken together, these invariants ensure
+
+* data owner security: a mesh CA key is only ever used with a single manifest and isn't known to non-Coordinators.
+* seed share owner security: the seed configured for an active manifest can be traced back to the seed share owners in that manifest.
+
+See the [section on manifest changes](#manifest-changes) below regarding authentication of Coordinators.
+
+### Persistency changes
+
+The Coordinator uses a generic key-value store interface to read from and write to the persistent store.
+In order to allow distributed read and write requests, we only need to swap the implementation for one that can be used concurrently.
+
+```golang
+type Store interface {
+  Get(key string) ([]byte, error)
+  Set(key string, value []byte) error
+  CompareAndSwap(key string, oldVal, newVal []byte) error
+  Watch(key string, chan<- []byte)) error // <-- New in RFC010.
+}
+```
+
+There are no special requirements for `Get` and `Set`, other than basic consistency guarantees (`Get` should return what was `Set` at some point in the past).
+We can use Kubernetes resources and their `GET` / `PUT` semantics to implement them.
+`CompareAndSwap` needs to do an atomic update, which is supported by the `ObjectMeta.resourceVersion`[^1] field.
+We also add a new `Watch` method that facilitates reacting to manifest updates.
+This can be implemented as a no-op or via `inotify(7)` on the existing file-backed store, and with the native watch mechanisms for Kubernetes objects.
+While the `Watch` method isn't strictly required, it prevents entering recovery mode in the hot path of user API and mesh API calls.
+
+[RFC 004](004-recovery.md#kubernetes-objects) contains an implementation sketch that uses custom resource definitions.
+However, in order to keep the implementation simple, we implement the KV store in content-addressed `ConfigMaps`.
+A few considerations on using Kubernetes objects for storage:
+
+1. Kubernetes object names need to be valid DNS labels, limiting the length to 63 characters.
+   We work around that by truncating the name and storing a map of full hashes to content.
+2. The `latest` transition isn't content-addressable.
+   We store it under a fixed name for now (`transitions/latest`), which limits us to one Coordinator deployment per namespace.
+   Should a use-case arise, we could make that name configurable.
+3. Etcd has a limit of 1.5MiB per value.
+   This is more than enough for policies, which are the largest objects we store right now at around 100KiB.
+   However, we should make sure that the collision probability is low enough that not too many policies end up in the same config.
+4. The store content is unlikely to be useful after deletion of the Coordinator, and leaving it in the cluster prevents new Coordinators from being deployed.
+   Thus, the Coordinator should set the `ownerReference`[^2] of the `ConfigMap`s to its `StatefulSet`.
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: "contrast-store-$(dirname)-$(truncated basename)"
+  labels:
+    app.kubernetes.io/managed-by: "contrast.edgeless.systems"
+data:
+  "$(basename)": "$(data)"
+```
+
+[^1]: <https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency>
+[^2]: <https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/#owner-references-in-object-specifications>
+
+### Recovery mode
+
+The Coordinator is said to be in *recovery mode* if its state object (containing manifest and seed engine) doesn't correspond to the latest state in persistence.
+Recovery mode can be entered as follows:
+
+* The Coordinator starts up and the history has a latest transition.
+* Receiving a watch event for the latest manifest, and the latest manifest isn't the current one.
+* Syncing with the state persistence during API calls and discovering a new manifest (updated by another Coordinator but not yet propagated to this Coordinator).
+
+Recovery mode exits after the state has been updated by:
+
+* A successful [Peer recovery](#peer-recovery).
+* A successful recovery through the `userapi.Recover` RPC.
+
+### Kubernetes changes
+
+#### Coordinator service
+
+Coordinators are going to recover themselves from peers that are already recovered (or set).
+This means that we need to be able to distinguish Coordinators that have secrets from Coordinators that have none.
+Since we're running in Kubernetes, we can leverage the built-in probe types.
+
+We add a few new paths to the existing HTTP server for metrics, supporting the following checks (returning status code 503 on failure):
+
+* `/probe/startup`: returns 200 when all ports are serving and [peer recovery](#peer-recovery) was attempted once.
+* `/probe/liveness`: returns 200 unless the Coordinator is recovered but can't read the transaction store.
+* `/probe/readiness`: returns 200 if the Coordinator has an active manifest and isn't in recovery mode.
+
+Using these probes, we expose the Coordinators in different ways, depending on audience:
+
+* A `cooordinator-ready` service backed by ready Coordinators should be used by initializers and verifying clients.
+* The `coordinator` service now accepts unready endpoints (`publishNotReadyAddresses=true`) and should be used for `set`.
+  The idea here is backwards compatibility with documented workflows, see [Alternatives considered](#alternatives-considered).
+
+#### Peer discovery
+
+A recovering Coordinator needs to know potential peers to recover from.
+We add a background watcher for k8s pods that monitors existing coordinator pods, filters for those that are ready and populates a local lookup table.
+Such functionality is core to projects like operator-sdk and kubebuilder.
+However, those are very opinionated and rather monolithic, so it may be better to implement this using the watch primitives in `client-go`.
+The implementation could probably share some code with `KubeClient.WaitFor`.
+
+#### Coordinator pods
+
+The Coordinator pods should have a soft anti-affinity[^3] to themselves.
+
+[^3]: <https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity>
+
+### Manifest changes
+
+In order to allow Coordinators to recover from their peers, we need to define which workloads are allowed to recover.
+There are two reasons why we want to make this explicit:
+
+1. Allowing a Coordinator with a different policy to recover is a prerequisite for automatic updates.
+2. The Coordinator policy can't be embedded into the Coordinator during build, and deriving it at runtime is inconvenient.
+
+Thus, we introduce a new field to the manifest that specifies roles available to the verified identity.
+For now, the only known role is `coordinator`, but this could be extended in the future (for example: delegate CAs).
+
+```golang
+type PolicyEntry struct {
+  SANs             []string
+  WorkloadSecretID string `json:",omitempty"`
+  Roles            []string `json:",omitempty"` // <-- New in RFC010.
+}
+```
+
+During `generate`, we append the value of the `contrast.edgeless.systems/pod-role` annotation to the policy entry.
+If there is no coordinator among the resources, we add a policy entry for the embedded coordinator policy.
+
+### Peer recovery
+
+The peer recovery process is attempted by Coordinators that are in [Recovery mode](#recovery-mode):
+
+1. Once after entering recovery mode.
+2. Periodically from a goroutine, to reconcile missed watch events.
+
+The process starts with a DNS request for the `coordinator-peers` name to discover ready Coordinators.
+For each of the ready Coordinators, the recovering Coordinator calls a new `meshapi.Recover` method.
+
+```proto
+service MeshAPI {
+  rpc NewMeshCert(NewMeshCertRequest) returns (NewMeshCertResponse);
+  rpc Recover(RecoverRequest) returns (RecoverResponse); // <-- New in RFC010.
+}
+
+message RecoverRequest {}
+
+message RecoverResponse {
+  bytes Seed = 1;
+  bytes Salt = 2;
+  bytes RootCAKey = 3;
+  bytes RootCACert = 4;
+  bytes MeshCAKey = 5;
+  bytes MeshCACert = 6;
+  bytes LatestManifest = 7;
+}
+```
+
+When this method is called, the serving Coordinator:
+
+1. Provides an attestation statement to the client (this isn't the case for the mesh API right now).
+2. Queries the current state and attaches it to the context (this already happens automatically for all `meshapi` calls).
+3. Verifies that the client identity is allowed to be recovered by the state's manifest (has the `coordinator` role).
+4. Fills the response with values from the current state.
+
+The complete flow from the recovering Coordinator's perspective:
+
+1. Load the latest manifest from persistence without verifying the signature.
+2. Build an aTLS client from the unverified manifest:
+   * Populate ValidateOpts with reference values.
+   * Add a callback that verifies the policy hash with the roles from the unverified manifest.
+   * Add standard client attestation.
+3. Call the serving Coordinator's `Recover` endpoint.
+4. Compare the received manifest with the temporary manifest.
+5. Set up a temporary `SeedEngine` with the received parameters.
+6. Load the manifest history from persistence, verifying the signature with the temporary seed engine.
+7. Compare the loaded manifest with the received manifest.
+   If they match, update the state with the temporary seed engine and the fetched history.
+
+If the recovery was successful, the client Coordinator leaves the peer recovery process.
+Otherwise, it continues with the next available peer, or fails the process if none are left.
+
+It's expected that recovery can fail transiently, for example due to concurrent `SetManifest` calls.
+If the entire process in this section fails, it should be restarted from the beginning with an appropriate backoff.
+
+## Open issues
+
+### Inconsistent state
+
+After a `userapi.Recover` operation, the Coordinator needs to generate a new mesh key because it can't accept one from the recovering seed share owner.
+If another Coordinator is still serving, they will both have the same active manifest and history, but different mesh keys.
+Options for dealing with this situation:
+
+* Treat user recovery like a manifest update (that is, write a new transition on recovery).
+  This should force other Coordinators into recovery mode.
+* Don't deal with this situation at all.
+  Coordinators are expected to recover from peers on the order of seconds, whereas recovery by seed share owners is at least on the order of minutes.
+  To be sure, we could
+  * try to verify that all Coordinators are in recovery mode or
+  * try peer recovery in the hot path of user recovery.
+
+## Alternatives considered
+
+### Recovery initiated by the serving Coordinator
+
+We could make each ready Coordinator try to discover unready peers and start recovering them.
+Superficially this approach would be closer to the existing recovery flow, but we'd still need to modify:
+
+* authorization of the initiating party (seed share owner vs. attested Coordinator)
+* generating a new mesh cert vs. accepting the cert from the initiator
+
+One upside would be that we would not need to watch manifest changes and avoid recovering in the hot path of `GetManifest` or `NewMeshCert`.
+
+### Services
+
+1. We could define a headless service `coordinator-peers` for peer discovery.
+   This would have the upside of not requiring permissions for the k8s api, but all the downsides of relying on DNS for discovery (freshness and TTLs, mostly).
+2. The idea behind `coordinator-ready` is to provide clients that only ever need to talk to a ready coordinator with an endpoint that's guaranteed to be ready.
+   However, if there is at least one ready coordinator, this proposal should ensure that the other coordinators become ready, too, after a short time.
+   This assumption would require adding recovery to the `GetManifest` and `NewMeshCert` handlers, though.
diff --git a/tools/vale/styles/config/vocabularies/edgeless/accept.txt b/tools/vale/styles/config/vocabularies/edgeless/accept.txt
@@ -12,10 +12,12 @@ autoscaler
 autoscaling
 AWS
 backend
+backoff
 backport
 Bazel
 bootloader
 Bootstrapper
+CA
 cachable
 cachix
 Canonical
@@ -49,6 +51,7 @@ Fulcio
 Gbps
 GCP
 genpolicy
+goroutine
 GPU
 gRPC
 Grype
@@ -61,6 +64,7 @@ initramfs
 initrd
 Inkscape
 iodepth
+IP
 IPSec
 iptable
 Istio
@@ -69,6 +73,7 @@ Kata
 KEK
 KMS
 kubeadm
+kubebuilder
 kubectl
 kubelet
 Kustomization
@@ -122,6 +127,7 @@ Terraform
 Tink
 tmpfs
 TPM
+TTL
 underspecified
 unencrypted
 unrepresentable