Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: distributed coordinator #1078

Merged
merged 1 commit into from
Feb 25, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
249 changes: 249 additions & 0 deletions rfc/010-distributed-coordinator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
# RFC 010: Distributed coordinator

## Background

The Contrast Coordinator is a stateful service with a backend storage that can't be shared.
If the Coordinator pod restarts, it loses access to the secret seed and needs to be recovered manually.
This leads to predictable outages on AKS, where the nodes are replaced periodically.

## Requirements

* High availability: node failure, pod migration, etc. must not impact the availability of the `set` or `verify` flow.
* Automatic recovery: node failure, pod migration, etc. must not require a manual recovery step.
* A newly started Coordinator that has recovered peers must eventually recover automatically.
* Consistency: the Coordinator state must be strongly consistent.
* Data-owner security: the mesh CA key must only ever be exposed to Coordinator instances.

## Design

### Security considerations

In order to allow reasoning about the chain of trust, we adhere to the following invariants:

1. If a Coordinator has seed share owners $S_1,...,S_n$ in its active manifest, one of the following is true:
* It created its secret seed locally and only shared it in encrypted form with these seed share owners.
* It received its secret seed from an authenticated seed share owner $S_i$.
* It received its secret seed from an authenticated Coordinator with an active manifest `M` listing exactly $S_1,...,S_n$ as seed share owners.
2. For the active mesh CA key at a Coordinator with active manifest `M`, one of the following is true:
* It was locally generated by the Coordinator when `M` was set.
* It was received from an authenticated Coordinator with active manifest `M`.
3. A Coordinator only shares its secret seed and its mesh CA key with an authenticated Coordinator.

Taken together, these invariants ensure

* data owner security: a mesh CA key is only ever used with a single manifest and isn't known to non-Coordinators.
* seed share owner security: the seed configured for an active manifest can be traced back to the seed share owners in that manifest.

See the [section on manifest changes](#manifest-changes) below regarding authentication of Coordinators.

### Persistency changes

The Coordinator uses a generic key-value store interface to read from and write to the persistent store.
In order to allow distributed read and write requests, we only need to swap the implementation for one that can be used concurrently.

```golang
type Store interface {
Get(key string) ([]byte, error)
Set(key string, value []byte) error
CompareAndSwap(key string, oldVal, newVal []byte) error
Watch(key string, chan<- []byte)) error // <-- New in RFC010.
}
```

There are no special requirements for `Get` and `Set`, other than basic consistency guarantees (`Get` should return what was `Set` at some point in the past).
We can use Kubernetes resources and their `GET` / `PUT` semantics to implement them.
`CompareAndSwap` needs to do an atomic update, which is supported by the `ObjectMeta.resourceVersion`[^1] field.
We also add a new `Watch` method that facilitates reacting to manifest updates.
This can be implemented as a no-op or via `inotify(7)` on the existing file-backed store, and with the native watch mechanisms for Kubernetes objects.
While the `Watch` method isn't strictly required, it prevents entering recovery mode in the hot path of user API and mesh API calls.

[RFC 004](004-recovery.md#kubernetes-objects) contains an implementation sketch that uses custom resource definitions.
However, in order to keep the implementation simple, we implement the KV store in content-addressed `ConfigMaps`.
A few considerations on using Kubernetes objects for storage:

1. Kubernetes object names need to be valid DNS labels, limiting the length to 63 characters.
We work around that by truncating the name and storing a map of full hashes to content.
2. The `latest` transition isn't content-addressable.
We store it under a fixed name for now (`transitions/latest`), which limits us to one Coordinator deployment per namespace.
Should a use-case arise, we could make that name configurable.
3. Etcd has a limit of 1.5MiB per value.
This is more than enough for policies, which are the largest objects we store right now at around 100KiB.
However, we should make sure that the collision probability is low enough that not too many policies end up in the same config.
4. The store content is unlikely to be useful after deletion of the Coordinator, and leaving it in the cluster prevents new Coordinators from being deployed.
Thus, the Coordinator should set the `ownerReference`[^2] of the `ConfigMap`s to its `StatefulSet`.

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: "contrast-store-$(dirname)-$(truncated basename)"
labels:
app.kubernetes.io/managed-by: "contrast.edgeless.systems"
data:
"$(basename)": "$(data)"
```

[^1]: <https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency>
[^2]: <https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/#owner-references-in-object-specifications>

### Recovery mode

The Coordinator is said to be in *recovery mode* if its state object (containing manifest and seed engine) doesn't correspond to the latest state in persistence.
Recovery mode can be entered as follows:

* The Coordinator starts up and the history has a latest transition.
* Receiving a watch event for the latest manifest, and the latest manifest isn't the current one.
* Syncing with the state persistence during API calls and discovering a new manifest (updated by another Coordinator but not yet propagated to this Coordinator).

Recovery mode exits after the state has been updated by:

* A successful [Peer recovery](#peer-recovery).
* A successful recovery through the `userapi.Recover` RPC.

### Kubernetes changes

#### Coordinator service

Coordinators are going to recover themselves from peers that are already recovered (or set).
This means that we need to be able to distinguish Coordinators that have secrets from Coordinators that have none.
Since we're running in Kubernetes, we can leverage the built-in probe types.

We add a few new paths to the existing HTTP server for metrics, supporting the following checks (returning status code 503 on failure):

* `/probe/startup`: returns 200 when all ports are serving and [peer recovery](#peer-recovery) was attempted once.
* `/probe/liveness`: returns 200 unless the Coordinator is recovered but can't read the transaction store.
* `/probe/readiness`: returns 200 if the Coordinator has an active manifest and isn't in recovery mode.

Using these probes, we expose the Coordinators in different ways, depending on audience:

* A `cooordinator-ready` service backed by ready Coordinators should be used by initializers and verifying clients.
* The `coordinator` service now accepts unready endpoints (`publishNotReadyAddresses=true`) and should be used for `set`.
The idea here is backwards compatibility with documented workflows, see [Alternatives considered](#alternatives-considered).

#### Peer discovery

A recovering Coordinator needs to know potential peers to recover from.
We add a background watcher for k8s pods that monitors existing coordinator pods, filters for those that are ready and populates a local lookup table.
Such functionality is core to projects like operator-sdk and kubebuilder.
However, those are very opinionated and rather monolithic, so it may be better to implement this using the watch primitives in `client-go`.
The implementation could probably share some code with `KubeClient.WaitFor`.

#### Coordinator pods

The Coordinator pods should have a soft anti-affinity[^3] to themselves.

[^3]: <https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity>

### Manifest changes

In order to allow Coordinators to recover from their peers, we need to define which workloads are allowed to recover.
There are two reasons why we want to make this explicit:

1. Allowing a Coordinator with a different policy to recover is a prerequisite for automatic updates.
2. The Coordinator policy can't be embedded into the Coordinator during build, and deriving it at runtime is inconvenient.

Thus, we introduce a new field to the manifest that specifies roles available to the verified identity.
For now, the only known role is `coordinator`, but this could be extended in the future (for example: delegate CAs).

```golang
type PolicyEntry struct {
SANs []string
WorkloadSecretID string `json:",omitempty"`
Roles []string `json:",omitempty"` // <-- New in RFC010.
}
```

During `generate`, we append the value of the `contrast.edgeless.systems/pod-role` annotation to the policy entry.
If there is no coordinator among the resources, we add a policy entry for the embedded coordinator policy.

### Peer recovery

The peer recovery process is attempted by Coordinators that are in [Recovery mode](#recovery-mode):

1. Once after entering recovery mode.
2. Periodically from a goroutine, to reconcile missed watch events.

The process starts with a DNS request for the `coordinator-peers` name to discover ready Coordinators.
For each of the ready Coordinators, the recovering Coordinator calls a new `meshapi.Recover` method.

```proto
service MeshAPI {
rpc NewMeshCert(NewMeshCertRequest) returns (NewMeshCertResponse);
rpc Recover(RecoverRequest) returns (RecoverResponse); // <-- New in RFC010.
}

message RecoverRequest {}

message RecoverResponse {
bytes Seed = 1;
bytes Salt = 2;
bytes RootCAKey = 3;
bytes RootCACert = 4;
bytes MeshCAKey = 5;
bytes MeshCACert = 6;
bytes LatestManifest = 7;
}
```

When this method is called, the serving Coordinator:

1. Provides an attestation statement to the client (this isn't the case for the mesh API right now).
2. Queries the current state and attaches it to the context (this already happens automatically for all `meshapi` calls).
3. Verifies that the client identity is allowed to be recovered by the state's manifest (has the `coordinator` role).
4. Fills the response with values from the current state.

The complete flow from the recovering Coordinator's perspective:

1. Load the latest manifest from persistence without verifying the signature.
2. Build an aTLS client from the unverified manifest:
* Populate ValidateOpts with reference values.
* Add a callback that verifies the policy hash with the roles from the unverified manifest.
* Add standard client attestation.
3. Call the serving Coordinator's `Recover` endpoint.
4. Compare the received manifest with the temporary manifest.
5. Set up a temporary `SeedEngine` with the received parameters.
6. Load the manifest history from persistence, verifying the signature with the temporary seed engine.
7. Compare the loaded manifest with the received manifest.
If they match, update the state with the temporary seed engine and the fetched history.

If the recovery was successful, the client Coordinator leaves the peer recovery process.
Otherwise, it continues with the next available peer, or fails the process if none are left.

It's expected that recovery can fail transiently, for example due to concurrent `SetManifest` calls.
If the entire process in this section fails, it should be restarted from the beginning with an appropriate backoff.

## Open issues

### Inconsistent state

After a `userapi.Recover` operation, the Coordinator needs to generate a new mesh key because it can't accept one from the recovering seed share owner.
If another Coordinator is still serving, they will both have the same active manifest and history, but different mesh keys.
Options for dealing with this situation:

* Treat user recovery like a manifest update (that is, write a new transition on recovery).
This should force other Coordinators into recovery mode.
* Don't deal with this situation at all.
Coordinators are expected to recover from peers on the order of seconds, whereas recovery by seed share owners is at least on the order of minutes.
To be sure, we could
* try to verify that all Coordinators are in recovery mode or
* try peer recovery in the hot path of user recovery.

## Alternatives considered

### Recovery initiated by the serving Coordinator

We could make each ready Coordinator try to discover unready peers and start recovering them.
Superficially this approach would be closer to the existing recovery flow, but we'd still need to modify:

* authorization of the initiating party (seed share owner vs. attested Coordinator)
* generating a new mesh cert vs. accepting the cert from the initiator

One upside would be that we would not need to watch manifest changes and avoid recovering in the hot path of `GetManifest` or `NewMeshCert`.

### Services

1. We could define a headless service `coordinator-peers` for peer discovery.
This would have the upside of not requiring permissions for the k8s api, but all the downsides of relying on DNS for discovery (freshness and TTLs, mostly).
2. The idea behind `coordinator-ready` is to provide clients that only ever need to talk to a ready coordinator with an endpoint that's guaranteed to be ready.
However, if there is at least one ready coordinator, this proposal should ensure that the other coordinators become ready, too, after a short time.
This assumption would require adding recovery to the `GetManifest` and `NewMeshCert` handlers, though.
6 changes: 6 additions & 0 deletions tools/vale/styles/config/vocabularies/edgeless/accept.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,12 @@ autoscaler
autoscaling
AWS
backend
backoff
backport
Bazel
bootloader
Bootstrapper
CA
cachable
cachix
Canonical
Expand Down Expand Up @@ -49,6 +51,7 @@ Fulcio
Gbps
GCP
genpolicy
goroutine
GPU
gRPC
Grype
Expand All @@ -61,6 +64,7 @@ initramfs
initrd
Inkscape
iodepth
IP
IPSec
iptable
Istio
Expand All @@ -69,6 +73,7 @@ Kata
KEK
KMS
kubeadm
kubebuilder
kubectl
kubelet
Kustomization
Expand Down Expand Up @@ -122,6 +127,7 @@ Terraform
Tink
tmpfs
TPM
TTL
underspecified
unencrypted
unrepresentable
Expand Down