Skip to content

Commit d623dcb

Browse files
committed
rfc: distributed coordinator
1 parent cb9f1bb commit d623dcb

File tree

1 file changed

+178
-0
lines changed

1 file changed

+178
-0
lines changed

Diff for: rfc/009-distributed-coordinator.md

+178
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# RFC 009: Distributed coordinator
2+
3+
## Background
4+
5+
The Contrast Coordinator is a stateful service with a backend storage that can't be shared.
6+
If the Coordinator pod restarts, it loses access to the secret seed and needs to be recovered manually.
7+
This leads to predictable outages on AKS, where the nodes are replaced periodically.
8+
9+
## Requirements
10+
11+
* High-availability: node failure, pod migration, etc. must not impact the availability of the `set` or `verify` flow.
12+
* Auto-recovery: node failure, pod migration, etc. must not require a manual recovery step.
13+
* A newly started Coordinator that has recovered peers must eventually recover automatically.
14+
* Consistency: the Coordinator state must be strongly consistent.
15+
16+
## Design
17+
18+
### Persistency changes
19+
20+
The Coordinator uses a generic key-value store interface to read from and write to the persistent store.
21+
In order to allow distributed read and write requests, we only need to swap the implementation for one that can be used concurrently.
22+
23+
```golang
24+
type Store interface {
25+
Get(key string) ([]byte, error)
26+
Set(key string, value []byte) error
27+
CompareAndSwap(key string, oldVal, newVal []byte) error
28+
Watch(key string, callback func(newVal []byte)) error // <-- New in RFC009.
29+
}
30+
```
31+
32+
There are no special requirements for `Get` and `Set`, other than basic consistency guarantees (i.e., `Get` should return what was `Set` at some point in the past).
33+
We can use Kubernetes resources and their `GET` / `PUT` semantics to implement them.
34+
`CompareAndSwap` needs to do an atomic update, which is supported by the `ObjectMeta.resourceVersion`[^1] field.
35+
We also add a new `Watch` method that facilitates reacting on manifest updates.
36+
This can be implemented as a no-op or via `inotify(7)` on the existing file-backed store, and with the native watch mechanisms for Kubernetes objects.
37+
38+
[RFC 004](004-recovery.md#kubernetes-objects) contains an implementation sketch that uses custom resource definitions.
39+
However, in order to keep the implementation simple, we implement the KV store in content-addressed `ConfigMaps`.
40+
A few considerations on using Kubernetes objects for storage:
41+
42+
1. Kubernetes object names need to be valid DNS labels, limiting the length to 63 characters.
43+
We work around that by truncating the name and storing a map of full hashes to content.
44+
2. The `latest` transition is not content-addressable.
45+
We store it under a fixed name for now (`transitions/latest`), which limits us to one Coordinator deployment per namespace.
46+
Should a use-case arise, we could make that name configurable.
47+
3. Etcd has a limit of 1.5MiB per value.
48+
This is more than enough for policies, which are the largest objects we store right now at around 100KiB.
49+
However, we should make sure that the collision probability is low enough that not to many policies end up in the same config.
50+
51+
```yaml
52+
apiVersion: v1
53+
kind: ConfigMap
54+
metadata:
55+
name: "contrast-store-$(dirname)-$(truncated basename)"
56+
labels:
57+
app.kubernetes.io/managed-by: "contrast.edgeless.systems"
58+
data:
59+
"$(basename)": "$(data)"
60+
```
61+
62+
[^1]: <https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency>
63+
64+
### Recovery mode
65+
66+
The Coordinator is said to be in *recovery mode* if its state object (containing manifest and seed engine) is unset.
67+
Recovery mode can be entered as follows:
68+
69+
* The Coordinator starts up in recovery mode.
70+
* Receiving a watch event for the latest manifest, and the latest manifest is not the current one.
71+
* Syncing the latest manifest during API calls and discovering a new manifest.
72+
73+
Recovery mode exits after the state has been set by:
74+
75+
* A successful [Peer recovery](#peer-recovery).
76+
* A successful recovery through the `userapi.Recover` RPC.
77+
* A successful initial `userapi.SetManifest` RPC.
78+
79+
### Service changes
80+
81+
Coordinators are going to recover themselves from peers that are already recovered (or set).
82+
This means that we need to be able to distinguish Coordinators that have secrets from Coordinators that have none.
83+
Since we're running in Kubernetes, we can leverage the built-in probe types.
84+
85+
We add a few new paths to the existing HTTP server for metrics, supporting the following checks (returning status code 503 on failure):
86+
87+
* `/probe/startup`: returns 200 when all ports are serving and [peer recovery](#peer-recovery) was attempted once.
88+
* `/probe/liveness`: returns 200 unless the Coordinator is recovered but can't read the transaction store.
89+
* `/probe/readiness`: returns 200 if the Coordinator is not in recovery mode.
90+
91+
Using these probes, we expose the Coordinators in different ways, depending on audience:
92+
93+
* A `cooordinator-ready` service backed by ready Coordinators should be used by initializers and verifying clients.
94+
* A `coordinator-peers` headless service backed by ready Coordinators should be used by Coordinators to discover peers to recover from.
95+
Using a headless service allows getting the individual IPs of ready Coordinators, as opposed to a `ClusterIP` service that just drops requests when there are no backends.
96+
* The `coordinator` service now accepts unready endpoints (`publishNotReadyAddresses=true`) and should be used for `set`.
97+
The idea here is backwards compatibility with documented workflows, see [Alternatives considered](#alternatives-considered).
98+
99+
### Manifest changes
100+
101+
In order to allow Coordinators to recover from their peers, we need to define which workloads are allowed to recover.
102+
There are two reasons why we want to make this explicit:
103+
104+
1. Allowing a Coordinator with a different policy to recover is a prerequisite for automatic updates.
105+
2. The Coordinator policy can't be embedded into the Coordinator during build, and deriving it at runtime is inconvenient.
106+
107+
Thus, we introduce a new field to the manifest that specifies roles available to the verified identity.
108+
For now, the only known role is `coordinator`, but this could be extended in the future (for example: delegate CAs).
109+
110+
```golang
111+
type PolicyEntry struct {
112+
SANs []string
113+
WorkloadSecretID string `json:",omitempty"`
114+
Roles []string `json:",omitempty"` // <-- New in RFC009.
115+
}
116+
```
117+
118+
During `generate`, we append the value of the `contrast.edgeless.systems/pod-role` annotation to the policy entry.
119+
If there is no coordinator among the resources, we add a policy entry for the embedded coordinator policy.
120+
121+
### Peer recovery
122+
123+
The peer recovery process is attempted by Coordinators that are in [Recovery mode](#recovery-mode):
124+
125+
1. Once after startup, to speed up initialization.
126+
2. Once as callback to the `Watch` event, .
127+
3. Periodically from a goroutine, to reconcile missed watch events.
128+
129+
The process starts with a DNS request for the `coordinator-peers` name to discover ready Coordinators.
130+
For each of the ready Coordinators, the recovering Coordinator calls a new `meshapi.Recover` method.
131+
132+
```proto
133+
service MeshAPI {
134+
rpc NewMeshCert(NewMeshCertRequest) returns (NewMeshCertResponse);
135+
rpc Recover(RecoverRequest) returns (RecoverResponse); // <-- New in RFC009.
136+
}
137+
138+
message RecoverRequest {}
139+
140+
message RecoverResponse {
141+
bytes Seed = 1;
142+
bytes Salt = 2;
143+
bytes RootCAKey = 3;
144+
bytes RootCACert = 4;
145+
bytes MeshCAKey = 5;
146+
bytes MeshCACert = 6;
147+
bytes LatestManifest = 7;
148+
}
149+
```
150+
151+
When this method is called, the serving Coordinator:
152+
153+
0. Queries the current state and attaches it to the context (this already happens automatically for all `meshapi` calls).
154+
1. Verifies that the client identity is allowed to recover by the state's manifest (has the `coordinator` role).
155+
2. Fills the response with values from the current state.
156+
157+
After receiving the `RecoverResponse`, the client Coordinator:
158+
159+
1. Sets up a temporary `SeedEngine` with the received parameters.
160+
2. Verifies that the received manifest is the current latest.
161+
If not, it stops and enters recovery mode again.
162+
3. Updates the state with the current manifest and the temporary seed engine.
163+
164+
If the recovery was successful, the client Coordinator leaves the peer recovery process.
165+
Otherwise, it continues with the next available peer, or fails the process if none are left.
166+
167+
## Open issues
168+
169+
* TODO(burgerdev): inconsistent state if `userapi.Recover` is called while a recovered coordinator exists. Fix candidates:
170+
* Store the mesh cert alongside the manifest
171+
* Sign the latest transition with the mesh key (and resign on `userapi.Recover`).
172+
173+
## Alternatives considered
174+
175+
* TODO(burgerdev): different service structure
176+
* TODO(burgerdev): push vs pull
177+
* TODO(burgerdev): active mesh CA key distribution
178+
* TODO(burgerdev): nested CAs, reuse `NewMeshCert`

0 commit comments

Comments
 (0)