Skip to content

Latest commit

 

History

History
163 lines (101 loc) · 6.57 KB

architecture.md

File metadata and controls

163 lines (101 loc) · 6.57 KB

Cosmos Operator Architecture

This is a high-level overview of the architecture of the Cosmos Operator. It is intended to be a reference for developers.

Overview

The operator was written with the kubebuilder framework.

Kubebuilder simplifies and provides abstractions for creating a Kubernetes controller.

In a nutshell, an operator observes a CRD. Its job is to match cluster state with the desired state in the CRD. It continually watches for changes and updates the cluster accordingly - a "control loop" pattern.

Each controller implements a Reconcile method:

Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error)

Unlike "built-in" controllers like Deployments or StatefulSets, operator controllers are visible in the cluster - one pod backed by a Deployment under the cosmos-operator-system namespace.

A controller can watch resources outside of the CRD it manages. For example, CosmosFullNode watches for pod deletions, so it can spin up new pods if a user deletes one manually.

The watching of resources is in this method for each controller:

SetupWithManager(ctx context.Context, mgr ctrl.Manager) error

Refer to kubebuilder docs for more info.

Makefile

Kubebuilder generated much of the Makefile. It contains common tasks for developers.

api directory

This directory contains the different CRDs.

You should run make generate manifests each time you change CRDs.

A CI job should fail if you forget to run this command after modifying the api structs.

config directory

The config directory contains kustomize files generated by Kubebuilder. Strangelove uses these files to deploy the operator (instead of a helm chart). A helm chart is on the road map but presents challenges in keeping the kustomize and helm code in sync.

controllers directory

The controllers directory contains every controller.

This directory is not unit tested. The code in controllers should act like main() functions where it's mostly wiring up of dependencies from internal.

internal directory

Almost all the business logic lives in internal and houses the unit and integration tests.

CosmosFullNode

This is the flagship CRD of the Cosmos Operator and contains the most complexity.

Builder, Diff, and Control Pattern

Each resource has its own builder and controller (referred as "control" in this context). For example, see pvc_builder.go and pvc_control.go which only manages PVCs. All builders should have file suffix _builder.go and all control objects _control.go.

The most complex builder is pod_builder.go. There may be opportunities to refactor it.

The "control" pattern was loosely inspired by Kubernetes source code.

Within the controller's Reconcile(...) method, the controller determines the order of operations of the separate Control objects.

On process start, each Control is initialized with a Diff and a Builder.

On each reconcile loop:

  1. The Builder builds the desired resources from the CRD.
  2. Control fetches a list of existing resources.
  3. Control uses Diff to compute a diff of the existing to the desired.
  4. Control makes changes based on what Diff reports.

The Control tests are integration tests where we mock out the Kubernetes API, but not the Builder or Diff. The tests run quickly (like unit tests) because we do not make any network calls.

The Diff object (type Diff[T client.Object] struct) took several iterations to get right. There is probably little need to tweak it further.

The hardest problem with diffing is determining updates. Essentially, Diff looks for a Revision() string method on the resource and sets a revision annotation. The revision is a simple fnv hash. It compares Revision to the existing annotation. If different, we know it's an update. We cannot compare equality of existing resources directly because Kubernetes adds additional annotations and fields.

Builders return a diff.Resource[T] which Diff can use. Therefore, Control does not need to adapt resources.

The fnv hash is computed from a resource's JSON representation, which has proven to be stable.

Special Note on Updating Status

There are several controllers that update a CosmosFullNode's status subresource:

  • CosmosFullNode
  • ScheduledVolumeSnapshot
  • SelfHealing

Each update to the status subresource triggers another reconcile loop. We found multiple controllers updating status caused race conditions. Updates were not applied or applied incorrectly. Some controllers read the status to take action, so it's important to preserve the integrity of the status.

Therefore, you must use the special SyncUpdate(...) method from fullnode.StatusClient. It ensures updates are performed serially per CosmosFullNode.

Sentries

Sentries are special because you should not include a readiness probe due to the way Tendermint/Comet remote signing works.

The remote signer reaches out to the sentry on the privval port. This is the inverse of what you'd expect, the sentry reaching out to the remote signer.

If the sentry does not detect a remote signer connection, it crashes. And the stable way to connect to a pod is through a Kube Service. So we have a chicken or egg problem. The sentry must be "ready" to be added to the Service, but the remote signer must connect to the sentry through the Service so it doesn't crash.

Therefore, the CosmosFullNode controller inspects Tendermint/Comet as part of its rolling update strategy - not just pod readiness state.

CacheController

The CacheController is special in that it does not manage a CRD.

It periodically polls every pod for its Tendermint/Comet status such as block height. The polling is done in the background. It's a controller because it needs the reconcile loop to update which pods it needs to poll.

The CacheController prevents slow reconcile loops. Previously, we queried this status on every reconcile loop.

When other controllers want Comet status, they always hit the cache controller.

Scheduled Volume Snapshot

Scheduled Volume Snapshot takes periodic backups.

To preserve data integrity, it will temporarily delete a pod, so it can capture a PVC snapshot without any process writing to it.

It uses a finite state machine pattern in the main reconcile loop.

StatefulJob

StatefulJob periodically runs a job on an interval (crontab not supported yet). The purpose is to run a job that attaches to a PVC created from a VolumeSnapshot.

It's the least developed of the CRDs.