From 4cb6e5cd5cdf92225c2c45770c790e3df090005c Mon Sep 17 00:00:00 2001 From: Ishan Tyagi <42602577+ishan16696@users.noreply.github.com> Date: Tue, 7 May 2024 12:41:50 +0530 Subject: [PATCH] DEP: design doc for operator out-of-band tasks. (#757) * DEP: Design doc for operator out-of-band tasks. * Removed some validations. * Add author name. * Add meta in task config. * restructuring changes to the proposal * Address restructuring document request. * Address review comments. * Update docs/proposals/05-etcd-operator-tasks.md Co-authored-by: Ashwani Kumar * Update docs/proposals/05-etcd-operator-tasks.md Co-authored-by: Anvesh Reddy Pinnapureddy * addressed review comments * Apply suggestions from code review Co-authored-by: Shreyas Rao <42259948+shreyas-s-rao@users.noreply.github.com> Co-authored-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Shreyas Rao <42259948+shreyas-s-rao@users.noreply.github.com> Co-authored-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Shreyas Rao <42259948+shreyas-s-rao@users.noreply.github.com> Co-authored-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com> * addressed review comments * Address review comments. * Address review comments 2. * Address review comments 3. * adderessed some review comments * Addressed review comments. * Fixed some typos. * Addressed minor review comments. * Addressed minor review comments. * Added kind,apiversion in .spec.config of task * Addressed review comments. * Addressed nits.. * Addressed review comments. * Addressed review comments. --------- Co-authored-by: Madhav Bhargava Co-authored-by: Ashwani Kumar Co-authored-by: Anvesh Reddy Pinnapureddy Co-authored-by: Shreyas Rao <42259948+shreyas-s-rao@users.noreply.github.com> Co-authored-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com> --- docs/proposals/05-etcd-operator-tasks.md | 422 +++++++++++++++++++++++ 1 file changed, 422 insertions(+) create mode 100644 docs/proposals/05-etcd-operator-tasks.md diff --git a/docs/proposals/05-etcd-operator-tasks.md b/docs/proposals/05-etcd-operator-tasks.md new file mode 100644 index 000000000..78e448e13 --- /dev/null +++ b/docs/proposals/05-etcd-operator-tasks.md @@ -0,0 +1,422 @@ +--- +title: operator out-of-band tasks +dep-number: 05 +creation-date: 6th Dec'2023 +status: implementable +authors: +- "@ishan16696" +- "@unmarshall" +- "@seshachalam-yv" +reviewers: +- "etcd-druid-maintainers" +--- + +# DEP-05: Operator Out-of-band Tasks + +## Table of Contents + +* [DEP-05: Operator Out-of-band Tasks](#dep-05-operator-out-of-band-tasks) + * [Table of Contents](#table-of-contents) + * [Summary](#summary) + * [Terminology](#terminology) + * [Motivation](#motivation) + * [Goals](#goals) + * [Non-Goals](#non-goals) + * [Proposal](#proposal) + * [Custom Resource Golang API](#custom-resource-golang-api) + * [Spec](#spec) + * [Status](#status) + * [Custom Resource YAML API](#custom-resource-yaml-api) + * [Lifecycle](#lifecycle) + * [Creation](#creation) + * [Execution](#execution) + * [Deletion](#deletion) + * [Use cases](#use-cases) + * [Recovery from permanent quorum loss](#recovery-from-permanent-quorum-loss) + * [Trigger on-demand snapshot compaction](#trigger-on-demand-snapshot-compaction) + * [Trigger on-demand full/delta snapshot](#trigger-on-demand-fulldelta-snapshot) + * [Trigger on-demand maintenance of etcd cluster](#trigger-on-demand-maintenance-of-etcd-cluster) + * [Copy Backups Task](#copy-backups-task) + * [Metrics](#metrics) + +## Summary + +This DEP proposes an enhancement to `etcd-druid`'s capabilities to handle [out-of-band](#terminology) tasks, which are presently performed manually or invoked programmatically via suboptimal APIs. The document proposes the establishment of a unified interface by defining a well-structured API to harmonize the initiation of any `out-of-band` task, monitor its status, and simplify the process of adding new tasks and managing their lifecycles. + +## Terminology + +* **etcd-druid:** [etcd-druid](https://github.com/gardener/etcd-druid) is an operator to manage the etcd clusters. + +* **backup-sidecar:** It is the etcd-backup-restore sidecar container running in each etcd-member pod of etcd cluster. + +* **leading-backup-sidecar:** A backup-sidecar that is associated to an etcd leader of an etcd cluster. + +* **out-of-band task:** Any on-demand tasks/operations that can be executed on an etcd cluster without modifying the [Etcd custom resource spec](https://github.com/gardener/etcd-druid/blob/9c5f8254e3aeb24c1e3e88d17d8d1de336ce981b/api/v1alpha1/types_etcd.go#L272-L273) (desired state). + +## Motivation + +Today, [etcd-druid](https://github.com/gardener/etcd-druid) mainly acts as an etcd cluster provisioner (creation, maintenance and deletion). In future, capabilities of etcd-druid will be enhanced via [etcd-member](https://github.com/gardener/etcd-druid/blob/8ac70d512969c2e12e666d923d7d35fdab1e0f8e/docs/proposals/04-etcd-member-custom-resource.md) proposal by providing it access to much more detailed information about each etcd cluster member. While we enhance the reconciliation and monitoring capabilities of etcd-druid, it still lacks the ability to allow users to invoke `out-of-band` tasks on an existing etcd cluster. + +There are new learnings while operating etcd clusters at scale. It has been observed that we regularly need capabilities to trigger `out-of-band` tasks which are outside of the purview of a regular etcd reconciliation run. Many of these tasks are multi-step processes, and performing them manually is error-prone, even if an operator follows a well-written step-by-step guide. Thus, there is a need to automate these tasks. +Some examples of an `on-demand/out-of-band` tasks: + +* Recover from a permanent quorum loss of etcd cluster. +* Trigger an on-demand full/delta snapshot. +* Trigger an on-demand snapshot compaction. +* Trigger an on-demand maintenance of etcd cluster. +* Copy the backups from one object store to another object store. + +## Goals + +* Establish a unified interface for operator tasks by defining a single dedicated custom resource for `out-of-band` tasks. +* Define a contract (in terms of prerequisites) which needs to be adhered to by any task implementation. +* Facilitate the easy addition of new `out-of-band` task(s) through this custom resource. +* Provide CLI capabilities to operators, making it easy to invoke supported `out-of-band` tasks. + +## Non-Goals + +* In the current scope, capability to abort/suspend an `out-of-band` task is not going to be provided. This could be considered as an enhancement based on pull. +* Ordering (by establishing dependency) of `out-of-band` tasks submitted for the same etcd cluster has not been considered in the first increment. In a future version based on how operator tasks are used, we will enhance this proposal and the implementation. + +## Proposal + +Authors propose creation of a new single dedicated custom resource to represent an `out-of-band` task. Etcd-druid will be enhanced to process the task requests and update its status which can then be tracked/observed. + +### Custom Resource Golang API + +`EtcdOperatorTask` is the new custom resource that will be introduced. This API will be in `v1alpha1` version and will be subject to change. We will be respecting [Kubernetes Deprecation Policy](https://kubernetes.io/docs/reference/using-api/deprecation-policy/). + +```go +// EtcdOperatorTask represents an out-of-band operator task resource. +type EtcdOperatorTask struct { + metav1.TypeMeta + metav1.ObjectMeta + + // Spec is the specification of the EtcdOperatorTask resource. + Spec EtcdOperatorTaskSpec `json:"spec"` + // Status is most recently observed status of the EtcdOperatorTask resource. + Status EtcdOperatorTaskStatus `json:"status,omitempty"` +} +``` + +#### Spec + +The authors propose that the following fields should be specified in the spec (desired state) of the `EtcdOperatorTask` custom resource. + +* To capture the type of `out-of-band` operator task to be performed, `.spec.type` field should be defined. It can have values from all supported `out-of-band` tasks eg. "OnDemandSnaphotTask", "QuorumLossRecoveryTask" etc. +* To capture the configuration specific to each task, a `.spec.config` field should be defined of type `string` as each task can have different input configuration. + +```go +// EtcdOperatorTaskSpec is the spec for a EtcdOperatorTask resource. +type EtcdOperatorTaskSpec struct { + + // Type specifies the type of out-of-band operator task to be performed. + Type string `json:"type"` + + // Config is a task specific configuration. + Config string `json:"config,omitempty"` + + // TTLSecondsAfterFinished is the time-to-live to garbage collect the + // related resource(s) of task once it has been completed. + // +optional + TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"` + + // OwnerEtcdReference refers to the name and namespace of the corresponding + // Etcd owner for which the task has been invoked. + OwnerEtcdRefrence types.NamespacedName `json:"ownerEtcdRefrence"` +} +``` + +#### Status + +The authors propose the following fields for the Status (current state) of the `EtcdOperatorTask` custom resource to monitor the progress of the task. + +```go +// EtcdOperatorTaskStatus is the status for a EtcdOperatorTask resource. +type EtcdOperatorTaskStatus struct { + // ObservedGeneration is the most recent generation observed for the resource. + ObservedGeneration *int64 `json:"observedGeneration,omitempty"` + // State is the last known state of the task. + State TaskState `json:"state"` + // Time at which the task has moved from "pending" state to any other state. + InitiatedAt metav1.Time `json:"initiatedAt"` + // LastError represents the errors when processing the task. + // +optional + LastErrors []LastError `json:"lastErrors,omitempty"` + // Captures the last operation status if task involves many stages. + // +optional + LastOperation *LastOperation `json:"lastOperation,omitempty"` +} + +type LastOperation struct { + // Name of the LastOperation. + Name opsName `json:"name"` + // Status of the last operation, one of pending, progress, completed, failed. + State OperationState `json:"state"` + // LastTransitionTime is the time at which the operation state last transitioned from one state to another. + LastTransitionTime metav1.Time `json:"lastTransitionTime"` + // A human readable message indicating details about the last operation. + Reason string `json:"reason"` +} + +// LastError stores details of the most recent error encountered for the task. +type LastError struct { + // Code is an error code that uniquely identifies an error. + Code ErrorCode `json:"code"` + // Description is a human-readable message indicating details of the error. + Description string `json:"description"` + // ObservedAt is the time at which the error was observed. + ObservedAt metav1.Time `json:"observedAt"` +} + +// TaskState represents the state of the task. +type TaskState string + +const ( + TaskStateFailed TaskState = "Failed" + TaskStatePending TaskState = "Pending" + TaskStateRejected TaskState = "Rejected" + TaskStateSucceeded TaskState = "Succeeded" + TaskStateInProgress TaskState = "InProgress" +) + +// OperationState represents the state of last operation. +type OperationState string + +const ( + OperationStateFailed OperationState = "Failed" + OperationStatePending OperationState = "Pending" + OperationStateCompleted OperationState = "Completed" + OperationStateInProgress OperationState = "InProgress" +) +``` + +### Custom Resource YAML API + +```yaml +apiVersion: druid.gardener.cloud/v1alpha1 +kind: EtcdOperatorTask +metadata: + name: + namespace: + generation: +spec: + type: + ttlSecondsAfterFinished: + config: + ownerEtcdRefrence: +status: + observedGeneration: + state: + initiatedAt: