diff --git a/multi-cluster-orchestrator/README.md b/multi-cluster-orchestrator/README.md index 2af0ae7..c17d912 100644 --- a/multi-cluster-orchestrator/README.md +++ b/multi-cluster-orchestrator/README.md @@ -2,9 +2,11 @@ ## Project Status -**Release Target:** The project is starting a public preview in mid-late April -2025. Partners interested in collaborting can email mco-interest@google.com for -more information. +**Release Target:** The project is currently available as an early public +preview. + +Anyone interested in collaborating can email mco-interest@google.com for more +information. An initial open source beta release will follow later in 2025. @@ -25,40 +27,9 @@ out of available GPUs at times. Multi-Cluster Orchestrator can help mitigate this scenario by automatically scaling the workload out to another cluster in region which still has available resources then scaling it back in later. +See the [overview page](docs/overview.md) for more details about the project and +its design. + ## Samples - [Hello World (deploy on Google Cloud using Terraform and Argo CD](./samples/hello_world/README.md) - -## Design - -The Multi-Cluster Orchestrator controller runs in a hub cluster and provides an -API allowing the user to define how their workload should be scheduled and -scaled across clusters. - -Based on these parameters the system continuously monitors metrics specified by -the user and dynamically makes decisions about which clusters should be used to -run the workload at a given time. - -The system builds on the [Cluster Inventory -API](https://github.com/kubernetes-sigs/cluster-inventory-api?tab=readme-ov-file#cluster-inventory-api) -developed by the Kubernetes [SIG -Multicluster](https://multicluster.sigs.k8s.io/), with the [ClusterProfile -API](https://github.com/kubernetes/enhancements/blob/master/keps/sig-multicluster/4322-cluster-inventory/README.md) -defining the available clusters to schedule workloads on. The `ClusterProfiles` -may be provisioned manually by the user, or automatically using a plugin to sync -them from an external source-of-truth such as a cloud provider API. - -While Multi-Cluster Orchestrator determines which clusters should run the -workload at a given time, the actual delivery of workloads to clusters is -decoupled from the core Multi-Cluster Orchestrator controller. This makes it -possible for users to integrate their preferred existing systems for application -delivery. For example, Multi-Cluster Orchestrator can be integrated with [Argo -CD](https://argo-cd.readthedocs.io) via an -[ApplicationSet](https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/) -[generator -plugin](https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/Generators-Plugin/). - -The system can be used in tandem with multi-cluster load balancers such as -Google Cloud's [Multi-Cluster -Gateway](https://cloud.google.com/kubernetes-engine/docs/how-to/deploying-multi-cluster-gateways) -to dynamically route incoming traffic to the various clusters. diff --git a/multi-cluster-orchestrator/docs/images/mco_architecture.svg b/multi-cluster-orchestrator/docs/images/mco_architecture.svg new file mode 100644 index 0000000..7eebcb0 --- /dev/null +++ b/multi-cluster-orchestrator/docs/images/mco_architecture.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/multi-cluster-orchestrator/docs/images/mco_cluster_state_machine.svg b/multi-cluster-orchestrator/docs/images/mco_cluster_state_machine.svg new file mode 100644 index 0000000..149e973 --- /dev/null +++ b/multi-cluster-orchestrator/docs/images/mco_cluster_state_machine.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/multi-cluster-orchestrator/docs/overview.md b/multi-cluster-orchestrator/docs/overview.md new file mode 100644 index 0000000..ea59593 --- /dev/null +++ b/multi-cluster-orchestrator/docs/overview.md @@ -0,0 +1,266 @@ +# Overview of Multi-Cluster Orchestrator + +Multi-Cluster Orchestrator (MCO) dynamically schedules an application’s +multi-cluster/multi-region deployment, allowing the application to automatically +scale out to more clusters or scale in to fewer clusters as needed. + +MCO schedules an application workload onto clusters based on: + +* The preferences specified by the user, such as which clusters are eligible + to run the workload and which clusters/regions are more desirable to use +* Information about cluster/region capacity, such as which clusters/regions + currently have certain machine types available +* Current load on the application + +These scheduling decisions are consumed by the CD tool (a.k.a. workload delivery +system) which provisions the workload onto clusters or removes the workload from +clusters as required. + +MCO requires a [hub cluster](https://github.com/kubernetes/community/pull/8210) +which contains metadata about the other clusters in the fleet (in the form of +[ClusterProfile](https://multicluster.sigs.k8s.io/concepts/cluster-profile-api/) resources). The hub cluster is used to run multicluster +controllers that manage the rest of the fleet such as MCO itself or CD tools +such as Argo CD. It is recommended not to run normal application workloads on +the hub cluster. + +The application workload is deployed to other clusters in the fleet. MCO can be +used along with a multi-cluster load balancer such as a Google Cloud +[multi-cluster +Gateway](https://cloud.google.com/kubernetes-engine/docs/how-to/deploying-multi-cluster-gateways) +load balancer to dynamically route incoming traffic to the various workload +clusters. + +# Architecture + +## Design Principles + +Other multi-cluster scheduling efforts have tried to do a lot of things all at +once. MCO has intentionally tried to avoid this and instead focuses on solving +the relatively narrow problem of selecting *which* clusters should be running +the workload at a given point in time (including scaling to more/fewer +clusters). + +Closely related to this, one of the core design principles of MCO is that the +MCO controller does not interact directly with the clusters that it is managing +except for consuming metrics which they publish. This means that MCO does not +directly provision workloads on clusters or scale the size of the workload on +each cluster. + +Instead, the delivery of workloads to clusters is decoupled from the core of +MCO. This allows workload delivery to be pluggable such that it can be +integrated/delegated to existing systems such as Argo CD or Flux (which the user +runs in the same hub cluster as the MCO controller). + +Locally scaling the workload within each cluster is left to the standard +Kubernetes [HorizontalPodAutoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/). + +## Architecture Diagram + +This diagram shows the high-level architecture of Multi-Cluster Orchestrator +when running on Google Cloud. The various components are explained below. + +![Diagram of MCO Archtecture](images/mco_architecture.svg) + +## Placement API + +The core of Multi-Cluster Orchestrator is the Placement API. The Placement API +is a CRD which defines a workload to be scheduled and autoscaled onto multiple +clusters. + +The spec of the Placement API defines which clusters are eligible to run the +workload and defines the autoscaling parameters for the workload (see below for +more details about autoscaling). + +The status of the Placement API contains the set of clusters where the workload +is currently scheduled. This is consumed by the workload delivery system. + +Here’s an example of an autoscaling Placement. The clusters that are eligible +to run the workload are defined by the rules section. We start with all of the +clusters for which ClusterProfiles exist, then we filter that using a regex +matching the ClusterProfile’s namespace/name. + +``` +apiVersion: orchestra.multicluster.x-k8s.io/v1alpha1 +kind: MultiKubernetesClusterPlacement +metadata: + name: cluster-regex-placement +spec: + rules: + - type: "all-clusters" + - type: "cluster-name-regex" + arguments: + regex: "^fleet-cluster-inventory/mco-worker-.*" + scaling: + autoscaleForCapacity: + minClustersBelowCapacityCeiling: 1 + workloadDetails: + namespace: {APP_DEPLOYMENT_NAMESPACE} + deploymentName: {APP_DEPLOYMENT_NAME} + hpaName: {APP_HPA_NAME} +``` + +And here’s another example using a hardcoded list of three eligible clusters on +which the workload will be autoscaled. The workload will be scaled out to the +clusters in the order in which they are specified: + +``` +apiVersion: orchestra.multicluster.x-k8s.io/v1alpha1 +kind: MultiKubernetesClusterPlacement +metadata: + name: cluster-list-placement +spec: + rules: + - type: "cluster-list" + arguments: + clusters: "fleet-cluster-inventory/mco-worker-us-west1,fleet-cluster-inventory/mco-worker-us-east1,fleet-cluster-inventory/mco-worker-europe-west4" + scaling: + autoscaleForCapacity: + minClustersBelowCapacityCeiling: 1 + workloadDetails: + namespace: {APP_DEPLOYMENT_NAMESPACE} + deploymentName: {APP_DEPLOYMENT_NAME} + hpaName: {APP_HPA_NAME} +``` + +## MCO Controller + +The MCO controller reads the spec of the Placement, makes decisions about +scheduling and scaling the workload onto clusters, and updates the Placement’s +status. For a given workload, each cluster will transition through the following +state machine: + +![Diagram placement cluster state machine](images/mco_cluster_state_machine.svg) + +Initially, we have a set of eligible clusters as defined by the rules in the +Placement. The MCO Controller will start by scaling out the workload to the +first eligible cluster — adding that cluster to the Placement in the ACTIVE +state. + +If a current cluster becomes ineligible to run the workload, it transitions to +the EVICTING state and the workload is drained and removed from it. + +Part of the MCO Controller is the autoscaler which makes scaling decisions based +on metrics from the clusters (see the Autoscaling section below for details). + +If the autoscaler decides that fewer clusters are needed, it moves one of the +ACTIVE clusters into the SCALING_IN state and we start draining the workload +from the cluster. Once the workload gets down to one remaining pod, it is +removed from the cluster entirely. + +If the autoscaler decides that more clusters are needed, it first checks if +there is a SCALING_IN cluster available and if so moves it back to ACTIVE. +Otherwise it picks the next eligible cluster and adds it to the set of ACTIVE +clusters. + +Health checking is a planned feature but is not currently implemented. + +> [!NOTE] +> Proactively draining pods from clusters is coming soon but not currently +> available. This means that clusters will only be removed when they have +> naturally scaled down to very few pods (via their HPAs). + +## Workload Delivery + +As mentioned above, workload delivery is decoupled from the core MCO Controller +which allows for multiple implementations. + +Each implementation should watch the status field of each Placement and ensure +that the workload is provisioned on clusters when they are added to the status +field and removed from clusters when they are removed from the status field. + +The only workload delivery system that is currently available is an Argo CD +[ApplicationSet](https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/) +[generator](https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/Generators/) +[plugin](https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/Generators-Plugin/) +which allows users to dynamically target Argo CD Applications to clusters. +Cluster profiles are automatically synced to Argo CD cluster secrets using the +[Argo CD ClusterProfile Syncer](https://github.com/GoogleCloudPlatform/gke-fleet-management/tree/main/argocd-clusterprofile-syncer) + +## Cluster Profiles and Syncer + +MCO uses +[ClusterProfile](https://github.com/kubernetes/enhancements/blob/master/keps/sig-multicluster/4322-cluster-inventory/README.md) +resources from the [SIG Multicluster](https://multicluster.sigs.k8s.io/) +[Cluster Inventory +API](https://github.com/kubernetes-sigs/cluster-inventory-api?tab=readme-ov-file#cluster-inventory-api) +to represent the clusters within the fleet. When running on Google Cloud, +cluster profiles for all clusters in the fleet can be [automatically generated +in the hub cluster by a syncer running in Google +Cloud](https://cloud.google.com/kubernetes-engine/fleet-management/docs/create-inventory). +When a cluster is added or removed from the fleet, the list of cluster profiles +is updated automatically by the syncer. + +Using the syncer is optional; the cluster profiles could also be created and +managed manually. + +# Autoscaling + +The autoscaler is part of the MCO Controller. + +**In the current implementation, the autoscaler tries to maintain a target +number of clusters which are below their capacity ceiling — i.e. clusters on +which more pods can still be scheduled.** This will likely evolve in the future +to be more flexible/sophisticated. + +Metrics published by each cluster are used to determine which clusters are +currently at their capacity ceiling. To accomplish this, it uses the following +query logic: + +1. Check if the cluster is at its capacity ceiling due to the current nodes + being at capacity and the cluster-autoscaler being unable to obtain more + nodes. + + To check this, we check whether: a) there are pending pods for the workload + and b) there have been `FailedScaleUp` events from the cluster-autoscaler (the + latter requires creating and using a log-based metric). + +2. Check if the workload’s HorizontalPodAutoscaler is unable to scale up + further due to reaching its configured maximum number of replicas. + + To check this, we check if a) the HPA’s number of replicas is equal to it’s + configured maximum and b) whether HPA’s ScalingLimited condition is true. + +When the number of clusters below their capacity ceiling falls below the target, +the workload is scaled out to a new cluster. + +Currently there are several limitations related to the autoscaling: + +* The queries are hardcoded and the Placement contains parameters which are + used in the queries such as the name and namespace of the Deployment for the + workload. The queries will be made user-configurable soon. +* Scaling in is currently implemented in a very naive way. Scaling in a + cluster is only triggered when the cluster’s HPA wants to scale the workload + down further than is possible due to the HPA’s configured minimum number of + replicas. This means that in practice clusters won’t be scaled in until + there are very few pods remaining. We plan to improve scale-in soon to make + it more responsive. +* The hardcoded queries currently require the user to set up a Google Cloud + log-based metric with a specific name and definition. Here’s an example of + how to set up the log-based metric with the gcloud CLI: + +``` +cat > /tmp/metric_definition.yaml << EOF +filter: jsonPayload.reason=FailedScaleUp AND jsonPayload.reportingComponent=cluster-autoscaler +labelExtractors: + cluster: EXTRACT(resource.labels.cluster_name) + location: EXTRACT(resource.labels.location) + namespace_name: EXTRACT(resource.labels.namespace_name) + pod_name: EXTRACT(resource.labels.pod_name) + project_id: EXTRACT(resource.labels.project_id) +metricDescriptor: + displayName: metric reflecting occurences of FailedScaleUp + labels: + - key: project_id + - key: pod_name + - key: namespace_name + - key: cluster + - key: location + metricKind: DELTA + type: logging.googleapis.com/user/ClusterAutoscalerFailedScaleUp + unit: '1' + valueType: INT64 +EOF + +gcloud logging metrics create ClusterAutoscalerFailedScaleUp \ + --config-from-file=/tmp/metric_definition.yaml +```