Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

Draft
wants to merge 67 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
1cebc86
add MIMO to dev
hawkowl Feb 29, 2024
c733dd1
initial maint API
hawkowl Jul 15, 2024
733dabd
update mimo DB to have a fetch which isn't just pending
hawkowl Jul 18, 2024
0c97cf6
update db fakes for maintmanifests
hawkowl Jul 18, 2024
2b59864
mimo API conversions for admin API
hawkowl Jul 18, 2024
727ece2
query + tests for maintenancemanifests
hawkowl Jul 18, 2024
ca2c62f
update with get
hawkowl Jul 18, 2024
95e95d8
test fixes
hawkowl Jul 19, 2024
a55e649
cancellation tests and impl
hawkowl Jul 19, 2024
a75fcd5
renaming and tweaking
hawkowl Jul 22, 2024
c1743a9
static validation
hawkowl Jul 22, 2024
505a210
update frontend to have create
hawkowl Jul 22, 2024
ea27dd1
code for deleting
hawkowl Jul 22, 2024
488a7c7
add deleting endpoint
hawkowl Jul 22, 2024
41d72f7
move clusteroperator check code for reuse
hawkowl Jul 26, 2024
eaa0f41
MIMO task/set cleanups
hawkowl Jul 26, 2024
f1916f8
mimo error code
hawkowl Jul 26, 2024
48129d6
more work on sets
hawkowl Jul 26, 2024
37c2f7e
tls tasks work
hawkowl Jul 26, 2024
fb2301e
update for cleanups
hawkowl Aug 5, 2024
c05b871
move into the main CLI endpoint
hawkowl Aug 16, 2024
559caea
add a task for updating the operator flags, for testing
hawkowl Sep 17, 2024
2e81eac
add healthz endpoints for MIMO actuator
hawkowl Sep 18, 2024
ae325ad
makefile target for running actuator locally
hawkowl Sep 18, 2024
9075618
add mimo actuator steps in e2e helper
hawkowl Sep 18, 2024
90832a0
start mimo in e2e
hawkowl Sep 18, 2024
38c6f11
fix build
hawkowl Sep 18, 2024
4655b11
go generate
hawkowl Sep 18, 2024
721d8f3
lint
hawkowl Sep 18, 2024
34f5839
updates for basic mimo e2e
hawkowl Sep 19, 2024
a3b1356
e2e testing
hawkowl Sep 19, 2024
72baacf
try and see what e2e is breaking with
hawkowl Sep 23, 2024
1045abf
initial doc frame
hawkowl Sep 24, 2024
4cd13e3
ARO-9263: Add ACR Token expiry
edisonLcardenas Aug 19, 2024
58ed3db
ARO-9263: Add ACR Token Expiry Checker
edisonLcardenas Aug 20, 2024
ac096bc
ARO-9263: Add unit test for checker
edisonLcardenas Aug 21, 2024
d7cc1bb
ARE-9263: Rename function
edisonLcardenas Aug 22, 2024
407b1ba
ARO-9263: Restore missing "ProvisioningStateMaintenance"
edisonLcardenas Aug 22, 2024
083d4f5
ARO-9263: Fix import CI check failures
edisonLcardenas Aug 26, 2024
3bf3b4d
ARO-9263: Refactoring tests and revising logic to check expiry date.
edisonLcardenas Sep 12, 2024
95ccde9
ARO-9263: Add another condition to check if expiry date is nil
edisonLcardenas Sep 12, 2024
a9e02de
refactor: update package groupings and error messages to resolve issu…
edisonLcardenas Sep 12, 2024
4abdcc0
ARO-9263: Change expiry to the date the token was issued.
edisonLcardenas Sep 16, 2024
449fc41
ARO-9263: Revise logic to check issue date instead of expiry
edisonLcardenas Sep 17, 2024
1a6810e
ARO-9263: Add constants to reduce redunant values
edisonLcardenas Sep 17, 2024
ebf3c4e
ARO-9263: Update test to check issue date in constant time to avoid f…
edisonLcardenas Sep 18, 2024
1973281
ARO-9263: Change or remove any references about expiry to issue date.
edisonLcardenas Sep 19, 2024
f472e78
ARO-9263: Fix lint issues
edisonLcardenas Sep 20, 2024
101db01
ARO-9263: Revise error message and reorder return statement
edisonLcardenas Sep 24, 2024
409d8a9
ARO-9263: Fix unit test
edisonLcardenas Sep 24, 2024
40f1d9e
fix e2e, hopefully
hawkowl Sep 26, 2024
79a39db
pls
hawkowl Sep 26, 2024
2155b80
API OperatorFlagsMergeStrategy
SrinivasAtmakuri Mar 1, 2023
c82f356
operator flags patches + tests
hawkowl Oct 2, 2024
4ae6d67
add the maintmanifests client to the RP frontend/backend in dev
hawkowl Oct 2, 2024
a985949
reset the cluster flags to stop other tests failing
hawkowl Oct 2, 2024
637bda9
Bump test file
hawkowl Oct 2, 2024
46f13be
Update actuator_test.go
hawkowl Oct 2, 2024
2e80206
fix the ARM resource deploying the partition key
hawkowl Oct 3, 2024
dc7692f
regen
hawkowl Oct 3, 2024
b1c0829
lint fix
hawkowl Oct 3, 2024
af70386
fixes for e2e
hawkowl Oct 3, 2024
470aa41
add the ability to add a debug flag
hawkowl Oct 4, 2024
6dcb1e5
e2e fix
hawkowl Oct 4, 2024
1380fd2
renames and fixes
hawkowl Oct 9, 2024
efc24b4
go mod tidy
hawkowl Oct 10, 2024
3176e49
add some documentation
hawkowl Oct 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .pipelines/e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ jobs:

- script: |
export CI=true
# Tell the E2E binary to run the MIMO tests
export ARO_E2E_MIMO=true
. secrets/env
. ./hack/e2e/run-rp-and-e2e.sh

Expand All @@ -84,6 +86,9 @@ jobs:
run_selenium
validate_selenium_running

run_mimo_actuator
validate_mimo_actuator_running

run_rp
validate_rp_running

Expand Down Expand Up @@ -128,6 +133,7 @@ jobs:

delete_e2e_cluster
kill_rp
kill_mimo_actuator
kill_selenium
kill_podman
kill_vpn
Expand Down
10 changes: 7 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ SHELL = /bin/bash
TAG ?= $(shell git describe --exact-match 2>/dev/null)
COMMIT = $(shell git rev-parse --short=7 HEAD)$(shell [[ $$(git status --porcelain) = "" ]] || echo -dirty)
ARO_IMAGE_BASE = ${RP_IMAGE_ACR}.azurecr.io/aro
E2E_FLAGS ?= -test.v --ginkgo.v --ginkgo.timeout 180m --ginkgo.flake-attempts=2 --ginkgo.junit-report=e2e-report.xml
E2E_FLAGS ?= -test.v --ginkgo.vv --ginkgo.timeout 180m --ginkgo.flake-attempts=2 --ginkgo.junit-report=e2e-report.xml
E2E_LABEL ?= !smoke&&!regressiontest
GO_FLAGS ?= -tags=containers_image_openpgp,exclude_graphdriver_btrfs,exclude_graphdriver_devicemapper

Expand Down Expand Up @@ -67,7 +67,7 @@ aro: check-release generate

.PHONY: runlocal-rp
runlocal-rp:
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro rp
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro ${ARO_CMD_ARGS} rp

.PHONY: az
az: pyenv
Expand Down Expand Up @@ -196,7 +196,11 @@ proxy:

.PHONY: runlocal-portal
runlocal-portal:
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro portal
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro ${ARO_CMD_ARGS} portal

.PHONY: runlocal-actuator
runlocal-actuator:
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro ${ARO_CMD_ARGS} mimo-actuator

.PHONY: build-portal
build-portal:
Expand Down
4 changes: 4 additions & 0 deletions cmd/aro/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ func usage() {
fmt.Fprintf(flag.CommandLine.Output(), " %s operator {master,worker}\n", os.Args[0])
fmt.Fprintf(flag.CommandLine.Output(), " %s update-versions\n", os.Args[0])
fmt.Fprintf(flag.CommandLine.Output(), " %s update-role-sets\n", os.Args[0])
fmt.Fprintf(flag.CommandLine.Output(), " %s mimo-actuator\n", os.Args[0])
flag.PrintDefaults()
}

Expand Down Expand Up @@ -74,6 +75,9 @@ func main() {
case "update-role-sets":
checkArgs(1)
err = updatePlatformWorkloadIdentityRoleSets(ctx, log)
case "mimo-actuator":
checkArgs(1)
err = mimoActuator(ctx, log)
default:
usage()
os.Exit(2)
Expand Down
98 changes: 98 additions & 0 deletions cmd/aro/mimoactuator.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
package main

// Copyright (c) Microsoft Corporation.
// Licensed under the Apache License 2.0.

import (
"context"
"os"
"os/signal"
"syscall"

"github.com/sirupsen/logrus"

"github.com/Azure/ARO-RP/pkg/database"
"github.com/Azure/ARO-RP/pkg/env"
"github.com/Azure/ARO-RP/pkg/metrics/statsd"
"github.com/Azure/ARO-RP/pkg/metrics/statsd/golang"
"github.com/Azure/ARO-RP/pkg/mimo/actuator"
"github.com/Azure/ARO-RP/pkg/mimo/tasks"
"github.com/Azure/ARO-RP/pkg/proxy"
"github.com/Azure/ARO-RP/pkg/util/service"
)

func mimoActuator(ctx context.Context, log *logrus.Entry) error {
stop := make(chan struct{})

_env, err := env.NewEnv(ctx, log, env.COMPONENT_MIMO_ACTUATOR)
if err != nil {
return err
}

var keys []string
if _env.IsLocalDevelopmentMode() {
keys = []string{}
} else {
keys = []string{
"MDM_ACCOUNT",
"MDM_NAMESPACE",
}
}

if err = env.ValidateVars(keys...); err != nil {
return err
}

m := statsd.New(ctx, log.WithField("component", "actuator"), _env, os.Getenv("MDM_ACCOUNT"), os.Getenv("MDM_NAMESPACE"), os.Getenv("MDM_STATSD_SOCKET"))

g, err := golang.NewMetrics(_env.Logger(), m)
if err != nil {
return err
}
go g.Run()

dbc, err := service.NewDatabase(ctx, _env, log, m, true)
if err != nil {
return err
}

dbName, err := service.DBName(_env.IsLocalDevelopmentMode())
if err != nil {
return err
}

clusters, err := database.NewOpenShiftClusters(ctx, dbc, dbName)
if err != nil {
return err
}

manifests, err := database.NewMaintenanceManifests(ctx, dbc, dbName)
if err != nil {
return err
}

dbg := database.NewDBGroup().
WithOpenShiftClusters(clusters).
WithMaintenanceManifests(manifests)

dialer, err := proxy.NewDialer(_env.IsLocalDevelopmentMode())
if err != nil {
return err
}

a := actuator.NewService(_env, _env.Logger(), dialer, dbg, m)
a.SetMaintenanceTasks(tasks.DEFAULT_MAINTENANCE_SETS)

sigterm := make(chan os.Signal, 1)
done := make(chan struct{})
signal.Notify(sigterm, syscall.SIGTERM)

go a.Run(ctx, stop, done)

<-sigterm
log.Print("received SIGTERM")
close(stop)
<-done

return nil
}
9 changes: 9 additions & 0 deletions cmd/aro/rp.go
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,15 @@ func rp(ctx context.Context, log, audit *logrus.Entry) error {
WithPlatformWorkloadIdentityRoleSets(dbPlatformWorkloadIdentityRoleSets).
WithSubscriptions(dbSubscriptions)

// MIMO only activated in development for now
if _env.IsLocalDevelopmentMode() {
dbMaintenanceManifests, err := database.NewMaintenanceManifests(ctx, dbc, dbName)
if err != nil {
return err
}
dbg.WithMaintenanceManifests(dbMaintenanceManifests)
}

f, err := frontend.NewFrontend(ctx, audit, log.WithField("component", "frontend"), _env, dbg, api.APIs, metrics, clusterm, feAead, hiveClusterManager, adminactions.NewKubeActions, adminactions.NewAzureActions, adminactions.NewAppLensActions, clusterdata.NewParallelEnricher(metrics, _env))
if err != nil {
return err
Expand Down
22 changes: 22 additions & 0 deletions docs/mimo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# MIMO Documentation

The Managed Infrastructure Maintenance Operator, or MIMO, is a component of the Azure Red Hat OpenShift Resource Provider (ARO-RP) which is responsible for automated maintenance of clusters provisioned by the platform.
MIMO specifically focuses on "managed infrastructure", the parts of ARO that are deployed and maintained by the RP and ARO Operator instead of by OCP (in-cluster) or Hive (out-of-cluster).

MIMO consists of two main components, the [Actuator](./actuator.md) and the [Scheduler](./scheduler.md). It is primarily interfaced with via the [Admin API](./admin-api.md).

## A Primer On MIMO

The smallest thing that you can tell MIMO to run is a **Task** (see [`pkg/mimo/tasks/`](../../pkg/mimo/tasks/)).
A Task is composed of reusable **Steps** (see [`pkg/mimo/steps/`](../../pkg/mimo/steps/)), reusing the framework utilised by AdminUpdate/Update/Install methods in `pkg/cluster/`.
A Task only runs in the scope of a singular cluster.
These steps are run in sequence and can return either **Terminal** errors (causing the ran Task to fail and not be retried) or **Transient** errors (which indicates that the Task can be retried later).

Tasks are executed by the **Actuator** by way of creation of a **Maintenance Manifest**.
This Manifest is created with the cluster ID (which is elided from the cluster-scoped Admin APIs), the Task ID (which is currently a UUID), and optional priority, "start after", and "start before" times which are filled in with defaults if not provided.
The Actuator will treat these Maintenance Manifests as a work queue, taking ones which are past their "start after" time and executing them in order of earliest start-after and priority.
After running each, a state will be written into the Manifest (with optional free-form status text) with the result of the ran Task.
Manifests past their start-before times are marked as having a "timed out" state and not ran.

Currently, Manifests are created by the Admin API.
In the future, the Scheduler will create some these Manifests depending on cluster state/version and wall-clock time, providing the ability to perform tasks like rotations of secrets autonomously.
30 changes: 30 additions & 0 deletions docs/mimo/actuator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Managed Infrastructure Maintenance Operator: Actuator

The Actuator is the MIMO component that performs execution of tasks.
The process of running tasks looks like this:

```mermaid
graph TD;
START((Start))-->QUERY;
QUERY[Fetch all State = Pending] -->SORT;
SORT[Sort tasks by RUNAFTER and PRIORITY]-->ITERATE[Iterate over tasks];
ITERATE-- Per Task -->ISEXPIRED;
subgraph PerTask[ ]
ISEXPIRED{{Is RUNBEFORE > now?}}-- Yes --> STATETIMEDOUT([State = TimedOut]) --> CONTINUE[Continue];
ISEXPIRED-- No --> DEQUEUECLUSTER;
DEQUEUECLUSTER[Claim lease on OpenShiftClusterDocument] --> DEQUEUE;
DEQUEUE[Actuator dequeues task]--> ISRETRYLIMIT;
ISRETRYLIMIT{{Have we retried the task too many times?}} -- Yes --> STATETIMEDOUT;
ISRETRYLIMIT -- No -->STATEINPROGRESS;
STATEINPROGRESS([State = InProgress]) -->RUN[[Task is run]];
RUN -- Success --> SUCCESS
RUN-- Terminal Error-->TERMINALERROR;
RUN-- Transient Error-->TRANSIENTERROR;
SUCCESS([State = Completed])-->DELEASECLUSTER
TERMINALERROR([State = Failed])-->DELEASECLUSTER;
TRANSIENTERROR([State = Pending])-->DELEASECLUSTER;
DELEASECLUSTER[Release Lease on OpenShiftClusterDocument] -->CONTINUE;
end
CONTINUE-->ITERATE;
ITERATE-- Finished -->END;
```
Empty file added docs/mimo/admin-api.md
Empty file.
3 changes: 3 additions & 0 deletions docs/mimo/scheduler.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# MIMO Scheduler

The MIMO Scheduler is a planned component, but is not yet implemented.
1 change: 1 addition & 0 deletions docs/mimo/writing-tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Writing MIMO Tasks
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ require (
github.com/vincent-petithory/dataurl v1.0.0
go.uber.org/mock v0.4.0
golang.org/x/crypto v0.28.0
golang.org/x/exp v0.0.0-20240222234643-814bf88cf225
golang.org/x/net v0.30.0
golang.org/x/oauth2 v0.18.0
golang.org/x/sync v0.8.0
Expand Down Expand Up @@ -260,7 +261,6 @@ require (
go.opentelemetry.io/otel/metric v1.22.0 // indirect
go.opentelemetry.io/otel/trace v1.22.0 // indirect
go.starlark.net v0.0.0-20220328144851-d1966c6b9fcd // indirect
golang.org/x/exp v0.0.0-20240222234643-814bf88cf225 // indirect
golang.org/x/mod v0.17.0 // indirect
golang.org/x/sys v0.26.0 // indirect
golang.org/x/term v0.25.0 // indirect
Expand Down
37 changes: 37 additions & 0 deletions hack/e2e/run-rp-and-e2e.sh
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,43 @@ kill_portal() {
wait $rppid
}

run_mimo_actuator() {
echo "########## 🚀 Run MIMO Actuator in background ##########"
export AZURE_ENVIRONMENT=AzurePublicCloud
./aro mimo-actuator &
}

kill_mimo_actuator() {
echo "########## Kill the MIMO Actuator running in background ##########"
rppid=$(lsof -t -i :8445)
kill $rppid
wait $rppid
}

validate_mimo_actuator_running() {
echo "########## ?Checking MIMO Actuator Status ##########"
ELAPSED=0
while true; do
sleep 5
http_code=$(curl -k -s -o /dev/null -w '%{http_code}' http://localhost:8445/healthz/ready)
case $http_code in
"200")
echo "########## ✅ ARO MIMO Actuator Running ##########"
break
;;
*)
echo "Attempt $ELAPSED - local MIMO Actuator is NOT up. Code : $http_code, waiting"
sleep 2
# after 40 secs return exit 1 to not block ci
ELAPSED=$((ELAPSED + 1))
if [ $ELAPSED -eq 20 ]; then
exit 1
fi
;;
esac
done
}

run_vpn() {
echo "########## 🚀 Run OpenVPN in background ##########"
echo "Using Secret secrets/$VPN"
Expand Down
40 changes: 40 additions & 0 deletions pkg/api/admin/mimo.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
package admin

// Copyright (c) Microsoft Corporation.
// Licensed under the Apache License 2.0.

type MaintenanceManifestState string

const (
MaintenanceManifestStatePending MaintenanceManifestState = "Pending"
MaintenanceManifestStateInProgress MaintenanceManifestState = "InProgress"
MaintenanceManifestStateCompleted MaintenanceManifestState = "Completed"
MaintenanceManifestStateFailed MaintenanceManifestState = "Failed"
MaintenanceManifestStateTimedOut MaintenanceManifestState = "TimedOut"
MaintenanceManifestStateCancelled MaintenanceManifestState = "Cancelled"
)

type MaintenanceManifest struct {
// The ID for the resource.
ID string `json:"id,omitempty"`

State MaintenanceManifestState `json:"state,omitempty"`
StatusText string `json:"statusText,omitempty"`

MaintenanceTaskID string `json:"maintenanceTaskID,omitempty"`
Priority int `json:"priority,omitempty"`

// RunAfter defines the earliest that this manifest should start running
RunAfter int `json:"runAfter,omitempty"`
// RunBefore defines the latest that this manifest should start running
RunBefore int `json:"runBefore,omitempty"`
}

// MaintenanceManifestList represents a list of MaintenanceManifests.
type MaintenanceManifestList struct {
// The list of MaintenanceManifests.
MaintenanceManifests []*MaintenanceManifest `json:"value"`

// The link used to get the next page of operations.
NextLink string `json:"nextLink,omitempty"`
}
Loading
Loading