PUCM Maintenance Signals #3021

niontive · 2023-07-12T15:27:08Z

Which issue this PR addresses:

Implement maintenance signals for this: https://issues.redhat.com/browse/ARO-1933

What this PR does / why we need it:

Add pucmPending flag to cluster properties
Add maintenance task for pucm pending update. This is set by admin PATCH geneva action. It avoids the normal "admin update" code path by setting provisioning state to "Updating"
Use state machine in cluster monitoring stack to output PUCM signals

Test plan for issue:

Unit tests
Test in INT

Is there any documentation that needs to be updated for this PR?

dem4gus

Suggestion for the test cases

pkg/monitor/cluster/maintenance_test.go

pkg/monitor/cluster/maintenance.go

cadenmarchese

Looks great, just one adjustment suggested. Thanks!

pkg/monitor/cluster/cluster.go

dem4gus

Small suggestion but otherwise looks good.

pkg/monitor/cluster/maintenance_test.go

hawkowl · 2023-07-17T04:27:04Z

@niontive We're planning on moving away from the "PUCM" terminology once the automation is going. Could we change this to a more generic "maintenancepending"?

facchettos

I think this is not the way it should be done, I believe we should emit the metrics directly when we are updating the cluster, rather than storing the state in the database and then polling it. This will increase the load on the monitor when we don't need to. We have access to the metrics emitter directly from the frontend (and most likely the backend too) and we should use that from there directly imho.

pkg/monitor/cluster/maintenance.go

AldoFusterTurpin

Thank you for this. Great work so far!

I left some suggestions to simplify a bit a few things in case you find them useful.

Thanks! 😃

pkg/api/admin/openshiftcluster_validatestatic.go

pkg/backend/openshiftcluster.go

pkg/frontend/openshiftcluster_putorpatch.go

pkg/monitor/cluster/maintenance.go

niontive · 2023-07-17T17:50:04Z

@facchettos I reviewed this design with @bennerv and @s-amann. Below is the justification for choosing using the cluster monitoring stack vs just the frontend:

For resource health check, these metrics get picked up by a Geneva monitor. That Geneva monitor must emit resource impacting events once a minute. According to the resource health check docs, this is a "must" from the resource health team.
For maintenance pending signal, we'll emit this two weeks before performing PUCM. This means we need to continuously generate that metric over the two week period. Geneva monitors are not designed to perform like step functions; they need the metric to be present continuously to emit the resource impacting event.
PUCM can fail. After speaking with Spencer, Ben and Puneet, we decided to leave the cluster in a "pucm ongoing" state as ARO SREs finish maintenance work. This means if we rely solely on the frontend, we'll need some logic to check if the PUCM failed and continuously output the metrics.
Any metrics outputted from ARO-RP would probably need some separate goroutine to emit the metric as the maintenance progresses. This puts additional load on the frontend and/or backend. The cluster monitor threads didn't get overloaded during INT testing, so I think we should be ok here.
RP and monitoring services can crash. In that event, it's helpful to have the state saved such that upon restart, the monitoring service can know the last known maintenance state and emit the appropriate metric.

petrkotas

Hi @niontive I do see all remarks are answered, the design is thought through and the code is ok. Thanks for the PR.

AldoFusterTurpin

LGTM! 🚀 Thanks for applying the suggestions.

Nicolas Ontiveros added 5 commits July 12, 2023 08:26

PUCM Maintenance Signals

1aa328c

fiix

4eef647

lint

81c0912

gofmt

8700955

more lint

0cd2c92

niontive marked this pull request as ready for review July 12, 2023 16:41

niontive requested review from jewzaam, bennerv, hawkowl, rogbas, petrkotas, jharrington22, cblecker, facchettos, cadenmarchese, UlrichSchlueter, s-amann, SudoBrendan, ellis-johnson, yjst2012 and anshulvermapatel as code owners July 12, 2023 16:41

cleanup test

0cddd29

dem4gus requested changes Jul 12, 2023

View reviewed changes

pkg/monitor/cluster/maintenance_test.go Show resolved Hide resolved

s-amann reviewed Jul 12, 2023

View reviewed changes

pkg/monitor/cluster/maintenance.go Outdated Show resolved Hide resolved

niontive added ready-for-review go Pull requests that update Go code labels Jul 12, 2023

Nicolas Ontiveros added 4 commits July 12, 2023 13:25

address comments

b0302d6

lint

e10d121

validate

dc64d2c

add loggiing

be9abca

Nicolas Ontiveros added 2 commits July 13, 2023 06:03

unit test

3d1bd6c

fix test

dc19e21

cadenmarchese reviewed Jul 13, 2023

View reviewed changes

pkg/monitor/cluster/cluster.go Outdated Show resolved Hide resolved

Nicolas Ontiveros added 3 commits July 13, 2023 07:04

add to bottom

6ed96c3

update logging

037fcf1

set no pucm pending properly

9aec743

niontive dismissed SrinivasAtmakuri’s stale review via 9aec743 July 13, 2023 14:33

Nicolas Ontiveros added 3 commits July 13, 2023 07:35

fix error checking

26ad167

fix lint

0a62e4e

lint

66d7ea8

dem4gus previously approved these changes Jul 13, 2023

View reviewed changes

pkg/monitor/cluster/maintenance_test.go Outdated Show resolved Hide resolved

cadenmarchese previously approved these changes Jul 14, 2023

View reviewed changes

Merge branch 'master' into niontive/maintenance-pucm

ed20519

facchettos suggested changes Jul 17, 2023

View reviewed changes

pkg/monitor/cluster/maintenance.go Outdated Show resolved Hide resolved

pkg/monitor/cluster/maintenance.go Outdated Show resolved Hide resolved

pkg/monitor/cluster/maintenance.go Show resolved Hide resolved

nit fix for unit test

3775b36

niontive dismissed stale reviews from cadenmarchese and dem4gus via 3775b36 July 17, 2023 17:02

niontive added 3 commits July 17, 2023 10:06

refactor getPucmState

2cefc95

Merge branch 'master' into niontive/maintenance-pucm

f1de256

remove ongoing

d8379b6

AldoFusterTurpin suggested changes Jul 17, 2023

View reviewed changes

niontive added 3 commits July 17, 2023 11:08

validate static refactor

b13ddf9

add comment for setUpdateProvisioningState

9cd1264

minor refactor

caf98c7

facchettos approved these changes Jul 18, 2023

View reviewed changes

petrkotas approved these changes Jul 18, 2023

View reviewed changes

AldoFusterTurpin approved these changes Jul 18, 2023

View reviewed changes

petrkotas merged commit 782d5fa into Azure:master Jul 18, 2023
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PUCM Maintenance Signals #3021

PUCM Maintenance Signals #3021

niontive commented Jul 12, 2023 •

edited

Loading

dem4gus left a comment

cadenmarchese left a comment

dem4gus left a comment

hawkowl commented Jul 17, 2023

facchettos left a comment •

edited

Loading

AldoFusterTurpin left a comment

niontive commented Jul 17, 2023

petrkotas left a comment

AldoFusterTurpin left a comment

PUCM Maintenance Signals #3021

PUCM Maintenance Signals #3021

Conversation

niontive commented Jul 12, 2023 • edited Loading

Which issue this PR addresses:

What this PR does / why we need it:

Test plan for issue:

Is there any documentation that needs to be updated for this PR?

dem4gus left a comment

Choose a reason for hiding this comment

cadenmarchese left a comment

Choose a reason for hiding this comment

dem4gus left a comment

Choose a reason for hiding this comment

hawkowl commented Jul 17, 2023

facchettos left a comment • edited Loading

Choose a reason for hiding this comment

AldoFusterTurpin left a comment

Choose a reason for hiding this comment

niontive commented Jul 17, 2023

petrkotas left a comment

Choose a reason for hiding this comment

AldoFusterTurpin left a comment

Choose a reason for hiding this comment

niontive commented Jul 12, 2023 •

edited

Loading

facchettos left a comment •

edited

Loading