Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUCM Maintenance Signals #3021

Merged
merged 28 commits into from
Jul 18, 2023
Merged

Conversation

niontive
Copy link
Collaborator

@niontive niontive commented Jul 12, 2023

Which issue this PR addresses:

Implement maintenance signals for this: https://issues.redhat.com/browse/ARO-1933

What this PR does / why we need it:

  1. Add pucmPending flag to cluster properties
  2. Add maintenance task for pucm pending update. This is set by admin PATCH geneva action. It avoids the normal "admin update" code path by setting provisioning state to "Updating"
  3. Use state machine in cluster monitoring stack to output PUCM signals

Test plan for issue:

  • Unit tests
  • Test in INT

Is there any documentation that needs to be updated for this PR?

Copy link
Collaborator

@dem4gus dem4gus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for the test cases

pkg/monitor/cluster/maintenance_test.go Show resolved Hide resolved
@niontive niontive added ready-for-review go Pull requests that update Go code labels Jul 12, 2023
Nicolas Ontiveros added 2 commits July 13, 2023 06:03
Copy link
Collaborator

@cadenmarchese cadenmarchese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just one adjustment suggested. Thanks!

pkg/monitor/cluster/cluster.go Outdated Show resolved Hide resolved
dem4gus
dem4gus previously approved these changes Jul 13, 2023
Copy link
Collaborator

@dem4gus dem4gus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestion but otherwise looks good.

pkg/monitor/cluster/maintenance_test.go Outdated Show resolved Hide resolved
cadenmarchese
cadenmarchese previously approved these changes Jul 14, 2023
@hawkowl
Copy link
Collaborator

hawkowl commented Jul 17, 2023

@niontive We're planning on moving away from the "PUCM" terminology once the automation is going. Could we change this to a more generic "maintenancepending"?

Copy link
Contributor

@facchettos facchettos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not the way it should be done, I believe we should emit the metrics directly when we are updating the cluster, rather than storing the state in the database and then polling it. This will increase the load on the monitor when we don't need to. We have access to the metrics emitter directly from the frontend (and most likely the backend too) and we should use that from there directly imho.

pkg/monitor/cluster/maintenance.go Outdated Show resolved Hide resolved
pkg/monitor/cluster/maintenance.go Outdated Show resolved Hide resolved
pkg/monitor/cluster/maintenance.go Show resolved Hide resolved
@niontive niontive dismissed stale reviews from cadenmarchese and dem4gus via 3775b36 July 17, 2023 17:02
Copy link

@AldoFusterTurpin AldoFusterTurpin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this. Great work so far!

I left some suggestions to simplify a bit a few things in case you find them useful.

Thanks! 😃

pkg/api/admin/openshiftcluster_validatestatic.go Outdated Show resolved Hide resolved
pkg/backend/openshiftcluster.go Show resolved Hide resolved
pkg/backend/openshiftcluster.go Show resolved Hide resolved
pkg/frontend/openshiftcluster_putorpatch.go Show resolved Hide resolved
pkg/monitor/cluster/maintenance.go Show resolved Hide resolved
pkg/monitor/cluster/maintenance.go Outdated Show resolved Hide resolved
@niontive
Copy link
Collaborator Author

@facchettos I reviewed this design with @bennerv and @s-amann. Below is the justification for choosing using the cluster monitoring stack vs just the frontend:

  • For resource health check, these metrics get picked up by a Geneva monitor. That Geneva monitor must emit resource impacting events once a minute. According to the resource health check docs, this is a "must" from the resource health team.
  • For maintenance pending signal, we'll emit this two weeks before performing PUCM. This means we need to continuously generate that metric over the two week period. Geneva monitors are not designed to perform like step functions; they need the metric to be present continuously to emit the resource impacting event.
  • PUCM can fail. After speaking with Spencer, Ben and Puneet, we decided to leave the cluster in a "pucm ongoing" state as ARO SREs finish maintenance work. This means if we rely solely on the frontend, we'll need some logic to check if the PUCM failed and continuously output the metrics.
  • Any metrics outputted from ARO-RP would probably need some separate goroutine to emit the metric as the maintenance progresses. This puts additional load on the frontend and/or backend. The cluster monitor threads didn't get overloaded during INT testing, so I think we should be ok here.
  • RP and monitoring services can crash. In that event, it's helpful to have the state saved such that upon restart, the monitoring service can know the last known maintenance state and emit the appropriate metric.

Copy link
Member

@petrkotas petrkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @niontive I do see all remarks are answered, the design is thought through and the code is ok. Thanks for the PR.

Copy link

@AldoFusterTurpin AldoFusterTurpin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀 Thanks for applying the suggestions.

@petrkotas petrkotas merged commit 782d5fa into Azure:master Jul 18, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go Pull requests that update Go code ready-for-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants