update predeploy to restart old VMSS when service secrets rotated #2946

rajdeepc2792 · 2023-06-08T15:08:46Z

Which issue this PR addresses:

Jira - https://issues.redhat.com/browse/ARO-3240

What this PR does / why we need it:

Here's the error scenario described:-
Deployment
- updates encryption secrets
- Deploys new VMSS but fails due to RHUI
Cluster creation
- Old VMs without updated encryption secrets handle creation
- Old VMs pass off to hive
- Hive runs aro-installer image
- aro-installer image picks up new encryption secrets
- aro-installer image uses latest encryption secrets to update openshiftcluster doc
- RP waits for aro-installer to finish
- aro-installer finishes
- RP assumes control
- RP attempts to generate kubeconfigs
- RP cannot decode existing cluster doc because it's on old keys
Suggested Fix:-
- Reduces the window of error due to encryption key rotation by restarting gateway and rp on old vmss after key rotation.

Test plan for issue:

Reproduced issue in dev using full-RP deployment:-
- Deployed Full-RP
- Triggered make deploy using changes:- rotated the encryption keys and skipped new VMSS deployment
- Triggered cluster creation, failed with same chacha20poly1305 error.
Performed the above step again, with partial fix:-
- Deployed Full-RP
- Triggered make deploy using changes:- rotated the encryption keys, restarted aro-rp in the old rp VMSS and skipped new VMSS deployment.
- Triggered cluster creation, failed with same chacha20poly1305 error, this time due to aro-gateway not restarted.
Performed the above step again, with complete fix:-
- Deployed Full-RP
- Triggered make deploy using changes:- rotated the encryption keys, restarted aro-gateway in the old gateway VMSS, restarted aro-rp in the old rp VMSS, and skipped new VMSS deployment.
- Triggered cluster creation
- Ran the make deploy in between the cluster creation again, restarting the aro-rp and aro-gateway services.
  - Successfully restarted the services in no time.
- Cluster creation completed successfully.
Unit Tests added for predeploy.go functions.

Is there any documentation that needs to be updated for this PR?

No
There's a need to watch the lag in rp restart during release in prod because of #157.
If the lag is frequent and impacts the release process time there's a need to come up with an alternative approach to this PR.

SudoBrendan

Just a quick comment I'd like some discussion on.

pkg/deploy/predeploy.go

cadenmarchese · 2023-06-20T20:02:51Z

pkg/deploy/predeploy.go

+	}
+
+	// wait for load balancer probe to change the health status
+	time.Sleep(30 * time.Second)


Do we need to sleep here, or is the context.WithTimeout on line 546 sufficient?

As I understand, context.withTimeout is to exit from the long running reporting of unhealthy status, but here after the aro-rp restart there's a chance that rp-probe remains healthy as probe runs with interval 15 seconds and threshold 2. That is rp-probe will start reflecting correct relevant health status for vmss after 30 seconds of restart.
Please correct me if there's something wrong with the understanding.

As i mentioned, I am not entirely sure we need this check. The thing we should check is the output of the command we are running.

I get what @rajdeepc2792 is saying and I see why we would need this time.Sleep. I think we may need to adjust the amount of time though. Two reasons:

I don't think you accounted for the "timeout period" mentioned here: https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-custom-probe-overview#probe-interval. I feel like the docs aren't super clear about what the timeout period really is, but it seems like we can assume it is at most 10 seconds. So reusing the logic you used to get to 30 seconds, we would need a sleep of (15 second interval + 10 second timeout period) * 2 probes = 50 seconds if we were to account for the timeout period.

I was checking out the health probe on the rp-lb in prod eastus to better understand how it was configured, and I noticed a warning about how there is an Azure bug where the numberOfProbes will always be 1 regardless of the configured value. You can find this bug documented on Azure documentation here: https://learn.microsoft.com/en-us/azure/load-balancer/whats-new#known-issues. So with that in mind, we would actually only need 15 second interval + 10 second timeout period = 25 seconds before the health status of the RP is accurately reflected.

So in summary, I think 25 seconds is long enough, but keeping it at 30 seconds won't hurt.

Thanks @kimorris27 for the detailed reasoning, I agree with your analysis. Also to add as pointed out in the comment the rp-probe configuration is set here, and as the known-issues link suggests the azure product team is working on it, the day it is fixed the waitForReadiness check might stop reflecting the real probe status if kept the timeout to ~30seconds.

Yeah, good point. It may be worth leaving a comment in this code with a link to info about the bug. Or maybe there's a better way to document this.

pkg/deploy/predeploy.go

dem4gus

Small suggestion, PR looks good.

pkg/deploy/predeploy.go

petrkotas

Hi @rajdeepc2792 this is nice work! I have a small find, that should improve the last check. Would you please look into it?

pkg/deploy/predeploy.go

petrkotas · 2023-06-27T14:32:52Z

pkg/deploy/predeploy.go

+	}
+
+	// wait for load balancer probe to change the health status
+	time.Sleep(30 * time.Second)


As i mentioned, I am not entirely sure we need this check. The thing we should check is the output of the command we are running.

facchettos

No Unit Tests. You should have some.
The code needs some simplification, and I am not sure we want to restart all services concurrently.

pkg/deploy/predeploy.go

cadenmarchese

Looks good to me, but agreed it would be nice to have some tests before merge.

pkg/deploy/predeploy.go

pkg/deploy/predeploy_test.go

cadenmarchese

Thanks @rajdeepc2792 for all of your work with the tests. This test coverage goes above and beyond. I think some of the repetition could be reduced and would like others who are better at reviewing tests to have a look as well.

pkg/deploy/predeploy_test.go

Comments addressed

cadenmarchese added the ready-for-review label Jun 8, 2023

rajdeepc2792 marked this pull request as ready for review June 8, 2023 19:52

rajdeepc2792 requested review from jewzaam, bennerv, hawkowl, rogbas, petrkotas, jharrington22, cblecker, facchettos, cadenmarchese, UlrichSchlueter, s-amann, SudoBrendan and ellis-johnson as code owners June 8, 2023 19:52

s-amann added the next-release To be included in the next RP release rollout label Jun 8, 2023

SudoBrendan reviewed Jun 8, 2023

View reviewed changes

pkg/deploy/predeploy.go Outdated Show resolved Hide resolved

cadenmarchese reviewed Jun 9, 2023

View reviewed changes

pkg/deploy/predeploy.go Outdated Show resolved Hide resolved

pkg/deploy/predeploy.go Outdated Show resolved Hide resolved

pkg/deploy/predeploy.go Outdated Show resolved Hide resolved

cadenmarchese reviewed Jun 13, 2023

View reviewed changes

pkg/deploy/predeploy.go Outdated Show resolved Hide resolved

rajdeepc2792 force-pushed the fix-encryption-keys-rotation-mechanism branch from 4783d36 to 4ff4541 Compare June 14, 2023 21:34

cadenmarchese reviewed Jun 20, 2023

View reviewed changes

dem4gus previously approved these changes Jun 21, 2023

View reviewed changes

pkg/deploy/predeploy.go Outdated Show resolved Hide resolved

rajdeepc2792 dismissed dem4gus’s stale review via b8198aa June 22, 2023 13:12

rajdeepc2792 force-pushed the fix-encryption-keys-rotation-mechanism branch from cdd3b55 to b8198aa Compare June 22, 2023 13:12

rajdeepc2792 requested a review from yjst2012 as a code owner June 22, 2023 13:12

kimorris27 previously approved these changes Jun 22, 2023

View reviewed changes

petrkotas previously requested changes Jun 27, 2023

View reviewed changes

facchettos suggested changes Jun 28, 2023

View reviewed changes

rajdeepc2792 dismissed kimorris27’s stale review via 59b7f43 June 28, 2023 21:20

rajdeepc2792 force-pushed the fix-encryption-keys-rotation-mechanism branch from b8198aa to 59b7f43 Compare June 28, 2023 21:20

rajdeepc2792 requested a review from anshulvermapatel as a code owner June 28, 2023 21:20

cadenmarchese reviewed Jul 10, 2023

View reviewed changes

pkg/deploy/predeploy.go Outdated Show resolved Hide resolved

pkg/deploy/predeploy.go Outdated Show resolved Hide resolved

cadenmarchese added chainsaw Pull requests or issues owned by Team Chainsaw and removed next-release To be included in the next RP release rollout labels Jul 13, 2023

rajdeepc2792 added 6 commits July 18, 2023 14:17

update predeploy to restart old VMSS when service secrets rotated

73a51ee

update scalesetVMSS conditions check for restart at RP predeploy

a030411

add vmss health check after vmss restart

0cf0937

nit changes related to logging

c563d48

remove concurrent rp service restarts

5f9b266

Add unit test cases for RP predeploy function

a1078f9

rajdeepc2792 force-pushed the fix-encryption-keys-rotation-mechanism branch from 59b7f43 to a1078f9 Compare July 18, 2023 18:30

facchettos previously requested changes Jul 19, 2023

View reviewed changes

pkg/deploy/predeploy_test.go Outdated Show resolved Hide resolved

refactor predeploy.go unit test cases

5e01c17

cadenmarchese reviewed Jul 25, 2023

View reviewed changes

pkg/deploy/predeploy_test.go Outdated Show resolved Hide resolved

pkg/deploy/predeploy_test.go Show resolved Hide resolved

remove variables duplication from predeploy test cases

ed1657b

cadenmarchese approved these changes Jul 26, 2023

View reviewed changes

kimorris27 approved these changes Jul 26, 2023

View reviewed changes

cadenmarchese requested review from petrkotas and facchettos July 27, 2023 15:18

cadenmarchese added the next-release To be included in the next RP release rollout label Jul 27, 2023

cadenmarchese merged commit f6129d9 into Azure:master Jul 28, 2023
18 checks passed

rajdeepc2792 mentioned this pull request Jul 28, 2023

remove gateway restart during rp predeploy #3071

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update predeploy to restart old VMSS when service secrets rotated #2946

update predeploy to restart old VMSS when service secrets rotated #2946

rajdeepc2792 commented Jun 8, 2023 •

edited

Loading

SudoBrendan left a comment

cadenmarchese Jun 20, 2023

rajdeepc2792 Jun 20, 2023

petrkotas Jun 27, 2023

kimorris27 Jul 20, 2023

rajdeepc2792 Jul 24, 2023

kimorris27 Jul 24, 2023

dem4gus left a comment

petrkotas left a comment

petrkotas Jun 27, 2023

facchettos left a comment •

edited

Loading

cadenmarchese left a comment

cadenmarchese left a comment

update predeploy to restart old VMSS when service secrets rotated #2946

update predeploy to restart old VMSS when service secrets rotated #2946

Conversation

rajdeepc2792 commented Jun 8, 2023 • edited Loading

Which issue this PR addresses:

What this PR does / why we need it:

Test plan for issue:

Is there any documentation that needs to be updated for this PR?

SudoBrendan left a comment

Choose a reason for hiding this comment

cadenmarchese Jun 20, 2023

Choose a reason for hiding this comment

rajdeepc2792 Jun 20, 2023

Choose a reason for hiding this comment

petrkotas Jun 27, 2023

Choose a reason for hiding this comment

kimorris27 Jul 20, 2023

Choose a reason for hiding this comment

rajdeepc2792 Jul 24, 2023

Choose a reason for hiding this comment

kimorris27 Jul 24, 2023

Choose a reason for hiding this comment

dem4gus left a comment

Choose a reason for hiding this comment

petrkotas left a comment

Choose a reason for hiding this comment

petrkotas Jun 27, 2023

Choose a reason for hiding this comment

facchettos left a comment • edited Loading

Choose a reason for hiding this comment

cadenmarchese left a comment

Choose a reason for hiding this comment

cadenmarchese left a comment

Choose a reason for hiding this comment

rajdeepc2792 commented Jun 8, 2023 •

edited

Loading

facchettos left a comment •

edited

Loading