Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update predeploy to restart old VMSS when service secrets rotated #2946

Merged

Conversation

rajdeepc2792
Copy link
Collaborator

@rajdeepc2792 rajdeepc2792 commented Jun 8, 2023

Which issue this PR addresses:

Jira - https://issues.redhat.com/browse/ARO-3240

What this PR does / why we need it:

  • Here's the error scenario described:-
    Deployment

    • updates encryption secrets
    • Deploys new VMSS but fails due to RHUI

    Cluster creation

    • Old VMs without updated encryption secrets handle creation
    • Old VMs pass off to hive
    • Hive runs aro-installer image
    • aro-installer image picks up new encryption secrets
    • aro-installer image uses latest encryption secrets to update openshiftcluster doc
    • RP waits for aro-installer to finish
    • aro-installer finishes
    • RP assumes control
    • RP attempts to generate kubeconfigs
    • RP cannot decode existing cluster doc because it's on old keys
  • Suggested Fix:-

    • Reduces the window of error due to encryption key rotation by restarting gateway and rp on old vmss after key rotation.

Test plan for issue:

  • Reproduced issue in dev using full-RP deployment:-

    • Deployed Full-RP
    • Triggered make deploy using changes:- rotated the encryption keys and skipped new VMSS deployment
    • Triggered cluster creation, failed with same chacha20poly1305 error.
  • Performed the above step again, with partial fix:-

    • Deployed Full-RP
    • Triggered make deploy using changes:- rotated the encryption keys, restarted aro-rp in the old rp VMSS and skipped new VMSS deployment.
    • Triggered cluster creation, failed with same chacha20poly1305 error, this time due to aro-gateway not restarted.
  • Performed the above step again, with complete fix:-

    • Deployed Full-RP
    • Triggered make deploy using changes:- rotated the encryption keys, restarted aro-gateway in the old gateway VMSS, restarted aro-rp in the old rp VMSS, and skipped new VMSS deployment.
    • Triggered cluster creation
    • Ran the make deploy in between the cluster creation again, restarting the aro-rp and aro-gateway services.
      • Successfully restarted the services in no time.
    • Cluster creation completed successfully.
  • Unit Tests added for predeploy.go functions.

Is there any documentation that needs to be updated for this PR?

No
There's a need to watch the lag in rp restart during release in prod because of #157.
If the lag is frequent and impacts the release process time there's a need to come up with an alternative approach to this PR.

Copy link
Collaborator

@SudoBrendan SudoBrendan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick comment I'd like some discussion on.

pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
@rajdeepc2792 rajdeepc2792 force-pushed the fix-encryption-keys-rotation-mechanism branch from 4783d36 to 4ff4541 Compare June 14, 2023 21:34
}

// wait for load balancer probe to change the health status
time.Sleep(30 * time.Second)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to sleep here, or is the context.WithTimeout on line 546 sufficient?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, context.withTimeout is to exit from the long running reporting of unhealthy status, but here after the aro-rp restart there's a chance that rp-probe remains healthy as probe runs with interval 15 seconds and threshold 2. That is rp-probe will start reflecting correct relevant health status for vmss after 30 seconds of restart.
Please correct me if there's something wrong with the understanding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i mentioned, I am not entirely sure we need this check. The thing we should check is the output of the command we are running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what @rajdeepc2792 is saying and I see why we would need this time.Sleep. I think we may need to adjust the amount of time though. Two reasons:

  1. I don't think you accounted for the "timeout period" mentioned here: https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-custom-probe-overview#probe-interval. I feel like the docs aren't super clear about what the timeout period really is, but it seems like we can assume it is at most 10 seconds. So reusing the logic you used to get to 30 seconds, we would need a sleep of (15 second interval + 10 second timeout period) * 2 probes = 50 seconds if we were to account for the timeout period.
  2. I was checking out the health probe on the rp-lb in prod eastus to better understand how it was configured, and I noticed a warning about how there is an Azure bug where the numberOfProbes will always be 1 regardless of the configured value. You can find this bug documented on Azure documentation here: https://learn.microsoft.com/en-us/azure/load-balancer/whats-new#known-issues. So with that in mind, we would actually only need 15 second interval + 10 second timeout period = 25 seconds before the health status of the RP is accurately reflected.

So in summary, I think 25 seconds is long enough, but keeping it at 30 seconds won't hurt.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kimorris27 for the detailed reasoning, I agree with your analysis. Also to add as pointed out in the comment the rp-probe configuration is set here, and as the known-issues link suggests the azure product team is working on it, the day it is fixed the waitForReadiness check might stop reflecting the real probe status if kept the timeout to ~30seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. It may be worth leaving a comment in this code with a link to info about the bug. Or maybe there's a better way to document this.

pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
dem4gus
dem4gus previously approved these changes Jun 21, 2023
Copy link
Collaborator

@dem4gus dem4gus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestion, PR looks good.

pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
kimorris27
kimorris27 previously approved these changes Jun 22, 2023
Copy link
Member

@petrkotas petrkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rajdeepc2792 this is nice work! I have a small find, that should improve the last check. Would you please look into it?

pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
}

// wait for load balancer probe to change the health status
time.Sleep(30 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i mentioned, I am not entirely sure we need this check. The thing we should check is the output of the command we are running.

Copy link
Contributor

@facchettos facchettos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Unit Tests. You should have some.
The code needs some simplification, and I am not sure we want to restart all services concurrently.

pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@cadenmarchese cadenmarchese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but agreed it would be nice to have some tests before merge.

pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
pkg/deploy/predeploy.go Outdated Show resolved Hide resolved
@cadenmarchese cadenmarchese added chainsaw Pull requests or issues owned by Team Chainsaw and removed next-release To be included in the next RP release rollout labels Jul 13, 2023
@rajdeepc2792 rajdeepc2792 force-pushed the fix-encryption-keys-rotation-mechanism branch from 59b7f43 to a1078f9 Compare July 18, 2023 18:30
pkg/deploy/predeploy_test.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@cadenmarchese cadenmarchese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rajdeepc2792 for all of your work with the tests. This test coverage goes above and beyond. I think some of the repetition could be reduced and would like others who are better at reviewing tests to have a look as well.

pkg/deploy/predeploy_test.go Outdated Show resolved Hide resolved
pkg/deploy/predeploy_test.go Show resolved Hide resolved
@cadenmarchese cadenmarchese added the next-release To be included in the next RP release rollout label Jul 27, 2023
@cadenmarchese cadenmarchese dismissed stale reviews from facchettos and petrkotas July 28, 2023 15:01

Comments addressed

@cadenmarchese cadenmarchese merged commit f6129d9 into Azure:master Jul 28, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chainsaw Pull requests or issues owned by Team Chainsaw next-release To be included in the next RP release rollout ready-for-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants