-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
increase machine health check node unready timeout to 15m #3133
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
The 15m change is due to customers having issues during upgrade where nodes would be NotReady while they are rebooting and running rpm-ostree
commands to upgrade to the latest RHCOS versions.
MHC was killing worker nodes causing some disruption during upgrades. This should prevent that. It will "negatively" impact the removal of NotReady nodes in a sense they will take longer to replace.
We should confirm the Node Not Ready monitor shouldn't fire within this 15m timeframe, but a bit after (30m or so).
/azp run ci |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run e2e |
Azure Pipelines successfully started running 1 pipeline(s). |
* increase machine health check node unready timeout to 15m * update mhc docs * increase machine health check node startup timeout to 25m
Which issue this PR addresses:
Fixes https://issues.redhat.com/browse/ARO-4040
What this PR does / why we need it:
Test plan for issue:
Is there any documentation that needs to be updated for this PR?