-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CIF: OSProvisioningTimedOut #3677
base: master
Are you sure you want to change the base?
Conversation
ff64863
to
3ac6592
Compare
4d10d3f
to
42df7bf
Compare
bc94d23
to
6ded794
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest testing this change by applying the Hive ConfigMap to a dev/full-service Hive cluster and attempting to reproduce these installation failures.
pkg/hive/failure/reasons.go
Outdated
var AzureOSProvisioningTimedOut = InstallFailingReason{ | ||
Name: "AzureOSProvisioningTimedOut", | ||
Reason: "AzureOSProvisioningTimedOut", | ||
Message: "OS Provisioning for VM, didn't finished in the allotted time. Please check provisioning state later.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is halting and results in cluster failure. We should not be telling users to check the state later, as it will not remediate and self-heal, they will need to delete and retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message updated
pkg/hive/failure/reasons.go
Outdated
Reason: "AzureOSProvisioningTimedOut", | ||
Message: "OS Provisioning for VM, didn't finished in the allotted time. Please check provisioning state later.", | ||
SearchRegexes: []*regexp.Regexp{ | ||
regexp.MustCompile(`"code\\":\W\\"OSProvisioningTimedOut\\"`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same feedback as in #3691 (review), we should ensure that this regex pattern matches against the real logs emitted by the installer in a Hive cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex format updated, added 3 different test cases to match multiple regex strings.
6ded794
to
572493f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the code changes so far, we should add logic to the RP code to not return this specific error to customers - it likely should result in a 500 error to the customer, and create some kind of data point for SREs to action.
@@ -50,6 +50,16 @@ func HandleProvisionFailed(ctx context.Context, cd *hivev1.ClusterDeployment, co | |||
AzureInvalidTemplateDeployment.Message, | |||
*armError, | |||
) | |||
case AzureOSProvisioningTimedOut.Reason: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while I think that it's good that we now have a well-defined error for this case, we should also add code modifications in the RP to handle this specific error, because it should not be returned to customers. OS provisioning timeouts are rarely something customers can fix, but may be something we wish to be alerted on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative feature we may choose to implement with approval from BU and buy-in from SREs is to extend the timeout duration of cluster installs for this specific case if we find it results in a higher install success rate. Granted, there are some rare reasons this error may be terminal and caused by the customer (e.g. a black-hole route table or a misconfigured NSG that blocks the image pull port), but anecdotally, I have also seen platform problems with OS Image Pulling in general for RHCOS in a region that either requires a fix from us (someone forgot to publish the RHCOS image and we get this error in Canary or FFINT) or in some cases it means we need to submit a ticket to another MSFT team because our images aren't being served properly... so at the very least, we need to somehow track this to catch systemic issues that may arise with our service. I would consider the domain of this error to be "shared" since both the customer's BYO VNet/Subnets and our platform can fail in ways that cause this error.
(Sorry @SudoBrendan, I shouldn't have pinged you for this one) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my previous review was accidentally dismissed - see comments above.
Please rebase pull request. |
Which issue this PR addresses:
ARO-5419
What this PR does / why we need it:
Following the pattern established in the old PR #3116 for handling Hive installation failures in the RP, this PR addresses a specific case (Azure OSProvisioningTimedOut) by forwarding the policy errors to the requester. The sequence of patterns in the Hive ConfigMap ensures that specific errors take precedence over this general one.
Test plan for issue:
With this code running on e.g. a local RP setup that installs via Hive, with the included additional-install-log-regexes ConfigMap applied in the Hive cluster:
Create a policy that blocks e.g. creation of the bootstrap/master VMs created by the installer
Attempt a cluster creation
Unit tests have been added for the specific Hive install failure cases implemented within the PR.
Is there any documentation that needs to be updated for this PR?
Hive Cluster Provision Pod Not Ready SOP (todo)
How do you know this will function as expected in production?