Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CIF: OSProvisioningTimedOut #3677

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

rhamitarora
Copy link
Collaborator

@rhamitarora rhamitarora commented Jul 10, 2024

Which issue this PR addresses:

ARO-5419

What this PR does / why we need it:

Following the pattern established in the old PR #3116 for handling Hive installation failures in the RP, this PR addresses a specific case (Azure OSProvisioningTimedOut) by forwarding the policy errors to the requester. The sequence of patterns in the Hive ConfigMap ensures that specific errors take precedence over this general one.

Test plan for issue:

With this code running on e.g. a local RP setup that installs via Hive, with the included additional-install-log-regexes ConfigMap applied in the Hive cluster:

Create a policy that blocks e.g. creation of the bootstrap/master VMs created by the installer
Attempt a cluster creation
Unit tests have been added for the specific Hive install failure cases implemented within the PR.

Is there any documentation that needs to be updated for this PR?

Hive Cluster Provision Pod Not Ready SOP (todo)

How do you know this will function as expected in production?

@rhamitarora rhamitarora force-pushed the osprovisioningtimedout branch 3 times, most recently from ff64863 to 3ac6592 Compare July 12, 2024 16:27
@rhamitarora rhamitarora added ready-for-review go Pull requests that update Go code firefly Issues or Pull requests owned by Team Firefly labels Jul 12, 2024
@rhamitarora rhamitarora marked this pull request as ready for review July 12, 2024 16:28
@rhamitarora rhamitarora force-pushed the osprovisioningtimedout branch 3 times, most recently from 4d10d3f to 42df7bf Compare July 13, 2024 09:46
@rhamitarora rhamitarora force-pushed the osprovisioningtimedout branch 6 times, most recently from bc94d23 to 6ded794 Compare July 15, 2024 06:02
bitoku
bitoku previously approved these changes Jul 15, 2024
Copy link
Collaborator

@tsatam tsatam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest testing this change by applying the Hive ConfigMap to a dev/full-service Hive cluster and attempting to reproduce these installation failures.

var AzureOSProvisioningTimedOut = InstallFailingReason{
Name: "AzureOSProvisioningTimedOut",
Reason: "AzureOSProvisioningTimedOut",
Message: "OS Provisioning for VM, didn't finished in the allotted time. Please check provisioning state later.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is halting and results in cluster failure. We should not be telling users to check the state later, as it will not remediate and self-heal, they will need to delete and retry.

Copy link
Collaborator Author

@rhamitarora rhamitarora Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message updated

Reason: "AzureOSProvisioningTimedOut",
Message: "OS Provisioning for VM, didn't finished in the allotted time. Please check provisioning state later.",
SearchRegexes: []*regexp.Regexp{
regexp.MustCompile(`"code\\":\W\\"OSProvisioningTimedOut\\"`),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same feedback as in #3691 (review), we should ensure that this regex pattern matches against the real logs emitted by the installer in a Hive cluster.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex format updated, added 3 different test cases to match multiple regex strings.

Copy link
Collaborator

@SudoBrendan SudoBrendan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the code changes so far, we should add logic to the RP code to not return this specific error to customers - it likely should result in a 500 error to the customer, and create some kind of data point for SREs to action.

@@ -50,6 +50,16 @@ func HandleProvisionFailed(ctx context.Context, cd *hivev1.ClusterDeployment, co
AzureInvalidTemplateDeployment.Message,
*armError,
)
case AzureOSProvisioningTimedOut.Reason:
Copy link
Collaborator

@SudoBrendan SudoBrendan Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I think that it's good that we now have a well-defined error for this case, we should also add code modifications in the RP to handle this specific error, because it should not be returned to customers. OS provisioning timeouts are rarely something customers can fix, but may be something we wish to be alerted on.

Copy link
Collaborator

@SudoBrendan SudoBrendan Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative feature we may choose to implement with approval from BU and buy-in from SREs is to extend the timeout duration of cluster installs for this specific case if we find it results in a higher install success rate. Granted, there are some rare reasons this error may be terminal and caused by the customer (e.g. a black-hole route table or a misconfigured NSG that blocks the image pull port), but anecdotally, I have also seen platform problems with OS Image Pulling in general for RHCOS in a region that either requires a fix from us (someone forgot to publish the RHCOS image and we get this error in Canary or FFINT) or in some cases it means we need to submit a ticket to another MSFT team because our images aren't being served properly... so at the very least, we need to somehow track this to catch systemic issues that may arise with our service. I would consider the domain of this error to be "shared" since both the customer's BYO VNet/Subnets and our platform can fail in ways that cause this error.

@mociarain
Copy link
Collaborator

(Sorry @SudoBrendan, I shouldn't have pinged you for this one)

Copy link
Collaborator

@SudoBrendan SudoBrendan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my previous review was accidentally dismissed - see comments above.

@github-actions github-actions bot added needs-rebase branch needs a rebase and removed ready-for-review labels Jul 24, 2024
Copy link

Please rebase pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
firefly Issues or Pull requests owned by Team Firefly go Pull requests that update Go code needs-rebase branch needs a rebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants