Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CIF: OSProvisioningTimedOut #3677

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions hack/hive-config/hive-additional-install-log-regexes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,12 @@ data:
name: AzureInvalidTemplateDeployment
searchRegexStrings:
- '"code":\w?"InvalidTemplateDeployment"'
- installFailingMessage: OS Provisioning for VM, didn't finished in the allotted time.
Please see details for more information.
installFailingReason: AzureOSProvisioningTimedOut
name: AzureOSProvisioningTimedOut
searchRegexStrings:
- '"code\W*":\W*"OSProvisioningTimedOut\W*"'
kind: ConfigMap
metadata:
creationTimestamp: null
Expand Down
10 changes: 10 additions & 0 deletions pkg/hive/failure/handler.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,16 @@ func HandleProvisionFailed(ctx context.Context, cd *hivev1.ClusterDeployment, co
AzureInvalidTemplateDeployment.Message,
*armError,
)
case AzureOSProvisioningTimedOut.Reason:
Copy link
Collaborator

@SudoBrendan SudoBrendan Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I think that it's good that we now have a well-defined error for this case, we should also add code modifications in the RP to handle this specific error, because it should not be returned to customers. OS provisioning timeouts are rarely something customers can fix, but may be something we wish to be alerted on.

Copy link
Collaborator

@SudoBrendan SudoBrendan Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative feature we may choose to implement with approval from BU and buy-in from SREs is to extend the timeout duration of cluster installs for this specific case if we find it results in a higher install success rate. Granted, there are some rare reasons this error may be terminal and caused by the customer (e.g. a black-hole route table or a misconfigured NSG that blocks the image pull port), but anecdotally, I have also seen platform problems with OS Image Pulling in general for RHCOS in a region that either requires a fix from us (someone forgot to publish the RHCOS image and we get this error in Canary or FFINT) or in some cases it means we need to submit a ticket to another MSFT team because our images aren't being served properly... so at the very least, we need to somehow track this to catch systemic issues that may arise with our service. I would consider the domain of this error to be "shared" since both the customer's BYO VNet/Subnets and our platform can fail in ways that cause this error.

armError, err := parseDeploymentFailedJson(*installLog)
if err != nil {
return err
}

return wrapArmError(
AzureOSProvisioningTimedOut.Message,
*armError,
)
default:
return genericErr
}
Expand Down
10 changes: 10 additions & 0 deletions pkg/hive/failure/reasons.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ var Reasons = []InstallFailingReason{
// priority over later ones.
AzureRequestDisallowedByPolicy,
AzureInvalidTemplateDeployment,
AzureOSProvisioningTimedOut,
}

var AzureRequestDisallowedByPolicy = InstallFailingReason{
Expand All @@ -36,3 +37,12 @@ var AzureInvalidTemplateDeployment = InstallFailingReason{
regexp.MustCompile(`"code":\w?"InvalidTemplateDeployment"`),
},
}

var AzureOSProvisioningTimedOut = InstallFailingReason{
Name: "AzureOSProvisioningTimedOut",
Reason: "AzureOSProvisioningTimedOut",
Message: "OS Provisioning for VM, didn't finished in the allotted time. Please see details for more information.",
SearchRegexes: []*regexp.Regexp{
regexp.MustCompile(`"code\W*":\W*"OSProvisioningTimedOut\W*"`),
},
}
62 changes: 62 additions & 0 deletions pkg/hive/failure/reasons_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,68 @@ level=error msg=step [AuthorizationRetryingAction github.com/openshift/ARO-Insta
level=error msg=400: DeploymentFailed: : Deployment failed. Details: : : {"code":"InvalidTemplateDeployment","message":"The template deployment failed with multiple errors. Please see details for more information.","details":[{"additionalInfo":[],"code":"RequestDisallowedByPolicy","message":"Resource 'test-bootstrap' was disallowed by policy. Policy identifiers: ''.","target":"test-bootstrap"}]}`,
want: AzureRequestDisallowedByPolicy,
},
{
name: "OSProvisioningTimedOut-1",
installLog: `Message: level=info msg=creating InstanceMetadata from Azure Instance Metadata Service (AIMS) level=info msg=InstanceMetadata: running on AzurePublicCloud level=info msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func1] level=info msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func2]
level=info msg=resolving graph level=info msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func3] level=info msg=checking if graph exists level=info msg=save graph Generates the Ignition Config asset level=info msg=creating InstanceMetadata from Azure Instance Metadata Service (AIMS)
level=info msg=InstanceMetadata: running on AzurePublicCloud level=info msg=running step [AuthorizationRetryingAction github.com/openshift/ARO-Installer/pkg/installer.(*manager).deployResourceTemplate-fm] level=info msg=load persisted graph level=info msg=deploying resources template level=error msg=step [AuthorizationRetryingAction github.com/openshift/ARO-Installer/pkg/installer.(*manager).deployResourceTemplate-fm]
encountered error: 400: DeploymentFailed: : Deployment failed. Details: : : {"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","target":null,
"details":[{"code":"Conflict","message":"{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.\",\r\n \"details\": [\r\n {\r\n \"code\": \"OSProvisioningTimedOut\",\r\n \"message\": \"OS Provisioning for VM 'aro-test-j57nv-master-2' did not finish in the allotted time.
The VM may still finish provisioning successfully. Please check provisioning state later. For details on how to check current provisioning state of Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and Linux VMs, refer to https://aka.ms/LinuxVMLifecycle.\"\r\n }\r\n ]\r\n }\r\n}"}],"innererror":null,"additionalInfo":null} level=error msg=400: DeploymentFailed: : Deployment failed. Details: : : {"code":"DeploymentFailed","message":"At least one resource deployment operation failed.
Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","target":null,"details":[{"code":"Conflict","message":"{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.\",\r\n \"details\":
[\r\n {\r\n \"code\": \"OSProvisioningTimedOut\",\r\n \"message\": \"OS Provisioning for VM 'aro-test-j57nv-master-2' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. For details on how to check current provisioning state of Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and Linux VMs, refer to https://aka.ms/LinuxVMLifecycle.\"\r\n }\r\n ]\r\n }\r\n}"}],"innererror":null,"additionalInfo":null}`,
want: AzureOSProvisioningTimedOut,
},
{
name: "OSProvisioningTimedOut-2",
installLog: `Message: level=info msg=creating InstanceMetadata from Azure Instance Metadata Service (AIMS)
level=info msg=InstanceMetadata: running on AzurePublicCloud
level=info msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func1]
level=info msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func2]
level=info msg=resolving graph
level=info msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func3]
level=info msg=checking if graph exists
level=info msg=save graph Generates the Ignition Config asset
level=info msg=creating InstanceMetadata from Azure Instance Metadata Service (AIMS)
level=info msg=InstanceMetadata: running on AzurePublicCloud
level=info msg=running step [AuthorizationRetryingAction github.com/openshift/ARO-Installer/pkg/installer.(*manager).deployResourceTemplate-fm] level=info msg=load persisted graph
level=info msg=deploying resources template
level=error msg=step [AuthorizationRetryingAction github.com/openshift/ARO-Installer/pkg/installer.(*manager).deployResourceTemplate-fm] encountered error: 400:
DeploymentFailed: : Deployment failed. Details: : : {"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","target":null,"details":
[{"code":"Conflict","message":"{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.\",\r\n \"details\":
[\r\n {\r\n \"code\": \"OSProvisioningTimedOut\",\r\n \"message\": \"OS Provisioning for VM 'aro-test-j57nv-master-2' did not finish in the allotted time. The VM may still finish provisioning successfully.
Please check provisioning state later. For details on how to check current provisioning state of Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and Linux VMs, refer to https://aka.ms/LinuxVMLifecycle.\"\r\n }\r\n ]\r\n }\r\n}"}],"innererror":null,"additionalInfo":null}
level=error msg=400: DeploymentFailed: : Deployment failed. Details: : : {"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","target":null,"details":
[{"code":"Conflict","message":"{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.\",\r\n \"details\":
[\r\n {\r\n \"code\": \"OSProvisioningTimedOut\",\r\n \"message\": \"OS Provisioning for VM 'aro-test-j57nv-master-2' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later.
For details on how to check current provisioning state of Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and Linux VMs, refer to https://aka.ms/LinuxVMLifecycle.\"\r\n }\r\n ]\r\n }\r\n}"}],"innererror":null,"additionalInfo":null}`,
want: AzureOSProvisioningTimedOut,
},
{
name: "OSProvisioningTimedOut-3",
installLog: `Message: level=info msg=creating InstanceMetadata from Azure Instance Metadata Service (AIMS)
level=info msg=InstanceMetadata: running on AzurePublicCloud
level=info msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func1]
level=info msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func2]
level=info msg=resolving graph
level=info msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func3]
level=info msg=checking if graph exists
level=info msg=save graph Generates the Ignition Config asset
level=info msg=creating InstanceMetadata from Azure Instance Metadata Service (AIMS)
level=info msg=InstanceMetadata: running on AzurePublicCloud
level=info msg=running step [AuthorizationRetryingAction github.com/openshift/ARO-Installer/pkg/installer.(*manager).deployResourceTemplate-fm] level=info msg=load persisted graph
level=info msg=deploying resources template
level=error msg=step [AuthorizationRetryingAction github.com/openshift/ARO-Installer/pkg/installer.(*manager).deployResourceTemplate-fm] encountered error: 400:
DeploymentFailed: : Deployment failed. Details: : : {"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","target":null,"details":
[{"code":"Conflict","message":"{"status":"Failed","error": {"code":"ResourceDeploymentFailure","message":"The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.","details":
[ {"code":"OSProvisioningTimedOut","message":"OS Provisioning for VM 'aro-test-j57nv-master-2' did not finish in the allotted time. The VM may still finish provisioning successfully.
Please check provisioning state later. For details on how to check current provisioning state of Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and Linux VMs, refer to https://aka.ms/LinuxVMLifecycle."}]}}"}],"innererror":null,"additionalInfo":null}
level=error msg=400: DeploymentFailed: : Deployment failed. Details: : : {"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","target":null,"details":
[{"code":"Conflict","message":"{"status":"Failed","error": {"code":"ResourceDeploymentFailure","message":"The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.","details":
[{"code":"OSProvisioningTimedOut","message":"OS Provisioning for VM 'aro-test-j57nv-master-2' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later.
For details on how to check current provisioning state of Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and Linux VMs, refer to https://aka.ms/LinuxVMLifecycle."}]}}"}],"innererror":null,"additionalInfo":null}`,
want: AzureOSProvisioningTimedOut,
},
} {
t.Run(tt.name, func(t *testing.T) {
// This test uses a "mock" version of Hive's real implementation for matching install logs against regex patterns.
Expand Down
Loading