Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CIF - ServerError: ZonalAllocationFailed #3691

Merged
merged 1 commit into from
Jul 24, 2024
Merged

Conversation

rhamitarora
Copy link
Collaborator

@rhamitarora rhamitarora commented Jul 15, 2024

Which issue this PR addresses:

ARO-4867

What this PR does / why we need it:

Following the pattern established in the old PR #3116 for handling Hive installation failures in the RP, this PR addresses a specific case (Azure ZonalAllocationFailed) by forwarding the policy errors to the requester. The sequence of patterns in the Hive ConfigMap ensures that specific errors take precedence over this general one.

Test plan for issue:

With this code running on e.g. a local RP setup that installs via Hive, with the included additional-install-log-regexes ConfigMap applied in the Hive cluster:

Create a policy that blocks e.g. creation of the bootstrap/master VMs created by the installer
Attempt a cluster creation
Unit tests have been added for the specific Hive install failure cases implemented within the PR.

Is there any documentation that needs to be updated for this PR?

Hive Cluster Provision Pod Not Ready SOP (todo)

How do you know this will function as expected in production?

@bitoku
Copy link
Collaborator

bitoku commented Jul 15, 2024

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Comment on lines 66 to 128
{
name: "ZonalAllocationFailed",
installLog: `Message: level=info msg=creating InstanceMetadata from Azure Instance Metadata Service (AIMS) level=info msg=InstanceMetadata: running on AzurePublicCloud level=info msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func1] level=info msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func2] level=info msg=resolving graph level=info
msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func3] level=info msg=checking if graph exists level=info msg=save graph Generates the Ignition Config asset level=info msg=creating InstanceMetadata from Azure Instance Metadata Service (AIMS) level=info msg=InstanceMetadata: running on AzurePublicCloud level=info
msg=running step [AuthorizationRetryingAction github.com/openshift/ARO-Installer/pkg/installer.(*manager).deployResourceTemplate-fm] level=info msg=load persisted graph level=info msg=deploying resources template level=error msg=step [AuthorizationRetryingAction github.com/openshift/ARO-Installer/pkg/installer.(*manager).deployResourceTemplate-fm]
encountered error: 400: DeploymentFailed: : Deployment failed. Details: : : {"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","target":null,"details":
[{"code":"Conflict","message":"{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.\",\r\n \"details\":
[\r\n {\r\n \"code\": \"ZonalAllocationFailed\",\r\n \"message\": \"Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance\"\r\n }\r\n ]\r\n }\r\n}"}],"innererror":null,"additionalInfo":null}
level=error msg=400: DeploymentFailed: : Deployment failed. Details: : : {"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","target":null,"details":
[{"code":"Conflict","message":"{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.\",\r\n \"details\":
[\r\n {\r\n \"code\": \"ZonalAllocationFailed\",\r\n \"message\": \"Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance\"\r\n }\r\n ]\r\n }\r\n}"}],"innererror":null,"additionalInfo":null}`,
want: AzureZonalAllocationFailed,
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why the escaping rule is different from other tests.
Is either one wrong probably?

Copy link
Collaborator Author

@rhamitarora rhamitarora Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No other's are not wrong, the Installlog data is what we get from hive pod log I run test case on that raw data for more accuracy. This log data also attached in JIRA.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these need to be consistent. The logs will be emitted the same way by Hive regardless of error type. We should confirm that the same regex format applies to any permutation of the logs (whether they have explicit newlines or literal \n or \r\n within them, escaped quotes or not, etc)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple test cases added to confirm regex format with or without \n or \r\n

@rhamitarora rhamitarora requested a review from bitoku July 15, 2024 11:30
bitoku
bitoku previously approved these changes Jul 15, 2024
Copy link
Collaborator

@tsatam tsatam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good, we should validate that the regexes match against the log format we get from Hive.

I'd suggest testing this by applying the Hive ConfigMap to either a dev or full-service Hive cluster, and validating that the behavior is as expected.

Comment on lines 66 to 128
{
name: "ZonalAllocationFailed",
installLog: `Message: level=info msg=creating InstanceMetadata from Azure Instance Metadata Service (AIMS) level=info msg=InstanceMetadata: running on AzurePublicCloud level=info msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func1] level=info msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func2] level=info msg=resolving graph level=info
msg=running step [Action github.com/openshift/ARO-Installer/pkg/installer.(*manager).Manifests.func3] level=info msg=checking if graph exists level=info msg=save graph Generates the Ignition Config asset level=info msg=creating InstanceMetadata from Azure Instance Metadata Service (AIMS) level=info msg=InstanceMetadata: running on AzurePublicCloud level=info
msg=running step [AuthorizationRetryingAction github.com/openshift/ARO-Installer/pkg/installer.(*manager).deployResourceTemplate-fm] level=info msg=load persisted graph level=info msg=deploying resources template level=error msg=step [AuthorizationRetryingAction github.com/openshift/ARO-Installer/pkg/installer.(*manager).deployResourceTemplate-fm]
encountered error: 400: DeploymentFailed: : Deployment failed. Details: : : {"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","target":null,"details":
[{"code":"Conflict","message":"{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.\",\r\n \"details\":
[\r\n {\r\n \"code\": \"ZonalAllocationFailed\",\r\n \"message\": \"Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance\"\r\n }\r\n ]\r\n }\r\n}"}],"innererror":null,"additionalInfo":null}
level=error msg=400: DeploymentFailed: : Deployment failed. Details: : : {"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","target":null,"details":
[{"code":"Conflict","message":"{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.\",\r\n \"details\":
[\r\n {\r\n \"code\": \"ZonalAllocationFailed\",\r\n \"message\": \"Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance\"\r\n }\r\n ]\r\n }\r\n}"}],"innererror":null,"additionalInfo":null}`,
want: AzureZonalAllocationFailed,
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these need to be consistent. The logs will be emitted the same way by Hive regardless of error type. We should confirm that the same regex format applies to any permutation of the logs (whether they have explicit newlines or literal \n or \r\n within them, escaped quotes or not, etc)

installFailingReason: AzureZonalAllocationFailed
name: AzureZonalAllocationFailed
searchRegexStrings:
- '"code\\":\W\\"ZonalAllocationFailed\\"'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same feedback as on the unit test below - we should ensure this, as well as other search regex strings, are compatible with either "format" of escaped logs we get, rather than building an implementation that assumes just one way. Thankfully, we can pass multiple regex strings in for each failing reason here so if we need to just duplicate the regex strings instead of building a single pattern that matches both, we can.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex format updated, added 3 different test cases to match multiple regex strings.

Copy link
Collaborator

@mociarain mociarain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice changes

@mociarain
Copy link
Collaborator

/azp run ci

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mociarain mociarain merged commit eefe6e8 into master Jul 24, 2024
20 checks passed
@mociarain mociarain deleted the zonalallocationfailed branch July 24, 2024 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
firefly Issues or Pull requests owned by Team Firefly go Pull requests that update Go code ready-for-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants