Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Hive ProvisionFailed=True conditions within RP and pass errors in API response #3116

Merged

Conversation

tsatam
Copy link
Collaborator

@tsatam tsatam commented Aug 23, 2023

Which issue this PR addresses:

Part of ARO-3220

What this PR does / why we need it:

Implements a pattern for handling Hive install failures in the RP, and handles one specific case (Azure RequestDisallowedByPolicy) by passing the policy errors along to the requester.

This PR also handles the generic Azure "InvalidTemplateDeployment" error, as this error contains many potential underlying errors including the above RequestDisallowedByPolicy error. The intent here is to cascade these generic errors to customers, but also alert SREs that a more specific error case ought to be implemented and handled as well. The ordering of patterns in the Hive ConfigMap ensures that more specific errors will take precedence over this generic one.

Test plan for issue:

With this code running on e.g. a local RP setup that installs via Hive, with the included additional-install-log-regexes ConfigMap applied in the Hive cluster:

  • Create a policy that blocks e.g. creation of the bootstrap/master VMs created by the installer
  • Attempt a cluster creation

Unit tests have been added for the specific Hive install failure cases implemented within the PR.

Is there any documentation that needs to be updated for this PR?

  • Hive Cluster Provision Pod Not Ready SOP (todo)

Additional Notes

This PR is soft-dependent on a PR to our Hive deployment pipeline in order to add the additional-install-log-regexes ConfigMap to its deployment artifacts. An RP with this code can be deployed without that, but the functionality present here will not go into effect as Hive will report all failed installs as UnknownError until the ConfigMap is present.

Copy link
Collaborator

@gvanderpotte gvanderpotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
This is great fundations work for better handling of installation errors happening during hive controlled ocp installation, and will help with customer experience for a set of well identified scenarios

Copy link
Collaborator

@SudoBrendan SudoBrendan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR! Very exciting, can't wait to see a merge on this. I had a few questions/nits. Thanks for your hard work!

Comment on lines +62 to +64
if path != "" {
return os.WriteFile(path, configmapRaw, 0666)
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we can always genhiveconfig.go | oc apply -f - is there a reason we'd need to write to FS, or can we remove this logic for simplicity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't plumbed this file through to our actual deployment pipelines. As of right now it's manually copied from this repository into the pipelines and applied statically alongside other static Hive resources.

name: AzureInvalidTemplateDeployment
searchRegexStrings:
- '"code":\w?"InvalidTemplateDeployment"'
kind: ConfigMap
Copy link
Collaborator

@SudoBrendan SudoBrendan Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok now I'm wondering why we need genhiveconfig.go at all? What's wrong with us keeping a static resource in source control somewhere we apply via GitOps/etc? I guess I don't quite see the value of that script given this resource's existence since it's 99% of the way there, though I'm sure there's a reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary benefit here is to get unit tests for these regex patterns, as well as to tie our RP-side implementation to the same strings used in the ConfigMap.

We could just maintain the CM directly, just like Hive itself does, though they also maintain unit tests that send that ConfigMap through their implementation and assert that the regexes do as they're expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where this is this going to be deployed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another point to consider - when we want to add a regex (perhaps for 423s on storage account api calls) we need to do a full RP release vs a config+hive release.

@@ -0,0 +1,96 @@
package failure
Copy link
Collaborator

@SudoBrendan SudoBrendan Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit: I'm not sure I like the noun handler for this, since my assumption would be that this pkg is meant to implement golang's standard lib's net/http.Handler, but it might just be me :p

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, open to better name ideas for this "ThingThatTakesHiveProvisionFailed=TrueConditionsAndGeneratesAUsefulUserErrorFromThem"er (transformer? enricher?)

Comment on lines +43 to +52
case AzureInvalidTemplateDeployment.Reason:
armError, err := parseDeploymentFailedJson(*installLog)
if err != nil {
return err
}

return wrapArmError(
AzureInvalidTemplateDeployment.Message,
*armError,
)
Copy link
Collaborator

@SudoBrendan SudoBrendan Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the PR description to include this too? It looks like we're processing two cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this extra case and reasoning. Open to suggestions on changing how we handle this case as well (e.g. below comment)

}

func parseDeploymentFailedJson(installLog string) (*mgmtfeatures.ErrorResponse, error) {
regex := regexp.MustCompile(`level=error msg=400: DeploymentFailed: : Deployment failed. Details: : : (\{.*\})`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to run my head through this... Is there ever a case where we might leak something like our credentials through an error like this to a customer? Should we perform sanitation of data like service principal IDs/tokens/passwords to be safe?

Copy link
Collaborator Author

@tsatam tsatam Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that the previous RP implementation (before Hive) would directly return this object to the customer, but I can validate exactly what was done then and ensure we do at least as thorough a job as the previous implementation did.

I'm also open to implementing specific transformations for specific sub-errors here, and not returning the generic error at all, and I think if we want to pursue that path, we should align other parts of the RP that handle RequestDisallowedByPolicy errors (e.g.

if detailedErr, ok := err.(autorest.DetailedError); ok {
if strings.Contains(detailedErr.Original.Error(), "RequestDisallowedByPolicy") {
return &api.CloudError{
StatusCode: http.StatusBadRequest,
CloudErrorBody: &api.CloudErrorBody{
Code: api.CloudErrorCodeRequestDisallowedByPolicy,
Message: fmt.Sprintf("Resource %s was disallowed by policy.",
subnetId[strings.LastIndex(subnetId, "/")+1:],
),
Details: []api.CloudErrorBody{
{
Code: api.CloudErrorCodeRequestDisallowedByPolicy,
Message: fmt.Sprintf("Policy definition : %s\nPolicy Assignment : %s",
regexp.MustCompile(`policyDefinitionName":"([^"]+)"`).FindStringSubmatch(detailedErr.Original.Error())[1],
regexp.MustCompile(`policyAssignmentName":"([^"]+)"`).FindStringSubmatch(detailedErr.Original.Error())[1],
),
},
},
},
}
}
}
) to return errors the same way.

pkg/hive/manager.go Outdated Show resolved Hide resolved
@tsatam tsatam force-pushed the ARO-3220/rp-handles-hive-provisionfailed branch from bd7e1a6 to 5a5cd93 Compare September 20, 2023 16:51
Copy link
Collaborator

@cadenmarchese cadenmarchese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this, returning some pre-defined common errors to customers will be a big upgrade. I just have a few comments going through this, but mostly this LGTM!

Comment on lines +16 to +17
// Order within this array determines precedence. Earlier entries will take
// priority over later ones.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this comment. Isn't this just establishing a list of errors that we range over? If not, how is priority decided?

"github.com/Azure/ARO-RP/pkg/api"
)

var genericErr = &api.CloudError{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth exporting this / other err definitions and use them in your tests. I see we redeclare this in manager_test.go. Thoughts?

Reason: failure.AzureRequestDisallowedByPolicy.Reason,
},
),
cp: makeClusterProvision(`level=info msg=running in local development mode
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the whole cluster provision log, or can we just specify the error that matches your regex?

@cadenmarchese cadenmarchese added the next-release To be included in the next RP release rollout label Oct 19, 2023
@@ -0,0 +1,38 @@
package failure
Copy link
Contributor

@s-amann s-amann Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to have this file? it is essentially a duplicate of our configuration. I understand the benefit of testing the regex, but we are testing a copy of it by duplicating the yaml into a go file.

hack/hive-config/hive-additional-install-log-regexes.yaml

edit: I suppose since the file going to production isn't the one above either, this is less of an issue but the same regex are now in triplicate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YAML is a generated artifact, which make generate will keep up-to-date based on the source-of-truth in this file.

As is, we would need to manually copy this YAML into our deployment pipeline repository, but I believe it would be possible to automate that as well.

Copy link
Contributor

@s-amann s-amann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, I think we need to figure out where to store the regex so we can test, deploy, and update one copy, but this is a great step towards the goal of providing better errors.

@cadenmarchese cadenmarchese removed the next-release To be included in the next RP release rollout label Oct 25, 2023
Copy link
Member

@petrkotas petrkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I think this is a good start. LGTM

Copy link
Member

@petrkotas petrkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I think this is a good start. LGTM

@cadenmarchese cadenmarchese merged commit f421bac into Azure:master Nov 17, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants