-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Hive ProvisionFailed=True conditions within RP and pass errors in API response #3116
Handle Hive ProvisionFailed=True conditions within RP and pass errors in API response #3116
Conversation
b8a7ec7
to
5e44eeb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
This is great fundations work for better handling of installation errors happening during hive controlled ocp installation, and will help with customer experience for a set of well identified scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR! Very exciting, can't wait to see a merge on this. I had a few questions/nits. Thanks for your hard work!
if path != "" { | ||
return os.WriteFile(path, configmapRaw, 0666) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if we can always genhiveconfig.go | oc apply -f -
is there a reason we'd need to write to FS, or can we remove this logic for simplicity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't plumbed this file through to our actual deployment pipelines. As of right now it's manually copied from this repository into the pipelines and applied statically alongside other static Hive resources.
name: AzureInvalidTemplateDeployment | ||
searchRegexStrings: | ||
- '"code":\w?"InvalidTemplateDeployment"' | ||
kind: ConfigMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, ok now I'm wondering why we need genhiveconfig.go
at all? What's wrong with us keeping a static resource in source control somewhere we apply via GitOps/etc? I guess I don't quite see the value of that script given this resource's existence since it's 99% of the way there, though I'm sure there's a reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary benefit here is to get unit tests for these regex patterns, as well as to tie our RP-side implementation to the same strings used in the ConfigMap.
We could just maintain the CM directly, just like Hive itself does, though they also maintain unit tests that send that ConfigMap through their implementation and assert that the regexes do as they're expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where this is this going to be deployed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another point to consider - when we want to add a regex (perhaps for 423s on storage account api calls) we need to do a full RP release vs a config+hive release.
@@ -0,0 +1,96 @@ | |||
package failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super-nit: I'm not sure I like the noun handler
for this, since my assumption would be that this pkg is meant to implement golang's standard lib's net/http.Handler, but it might just be me :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, open to better name ideas for this "ThingThatTakesHiveProvisionFailed=TrueConditionsAndGeneratesAUsefulUserErrorFromThem"er (transformer
? enricher
?)
case AzureInvalidTemplateDeployment.Reason: | ||
armError, err := parseDeploymentFailedJson(*installLog) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
return wrapArmError( | ||
AzureInvalidTemplateDeployment.Message, | ||
*armError, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update the PR description to include this too? It looks like we're processing two cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this extra case and reasoning. Open to suggestions on changing how we handle this case as well (e.g. below comment)
} | ||
|
||
func parseDeploymentFailedJson(installLog string) (*mgmtfeatures.ErrorResponse, error) { | ||
regex := regexp.MustCompile(`level=error msg=400: DeploymentFailed: : Deployment failed. Details: : : (\{.*\})`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to run my head through this... Is there ever a case where we might leak something like our credentials through an error like this to a customer? Should we perform sanitation of data like service principal IDs/tokens/passwords to be safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that the previous RP implementation (before Hive) would directly return this object to the customer, but I can validate exactly what was done then and ensure we do at least as thorough a job as the previous implementation did.
I'm also open to implementing specific transformations for specific sub-errors here, and not returning the generic error at all, and I think if we want to pursue that path, we should align other parts of the RP that handle RequestDisallowedByPolicy errors (e.g.
ARO-RP/pkg/cluster/deploybaseresources.go
Lines 252 to 273 in 3b370ae
if detailedErr, ok := err.(autorest.DetailedError); ok { | |
if strings.Contains(detailedErr.Original.Error(), "RequestDisallowedByPolicy") { | |
return &api.CloudError{ | |
StatusCode: http.StatusBadRequest, | |
CloudErrorBody: &api.CloudErrorBody{ | |
Code: api.CloudErrorCodeRequestDisallowedByPolicy, | |
Message: fmt.Sprintf("Resource %s was disallowed by policy.", | |
subnetId[strings.LastIndex(subnetId, "/")+1:], | |
), | |
Details: []api.CloudErrorBody{ | |
{ | |
Code: api.CloudErrorCodeRequestDisallowedByPolicy, | |
Message: fmt.Sprintf("Policy definition : %s\nPolicy Assignment : %s", | |
regexp.MustCompile(`policyDefinitionName":"([^"]+)"`).FindStringSubmatch(detailedErr.Original.Error())[1], | |
regexp.MustCompile(`policyAssignmentName":"([^"]+)"`).FindStringSubmatch(detailedErr.Original.Error())[1], | |
), | |
}, | |
}, | |
}, | |
} | |
} | |
} |
a4c3380
to
b7f819b
Compare
b7f819b
to
bd7e1a6
Compare
bd7e1a6
to
5a5cd93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this, returning some pre-defined common errors to customers will be a big upgrade. I just have a few comments going through this, but mostly this LGTM!
// Order within this array determines precedence. Earlier entries will take | ||
// priority over later ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this comment. Isn't this just establishing a list of errors that we range over? If not, how is priority decided?
"github.com/Azure/ARO-RP/pkg/api" | ||
) | ||
|
||
var genericErr = &api.CloudError{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth exporting this / other err definitions and use them in your tests. I see we redeclare this in manager_test.go
. Thoughts?
Reason: failure.AzureRequestDisallowedByPolicy.Reason, | ||
}, | ||
), | ||
cp: makeClusterProvision(`level=info msg=running in local development mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the whole cluster provision log, or can we just specify the error that matches your regex?
@@ -0,0 +1,38 @@ | |||
package failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have to have this file? it is essentially a duplicate of our configuration. I understand the benefit of testing the regex, but we are testing a copy of it by duplicating the yaml into a go file.
hack/hive-config/hive-additional-install-log-regexes.yaml
edit: I suppose since the file going to production isn't the one above either, this is less of an issue but the same regex are now in triplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The YAML is a generated artifact, which make generate
will keep up-to-date based on the source-of-truth in this file.
As is, we would need to manually copy this YAML into our deployment pipeline repository, but I believe it would be possible to automate that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, I think we need to figure out where to store the regex so we can test, deploy, and update one copy, but this is a great step towards the goal of providing better errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I think this is a good start. LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I think this is a good start. LGTM
Which issue this PR addresses:
Part of ARO-3220
What this PR does / why we need it:
Implements a pattern for handling Hive install failures in the RP, and handles one specific case (Azure RequestDisallowedByPolicy) by passing the policy errors along to the requester.
This PR also handles the generic Azure "InvalidTemplateDeployment" error, as this error contains many potential underlying errors including the above RequestDisallowedByPolicy error. The intent here is to cascade these generic errors to customers, but also alert SREs that a more specific error case ought to be implemented and handled as well. The ordering of patterns in the Hive ConfigMap ensures that more specific errors will take precedence over this generic one.
Test plan for issue:
With this code running on e.g. a local RP setup that installs via Hive, with the included
additional-install-log-regexes
ConfigMap applied in the Hive cluster:Unit tests have been added for the specific Hive install failure cases implemented within the PR.
Is there any documentation that needs to be updated for this PR?
Additional Notes
This PR is soft-dependent on a PR to our Hive deployment pipeline in order to add the additional-install-log-regexes ConfigMap to its deployment artifacts. An RP with this code can be deployed without that, but the functionality present here will not go into effect as Hive will report all failed installs as
UnknownError
until the ConfigMap is present.