-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node-group-auto-discovery support for oci #7403
node-group-auto-discovery support for oci #7403
Conversation
Hi @gvnc. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
cluster-autoscaler/FAQ.md
Outdated
| `debugging-snapshot-enabled` | Whether the debugging snapshot of cluster autoscaler feature is enabled. | false | ||
| `node-delete-delay-after-taint` | How long to wait before deleting a node after tainting it. | 5 seconds | ||
| `enable-provisioning-requests` | Whether the clusterautoscaler will be handling the ProvisioningRequest CRs. | false | ||
| Parameter | Description | Default | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think want to edit/reformat this file since it is outside the provider directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted this file back to its initial state. Instead, I updated README under oci folder.
@@ -153,8 +153,8 @@ func BuildOCI(opts config.AutoscalingOptions, do cloudprovider.NodeGroupDiscover | |||
if err != nil { | |||
klog.Fatalf("Failed to get pool type: %v", err) | |||
} | |||
if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) { | |||
manager, err := nodepools.CreateNodePoolManager(opts.CloudConfig, do, createKubeClient(opts)) | |||
if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) || opts.NodeGroupAutoDiscovery != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have two implementations of this provider based on whether ocid1.nodepool...
or ocid1.instancepool...
resources were specified via --nodes
param.
We want to give ourselves the option of supporting auto-discovery for bothnodepool
and instancepool
implementations, which means we need to be able to differentiate between the two in the --node-group-auto-discovery
format.
Other cloud providers have added a label
in the auto-discovery string to differentiate between different scaling group types e.g. AWS=>asg, GCE=mig, etc.
Maybe clusterId
and nodepoolTags
is already sufficient to clue us in that the implementation is OKE / nodepools
. If that's the case, it's better to explicitly check for that here rather than assuming e.g.
if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) || hasNodeGroupAutoDiscovery() {
Later, instancepool
could follow the pattern and do something like this:
else if strings.HasPrefix(ocidType, ipconsts.OciInstancePoolResourceIdent) || hasInstancePoolAutoDiscovery() {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for highlighting this, I was unaware of the instancepool part and more focused on nodepools, I will come up with a fix for the instancepool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, we don't have to actually implement auto-discovery for instance-pools in this PR. We just want to make sure to account for each implementation since hasNodeGroupAutoDiscovery()
could be true with either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the suggested validation. Even though this change doesn't have the implementation for instancepools at the moment, I assumed there would be a parameter called instancepoolTags
.
And the validation method would check either nodepoolTags
or instancepoolTags
were used in nodeGroupAutoDiscovery
but not both of them at the same time.
_, nodepoolTagsFound, err := ocicommon.HasNodeGroupTags(opts.NodeGroupAutoDiscovery)
if err != nil {
klog.Fatalf("Failed to get auto discovery tags: %v", err)
}
if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) && nodepoolTagsFound == true {
klog.Fatalf("-nodes and -node-group-auto-discovery parameters can not be used together.")
} else if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) || nodepoolTagsFound == true {
// return oci clound provider
}
Since I see below comments for instancepool, I didn't add any if statement.
// theoretically the only other possible value is no value (if no node groups are passed in)
// or instancepool, but either way, we'll just default to the instance pool implementation
return false, reqErr | ||
} | ||
for _, nodePoolSummary := range resp.Items { | ||
klog.V(5).Infof("found nodepool %v", nodePoolSummary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NodePoolSummary
contains semi-sensitive fields that we probably shouldn't log unless we have a reason to.
Also, it might be confusing to log found nodepool ...
since at this point int the code we don't know whether it has the tags we require.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the log line.
} | ||
for _, nodePoolSummary := range resp.Items { | ||
klog.V(5).Infof("found nodepool %v", nodePoolSummary) | ||
if validateNodepoolTags(nodeGroup.tags, nodePoolSummary.FreeformTags, nodePoolSummary.DefinedTags) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few types of tags including defined tags and free form tags.
As I understand it, user defined tags on a Node Pool resource would appear in the form tags (i.e. nodePoolSummary.FreeformTags
not nodePoolSummary.DefinedTags
). Is there a reason we're not checking all the tag namspaces for a match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not only Freeform tags. Users can also create their own namespace and defined tags. We check both of them to make sure we don't miss a tag applied by the user.
Defined tag holds a namespace but FreeForm tag does not.
- Defined tag : namepsace.tagKey=tagValue
- Freeform tag: tagKey=tagValue
When we query Nodepool through api, the response returns them in separate fields.
- FreeformTags is a map[string=>string]
- DefinedTags is a map[string => map[string]string]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
manager.nodeGroups = append(manager.nodeGroups, *nodeGroup) | ||
autoDiscoverNodeGroups(manager, manager.okeClient, *nodeGroup) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like node-pools that were explicitly configured via --nodes
should be added before (and take precedent over) node-pools that were discovered via --node-group-auto-discovery
.
Do you agree? That also raises the question of the expected behavior of, say, the max
or min
node setting when a pool is specified via --nodes=2:5:ocid1.nodepool.oc1.np-a
and also discovered via --node-group-auto-discovery=clusterId:ocid1.cluster.oc1.c-1,compartmentId:ocid1.compartment.oc1..c1,nodepoolTags:cluster-autoscaler-also/enabled=true,min:0,max:10
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nodeGroupAutoDiscovery
actually overrides nodes
parameter with this implementation which means nodes
parameter is ignored if nodeGroupAutoDiscovery
is provided.
What I can think of as a solution,
- We could force the user to provide only one of them in the config, so the CA would fail on startup if both of the parameters were provided and also we would log an error line to state the reason for the end-user to see. The end-user should fix the configuration by removing one of them.
- If we want both parameters work together, we need to decide which one has a higher priority over the other. I would say
nodes
parameter should overridenodeGroupAutoDiscovery
min/max values.
Please let me know of your thoughts and I will proceed accordingly to make the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention seems to be to warn against using it in the docs[1,2], and/or disallow [1] it in the code.
I'm fine with either documenting it and/or errorring out. As you mentioned, currently the code quietly overrides any static node-pools while also logging messages as it processes each static-node pool, which could cause confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added extra checks to prevent using both parameters together, and also documented it in oci/README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I raised a few issues that need to be resolved.
Additionally, an update to oci/README.md
should also be a part of this change that documents the expected format of the discovery string clusterId:<clusterId>,compartmentId:<compartmentId>,nodepoolTags:<tagKey1>=<tagValue1>&<tagKey2>=<tagValue2>,min:<min>,max:<max>
, and clarifies which types of tags are expected on the node pool (i.e. free-form or Oracle-Recommended-Tags, or OracleInternalReserved), and any other information that the user needs or that would be helpful to them.
OK. Changes look good to me. How about you @trungng92 ? |
nodeGroup.kubeClient = kubeClient | ||
|
||
manager.nodeGroups = append(manager.nodeGroups, *nodeGroup) | ||
autoDiscoverNodeGroups(manager, manager.okeClient, *nodeGroup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the auto discovery happens in CreateNodePoolManager
, I assume that auto discovery only happens during startup. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, below in the forceRefresh function, we also autoDiscoverNodeGroups
.
|
||
if validateNodepoolTags(nodeGroupTags, nodePoolTags, definedTags) == true { | ||
t.Errorf("validateNodepoolTags shouldn't return true for not matching tags") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to do something like a traditional table driven test like this?
https://go.dev/wiki/TableDrivenTests
My main concern with the current test is that if someone adds a new test case in the middle of this, it will mess up every test that comes after it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this test to meet the table driven test requirements.
now it runs with test cases given in a map.
@@ -384,3 +388,70 @@ func TestRemoveInstance(t *testing.T) { | |||
} | |||
} | |||
} | |||
|
|||
func TestNodeGroupFromArg(t *testing.T) { | |||
var nodeGroupArg = "clusterId:testClusterId,compartmentId:testCompartmentId,nodepoolTags:ca-managed=true&namespace.foo=bar,min:1,max:5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, but can we update the IDs to look like real ocids just in case there are weird parsing bugs. For example
ocid1.cluster.oc1.test-region.test
ocid1.compartment.oc1.test-region.test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
parts := strings.Split(pair, "=") | ||
if len(parts) == 2 { | ||
spec.tags[parts[0]] = parts[1] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be returning an error if the length is not 2? Or is that a valid use case? Right now we will just silently continue on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this would actually be a formatting error, I've added an else statement to fix it.
return nil, fmt.Errorf("failed to set %s size: %s, expected integer", max, parametersMap[max]) | ||
} | ||
|
||
if parametersMap[nodepoolTags] != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it valid not to specify node pool tags? Or are they required? This will silently continue on if there are no nodePoolTags specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made nodepooltags optional at the beginning, but then I noticed we may also need to support instancepooltags after @jlamillan's feedbacks.
in the final state, nodepool tags are not optional as I do a validation already to decide whether an instancepoolmanager or nodepoolmanager would be initiated.
In short, not it's not optional and I added a failure case to address your feedback.
### Node Group Auto Discovery | ||
`--node-group-auto-discovery` could be given in below pattern. It would discover the nodepools under given compartment by matching the nodepool tags (either they are Freeform or Defined tags) | ||
``` | ||
clusterId:<clusterId>,compartmentId:<compartmentId>,nodepoolTags:<tagKey1>=<tagValue1>&<tagKey2>=<tagValue2>,min:<min>,max:<max> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all of the fields in this required? Or are any optional? Can we specify in the comment string above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of the fields are mandatory. I've added a statement to make it clear in readme.
Okay, new changes look good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gvnc, jlamillan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR provides oci support for node-group-auto-discovery parameter. ClusterAutoscaler will look for the nodepools in given compartment and match the nodepool tags. If tags are matched, the nodepool will be used for autoscaling. If tags do not match, nodepool will be ignored.
Which issue(s) this PR fixes:
Special notes for your reviewer:
This functionality was tested on OCI. Please find extended logs from the test.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: