node-group-auto-discovery support for oci #7403

gvnc · 2024-10-16T11:02:55Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR provides oci support for node-group-auto-discovery parameter. ClusterAutoscaler will look for the nodepools in given compartment and match the nodepool tags. If tags are matched, the nodepool will be used for autoscaling. If tags do not match, nodepool will be ignored.

Which issue(s) this PR fixes:

Special notes for your reviewer:

This functionality was tested on OCI. Please find extended logs from the test.


I1016 10:02:11.736204       1 oci_manager.go:340] node group auto discovery spec constructed: &{manager:<nil> kubeClient:<nil> clusterId:ocid1.clusterinteg.oc1.phx.aaaaaaaagju3g2ukus57t4spw7u4fcmdiozwzgamnzq46fdozcv1234567 compartmentId:ocid1.compartment.oc1..aaaaaaaacciywjzae6gctocqzgiah6go4qay2phl2aoepwq6kv42xratkadq tags:map[foo:bar nmsp.ca-managed:true] minSize:1 maxSize:5}

W1016 10:02:12.921044       1 oci_manager.go:225] nodepool ignored as the tags do not satisfy the requirement : ocid1.nodepoolinteg.oc1.phx.aaaaaaaaunlcuncyrpqm7u6x6gw7ihycwkybiqddsug2wqrsinktn3dnllda , map[tag1:1234]

I1016 10:02:12.921103       1 oci_manager.go:223] auto discovered nodepool in compartment : ocid1.compartment.oc1..aaaaaaaacciywjzae6gctocqzgiah6go4qay2phl2aoepwq6kv42xratkadq , nodepoolid: ocid1.nodepoolinteg.oc1.phx.aaaaaaaaxq7iksc5y5hbpahfcdr3xjbif2mfzf7n45sbzec32nc2tlax4hza

W1016 10:02:12.921153       1 oci_manager.go:225] nodepool ignored as the  tags do not satisfy the requirement : ocid1.nodepoolinteg.oc1.phx.aaaaaaaahtj2m647kg7oqtb6rpeckaj7olvuyzl5kqejumbsmn2ly6w7rfbq , map[testTag:testValue]

Does this PR introduce a user-facing change?

Added OCI support for **node-group-auto-discovery** parameter.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [Usage]: The parameter should have a value in the pattern;
- `clusterId:<clusterId>,compartmentId:<compartmentId>,nodepoolTags:<tagKey1>=<tagValue1>&<tagKey2>=<tagValue2>,min:<min>,max:<max>`

k8s-ci-robot · 2024-10-16T11:03:04Z

Hi @gvnc. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

aleksandra-malinowska · 2024-10-16T11:23:46Z

/ok-to-test

jlamillan · 2024-10-16T21:53:23Z

cluster-autoscaler/FAQ.md

-| `debugging-snapshot-enabled` | Whether the debugging snapshot of cluster autoscaler feature is enabled. | false
-| `node-delete-delay-after-taint` | How long to wait before deleting a node after tainting it. | 5 seconds
-| `enable-provisioning-requests` | Whether the clusterautoscaler will be handling the ProvisioningRequest CRs. | false
+| Parameter | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | Default |


I don't think want to edit/reformat this file since it is outside the provider directory.

I reverted this file back to its initial state. Instead, I updated README under oci folder.

jlamillan · 2024-10-16T22:21:45Z

cluster-autoscaler/cloudprovider/oci/instancepools/oci_cloud_provider.go

@@ -153,8 +153,8 @@ func BuildOCI(opts config.AutoscalingOptions, do cloudprovider.NodeGroupDiscover
 	if err != nil {
 		klog.Fatalf("Failed to get pool type: %v", err)
 	}
-	if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) {
-		manager, err := nodepools.CreateNodePoolManager(opts.CloudConfig, do, createKubeClient(opts))
+	if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) || opts.NodeGroupAutoDiscovery != nil {


We have two implementations of this provider based on whether ocid1.nodepool... or ocid1.instancepool... resources were specified via --nodes param.

We want to give ourselves the option of supporting auto-discovery for bothnodepool and instancepool implementations, which means we need to be able to differentiate between the two in the --node-group-auto-discovery format.

Other cloud providers have added a label in the auto-discovery string to differentiate between different scaling group types e.g. AWS=>asg, GCE=mig, etc.

Maybe clusterId and nodepoolTags is already sufficient to clue us in that the implementation is OKE / nodepools. If that's the case, it's better to explicitly check for that here rather than assuming e.g.

if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) || hasNodeGroupAutoDiscovery() {

Later, instancepool could follow the pattern and do something like this:

else if strings.HasPrefix(ocidType, ipconsts.OciInstancePoolResourceIdent) || hasInstancePoolAutoDiscovery() {

thanks for highlighting this, I was unaware of the instancepool part and more focused on nodepools, I will come up with a fix for the instancepool.

To clarify, we don't have to actually implement auto-discovery for instance-pools in this PR. We just want to make sure to account for each implementation since hasNodeGroupAutoDiscovery() could be true with either.

I've added the suggested validation. Even though this change doesn't have the implementation for instancepools at the moment, I assumed there would be a parameter called instancepoolTags.
And the validation method would check either nodepoolTags or instancepoolTags were used in nodeGroupAutoDiscovery but not both of them at the same time.

_, nodepoolTagsFound, err := ocicommon.HasNodeGroupTags(opts.NodeGroupAutoDiscovery) if err != nil { klog.Fatalf("Failed to get auto discovery tags: %v", err) } if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) && nodepoolTagsFound == true { klog.Fatalf("-nodes and -node-group-auto-discovery parameters can not be used together.") } else if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) || nodepoolTagsFound == true { // return oci clound provider }

Since I see below comments for instancepool, I didn't add any if statement.

// theoretically the only other possible value is no value (if no node groups are passed in) // or instancepool, but either way, we'll just default to the instance pool implementation

jlamillan · 2024-10-16T22:24:06Z

cluster-autoscaler/cloudprovider/oci/nodepools/oci_manager.go

+		return false, reqErr
+	}
+	for _, nodePoolSummary := range resp.Items {
+		klog.V(5).Infof("found nodepool %v", nodePoolSummary)


The NodePoolSummary contains semi-sensitive fields that we probably shouldn't log unless we have a reason to.

Also, it might be confusing to log found nodepool ... since at this point int the code we don't know whether it has the tags we require.

I removed the log line.

jlamillan · 2024-10-16T22:29:52Z

cluster-autoscaler/cloudprovider/oci/nodepools/oci_manager.go

+	}
+	for _, nodePoolSummary := range resp.Items {
+		klog.V(5).Infof("found nodepool %v", nodePoolSummary)
+		if validateNodepoolTags(nodeGroup.tags, nodePoolSummary.FreeformTags, nodePoolSummary.DefinedTags) {


There are a few types of tags including defined tags and free form tags.

As I understand it, user defined tags on a Node Pool resource would appear in the form tags (i.e. nodePoolSummary.FreeformTags not nodePoolSummary.DefinedTags). Is there a reason we're not checking all the tag namspaces for a match?

It is not only Freeform tags. Users can also create their own namespace and defined tags. We check both of them to make sure we don't miss a tag applied by the user.

Defined tag holds a namespace but FreeForm tag does not.

Defined tag : namepsace.tagKey=tagValue

Freeform tag: tagKey=tagValue

When we query Nodepool through api, the response returns them in separate fields.

FreeformTags is a map[string=>string]

DefinedTags is a map[string => map[string]string]

jlamillan · 2024-10-16T22:44:18Z

cluster-autoscaler/cloudprovider/oci/nodepools/oci_manager.go

+		manager.nodeGroups = append(manager.nodeGroups, *nodeGroup)
+		autoDiscoverNodeGroups(manager, manager.okeClient, *nodeGroup)
+	}
+


It seems like node-pools that were explicitly configured via --nodes should be added before (and take precedent over) node-pools that were discovered via --node-group-auto-discovery.

Do you agree? That also raises the question of the expected behavior of, say, the max or min node setting when a pool is specified via --nodes=2:5:ocid1.nodepool.oc1.np-a and also discovered via --node-group-auto-discovery=clusterId:ocid1.cluster.oc1.c-1,compartmentId:ocid1.compartment.oc1..c1,nodepoolTags:cluster-autoscaler-also/enabled=true,min:0,max:10?

nodeGroupAutoDiscovery actually overrides nodes parameter with this implementation which means nodes parameter is ignored if nodeGroupAutoDiscovery is provided.

What I can think of as a solution,

We could force the user to provide only one of them in the config, so the CA would fail on startup if both of the parameters were provided and also we would log an error line to state the reason for the end-user to see. The end-user should fix the configuration by removing one of them.

If we want both parameters work together, we need to decide which one has a higher priority over the other. I would say nodes parameter should override nodeGroupAutoDiscovery min/max values.

Please let me know of your thoughts and I will proceed accordingly to make the changes.

The convention seems to be to warn against using it in the docs[1,2], and/or disallow [1] it in the code.

I'm fine with either documenting it and/or errorring out. As you mentioned, currently the code quietly overrides any static node-pools while also logging messages as it processes each static-node pool, which could cause confusion.

I added extra checks to prevent using both parameters together, and also documented it in oci/README.

jlamillan

I raised a few issues that need to be resolved.

Additionally, an update to oci/README.md should also be a part of this change that documents the expected format of the discovery string clusterId:<clusterId>,compartmentId:<compartmentId>,nodepoolTags:<tagKey1>=<tagValue1>&<tagKey2>=<tagValue2>,min:<min>,max:<max>, and clarifies which types of tags are expected on the node pool (i.e. free-form or Oracle-Recommended-Tags, or OracleInternalReserved), and any other information that the user needs or that would be helpful to them.

…le as well.

jlamillan · 2024-10-21T22:20:58Z

OK. Changes look good to me. How about you @trungng92 ?

trungng92 · 2024-10-22T15:02:08Z

cluster-autoscaler/cloudprovider/oci/nodepools/oci_manager.go

+		nodeGroup.kubeClient = kubeClient
+
+		manager.nodeGroups = append(manager.nodeGroups, *nodeGroup)
+		autoDiscoverNodeGroups(manager, manager.okeClient, *nodeGroup)


Given that the auto discovery happens in CreateNodePoolManager, I assume that auto discovery only happens during startup. Is that right?

Nevermind, below in the forceRefresh function, we also autoDiscoverNodeGroups.

trungng92 · 2024-10-22T15:15:53Z

cluster-autoscaler/cloudprovider/oci/nodepools/oci_manager_test.go

+
+	if validateNodepoolTags(nodeGroupTags, nodePoolTags, definedTags) == true {
+		t.Errorf("validateNodepoolTags shouldn't return true for not matching tags")
+	}


Any reason not to do something like a traditional table driven test like this?

https://go.dev/wiki/TableDrivenTests

My main concern with the current test is that if someone adds a new test case in the middle of this, it will mess up every test that comes after it.

I refactored this test to meet the table driven test requirements.
now it runs with test cases given in a map.

trungng92 · 2024-10-22T15:19:43Z

cluster-autoscaler/cloudprovider/oci/nodepools/oci_manager_test.go

@@ -384,3 +388,70 @@ func TestRemoveInstance(t *testing.T) {
 		}
 	}
 }
+
+func TestNodeGroupFromArg(t *testing.T) {
+	var nodeGroupArg = "clusterId:testClusterId,compartmentId:testCompartmentId,nodepoolTags:ca-managed=true&namespace.foo=bar,min:1,max:5"


Minor nit, but can we update the IDs to look like real ocids just in case there are weird parsing bugs. For example

ocid1.cluster.oc1.test-region.test ocid1.compartment.oc1.test-region.test

trungng92 · 2024-10-22T15:23:10Z

cluster-autoscaler/cloudprovider/oci/nodepools/oci_manager.go

+			parts := strings.Split(pair, "=")
+			if len(parts) == 2 {
+				spec.tags[parts[0]] = parts[1]
+			}


Should we be returning an error if the length is not 2? Or is that a valid use case? Right now we will just silently continue on.

yes, this would actually be a formatting error, I've added an else statement to fix it.

trungng92 · 2024-10-22T15:27:53Z

cluster-autoscaler/cloudprovider/oci/nodepools/oci_manager.go

+		return nil, fmt.Errorf("failed to set %s size: %s, expected integer", max, parametersMap[max])
+	}
+
+	if parametersMap[nodepoolTags] != "" {


Is it valid not to specify node pool tags? Or are they required? This will silently continue on if there are no nodePoolTags specified.

I have made nodepooltags optional at the beginning, but then I noticed we may also need to support instancepooltags after @jlamillan's feedbacks.
in the final state, nodepool tags are not optional as I do a validation already to decide whether an instancepoolmanager or nodepoolmanager would be initiated.
In short, not it's not optional and I added a failure case to address your feedback.

trungng92 · 2024-10-22T15:29:17Z

cluster-autoscaler/cloudprovider/oci/README.md

+### Node Group Auto Discovery
+`--node-group-auto-discovery` could be given in below pattern. It would discover the nodepools under given compartment by matching the nodepool tags (either they are Freeform or Defined tags)
+```
+clusterId:<clusterId>,compartmentId:<compartmentId>,nodepoolTags:<tagKey1>=<tagValue1>&<tagKey2>=<tagValue2>,min:<min>,max:<max>


Are all of the fields in this required? Or are any optional? Can we specify in the comment string above?

all of the fields are mandatory. I've added a statement to make it clear in readme.

trungng92 · 2024-10-25T16:23:17Z

Okay, new changes look good to me.

jlamillan

/lgtm
/approve

k8s-ci-robot · 2024-10-25T20:14:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gvnc, jlamillan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/oci/OWNERS~~ [jlamillan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

node-group-auto-discovery support for oci

e4ff4ad

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 16, 2024

k8s-ci-robot requested a review from aleksandra-malinowska October 16, 2024 11:03

k8s-ci-robot added the area/cluster-autoscaler label Oct 16, 2024

k8s-ci-robot requested a review from jlamillan October 16, 2024 11:03

k8s-ci-robot added the area/provider/oci Issues or PRs related to oci provider label Oct 16, 2024

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 16, 2024

fixing golint issues

5ae6fa4

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 16, 2024

jlamillan reviewed Oct 16, 2024

View reviewed changes

jlamillan suggested changes Oct 16, 2024

View reviewed changes

gvnc added 4 commits October 17, 2024 13:08

removed the log line to close a review item

ee70dce

validations added for nodeGroupAutoDiscovery parameter

02c1e04

small amendment in document

e79bffe

renaming validation method and returning instancepoolTagsFound variab…

cf061f3

…le as well.

trungng92 reviewed Oct 22, 2024

View reviewed changes

gvnc added 2 commits October 24, 2024 16:46

addressed review feedback items

6396e63

addressed review feedback items

1bf5f72

jlamillan approved these changes Oct 25, 2024

View reviewed changes

k8s-ci-robot assigned jlamillan Oct 25, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2024

k8s-ci-robot merged commit 6a02299 into kubernetes:master Oct 25, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node-group-auto-discovery support for oci #7403

node-group-auto-discovery support for oci #7403

gvnc commented Oct 16, 2024

k8s-ci-robot commented Oct 16, 2024

aleksandra-malinowska commented Oct 16, 2024

jlamillan Oct 16, 2024 •

edited

Loading

gvnc Oct 18, 2024 •

edited

Loading

jlamillan Oct 16, 2024

gvnc Oct 17, 2024

jlamillan Oct 17, 2024

gvnc Oct 18, 2024 •

edited

Loading

jlamillan Oct 16, 2024 •

edited

Loading

gvnc Oct 17, 2024

jlamillan Oct 16, 2024

gvnc Oct 17, 2024

jlamillan Oct 17, 2024

jlamillan Oct 16, 2024 •

edited

Loading

gvnc Oct 17, 2024

jlamillan Oct 17, 2024

gvnc Oct 18, 2024

jlamillan left a comment •

edited

Loading

jlamillan commented Oct 21, 2024

trungng92 Oct 22, 2024

trungng92 Oct 22, 2024

trungng92 Oct 22, 2024

gvnc Oct 24, 2024

trungng92 Oct 22, 2024

gvnc Oct 24, 2024

trungng92 Oct 22, 2024

gvnc Oct 24, 2024 •

edited

Loading

trungng92 Oct 22, 2024

gvnc Oct 24, 2024

trungng92 Oct 22, 2024

gvnc Oct 24, 2024

trungng92 commented Oct 25, 2024

jlamillan left a comment

k8s-ci-robot commented Oct 25, 2024

node-group-auto-discovery support for oci #7403

node-group-auto-discovery support for oci #7403

Conversation

gvnc commented Oct 16, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Oct 16, 2024

aleksandra-malinowska commented Oct 16, 2024

jlamillan Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

gvnc Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvnc Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

jlamillan Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlamillan Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlamillan left a comment • edited Loading

Choose a reason for hiding this comment

jlamillan commented Oct 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvnc Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trungng92 commented Oct 25, 2024

jlamillan left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 25, 2024

jlamillan Oct 16, 2024 •

edited

Loading

gvnc Oct 18, 2024 •

edited

Loading

gvnc Oct 18, 2024 •

edited

Loading

jlamillan Oct 16, 2024 •

edited

Loading

jlamillan Oct 16, 2024 •

edited

Loading

jlamillan left a comment •

edited

Loading

gvnc Oct 24, 2024 •

edited

Loading