Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for persistent node labels for node stop/restart #222

Open
Mossaka opened this issue Sep 17, 2024 · 3 comments
Open

Support for persistent node labels for node stop/restart #222

Mossaka opened this issue Sep 17, 2024 · 3 comments
Assignees

Comments

@Mossaka
Copy link
Member

Mossaka commented Sep 17, 2024

Context: In AKS clusters, when the cluster is stopped and restarted, new nodes are provisioned as part of the VMSS design. This leads to the loss of Kubernetes annotations on the nodes (i.e. kwasm.sh/kwasm-node=true), which impacts configurations that rely on those annotations. The current behavior of runtime-class-manager is to mutate containerd configurations based on annotations that are manually applied to nodes. However, since annotations are not persistent across cluster restarts, this causes issues with retaining the necessary configuration.

ref: spinkube/azure#24

Proposal:

Labels such as kwasm.sh/kwasm-node=true should be used instead of annotations for the runtime-class-manager to identify and mutate node configurations. Unlike annotations, labels are persistent across cluster restarts and node reallocation.

The runtime-class-manager should be enhanced to detect and reapply configurations based on node labels automatically, without requiring manual intervention. This would ensure configurations persist across node restarts.

Lastly, the legacy code in node_controller.go needs to be cleaned up. cc @vdice @voigt

@vdice
Copy link
Collaborator

vdice commented Sep 17, 2024

  1. rcm does utilize labels today (see job_controller.go). (The legacy kwasm-operator project uses annotations.)
    a. However, re: AKS node turnover: I just tested and am not currently seeing any labels I've applied to one node persist to the next that has taken its place after an AKS stop/start cycle. This seems like a blocker for the purposes of this ticket, right? My mistake: for nodes to inherit labels on AKS, they need to be applied to the nodepool via https://learn.microsoft.com/en-us/azure/aks/use-labels#create-a-node-pool-with-a-label. Thanks @Mossaka!
  2. I'd be happy to pick the node_controller.go work up if I can get clarity on the intended functionality. I assume its intention is to watch for new nodes and see if they have relevant labels (eg <shim name>=true) and if so install the corresponding shim(s)? So, this in combination with 1a means we should have successful app restarts during node turnover?

@vdice vdice self-assigned this Oct 23, 2024
@vdice
Copy link
Collaborator

vdice commented Oct 23, 2024

Adding myself as an assignee as I had started on the second item mentioned in #222 (comment): unlocking the node_controller.go functionality.

@vdice
Copy link
Collaborator

vdice commented Nov 7, 2024

I finally got back to uncommenting and updating the node_controller.go code but quickly discovered that node_controller and the current shim_controller step on each other's toes. The shim controller already watches for node label changes which also apparently includes new nodes registering with a cluster. So, today, without any node_controller.go code, I believe we have the functionality to handle this issue, i.e. support new nodes already having the appropriate label per a given Shim's nodeSelector config.

I added a GH workflow test to help verify this; see a sample run here.

I've also verified that RCM works today in conjunction with AKS VMSS. When I label a node pool with the appropriate shim nodeSelector (in my case, spin=true) and I either restart the cluster/node pool or scale up the node pool, rcm installs the shim on the nodes. (Though I sometimes see cases of #66)

If RCM today watches for node label events, including new nodes, I wonder if we can fully do without node_controller.go. Perhaps this was what the original authors might have been thinking, hence being commented out. Is there any other node management functionality that RCM needs that would warrant a node_controller? cc @voigt @Mossaka @phyrog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants