Update controller and agent to kube-rs client `0.91.0` #702

kate-goldenring · 2024-09-25T05:07:56Z

What this PR does / why we need it:
Our controller uses a really out of date Kubernetes client (kube-rs) API. This updates it to a newer version. Another PR can bump us to latest, which will cause more changes to the agent which is why I avoided that big of a jump.
Much of the controller code was written years ago when the kube-rs did a lot less of the heavy lifting. Switching to a newer version with a better reconciliation model means I was able to delete a lot of helper code.
All the tests had to be rewritten for the new API.

Special notes for your reviewer:
This is a lot. I am hoping the tests can confirm the functionality but we may also need to do a good bug bash.

If applicable:

this PR has an associated PR with documentation in akri-docs
this PR contains unit tests
added code adheres to standard Rust formatting (cargo fmt)
code builds properly (cargo build)
code is free of common mistakes (cargo clippy)
all Akri tests succeed (cargo test)
inline documentation builds (cargo doc)
all commits pass the DCO bot check by being signed off -- see the failing DCO check for instructions on how to retroactively sign commits

Signed-off-by: Kate Goldenring <[email protected]>

diconico07

Did a first pass here, didn't look at the tests for now.

shared/Cargo.toml

controller/src/main.rs

controller/src/util/node_watcher.rs

diconico07 · 2024-09-25T08:14:03Z

controller/src/util/node_watcher.rs

+/// This attempts to remove nodes from the nodes list and deviceUsage
+/// map in an Instance.  An attempt is made to update
+/// the instance in etcd, any failure is returned.
+async fn try_remove_nodes_from_instance(


Does this play well with Server Side Apply used by the agent ? I fear this would completely mess with fields ownership and prevent agents to remove themselves from an Instance afterwards (on device disappearance).
Also would need to remove the finalizer of the vanished nodes' agent.

Server side apply is fairly new to me. It sounds like the idea is that only one owner should manage an object or there could be conflicts. The owner of the instance objects should be the agent. Therefore, i should switch this back to do patching as we used to?

If we remove a node at a time, then I think we should impersonate the node's agent to do so and do what the missing node agent's would do if it were to consider the device disappears.

I don't like the idea of impersonation, but i don't see removing a finalizer as impersonation. Does d9689ae address your concern

controller/src/util/pod_watcher.rs

diconico07 · 2024-09-25T12:15:37Z

controller/src/util/pod_watcher.rs

+    let owner_references: Vec<OwnerReference> = vec![OwnerReference {
+        api_version: ownership.get_api_version(),
+        kind: ownership.get_kind(),
+        controller: ownership.get_controller(),
+        block_owner_deletion: ownership.get_block_owner_deletion(),
+        name: ownership.get_name(),
+        uid: ownership.get_uid(),
+    }];


It would be easier to implement From<OwnershipInfo> for OwnerReference than repeating this block every time

controller/src/util/instance_action.rs

Signed-off-by: Kate Goldenring <[email protected]>

…version to 1.75.0 Signed-off-by: Kate Goldenring <[email protected]>

Signed-off-by: Kate Goldenring <[email protected]>

kate-goldenring · 2024-10-01T21:22:09Z

Thinking about this more, i think we want to avoid using the finalizer API for the instance_watcher since we don't do anything on instance deletes so a finalizer shouldn't be necessary

For node_watcher, we could also remove the finalizer given that we don't care if the node is fully deleted before reconciling it.
Even the pod_watcher is debatable. For all of these, reverting to using a watcher or reflector may be better.

diconico07 · 2024-10-02T06:22:30Z

@kate-goldenring the Controller system doesn't have any "delete" event partly because these can be missed easily, e.g for the node watcher if the controller was on the now missing node, it will never get a delete event for that node.

Overall, the akri controller currently use an event based imperative mechanism (node got deleted, so I do remove it from Instances). But the kube-rs "Controller" system is a bit different in that it works with: I have a resource (in our case the Instance), I should ensure everything is set-up correctly for it.

So if we were to follow the full kube-rs flow here, we would have an Instance controller, that ensure all broker Pods/Jobs/... are here, and that all referenced nodes are healthy, it would trigger on an Instance change to reconcile that Instance, would trigger on owned Pod/Job/... change (to ensure it is always according to spec), and would also trigger (for all Instances) on node state change.

We could also have a reflector for the Configuration to maintain a cache of them and avoid fetching the Configuration from API every time we reconcile an Instance.

I didn't comment on this as your goal was to do the minimal changes necessary to upgrade to latest kube-rs

kate-goldenring · 2024-10-02T14:44:07Z

@diconico07 that vision makes sense. I started down the path of minimal changes and obviously it grew so much that a full rewrite with just an instance controller would have been easier. At this point, I don't have much more time to devote to this, so could we look at this from the lens of what needs to change here for parity and upgraded kube-rs and then track an issue on rewriting the controller?

Signed-off-by: Kate Goldenring <[email protected]>

kate-goldenring · 2024-10-02T14:53:54Z

Because we are not reconciling instance deletes anymore, the BROKER_POD_COUNT metric is becoming more of a BROKER_POD_CREATED metric

kate-goldenring · 2024-10-02T14:54:37Z

I think we need to bump rust version in a separate PR given the failing e2e tests probably due to too low a rust version in the cross containers:

#19 6.424 error: package `opcua-discovery-handler v0.13.2 (/app/discovery-handler-modules/opcua-discovery-handler)` cannot be built because it requires rustc 1.75.0 or newer, while the currently active rustc version is 1.74.1

kate-goldenring · 2024-10-02T14:58:26Z

it would trigger on an Instance change to reconcile that Instance, would trigger on owned Pod/Job/... change (to ensure it is always according to spec), and would also trigger (for all Instances) on node state change.

@diconico07 clarifying here: are you saying the controller should still watch Instance, Pod and Node resources? or just instance resources and verify the existence of resources based on the instance change?

diconico07 · 2024-10-02T15:08:09Z

It should reconcile Instances (Controller mode), watch Nodes (IIRC there is a way to just get a trigger when a specific field gets updated), and watch Pods,Services,Jobs (if we want to always keep these in line with the spec in Configuration).

The last two being light watches (or probably reflector for nodes), that will just trigger an Instance reconciliation.

Signed-off-by: Kate Goldenring <[email protected]>

Update controller and agent to kube-rs 0.91.0

e10c0ca

Signed-off-by: Kate Goldenring <[email protected]>

kate-goldenring requested review from bfjelds, jiria, Britel, romoh, adithyaj, johnsonshih and diconico07 as code owners September 25, 2024 05:07

diconico07 requested changes Sep 25, 2024

View reviewed changes

kate-goldenring added 4 commits September 25, 2024 16:38

Remove Pod handling for instance deletion and controller cleanups

fc437e0

Signed-off-by: Kate Goldenring <[email protected]>

Use ResourceExt methods for resource property access and update rust …

1aada19

…version to 1.75.0 Signed-off-by: Kate Goldenring <[email protected]>

Consolidate pod change handling code

9ab2457

Signed-off-by: Kate Goldenring <[email protected]>

Remove node finalizer on node cleanup

d9689ae

Signed-off-by: Kate Goldenring <[email protected]>

Use Watcher API for instance reconciliation in controller

97470b0

Signed-off-by: Kate Goldenring <[email protected]>

kate-goldenring force-pushed the update-kube-rs branch from 37fc057 to 97470b0 Compare October 2, 2024 14:53

Update Rust Dockerfile version

335da46

Signed-off-by: Kate Goldenring <[email protected]>

kate-goldenring force-pushed the update-kube-rs branch from fb7d4e6 to 335da46 Compare October 2, 2024 22:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update controller and agent to kube-rs client `0.91.0` #702

Update controller and agent to kube-rs client `0.91.0` #702

kate-goldenring commented Sep 25, 2024

diconico07 left a comment

diconico07 Sep 25, 2024

kate-goldenring Sep 25, 2024

diconico07 Oct 1, 2024

kate-goldenring Oct 1, 2024 •

edited

Loading

diconico07 Sep 25, 2024

kate-goldenring commented Oct 1, 2024

diconico07 commented Oct 2, 2024

kate-goldenring commented Oct 2, 2024

kate-goldenring commented Oct 2, 2024

kate-goldenring commented Oct 2, 2024

kate-goldenring commented Oct 2, 2024

diconico07 commented Oct 2, 2024

Update controller and agent to kube-rs client 0.91.0 #702

Are you sure you want to change the base?

Update controller and agent to kube-rs client 0.91.0 #702

Conversation

kate-goldenring commented Sep 25, 2024

diconico07 left a comment

Choose a reason for hiding this comment

diconico07 Sep 25, 2024

Choose a reason for hiding this comment

kate-goldenring Sep 25, 2024

Choose a reason for hiding this comment

diconico07 Oct 1, 2024

Choose a reason for hiding this comment

kate-goldenring Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

diconico07 Sep 25, 2024

Choose a reason for hiding this comment

kate-goldenring commented Oct 1, 2024

diconico07 commented Oct 2, 2024

kate-goldenring commented Oct 2, 2024

kate-goldenring commented Oct 2, 2024

kate-goldenring commented Oct 2, 2024

kate-goldenring commented Oct 2, 2024

diconico07 commented Oct 2, 2024

Update controller and agent to kube-rs client `0.91.0` #702

Update controller and agent to kube-rs client `0.91.0` #702

kate-goldenring Oct 1, 2024 •

edited

Loading