Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update controller and agent to kube-rs client 0.91.0 #702

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

kate-goldenring
Copy link
Contributor

What this PR does / why we need it:
Our controller uses a really out of date Kubernetes client (kube-rs) API. This updates it to a newer version. Another PR can bump us to latest, which will cause more changes to the agent which is why I avoided that big of a jump.
Much of the controller code was written years ago when the kube-rs did a lot less of the heavy lifting. Switching to a newer version with a better reconciliation model means I was able to delete a lot of helper code.
All the tests had to be rewritten for the new API.

Special notes for your reviewer:
This is a lot. I am hoping the tests can confirm the functionality but we may also need to do a good bug bash.

If applicable:

  • this PR has an associated PR with documentation in akri-docs
  • this PR contains unit tests
  • added code adheres to standard Rust formatting (cargo fmt)
  • code builds properly (cargo build)
  • code is free of common mistakes (cargo clippy)
  • all Akri tests succeed (cargo test)
  • inline documentation builds (cargo doc)
  • all commits pass the DCO bot check by being signed off -- see the failing DCO check for instructions on how to retroactively sign commits

Copy link
Contributor

@diconico07 diconico07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass here, didn't look at the tests for now.

shared/Cargo.toml Outdated Show resolved Hide resolved
controller/src/main.rs Outdated Show resolved Hide resolved
controller/src/util/node_watcher.rs Outdated Show resolved Hide resolved
/// This attempts to remove nodes from the nodes list and deviceUsage
/// map in an Instance. An attempt is made to update
/// the instance in etcd, any failure is returned.
async fn try_remove_nodes_from_instance(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this play well with Server Side Apply used by the agent ? I fear this would completely mess with fields ownership and prevent agents to remove themselves from an Instance afterwards (on device disappearance).
Also would need to remove the finalizer of the vanished nodes' agent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Server side apply is fairly new to me. It sounds like the idea is that only one owner should manage an object or there could be conflicts. The owner of the instance objects should be the agent. Therefore, i should switch this back to do patching as we used to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove a node at a time, then I think we should impersonate the node's agent to do so and do what the missing node agent's would do if it were to consider the device disappears.

Copy link
Contributor Author

@kate-goldenring kate-goldenring Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea of impersonation, but i don't see removing a finalizer as impersonation. Does d9689ae address your concern

controller/src/util/pod_watcher.rs Outdated Show resolved Hide resolved
controller/src/util/pod_watcher.rs Outdated Show resolved Hide resolved
controller/src/util/pod_watcher.rs Outdated Show resolved Hide resolved
Comment on lines +514 to +521
let owner_references: Vec<OwnerReference> = vec![OwnerReference {
api_version: ownership.get_api_version(),
kind: ownership.get_kind(),
controller: ownership.get_controller(),
block_owner_deletion: ownership.get_block_owner_deletion(),
name: ownership.get_name(),
uid: ownership.get_uid(),
}];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be easier to implement From<OwnershipInfo> for OwnerReference than repeating this block every time

controller/src/util/instance_action.rs Outdated Show resolved Hide resolved
@kate-goldenring
Copy link
Contributor Author

Thinking about this more, i think we want to avoid using the finalizer API for the instance_watcher since we don't do anything on instance deletes so a finalizer shouldn't be necessary

For node_watcher, we could also remove the finalizer given that we don't care if the node is fully deleted before reconciling it.
Even the pod_watcher is debatable. For all of these, reverting to using a watcher or reflector may be better.

@diconico07
Copy link
Contributor

@kate-goldenring the Controller system doesn't have any "delete" event partly because these can be missed easily, e.g for the node watcher if the controller was on the now missing node, it will never get a delete event for that node.

Overall, the akri controller currently use an event based imperative mechanism (node got deleted, so I do remove it from Instances). But the kube-rs "Controller" system is a bit different in that it works with: I have a resource (in our case the Instance), I should ensure everything is set-up correctly for it.

So if we were to follow the full kube-rs flow here, we would have an Instance controller, that ensure all broker Pods/Jobs/... are here, and that all referenced nodes are healthy, it would trigger on an Instance change to reconcile that Instance, would trigger on owned Pod/Job/... change (to ensure it is always according to spec), and would also trigger (for all Instances) on node state change.

We could also have a reflector for the Configuration to maintain a cache of them and avoid fetching the Configuration from API every time we reconcile an Instance.

I didn't comment on this as your goal was to do the minimal changes necessary to upgrade to latest kube-rs

@kate-goldenring
Copy link
Contributor Author

@diconico07 that vision makes sense. I started down the path of minimal changes and obviously it grew so much that a full rewrite with just an instance controller would have been easier. At this point, I don't have much more time to devote to this, so could we look at this from the lens of what needs to change here for parity and upgraded kube-rs and then track an issue on rewriting the controller?

@kate-goldenring
Copy link
Contributor Author

Because we are not reconciling instance deletes anymore, the BROKER_POD_COUNT metric is becoming more of a BROKER_POD_CREATED metric

@kate-goldenring
Copy link
Contributor Author

I think we need to bump rust version in a separate PR given the failing e2e tests probably due to too low a rust version in the cross containers:

#19 6.424 error: package `opcua-discovery-handler v0.13.2 (/app/discovery-handler-modules/opcua-discovery-handler)` cannot be built because it requires rustc 1.75.0 or newer, while the currently active rustc version is 1.74.1

@kate-goldenring
Copy link
Contributor Author

it would trigger on an Instance change to reconcile that Instance, would trigger on owned Pod/Job/... change (to ensure it is always according to spec), and would also trigger (for all Instances) on node state change.

@diconico07 clarifying here: are you saying the controller should still watch Instance, Pod and Node resources? or just instance resources and verify the existence of resources based on the instance change?

@diconico07
Copy link
Contributor

It should reconcile Instances (Controller mode), watch Nodes (IIRC there is a way to just get a trigger when a specific field gets updated), and watch Pods,Services,Jobs (if we want to always keep these in line with the spec in Configuration).

The last two being light watches (or probably reflector for nodes), that will just trigger an Instance reconciliation.

Signed-off-by: Kate Goldenring <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants