Improve ClusterCacheTracker TryLock behavior #10819

sbueringer · 2024-07-01T13:00:56Z

ClusterCacheTracker today allows only one controller worker at a time to retrieve a client. If a second controller worker tries it at the same time, it will get an ErrClusterLocked error. This usually leads to a log like this "Requeuing because another worker has the lock on the ClusterCacheTracker" (log level 5) and a requeue.

This was introduced for the case where a workload cluster is not reachable. In that case, when we try to create a client with the CCT the client creation times out after 10 seconds. In this scenario we wanted to block at most one worker and not deadlock entire controllers.

I think the current behavior is not ideal in so far that TryLock just immediately fails/returns. Ideally we would try get the lock for a small period of time so we end up with the following results:

Happy path (cluster is reachable, we can create a client):

1st controller creates the client
all other controllers retry TryLock for a few ms and eventually everyone gets a client without an entire requeue (the requeue is done today via RequeueAfter 1m)

Un-happy path (cluster is not reachable, client creation times out after 10s):

1st controller tries to create a client and times out after 10s
all other controllers retry TryLock for a few ms, but eventually give up (ideally in this scenario over time the duration of our retry goes down so that the average reconcile duration of our controller doesn't degrade too much)

Or maybe we should re-think the whole mechanism and come up with something different :)

The text was updated successfully, but these errors were encountered:

fabriziopandini · 2024-07-03T13:57:21Z

/assign

added to my pipeline, please reach out if someone wants to give a try before I get to it.

fabriziopandini · 2024-07-08T10:14:43Z

Another idea: we can take the opportunity to think about remote connection management in a more holistic way.

e.g. We can assign to CCT the responsibility to create remote clients, and let the other controllers use them if available.
This could probably remove consideration about lock from other controllers (the client is there or not, no need to lock); this approach most probably requires some mechanism to trigger reconciliations when a remote client becomes available.

sbueringer · 2024-07-08T16:17:52Z

Absolutely in favor!

the client is there or not, no need to lock

I think we have to lock, but we can split between read and write locks, which should help a lot
(the huge improvement comes from that no regular controller will be blocked for up to 10s when trying to create a client and timing out)

most probably requires some mechanism to trigger reconciliations

Today we just requeue with 1m if we hit ErrClusterLocked, We could do something similar if the client is not available.

sbueringer · 2024-09-10T11:09:20Z

/assign

sbueringer · 2024-10-08T15:24:55Z

/close

in favor of #11272

k8s-ci-robot · 2024-10-08T15:25:00Z

@sbueringer: Closing this issue.

In response to this:

/close

in favor of #11272

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sbueringer added the area/clustercachetracker Issues or PRs related to the clustercachetracker label Jul 1, 2024

k8s-ci-robot added needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 1, 2024

fabriziopandini added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 3, 2024

k8s-ci-robot removed the needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Jul 3, 2024

fabriziopandini added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 3, 2024

k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Jul 3, 2024

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 3, 2024

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 3, 2024

k8s-ci-robot assigned fabriziopandini Jul 3, 2024

k8s-ci-robot assigned sbueringer Sep 10, 2024

sbueringer unassigned fabriziopandini Sep 10, 2024

sbueringer mentioned this issue Oct 1, 2024

✨ Introduce new ClusterCache #11247

Merged

sbueringer mentioned this issue Oct 8, 2024

Improve ClusterCacheTracker to address fundamental limitations #11272

Open

12 tasks

k8s-ci-robot closed this as completed Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ClusterCacheTracker TryLock behavior #10819

Improve ClusterCacheTracker TryLock behavior #10819

sbueringer commented Jul 1, 2024

fabriziopandini commented Jul 3, 2024

fabriziopandini commented Jul 8, 2024

sbueringer commented Jul 8, 2024 •

edited

Loading

sbueringer commented Sep 10, 2024

sbueringer commented Oct 8, 2024

k8s-ci-robot commented Oct 8, 2024

Improve ClusterCacheTracker TryLock behavior #10819

Improve ClusterCacheTracker TryLock behavior #10819

Comments

sbueringer commented Jul 1, 2024

fabriziopandini commented Jul 3, 2024

fabriziopandini commented Jul 8, 2024

sbueringer commented Jul 8, 2024 • edited Loading

sbueringer commented Sep 10, 2024

sbueringer commented Oct 8, 2024

k8s-ci-robot commented Oct 8, 2024

sbueringer commented Jul 8, 2024 •

edited

Loading