-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ClusterCacheTracker TryLock behavior #10819
Comments
/assign added to my pipeline, please reach out if someone wants to give a try before I get to it. |
Another idea: we can take the opportunity to think about remote connection management in a more holistic way. e.g. We can assign to CCT the responsibility to create remote clients, and let the other controllers use them if available. |
Absolutely in favor!
I think we have to lock, but we can split between read and write locks, which should help a lot
Today we just requeue with 1m if we hit ErrClusterLocked, We could do something similar if the client is not available. |
/assign |
/close in favor of #11272 |
@sbueringer: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
ClusterCacheTracker today allows only one controller worker at a time to retrieve a client. If a second controller worker tries it at the same time, it will get an
ErrClusterLocked
error. This usually leads to a log like this"Requeuing because another worker has the lock on the ClusterCacheTracker"
(log level 5) and a requeue.This was introduced for the case where a workload cluster is not reachable. In that case, when we try to create a client with the CCT the client creation times out after 10 seconds. In this scenario we wanted to block at most one worker and not deadlock entire controllers.
I think the current behavior is not ideal in so far that TryLock just immediately fails/returns. Ideally we would try get the lock for a small period of time so we end up with the following results:
Happy path (cluster is reachable, we can create a client):
Un-happy path (cluster is not reachable, client creation times out after 10s):
Or maybe we should re-think the whole mechanism and come up with something different :)
The text was updated successfully, but these errors were encountered: