Race condition occurs during sharding when application-controller-0
pod is restarted
#20620
Open
3 tasks done
Labels
Checklist:
argocd version
.Describe the bug
We have a lot of applications (1 per cluster) and noticed a race condition when
application-controller-0
is restarted; Some of the in progress syncing applications that do not belong toapplication-controller-0
suddenly goes toError
state.By deeper investigation, I noticed there is a race condition during the applicationController startup:
newApplicationInformerAndLister
in herenewApplicationInformerAndLister
will register k8s resources event handler which will add the correlated objects (argocd Apps) into the associate work queue. To determine if the app belongs to that controller, it usescanProcessApp
in herecanProcessApp
internally usesIsManagedCluster
to determine if the cluster is managed by the associated ApplicationControllerIsManagedCluster
has a fault: If the applicationController sharding is empty and if application controller shard is 0. Then -0 will owns everything.Run
function, it first spawn ApplicationInformer object (from step 2) in a go routine.processAppOperationQueueItem
. Since the queue has the all the apps, the unrelated app will also being processed. At certain point, it will checkIsManagedCluster
again (by the time the sharding is built up) and find out the app does not belong to theapplication-controller-0
and therefore lead to the app goes tocommon.OperationError
state.Our applications all of sudden have
controller is configured to ignore cluster
(report from here)And end up reporting
updated 'argocd/foo-cluster' operation (phase: Error)
from the log. The applications will stuck syncing forever until we manually hard refresh them in argoUI.I am proposing the solution by switch step 3 and step 4. So that we initialize the sharding before we launch the AppInformer in goroutine.
I can push a PR pretty soon.
To Reproduce
application-controller-0
podcontroller is configured to ignore cluster
is being reported fromapplication-controller-0
podExpected behavior
application-controller-0
should just handle its own cluster regardless any situation.Screenshots
Version
Paste the output from `argocd version` here.
The text was updated successfully, but these errors were encountered: