Race condition occurs during sharding when `application-controller-0` pod is restarted #20620

kahou82 · 2024-10-31T16:26:03Z

Checklist:

I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
I've included steps to reproduce the bug.
I've pasted the output of argocd version.

Describe the bug

We have a lot of applications (1 per cluster) and noticed a race condition when application-controller-0 is restarted; Some of the in progress syncing applications that do not belong to application-controller-0 suddenly goes to Error state.

By deeper investigation, I noticed there is a race condition during the applicationController startup:

When ApplicationController starts, it initializes appInformer from newApplicationInformerAndLister in here
newApplicationInformerAndLister will register k8s resources event handler which will add the correlated objects (argocd Apps) into the associate work queue. To determine if the app belongs to that controller, it uses canProcessApp in here
- canProcessApp internally uses IsManagedCluster to determine if the cluster is managed by the associated ApplicationController
- However, IsManagedCluster has a fault: If the applicationController sharding is empty and if application controller shard is 0. Then -0 will owns everything.
Then application controller goes to Run function, it first spawn ApplicationInformer object (from step 2) in a go routine.
ApplicationInformer gets event notification and by that time ApplicationController didn't have sharding cluster correlation build up. It will process every cluster in the queue. (see here)
Then application controller initialize the sharding in here
Application controller then starts process iterm from the queue via processAppOperationQueueItem. Since the queue has the all the apps, the unrelated app will also being processed. At certain point, it will check IsManagedCluster again (by the time the sharding is built up) and find out the app does not belong to the application-controller-0 and therefore lead to the app goes to common.OperationError state.

Our applications all of sudden have controller is configured to ignore cluster (report from here)

And end up reporting updated 'argocd/foo-cluster' operation (phase: Error) from the log. The applications will stuck syncing forever until we manually hard refresh them in argoUI.

I am proposing the solution by switch step 3 and step 4. So that we initialize the sharding before we launch the AppInformer in goroutine.

I can push a PR pretty soon.

To Reproduce

Start application-controller sts with multiple replicas to enable sharding.
Start several applications that require a very long time of sync process.
Restart application-controller-0 pod
Intermittently, you will see controller is configured to ignore cluster is being reported from application-controller-0 pod

Expected behavior

application-controller-0 should just handle its own cluster regardless any situation.

Screenshots

Version

Paste the output from `argocd version` here.

Paste any relevant application logs here.

The text was updated successfully, but these errors were encountered:

kahou82 added the bug Something isn't working label Oct 31, 2024

gdsoumya added the component:sharding label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition occurs during sharding when `application-controller-0` pod is restarted #20620

Race condition occurs during sharding when `application-controller-0` pod is restarted #20620

kahou82 commented Oct 31, 2024 •

edited

Loading

Race condition occurs during sharding when application-controller-0 pod is restarted #20620

Race condition occurs during sharding when application-controller-0 pod is restarted #20620

Comments

kahou82 commented Oct 31, 2024 • edited Loading

Race condition occurs during sharding when `application-controller-0` pod is restarted #20620

Race condition occurs during sharding when `application-controller-0` pod is restarted #20620

kahou82 commented Oct 31, 2024 •

edited

Loading