Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition occurs during sharding when application-controller-0 pod is restarted #20620

Open
3 tasks done
kahou82 opened this issue Oct 31, 2024 · 0 comments
Open
3 tasks done
Labels
bug Something isn't working component:sharding

Comments

@kahou82
Copy link

kahou82 commented Oct 31, 2024

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

We have a lot of applications (1 per cluster) and noticed a race condition when application-controller-0 is restarted; Some of the in progress syncing applications that do not belong to application-controller-0 suddenly goes to Error state.

By deeper investigation, I noticed there is a race condition during the applicationController startup:

  1. When ApplicationController starts, it initializes appInformer from newApplicationInformerAndLister in here
  2. newApplicationInformerAndLister will register k8s resources event handler which will add the correlated objects (argocd Apps) into the associate work queue. To determine if the app belongs to that controller, it uses canProcessApp in here
    • canProcessApp internally uses IsManagedCluster to determine if the cluster is managed by the associated ApplicationController
    • However, IsManagedCluster has a fault: If the applicationController sharding is empty and if application controller shard is 0. Then -0 will owns everything.
  3. Then application controller goes to Run function, it first spawn ApplicationInformer object (from step 2) in a go routine.
  4. ApplicationInformer gets event notification and by that time ApplicationController didn't have sharding cluster correlation build up. It will process every cluster in the queue. (see here)
  5. Then application controller initialize the sharding in here
  6. Application controller then starts process iterm from the queue via processAppOperationQueueItem. Since the queue has the all the apps, the unrelated app will also being processed. At certain point, it will check IsManagedCluster again (by the time the sharding is built up) and find out the app does not belong to the application-controller-0 and therefore lead to the app goes to common.OperationError state.

Our applications all of sudden have controller is configured to ignore cluster (report from here)

And end up reporting updated 'argocd/foo-cluster' operation (phase: Error) from the log. The applications will stuck syncing forever until we manually hard refresh them in argoUI.

I am proposing the solution by switch step 3 and step 4. So that we initialize the sharding before we launch the AppInformer in goroutine.

I can push a PR pretty soon.

To Reproduce

  1. Start application-controller sts with multiple replicas to enable sharding.
  2. Start several applications that require a very long time of sync process.
  3. Restart application-controller-0 pod
  4. Intermittently, you will see controller is configured to ignore cluster is being reported from application-controller-0 pod

Expected behavior

  1. application-controller-0 should just handle its own cluster regardless any situation.

Screenshots

Version

Paste the output from `argocd version` here.

Screenshot 2024-10-31 at 12 08 15 PM

Paste any relevant application logs here.
@kahou82 kahou82 added the bug Something isn't working label Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component:sharding
Projects
None yet
Development

No branches or pull requests

2 participants