Sync waves skipping & Application controller performance #19666
FlorisFeddema
started this conversation in
General
Replies: 1 comment
-
I'm seeing a similar issue when installing app into a new cluster. I have 10 sync waves, have restored the app healthcheck as per the documentation, and yet what I observe is that sync just rips through the apps, even as apps from the previous wave are still pending to sync (not even progressing, they haven't even started to sync yet!). |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
While looking into some weird behavior with our sync waves skipping sometimes we feel like we are missing some key information to get it to work properly again.
This behavior happened a lot more when we upgraded to ArgoCD version 2.12
We noticed that sometimes a few waves are skipped when we sync our application. This application exists of 4 other applications that need to be synced in order. Our expectation is that the child applications are not yet "processing" when the parent application checks whether the wave is completer or not.
According to the docs there is a delay between sync waves to prevent this behavior. We have set the ARGOCD_SYNC_WAVE_DELAY to 10 seconds already, but the issue is still happening.
When we set it to 30 seconds the problem goes away, but it seems like Argo waits 30 seconds for the first wave to start. But there seems to be no delay between te waves, when the first waves finishes we see the second wave starting directly.
Does anyone know how this variable works and why it "fixes" our problem?
While setting that variable to an high value might fix the problem, we wonder if there are other things we can change to make our setup more stable.
We currently have 3 pods on different nodes. Our nodes have 4 CPU cores each.
The Argo environment has about 800 Applications and 30 clusters.
In this chart we see the CPU usage of one of the Application Controller pods:
There are moments that the controller uses all available cpu on the node. We suspect these come from the timed reconciliation that happens about every 3 minutes. But since it consumes all available CPU at that moment we suspect some other Argo processes get throttled by this.
While the docs state that the status- and operation-processors could be changed to get more performance out of the environment we are afraid that setting these to higher values might throttle our environment even more, since it is already constrained by CPU usage.
While we are looking into ignoring the status field on all resources for status updates, but we are afraid about side-effects of this setting. While it feels like this could improve our performance a little bit, it does not seem obvious why it is not enabled by default (https://argo-cd.readthedocs.io/en/stable/operator-manual/reconcile/#system-level-configuration).
Beta Was this translation helpful? Give feedback.
All reactions