Flux performance on large repos #3380
Replies: 2 comments 10 replies
-
I suggest you set the number of parallel reconciliations to something like 100 or even higher, each reconciliation runs in a Go routine which is a lightweight thread, a single CPU core can handle hundreds of those so this may speed up considerably the whole reconciliation.
Due to the nature of Kustomize where a change in a base overlay can affect all the other overlays, Flux can’t skip changes by looking at files, it needs to ask the server, so we do a server-side apply dry-run and we only apply the objects that changed. Also if something in Git didn’t change, it may have drifted in-cluster, to correct drift such as a kubectl apply, we decided to always run the drift detection via server-side apply dry-run on Git revision changes. If each tenant would have its definition in an OCI artifact, then Flux will only reconcile the tenant that changed. |
Beta Was this translation helpful? Give feedback.
-
Did you ever add that to the cheatsheet @stefanprodan? We've run into this issue, in particular in terms of AWS EBS gp2 burst exhaustion. We've moved to gp3 which has no burst issues, but a memory disk would still be better. We've proved it manually (not using the helm chart, see below) in our cluster and it works great. Do you still recommend it? FWIW here is our change that works for the flux helm chart v2.14.1 (which I know is community driven). kustomizeController:
volumes:
- name: temp
emptyDir:
sizeLimit: 1000Mi
medium: Memory
volumeMounts:
- name: temp
mountPath: /tmp |
Beta Was this translation helpful? Give feedback.
-
First of all I'd like to say I love working with Flux, thanks for this great piece of software.
We have a single-tenant setup for our solution. At the moment, we run about 1,200 tenants on a single Kubernetes cluster, which works fine.
In our gitops repo that Flux monitors, we have 1,200 directories containing the required Kubernetes manifests (about 20 objects).
Because a full reconciliation run took quite long, we made the following performance improvements to the
kustomize-controller
Deployment:medium: Memory
for theemptyDir
mount on/tmp
to avoid disk throttlinginterval
of the Kustomizations to 60m.Now, a full run takes about 13 minutes, which is doable.
However, the problem that we face is the following. When we update the manifests of a single tenant and the Flux source-controller fetches the changes, a full reconciliation run is initiated (meaning, all 1,200 Kustomizations). This means that we need to wait max. 13 min for a simple configuration update (source-controller fetches the repo every 13 minutes). Second, we when update another configuration, all 1,200 Kustomizations are added to kustomize-controller's queue again.
Why doesn't it only reconcile the Kustomizations that have changes in the repo? If this is by design, so you have any recommendations how to improve this setup?
Beta Was this translation helpful? Give feedback.
All reactions