You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I currently manage hundreds of Kubernetes clusters, which are configured using Flux from a single GitOps repository. We utilize flux_bootstrap_git to manage Flux installations for each cluster. On average, this repository receives new commits every minute during day time hours.
This high frequency of commits has caused issues when updating the flux_bootstrap_git resource. Specifically, whenever flux terraform provider attempts to push a commit to our GitOps repository, the Terraform provider almost always times out with the following error:
│ failed to push manifests: failed to push to remote: command error on
│ refs/heads/main: cannot lock ref 'refs/heads/main': is at
│ da2267035aa50139f41df052947da4e85202c0f0 but expected
│ 71853c6197a6a7f222db0f1978c7cb232b87c5ee
To mitigate this, we’ve increased the timeouts, which has helped to some extent. However, we’ve observed that on every retry, the Terraform provider performs a full clone of the entire repository. This process is time-consuming, given that the repository has over 300,000 commits, and new commits are often added within the retry window.
A potential improvement could involve modifying the func (prd *providerResourceData) CloneRepository(ctx context.Context) function in internal/provider/provider_resource_data.go to use a shallow clone. Here’s an example of the proposed change:
func (prd *providerResourceData) CloneRepository(ctx context.Context) (*gogit.Client, error) {
tmpDir, err := manifestgen.MkdirTempAbs("", "flux-bootstrap-")
if err != nil {
return nil, fmt.Errorf("could not create temporary working directory for git repository: %w", err)
}
gitClient, err := prd.GetGitClient(tmpDir)
if err != nil {
return nil, fmt.Errorf("could not create git client: %w", err)
}
// TODO: Need to conditionally clone here. If repository is empty this will fail.
_, err = gitClient.Clone(ctx, prd.GetRepositoryURL().String(), repository.CloneConfig{
CheckoutStrategy: repository.CheckoutStrategy{
Branch: prd.git.Branch.ValueString(),
},
+ ShallowClone: true,
})
if err != nil {
return nil, fmt.Errorf("could not clone git repository: %w", err)
}
return gitClient, nil
}
Testing this change locally has shown an improvement in performance. It reduces the time required to clone the repository and should decrease the likelihood of timeouts when applying our Terraform configuration.
Would this be a reasonable proposal for a pull request? Let me know if there are other considerations I should account for.
Steps to reproduce
Note
The failure is transient, so reproducing it may be tricky.
Bootstrap a repository using flux_bootstrap_git by running terraform apply.
Modify a property in flux_bootstrap_git, which triggers a new commit to be pushed to the bootstrapped repository.
Reapply the Terraform configuration (terraform apply).
While Terraform is applying, continuously push new commits to the bootstrapped repository to intentionally disrupt the process. (This is especially helpful if the repository is large and slow to clone.)
Expected behavior
Ideally, flux_bootstrap_git should be designed to scale efficiently and remain resilient under high-frequency repository operations, avoiding timeouts.
Screenshots and recordings
No response
Terraform and provider versions
Terraform v1.10.4
on linux_amd64
+ provider registry.terraform.io/fluxcd/flux v1.2.3
Describe the bug
I currently manage hundreds of Kubernetes clusters, which are configured using Flux from a single GitOps repository. We utilize
flux_bootstrap_git
to manage Flux installations for each cluster. On average, this repository receives new commits every minute during day time hours.This high frequency of commits has caused issues when updating the
flux_bootstrap_git
resource. Specifically, whenever flux terraform provider attempts to push a commit to our GitOps repository, the Terraform provider almost always times out with the following error:To mitigate this, we’ve increased the timeouts, which has helped to some extent. However, we’ve observed that on every retry, the Terraform provider performs a full clone of the entire repository. This process is time-consuming, given that the repository has over 300,000 commits, and new commits are often added within the retry window.
A potential improvement could involve modifying the
func (prd *providerResourceData) CloneRepository(ctx context.Context)
function ininternal/provider/provider_resource_data.go
to use a shallow clone. Here’s an example of the proposed change:func (prd *providerResourceData) CloneRepository(ctx context.Context) (*gogit.Client, error) { tmpDir, err := manifestgen.MkdirTempAbs("", "flux-bootstrap-") if err != nil { return nil, fmt.Errorf("could not create temporary working directory for git repository: %w", err) } gitClient, err := prd.GetGitClient(tmpDir) if err != nil { return nil, fmt.Errorf("could not create git client: %w", err) } // TODO: Need to conditionally clone here. If repository is empty this will fail. _, err = gitClient.Clone(ctx, prd.GetRepositoryURL().String(), repository.CloneConfig{ CheckoutStrategy: repository.CheckoutStrategy{ Branch: prd.git.Branch.ValueString(), }, + ShallowClone: true, }) if err != nil { return nil, fmt.Errorf("could not clone git repository: %w", err) } return gitClient, nil }
Testing this change locally has shown an improvement in performance. It reduces the time required to clone the repository and should decrease the likelihood of timeouts when applying our Terraform configuration.
Would this be a reasonable proposal for a pull request? Let me know if there are other considerations I should account for.
Steps to reproduce
Note
The failure is transient, so reproducing it may be tricky.
flux_bootstrap_git
by runningterraform apply
.flux_bootstrap_git
, which triggers a new commit to be pushed to the bootstrapped repository.terraform apply
).(This is especially helpful if the repository is large and slow to clone.)
Expected behavior
Ideally,
flux_bootstrap_git
should be designed to scale efficiently and remain resilient under high-frequency repository operations, avoiding timeouts.Screenshots and recordings
No response
Terraform and provider versions
Terraform provider configurations
flux_bootstrap_git resource
Flux version
v2.2.3
Additional context
No response
Code of Conduct
Would you like to implement a fix?
Yes
The text was updated successfully, but these errors were encountered: