Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Help Mitigate cannot lock ref Error with Shallow Clones #735

Open
1 task done
theMagicalKarp opened this issue Jan 24, 2025 · 0 comments
Open
1 task done

Comments

@theMagicalKarp
Copy link

Describe the bug

I currently manage hundreds of Kubernetes clusters, which are configured using Flux from a single GitOps repository. We utilize flux_bootstrap_git to manage Flux installations for each cluster. On average, this repository receives new commits every minute during day time hours.

This high frequency of commits has caused issues when updating the flux_bootstrap_git resource. Specifically, whenever flux terraform provider attempts to push a commit to our GitOps repository, the Terraform provider almost always times out with the following error:

│ failed to push manifests: failed to push to remote: command error on
│ refs/heads/main: cannot lock ref 'refs/heads/main': is at
│ da2267035aa50139f41df052947da4e85202c0f0 but expected
│ 71853c6197a6a7f222db0f1978c7cb232b87c5ee

To mitigate this, we’ve increased the timeouts, which has helped to some extent. However, we’ve observed that on every retry, the Terraform provider performs a full clone of the entire repository. This process is time-consuming, given that the repository has over 300,000 commits, and new commits are often added within the retry window.

A potential improvement could involve modifying the func (prd *providerResourceData) CloneRepository(ctx context.Context) function in internal/provider/provider_resource_data.go to use a shallow clone. Here’s an example of the proposed change:

func (prd *providerResourceData) CloneRepository(ctx context.Context) (*gogit.Client, error) {
	tmpDir, err := manifestgen.MkdirTempAbs("", "flux-bootstrap-")
	if err != nil {
		return nil, fmt.Errorf("could not create temporary working directory for git repository: %w", err)
	}
	gitClient, err := prd.GetGitClient(tmpDir)
	if err != nil {
		return nil, fmt.Errorf("could not create git client: %w", err)
	}
	// TODO: Need to conditionally clone here. If repository is empty this will fail.
	_, err = gitClient.Clone(ctx, prd.GetRepositoryURL().String(), repository.CloneConfig{
		CheckoutStrategy: repository.CheckoutStrategy{
			Branch: prd.git.Branch.ValueString(),
		},
+               ShallowClone: true,
	})
	if err != nil {
		return nil, fmt.Errorf("could not clone git repository: %w", err)
	}
	return gitClient, nil
}

Testing this change locally has shown an improvement in performance. It reduces the time required to clone the repository and should decrease the likelihood of timeouts when applying our Terraform configuration.

Would this be a reasonable proposal for a pull request? Let me know if there are other considerations I should account for.

Steps to reproduce

Note

The failure is transient, so reproducing it may be tricky.

  1. Bootstrap a repository using flux_bootstrap_git by running terraform apply.
  2. Modify a property in flux_bootstrap_git, which triggers a new commit to be pushed to the bootstrapped repository.
  3. Reapply the Terraform configuration (terraform apply).
  4. While Terraform is applying, continuously push new commits to the bootstrapped repository to intentionally disrupt the process. (This is especially helpful if the repository is large and slow to clone.)

Expected behavior

Ideally, flux_bootstrap_git should be designed to scale efficiently and remain resilient under high-frequency repository operations, avoiding timeouts.

Screenshots and recordings

No response

Terraform and provider versions

Terraform v1.10.4
on linux_amd64
+ provider registry.terraform.io/fluxcd/flux v1.2.3

Terraform provider configurations

provider "flux" {
  kubernetes = {
    host                   = var.kubernetes.host
    cluster_ca_certificate = base64decode(var.kubernetes.ca_certificate)
    token                  = var.kubernetes.token
  }
  git = {
    branch = var.branch
    url    = "ssh://[email protected]/${var.github_owner}/${var.repository_name}.git"
    ssh = {
      username    = "git"
      private_key = tls_private_key.main.private_key_pem
    }
  }
}

flux_bootstrap_git resource

resource "flux_bootstrap_git" "this" {
  depends_on = [github_repository_deploy_key.main]

  path             = var.target_path
  components_extra = var.components_extra
  kustomization_override = templatefile("${path.module}/kustomization.tftpl.yaml", {
    // .....
  })

  timeouts = {
    create = "30m"
    delete = "30m"
    update = "30m"
    read   = "10m"
  }
}

Flux version

v2.2.3

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Would you like to implement a fix?

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant