Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horizontally Scalable providers #739

Open
tchinmai7 opened this issue Jun 23, 2024 · 2 comments
Open

Horizontally Scalable providers #739

tchinmai7 opened this issue Jun 23, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@tchinmai7
Copy link

What problem are you facing?

Crossplane providers are currently deployed as single instances of the provider binary. This single instance is responsible for managing multiple Managed Resources. This model works well for providers that do one-off or quick operations, such as creating a cloud resource, it becomes tricky when it comes to long-running operations as is the case with provider-ansible.

Provider-ansible runs user-provided ansible playbooks as its reconciliation loop, and the existing concurrency control mechanisms such as max-reconcile-rate and timeouts don't help much here.

How could Crossplane help solve your problem?

One potential solution to this problem is to enable horizontally scalable providers, which let multiple provider pods handle the managed resource. Each pod can attempt to acquire a lease on the Managed Resource, and proceed to reconcile the resource if and only if the lease was successfully acquired. Another solution would be to do a leader - worker approach with the leader assigning resources to workers.

To implement the lease mechanism in crossplane-runtime we could
a) Implement a new method Lock that provider implementers can optionally call before calling Connect to acquire a lease on the MR
b) Modify Connect to do the lease acquisition by default - it is essentially a no-op for providers that are running as single pods, for horizontally scaled providers - this will be transparent to them.

Slack discussion in provider-ansible channel - https://crossplane.slack.com/archives/C043WMY9UJE/p1718382530846819

@tchinmai7 tchinmai7 added the enhancement New feature or request label Jun 23, 2024
@bobh66
Copy link
Contributor

bobh66 commented Jun 23, 2024

There have been similar discussions from the provider-terraform community - see upbound/provider-terraform#189

There is definitely a difference between horizontal scaling of the provider/controller, which would require sharding of the resource UIDs and changes in controller-runtime, and moving to a job-dispatch model as discussed here and in the other issue.

provider-terraform is using the terraform-native locking to prevent multiple commands from running at the same time on the same workspace, so theoretically a dispatch model would not need to change that.

The workaround in provider-terraform is basically to throw more CPUs at the provider, and increase the --max-reconcile-rate accordingly. That has other issues with shared provider caches and cache-locking problems.

I wonder if this is a case where a different provider deployment model would be appropriate, for example something that dispatches Jobs instead of using the standard provider Deployment. That's probably a stretch but maybe something to think about. Maybe @negz has some ideas as well.

@tchinmai7
Copy link
Author

Thanks for the link to the other discussion! The idea proposed in that issue was another thing that I considered, and discussed with provider-ansible maintainers, but that seemed like a much larger change to the architecture than the locking model.

In our usage of provider-ansible, we've done pretty much the same thing - increase --max-reconcile-rate and throw more CPU at it. Our use case has reached a point where this isn't sustainable, and some ansible runs take >2hrs. Hence the motivation for this idea - this would allow multiple provider pods to do the work and spread the load.

Can you elaborate a bit more on this?

I wonder if this is a case where a different provider deployment model would be appropriate, for example something that dispatches Jobs instead of using the standard provider Deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants