Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase provider performance - move to no-fork/forkless architecture (free from Terraform CLI) #226

Open
denniskniep opened this issue Jan 27, 2025 · 4 comments

Comments

@denniskniep
Copy link
Contributor

Currently the provider runs into limitations when it comes to many resources (thousands of resources). These thresholds can be quickly reached when it comes to users, roles & groups

Therefore it would make sense to increase provider performance (at least for certain resources) by moving to no-fork/forkless architecture (free from Terraform CLI)

Currently the no-fork architecture approach can not be generated by upjet code generation. The migration process is unique for all providers.
Prerequisite is that we use Upjet in v1.0.0 or above.

As a PoC we should choose one resource to move to no-fork architecture:

  • Put resource into a different list in externalname.go. For an example see provider-upjet-aws externalname.go (TerraformPluginFrameworkExternalNameConfigs and TerraformPluginSDKExternalNameConfigs)
  • Adapt ExternalNameConfigurations() to look like in AWS provider
  • Configure Keycloak Client with no-fork architecture. For a simple approach see provider-upjet-gcp. If that is not working, go for the advanced approach implemented in provider-upjet-aws.

Another benefit of no-fork architecture besides increased performance is that the terraform license change is not affecting us anymore. Because it will not use Terraform CLI anymore.

More Information:

@Breee
Copy link
Collaborator

Breee commented Jan 27, 2025

hm, i thought that's what i did with
#124

@denniskniep
Copy link
Contributor Author

denniskniep commented Feb 2, 2025

yeah, you are right. I checked also the stack trace (not for every resource), but there is no terraform cli invoked, upjet controller directly calls the go code of the terraform provider.

Image

I did some really simple performance tests with a single Keycloak Provider pod in version v1.10.1:

I scripted a loop which inserts per run:

  • 4 Users
  • 2 Groups
  • 2 Memberships for the created groups containing all users as members

I ran it ~750x in a single thread, these are the results:

Image Image

Image
(0.5 quantile)

Image Image

(0.5 quantile)

All of that took ~40min for ~6000 Resources

Pod took up to 2,5 GB RAM

That seems a bit too slow and resource intensive compared to the other providers that moved to forkless (mentioned in this Blog Post)

Am I missing something ? Any idea what exactly slow´s it down?

@Breee
Copy link
Collaborator

Breee commented Feb 3, 2025

Do the resources depend on each other?

In the past i had experiences with keycloak where many API calls took a long time under certain circumstances (especially if you are using user federation via AD/LDAP)
I also had the experience that sometimes it gets really slow if the resources are created in the wrong order.

There should also be reconcile settings for that case, we can experience with those, maybe we miss something there

@denniskniep
Copy link
Contributor Author

denniskniep commented Feb 6, 2025

Thanks for that hint. I removed AD/LDAP user federation from keycloak.
Furthermore I checked that the resources are created in the correct order (regarding ref props). That was already the case
I ran the script again, but no real improvement. Here are the results:

Image

Image

Image

Image

Image

All of that took again ~40min for ~6000 Resources

Pod took up to 3,3 GB RAM

I see a lot of those errors in the log:

"error":"cannot update status of the resource group.keycloak.crossplane.io/v1alpha1, Kind=Group/idp-2-group-3-752 after an async create: Operation cannot be fulfilled on groups.group.keycloak.crossplane.io "idp-2-group-3-752": the object has been modified; please apply your changes to the latest version and try again"}

There are a lot of retries per second, probably due to that error:
Image

Regarding to your proposal to modify reconcile settings, are you referring to those:
https://docs.crossplane.io/v1.18/concepts/pods/#reconcile-loop

And propose to change the values here?

Options: xpcontroller.Options{
Logger: log,
GlobalRateLimiter: ratelimiter.NewGlobal(*maxReconcileRate),
PollInterval: *pollInterval,
MaxConcurrentReconciles: *maxReconcileRate,
Features: featureFlags,
MetricOptions: &xpcontroller.MetricOptions{
PollStateMetricInterval: *pollStateMetricInterval,
MRMetrics: metricRecorder,
MRStateMetrics: stateMetrics,
},
},

If I increase --max-reconcile-rate from 10 (default) to 30, then I get even more those errors:

the object has been modified; please apply your changes to the latest version and try again

and furthermore these errors:

request.go:697] Waited for 1.362800865s due to client-side throttling, not priority and fairness, request

I think it would make sense to drill down why the error the object has been modified; please apply your changes to the latest version and try again occurs.

I this happening when two controllers(-threads) race to update a resource ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants