Skip to content

Conversation

@naemono
Copy link
Contributor

@naemono naemono commented Dec 4, 2025

Resolves #8789

What is this change?

This adds a new CRD AutoOpsAgentPolicy that allows Elastic AutoOps to be integrated into self-managed ECK clusters.

TODO

  • Allow parts of the configuration (configmap) to be overridden
  • Allow cleanup of orphaned Agents and their relevant data.

Implementation Notes

  • For each ES cluster, the CA is copied to the namespace of the AutoOps Policy, and an API Key is created in the ES cluster for communication purposes and an additional secret is created that contains the API Key.
  • Currently if the policy is in the same namespace as ECK operator the query for ES clusters is cluster-scoped, and if it's outside of the operator namespace, it's namespace scoped. This follows what we did for SSP, but recent discussions are questioning this behavior. (This behavior could quickly change and default to cluster-scoped always, which seems to make sense)

Needs testing

  • All Helm Charts

naemono added 26 commits October 7, 2025 13:25
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
@prodsecmachine
Copy link
Collaborator

prodsecmachine commented Dec 4, 2025

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@github-actions
Copy link

github-actions bot commented Dec 4, 2025

🔍 Preview links for changed docs

Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
SSLEnabled: sslEnabled,
CACertPath: caCertPath,
}
if err := tmpl.Execute(&configBuf, templateData); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checking for my understanding, if SSL is not enabled ca cert path is going to be "". is that fine from an autoops config stand point ?


if err := cleanupAutoOpsESAPIKey(ctx, r.Client, r.esClientProvider, r.params.Dialer, obj.Namespace, obj.Name, es); err != nil {
log.Error(err, "Failed to cleanup API key for Elasticsearch cluster", "es_namespace", esNamespace, "es_name", esName)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we continue on errors in this function instead of re queuing, sorry if I missing something 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've compared the functionality of this onDelete with the other controllers, and this seems to be the pattern we follow. The onDelete is returned from the primary Reconcile function, which does an automatic retry on error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then do we need a finalizer to ensure that the autooops policy is not deleted until we have cleaned up the api keys ? sorry if we are already adding a finalizer and I missed it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kvalliyurnatt I think the "Todo" item in the readme covers this. Peter mentioned this previously I believe:

Also I don't think it handles the case where a user changes the label selector on the policy. See this section from the design:

Redefinition of selector. Problem: we now have orphaned AutoOps agents
Same approach as outlined below for deletion: add extra label metadata to track origin of deployments, diff
existing vs expected, delete unwanted, remove unnecessary API keys/users from no longer instrumented clusters.

Do you agree?

wantErr bool
}{
{
name: "cleanup API keys for single ES cluster",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way for us to test the negative cases here, like the ES API failing with an error other than not found.


// Only increment readyCount if there are no errors in previous ES instances.
if isDeploymentReady(reconciledDeployment) && errorCount == 0 {
readyCount++
Copy link
Contributor

@kvalliyurnatt kvalliyurnatt Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not understand this, say we have three es instances.
The first one had no error, second one had an error and third one had no errors. In this case shouldn't the ready count be 2 ? but since we never reset errorCount between iterations, in this case the ready count will be 1 right ?

return r.internalReconcile(ctx, policy, results, state)
}

func (r *ReconcileAutoOpsAgentPolicy) internalReconcile(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should aa a unit test for this function

Signed-off-by: Michael Montgomery <[email protected]>
Move unit tests

Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
use apikey constant

Signed-off-by: Michael Montgomery <[email protected]>
Signed-off-by: Michael Montgomery <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>feature Adds or discusses adding a feature to the product

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[cloud connected mode] Manage autoOps agents in a native manner in ECK

5 participants