Support remote clusters using API keys #8089

barkbay · 2024-10-09T07:52:00Z

This PRs allows to define API keys for remote cluster connections.

I'm creating this PR as a draft to get a first feedback on the overall architecture and how the API keys are managed.

The examples in this comment refer to the recipe available in config/recipes/remoteclusters/elasticsearch.yaml.

No new controller, `remoteca` ➡️ `remotecluster-controller`

The API key reconciliation has been included in the existing remoteca controller that was taking care of the certificate management.
It has been therefore renamed remotecluster-controller and is now also handling API keys reconciliation in both:

The Elasticsearch API of the remote cluster
The secure settings of the client cluster

API Key management

In the remote Elasticsearch cluster

API keys created in the remote cluster by the operator follow the following naming convention: eck-<client_cluster_ns>-<client_cluster_name>-<remote_cluster_name>.
Some metadata, like the client cluster name or the hash of the expected key specification, are attached to the API keys in order to facilitate their conciliation:

{
  "api_keys": [
    {
      "id": "-yvMb5IBPP7lgi3Dtn2m",
      "name": "eck-ns1-cluster1-to-ns2-cluster2",
      "metadata": {
        "elasticsearch.k8s.elastic.co/config-hash": "660014403",
        "elasticsearch.k8s.elastic.co/name": "cluster1",
        "elasticsearch.k8s.elastic.co/namespace": "ns1",
        "elasticsearch.k8s.elastic.co/uid": "831b535e-c031-4b4d-aba5-3ad30271f5cc",
        "elasticsearch.k8s.elastic.co/managed-by": "eck"
      },
      "access": {
        "search": [
          {
            "names": [
              "kibana_sample_data_ecommerce"
            ]
          }
        ]
      }
    }
  ]
}

In the client Elasticsearch cluster

Encoded keys used by the client cluster to authenticate against the remote one are stored in a new dedicated Secret on the client cluster (this Secret is handled using the APIKeyStore in pkg/controller/remotecluster/keystore.go):

> kubectl get secret -l common.k8s.elastic.co/type=remote-cluster-api-keys -A
NAMESPACE   NAME                          TYPE     DATA   AGE
ns1         cluster1-es-remote-api-keys   Opaque   1      13m

We are tracking the API key IDs in an annotation, this is used to detect when a key has been externally invalidated:

>kubectl get secret cluster1-es-remote-api-keys -n ns1 -o=jsonpath='{.metadata.annotations.elasticsearch\.k8s\.elastic\.co\/remote-cluster-api-keys}'  | jq

{
  "to-ns2-cluster2": {
    "namespace": "ns2",
    "name": "cluster2",
    "id": "-yvMb5IBPP7lgi3Dtn2m"
  }
}

This Secret is:

Loaded as part of the existing secure setting mechanism (which triggers a restart when updated, refer to section "To be discussed in dedicated issues/prs").
Not created until there are some keys to be stored.

Is it possible to go back to the "legacy" remote cluster mode once API keys are enabled?

Yes, you just have to delete the api key from the custom resource. However, this may result in downtime.

Todo

There are still a few things I'm not happy with. For example I'm wondering if we need an expectation mechanism for the Secrets that hold the encoded key: if a key is created and the Secret is created but not observed in the next reconciliation we may invalidate the key that has just been created.

Other things to complete before merging or going ga:

Documentation: Add documentation for remote clusters using API keys #8167
E2E Tests
Evaluate how this works with custom certificates on the transport layer, if there is any new requirement.

To be discussed in dedicated issues/prs

Avoid the restart of the whole cluster when a new API key is generated.

Testing

As previously mentioned there is an example in config/recipes/remoteclusters/elasticsearch.yaml.

Here are a few API calls that you may find useful while testing this PR:

GET /_remote/info to get the current remote cluster status. Note that the connected field may not be immedately refreshed/updated)
GET /_security/api_key?active_only=true&name=eck-* to get the active keys created by the operator.
GET /to-ns2-cluster2:kibana_sample_data_ecommerce/_search , when using the manifest in the recipe and run on ns1/cluster1 this should return the data from ns2/cluster2

Fixes #7818

barkbay · 2024-10-09T11:33:22Z

Please do not spend some time on the case where node transport certificates are provided by a third-party tool (https://www.elastic.co/guide/en/cloud-on-k8s/master/k8s-transport-settings.html#k8s-transport-third-party-tools). This needs some adjustments, I'm working on it.

barkbay · 2024-10-10T08:15:10Z

pkg/controller/elasticsearch/settings/merged_config.go

-		// cluster is going to try to connect to the remote cluster service using the Service and each specific Pod IP.
-		cfg[esv1.RemoteClusterPublishHost] = "${" + EnvRemoteClusterService + "}.${" + EnvNamespace + "}.svc"
-		cfg[esv1.RemoteClusterBindHost] = "0.0.0.0"
+		cfg[esv1.RemoteClusterPublishHost] = "${" + EnvPodName + "}.${" + HeadlessServiceName + "}.${" + EnvNamespace + "}.svc"


By default the published host is the Pod's IP address. While that IP address is automatically added in the ECK managed transport certificate it is not possible to include it when using the cert-manager (cert-manager/csi-driver#17).

That's why I decided to use the Pod hostname as available through the existing headless Service (so it can be resolved by other Pods).

It still has the downside that when using cert-manager CSI driver, the csi.cert-manager.io/dns-names is now a bit involved, something along the lines of:

- name: transport-certs csi: driver: csi.cert-manager.io readOnly: true volumeAttributes: csi.cert-manager.io/issuer-name: ca-cluster-issuer csi.cert-manager.io/issuer-kind: ClusterIssuer csi.cert-manager.io/dns-names: "${POD_NAME}.${POD_NAMESPACE}.svc.cluster.local,${POD_NAME}.<cluster-name>-es-<nodeset-name>.${POD_NAMESPACE}.svc,<cluster-name>-es-remote-cluster.${POD_NAMESPACE}.svc"

${POD_NAME}.${POD_NAMESPACE}.svc.cluster.local is the existing, recommended DNS name, from our documentation (I guess it only works because of verification_mode: certificate in the transport configuration)

${POD_NAME}.<cluster-name>-es-<nodeset-name>.${POD_NAMESPACE}.svc is to match the published host.

<cluster-name>-es-remote-cluster.${POD_NAMESPACE}.svc is to match the remote cluster service.

(I think an alternative would be to try to use verification_mode: certificate for the remote cluster server, this is something I wanted to avoid)

I think this is fine

barkbay · 2024-10-11T11:07:51Z

buildkite test this -f p=gke,t=TestRemoteClusterWithAPIKeys -m s=8.9.2,s=8.15.2,s=8.16.0-SNAPSHOT

pebrc

I spent a bit of time today testing your PR and it is really impressive how well it worked. 🚀 I am not sure if an expectation mechanism on the secrets is need IIUC the worst case is we re-create an API key?

What I found slightly difficult to reason about is the reconciliation in the remote cluster controller. I know that this is to large parts existing code into which the new functionality was inserted. But maybe we can do something with the naming. IIUC the local cluster is always the remote server and I think this threw me off because in my head local meant the client that makes calls to a remote using an API key.

pkg/apis/elasticsearch/v1/remote_cluster.go

pkg/controller/elasticsearch/client/remote_cluster.go

pebrc · 2024-10-11T12:35:23Z

pkg/controller/elasticsearch/settings/merged_config.go

-		// cluster is going to try to connect to the remote cluster service using the Service and each specific Pod IP.
-		cfg[esv1.RemoteClusterPublishHost] = "${" + EnvRemoteClusterService + "}.${" + EnvNamespace + "}.svc"
-		cfg[esv1.RemoteClusterBindHost] = "0.0.0.0"
+		cfg[esv1.RemoteClusterPublishHost] = "${" + EnvPodName + "}.${" + HeadlessServiceName + "}.${" + EnvNamespace + "}.svc"


I think this is fine

pkg/controller/remotecluster/apikey.go

pkg/controller/remotecluster/controller.go

pkg/controller/remotecluster/apikey.go

pebrc · 2024-10-11T15:01:21Z

pkg/controller/remotecluster/apikey.go

+	}
+
+	// Save the generated keys in the keystore.
+	if err := clientClusterAPIKeyStore.Save(ctx, c, clientES); err != nil {


If I understand correctly we reconcile from the server's perspective which means if a cluster is client to multiple remotes multiple threads of reconciliation will try to update the same secret? Potentially at the same time depending on the parallelism the controller is configured with? Do you think this will be problem in practice? Did you consider doing it the other way round?

Did you consider doing it the other way round?

Yes, since we need an Elasticsearch client to reconcile the API keys it seemed more "natural" to me to create the ES client "once", and then reconcile/propagates the API keys for all the client clusters in the same loop, when the remote cluster is reconciled (as opposed to create n Elasticsearch clients for each of the remote clusters to reconcile the API keys). My feeling was that, in case of a conflict on the Secret, the consequence would be that we would "just" need to create a new API key during the next iteration, and try to store it again, which seemed acceptable.

In any case one thing we may want to solve is when the API key is not immediately observed in the Secret (because of the client's cache). This would require a kind of "expectation mechanism": the generated keys would be stored in memory until they are persisted. This would prevent the conflict side effect when the Secret is created/updated since we would no longer need to generate a new key because we still have it in memory.

Makes sense. It is probably computationally more expensive to create many ES clients with certificate pools than the current approach which minimises the number of clients created at the cost of potential conflicts.

One downside of the current approach that occurred to me during testing is that the reconciliation that optimises for creating fewer clients for the remote server clusters is more disruptive to the client clusters as every new remote cluster connection requires a full restart of the cluster to update the keystores. I don't think this a problem in practice, it I can only think of very rare scenarios where this might be observable (adding remote clusters within very short intervals but not at once)

pkg/controller/remotecluster/keystore.go

barkbay · 2024-10-16T12:59:03Z

I removed the draft status. I still want to work on the documentation and add some additional unit tests, but happy to get additional feedback in the meantime.

pkg/apis/elasticsearch/v1/elasticsearch_types.go

pkg/controller/remotecluster/controller.go

pkg/controller/remotecluster/keystore/changes_tracker.go

pebrc · 2024-10-20T08:28:39Z

pkg/controller/remotecluster/controller.go

+		// Check that the API is available
+		esClient = newEsClient
+		// Get all the API Keys, for that specific client, on the reconciled cluster.
+		getCrossClusterAPIKeys, err := esClient.GetCrossClusterAPIKeys(ctx, "eck-*")


Should we try to handle the case where a new cluster is not reachable yet to avoid noisy error logs like this:

manager.eck-operator Reconciler error {"service.version": "2.15.0-SNAPSHOT+65769109", "controller": "remotecluster-controller", "object": {"name":"cluster3","namespace":"ns2"}, "namespace": "ns2", "name": "cluster3", "reconcileID": "8aebea3f-118d-474e-b4c8-4b55efc26141", "error": "elasticsearch client failed for https://cluster3-es-default-0.cluster3-es-default.ns2:9200/_security/api_key?active_only=true&error_trace=true&name=eck-%2A: Get \"https://cluster3-es-default-0.cluster3-es-default.ns2:9200/_security/api_key?active_only=true&error_trace=true&name=eck-%2A\": EOF"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler /Users/pebrc/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem /Users/pebrc/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263

This should be the case:

cloud-on-k8s/pkg/controller/remotecluster/controller.go

Lines 181 to 186 in 6576910

// Check if the ES API is available. We need it to create, update and invalidate

// API keys in this cluster.

if !services.NewElasticsearchURLProvider(*localEs, r.Client).HasEndpoints() {

log.Info("Elasticsearch API is not available yet")

return results.WithResult(defaultRequeue).Aggregate()

}

Maybe there's a bug, this EOF seems a bit odd though 🤔 , I would expect a connection refused or another error, EOF feels like the connection was initiated but unexpectedly closed.

Could you check if this error comes from our port-forwarder "hack" used in dev mode please, maybe by looking at the full stacktrace?

Yes that was it

pebrc · 2024-10-20T08:34:05Z

pkg/controller/remotecluster/keystore/keystore.go

+		},
+		Data: data,
+	}
+	if _, err := reconciler.ReconcileSecret(ctx, c, expected, owner); err != nil {


I was seeing a few noisy conflicts during testing I assume because Save is called in two places during one reconciliation?

2024-10-20T10:29:43.425+0200 ERROR manager.eck-operator Reconciler error {"service.version": "2.15.0-SNAPSHOT+65769109", "controller": "remotecluster-controller", "object": {"name":"cluster1","namespace":"ns1"}, "namespace": "ns1", "name": "cluster1", "reconcileID": "316f64b0-454e-4285-8e89-34c18f0da6a2", "error": "Operation cannot be fulfilled on secrets \"cluster1-es-remote-api-keys\": the object has been modified; please apply your changes to the latest version and try again", "errorCauses": [{"error": "Operation cannot be fulfilled on secrets \"cluster1-es-remote-api-keys\": the object has been modified; please apply your changes to the latest version and try again"}]} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler /Users/pebrc/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem /Users/pebrc/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2 /Users/pebrc/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224

pebrc · 2024-10-20T08:36:16Z

pkg/controller/remotecluster/apikey.go

+	}
+
+	// Save the generated keys in the keystore.
+	if err := clientClusterAPIKeyStore.Save(ctx, c, clientES); err != nil {


One downside of the current approach that occurred to me during testing is that the reconciliation that optimises for creating fewer clients for the remote server clusters is more disruptive to the client clusters as every new remote cluster connection requires a full restart of the cluster to update the keystores. I don't think this a problem in practice, it I can only think of very rare scenarios where this might be observable (adding remote clusters within very short intervals but not at once)

pkg/apis/elasticsearch/v1/remote_cluster.go

pkg/controller/elasticsearch/driver/driver.go

pkg/controller/remotecluster/keystore/keystore.go

pkg/apis/elasticsearch/v1/name.go

pebrc

Nice work! 🚢

Small aside: I noticed we don't watch rolebindings in the ES namespaces so when using the RBAC association restrictions we don't re-reconcile if those change. Not sure if there is an easy fix though because users could also use cluster role bindings. Also it has been like this for a very long time and was not introduced in your PR.

pkg/apis/elasticsearch/v1/elasticsearch_types.go

pebrc · 2024-10-27T08:40:47Z

pkg/controller/elasticsearch/client/v8.go

+	return response, err
+}
+
+func (c *clientV8) InvalidateCrossClusterAPIKey(ctx context.Context, name string) error {


Curious why this is not just called DeleteCrossClusterAPIKey?

Because the key is actually not deleted:

This API invalidates API keys created by the create API key or grant API key APIs. Invalidated API keys fail authentication, but they can still be viewed using the get API key information and query API key information APIs, for at least the configured retention period, until they are automatically deleted.

https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-invalidate-api-key.html

pkg/controller/elasticsearch/driver/driver.go

test/e2e/es/remote_cluster_test.go

pkg/controller/remotecluster/secret.go

pkg/controller/remotecluster/controller.go

Co-authored-by: Peter Brachwitz <[email protected]>

barkbay · 2024-10-30T08:00:03Z

buildkite test this -f p=gke,t=TestRemoteClusterWithAPIKeys -m s=8.9.2,s=8.15.2,s=8.16.0-SNAPSHOT,s=8.17.0-SNAPSHOT

barkbay · 2024-10-31T07:19:49Z

Small aside: I noticed we don't watch rolebindings in the ES namespaces so when using the RBAC association restrictions we don't re-reconcile if those change. Not sure if there is an easy fix though because users could also use cluster role bindings. Also it has been like this for a very long time and was not introduced in your PR.

Sorry, just realized I forgot to reply to your comment. We are actually reconciling every 15 minutes in the current implementation:

cloud-on-k8s/pkg/controller/remotecluster/controller.go

Line 323 in ae1f62c

    
           return results.WithResult(association.RequeueRbacCheck(r.accessReviewer)).Aggregate()

// RequeueRbacCheck returns a reconcile result depending on the implementation of the AccessReviewer.
// It is mostly used when using the subjectAccessReviewer implementation in which case a next reconcile loop should be
// triggered later to keep the association in sync with the RBAC roles and bindings.
// See https://github.com/elastic/cloud-on-k8s/issues/2468#issuecomment-579157063
func RequeueRbacCheck(accessReviewer rbac.AccessReviewer) reconcile.Result {
	switch accessReviewer.(type) {
	case *rbac.SubjectAccessReviewer:
		return reconcile.Result{RequeueAfter: 15 * time.Minute}
	default:
		return reconcile.Result{}
	}
}

pebrc · 2024-10-31T09:33:51Z

We are actually reconciling every 15 minutes in the current implementation:

I had completely forgotton about that. I did not wait long enough during testing obviously.

Remote Clusters using API Keys

851219b

barkbay added >feature Adds or discusses adding a feature to the product release-highlight Candidate for the ECK release highlight summary labels Oct 9, 2024

Cosmetic changes

fb34ee7

barkbay added 2 commits October 10, 2024 08:10

Publish the remote cluster Service to the client, not the Pod IP

ed29168

Use existing headless service

4c2c6fe

barkbay commented Oct 10, 2024

View reviewed changes

barkbay added 3 commits October 11, 2024 08:22

[DOC] Update issuing node transport certificates with third-party tools

8735592

[E2E] Add end-to-end test

5ec1e8d

[E2E] Also attempt to search + more details about errors

875c169

Only delete the keystore which has been initially loaded

c1223bd

barkbay force-pushed the rcs2-pr branch from 6fdd2db to c1223bd Compare October 11, 2024 13:44

pebrc reviewed Oct 11, 2024

View reviewed changes

barkbay added 5 commits October 15, 2024 10:05

Add API keystore Secret expectations

ab442d1

Update from review

2a8cd2d

Add support for access.search.query

b6bf462

Add support for allow_restricted_indices

3225307

Fix unit tests

6576910

barkbay marked this pull request as ready for review October 16, 2024 12:56

pebrc added the v2.16.0 label Oct 18, 2024

pebrc reviewed Oct 20, 2024

View reviewed changes

barkbay added 3 commits October 21, 2024 14:55

Apply Peter's suggestions

e9fd909

Add ForgetChangeFor

8c0a6fb

Handle conflict

c42c3f1

thbkrkr reviewed Oct 21, 2024

View reviewed changes

barkbay added 2 commits October 21, 2024 16:55

Merge remote-tracking branch 'origin/main' into rcs2-pr

ba62848

typos

9af83ee

pebrc approved these changes Oct 28, 2024

View reviewed changes

barkbay and others added 5 commits October 30, 2024 07:58

Apply suggestions from code review

823ee1e

Co-authored-by: Peter Brachwitz <[email protected]>

Update comments

e4e053a

Merge remote-tracking branch 'origin/main' into rcs2-pr

c1dab18

make generate

a35d056

Fix expected license

610423a

barkbay merged commit ae1f62c into elastic:main Oct 31, 2024
5 checks passed

barkbay deleted the rcs2-pr branch October 31, 2024 07:11

barkbay mentioned this pull request Oct 31, 2024

Add documentation for remote clusters using API keys #8167

Open

barkbay mentioned this pull request Nov 4, 2024

Cross-Cluster API keys rotation #8176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support remote clusters using API keys #8089

Support remote clusters using API keys #8089

barkbay commented Oct 9, 2024 •

edited

Loading

barkbay commented Oct 9, 2024

barkbay Oct 10, 2024

pebrc Oct 11, 2024

barkbay commented Oct 11, 2024

pebrc left a comment •

edited

Loading

pebrc Oct 11, 2024

pebrc Oct 11, 2024

barkbay Oct 14, 2024

pebrc Oct 14, 2024

pebrc Oct 20, 2024

barkbay commented Oct 16, 2024

pebrc Oct 20, 2024

barkbay Oct 21, 2024

barkbay Oct 21, 2024

pebrc Oct 21, 2024

pebrc Oct 20, 2024

pebrc Oct 20, 2024

pebrc left a comment

pebrc Oct 27, 2024

barkbay Oct 30, 2024

pebrc Oct 30, 2024

barkbay commented Oct 30, 2024

barkbay commented Oct 31, 2024

pebrc commented Oct 31, 2024

	// Check if the ES API is available. We need it to create, update and invalidate
	// API keys in this cluster.
	if !services.NewElasticsearchURLProvider(*localEs, r.Client).HasEndpoints() {
	log.Info("Elasticsearch API is not available yet")
	return results.WithResult(defaultRequeue).Aggregate()
	}

Support remote clusters using API keys #8089

Support remote clusters using API keys #8089

Conversation

barkbay commented Oct 9, 2024 • edited Loading

No new controller, remoteca ➡️ remotecluster-controller

API Key management

In the remote Elasticsearch cluster

In the client Elasticsearch cluster

Is it possible to go back to the "legacy" remote cluster mode once API keys are enabled?

Todo

To be discussed in dedicated issues/prs

Testing

barkbay commented Oct 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barkbay commented Oct 11, 2024

pebrc left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barkbay commented Oct 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pebrc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barkbay commented Oct 30, 2024

barkbay commented Oct 31, 2024

pebrc commented Oct 31, 2024

barkbay commented Oct 9, 2024 •

edited

Loading

No new controller, `remoteca` ➡️ `remotecluster-controller`

pebrc left a comment •

edited

Loading