Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement][opensearch] Semantic health probes #474

Open
lieberlois opened this issue Sep 25, 2023 · 8 comments
Open

[Enhancement][opensearch] Semantic health probes #474

lieberlois opened this issue Sep 25, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@lieberlois
Copy link
Contributor

lieberlois commented Sep 25, 2023

Is your feature request related to a problem? Please describe.
Its frustrating that the OpenSearch API distinguishes between different cluster states (red/yellow/green) which makes liveness / readiness probes semantically useless during a rolling restart, since a second opensearch node is restarted, before all shards have been moved.

Describe the solution you'd like
I want an endpoint that only responds OK when the cluster health is green (or maybe yellow). One idea would be to add a small sidecar container as an adapter that interprets the response from GET _cluster/health which looks like this:

{
  "cluster_name": "opensearch-cluster",
  "status": "yellow",
  ...
}

I would suggest to implement a small Go service since it can be written in very few LoC and is very lightweight. Alternatively we could use a curl command along the lines of GET _cluster/health?wait_for_status=green&timeout=1 since it times out when the cluster state is not green and hence fails the readiness probe check. For both cases we somehow need to add a user that is allowed to call the endpoint (or use a dedicated client certificate which probably complicates the setup).

Describe alternatives you've considered
Using the Kubernetes Operator for OpenSearch is an option.

Additional context
I contributed #329 and #333 and can also assist in the implementation of this if others also think this is a good idea.

@lieberlois lieberlois added enhancement New feature or request untriaged Issues that have not yet been triaged labels Sep 25, 2023
@smlx
Copy link
Contributor

smlx commented Sep 27, 2023

There was an extensive discussion about this issue on this PR: #172

Please take a look and see if the discussion covers this issue.

@lieberlois
Copy link
Contributor Author

Thanks @smlx, I read through the discussion :) I dont think using client cetificates was evaluated in the discussion. I would suggest to have a client cert, for whose CN / DN only the GET _cluster/health is allowed. This way we should be able to achieve a secure passwordless health check. What do you think?

@smlx
Copy link
Contributor

smlx commented Sep 29, 2023

One problem with using the _cluster/health endpoint for a readiness probe is that if the cluster goes unhealthy for some reason then Kubernetes will stop sending traffic to any node. This makes it very difficult to recover the cluster, since you can't hit the API through the service endpoint.

@lieberlois
Copy link
Contributor Author

@smlx Good Point, but we can fix this by only using a startupProbe, which only runs on the first start of the pod 🤔 I see why its a tricky topic, its just that we sometimes get a red cluster state because of the rolling restart behaviour (e.g. 3 nodes, node 1 restarted and not fully initialized yet, then node 2 also reboots -> Cluster Red)

@smlx
Copy link
Contributor

smlx commented Sep 29, 2023

we can fix this by only using a startupProbe

But in that case imagine that enough pods go offline for the cluster to go red (e.g. nodes crash). How will the replacement pods rejoin the cluster, since the replacement pods will wait on the startup probe for the cluster to go green but it never will?

@lieberlois
Copy link
Contributor Author

Mhm good point, i thought about using this startup probe in our project:

CERTS="/usr/share/opensearch/config/certs/admin/"
curl --fail-with-body --cert "$CERTS/tls.crt" --key "$CERTS/tls.key" -k "https://localhost:9200/_cluster/health?wait_for_status=green&timeout=1s"

How do you guys restart your cluster (also @smlx)? In our case, the cluster will at some point go red, because K8s doesnt wait for the cluster to be green again after restarting a node, it will just restart the next sts pod.

@lieberlois
Copy link
Contributor Author

lieberlois commented Oct 6, 2023

Another idea, we could also add a preStop lifecycle hook that either drains the OpenSearch node or makes sure that the cluster health is green before restarting it. What do you think of that? Basically i suggest to add sane defaults for this preStart hook (or at least an example on how to do it) here:

https://github.com/opensearch-project/helm-charts/blob/main/charts/opensearch/values.yaml#L409

Edit: an idea would be to just provide it as an example on how sb might implement this.

@smlx
Copy link
Contributor

smlx commented Oct 7, 2023

Yeah that might work, I like the idea. There also might be something we could do as a pre-upgrade hook in the helm chart, maybe in combination with a pod preStart or preStop hook

Honestly speaking, in the interest of a speedy upgrade we just incur a short window of the cluster going red.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

3 participants