Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement additional metrics for counting failed HealthChecks #953

Open
1 task done
vlast3k opened this issue Aug 12, 2024 · 1 comment
Open
1 task done

Implement additional metrics for counting failed HealthChecks #953

vlast3k opened this issue Aug 12, 2024 · 1 comment

Comments

@vlast3k
Copy link

vlast3k commented Aug 12, 2024

Proposed Change

As an operator
I want to monitor of the rate of failing liveness healtchecks via metrics
So that i can get alerted in case there are some irregularities

Problem Details

We have observed that during diego-cell-evacuation in some cases an exceptionally big amount of liveness-healtchecks times out.
After closer investigation we discovered the follwiing:

  • this happens when the diego-cells are updated and with each batch (of 10% of the workload) the remaining 90% of the cells have to start the replacement LRPs
  • starting each LRP results in high Disk IO since the droplets / docker-layers are being downloaded
  • in case the disk performance (EBS volumes) is not high enough, this leads to high CPU Wait time
  • CPU wait tends to block a certain Core from executing commands from other threads
  • So we observe that Liveness healtchecks configured for 1-5 seconds tend to timeout, even if the container is idling
  • This happens mostly on overloaded landscapes, and increasing the disk-prformance from the default 125 MB/s to 500 MB/s solves the problem

Currently we are monitoring the CPU Wait as reported from Bosh, but this is sometimes misleading because:

  • on VMs with few cores, e.g. 4, one core waiting is 25% cpu wait
  • on VMs with 128+ cores, one core waiting is < 1%

So it is hard to define a consistent metric - when to trigger an alert that something need to be scaled

On the other hand , monitoring the failing healthchecks an especially sharp increases (e.g. from 10 to 1000 per minute) is a very consistent indicator

Currently we are doing it by counting the number of those logs

rep.executing-container-operation.ordinary-lrp-processor.process-reserved-container.run-container.containerstore-run.node-run.liveness-check.run-step.run-step-failed-with-nonzero-status-code

in a kibana dashboard, but triggering alerts from kibana has other operational challenges.

Solution Proposal

Therefore our proposal is to modify the executor in a way that it will emit a Counter that emits the number of failed healtchecks. This way an alert (e.g. via Riemann) can be configured in case of exceptionally high values.

For discussion we have did a POC in this PR
cloudfoundry/executor#102
That solves the problem and allows us to monitor the healtchecks.
It allows to choose for which checks the counter should be emitted. So far it is not configurable, because for most of the checks it does not make sense.

Depending on our discussions here we may also extend it or change it in a way that it suits the community

Acceptance criteria

Scenario: Diego cell update is performed
Given I have enabled emitting metrics for failing healtchecks
When the performance of the Diego EBS Volumes is not enough
Then I receive the metrics in the monitoring stack and can act on them

Related links

@ebroberson
Copy link
Contributor

I think this looks like a good change, but definitely want someone else on the team to review and comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants