You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As an operator
I want to monitor of the rate of failing liveness healtchecks via metrics
So that i can get alerted in case there are some irregularities
Problem Details
We have observed that during diego-cell-evacuation in some cases an exceptionally big amount of liveness-healtchecks times out.
After closer investigation we discovered the follwiing:
this happens when the diego-cells are updated and with each batch (of 10% of the workload) the remaining 90% of the cells have to start the replacement LRPs
starting each LRP results in high Disk IO since the droplets / docker-layers are being downloaded
in case the disk performance (EBS volumes) is not high enough, this leads to high CPU Wait time
CPU wait tends to block a certain Core from executing commands from other threads
So we observe that Liveness healtchecks configured for 1-5 seconds tend to timeout, even if the container is idling
This happens mostly on overloaded landscapes, and increasing the disk-prformance from the default 125 MB/s to 500 MB/s solves the problem
Currently we are monitoring the CPU Wait as reported from Bosh, but this is sometimes misleading because:
on VMs with few cores, e.g. 4, one core waiting is 25% cpu wait
on VMs with 128+ cores, one core waiting is < 1%
So it is hard to define a consistent metric - when to trigger an alert that something need to be scaled
On the other hand , monitoring the failing healthchecks an especially sharp increases (e.g. from 10 to 1000 per minute) is a very consistent indicator
Currently we are doing it by counting the number of those logs
in a kibana dashboard, but triggering alerts from kibana has other operational challenges.
Solution Proposal
Therefore our proposal is to modify the executor in a way that it will emit a Counter that emits the number of failed healtchecks. This way an alert (e.g. via Riemann) can be configured in case of exceptionally high values.
For discussion we have did a POC in this PR cloudfoundry/executor#102
That solves the problem and allows us to monitor the healtchecks.
It allows to choose for which checks the counter should be emitted. So far it is not configurable, because for most of the checks it does not make sense.
Depending on our discussions here we may also extend it or change it in a way that it suits the community
Acceptance criteria
Scenario: Diego cell update is performed
Given I have enabled emitting metrics for failing healtchecks
When the performance of the Diego EBS Volumes is not enough
Then I receive the metrics in the monitoring stack and can act on them
Proposed Change
As an operator
I want to monitor of the rate of failing liveness healtchecks via metrics
So that i can get alerted in case there are some irregularities
Problem Details
We have observed that during diego-cell-evacuation in some cases an exceptionally big amount of liveness-healtchecks times out.
After closer investigation we discovered the follwiing:
Currently we are monitoring the CPU Wait as reported from Bosh, but this is sometimes misleading because:
So it is hard to define a consistent metric - when to trigger an alert that something need to be scaled
On the other hand , monitoring the failing healthchecks an especially sharp increases (e.g. from 10 to 1000 per minute) is a very consistent indicator
Currently we are doing it by counting the number of those logs
in a kibana dashboard, but triggering alerts from kibana has other operational challenges.
Solution Proposal
Therefore our proposal is to modify the
executor
in a way that it will emit a Counter that emits the number of failed healtchecks. This way an alert (e.g. via Riemann) can be configured in case of exceptionally high values.For discussion we have did a POC in this PR
cloudfoundry/executor#102
That solves the problem and allows us to monitor the healtchecks.
It allows to choose for which checks the counter should be emitted. So far it is not configurable, because for most of the checks it does not make sense.
Depending on our discussions here we may also extend it or change it in a way that it suits the community
Acceptance criteria
Scenario: Diego cell update is performed
Given I have enabled emitting metrics for failing healtchecks
When the performance of the Diego EBS Volumes is not enough
Then I receive the metrics in the monitoring stack and can act on them
Related links
The text was updated successfully, but these errors were encountered: