-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sabakan-state-setter: detect machine failures using Prometheus alerts #2788
base: main
Are you sure you want to change the base?
sabakan-state-setter: detect machine failures using Prometheus alerts #2788
Conversation
1051b2a
to
dffeebf
Compare
case !ok || newState == m.State || (newState == stateUnhealthyImmediate && m.State == sabakan.StateUnhealthy): | ||
c.ClearUnhealthy(m) | ||
continue | ||
case newState == stateUnhealthyImmediate: | ||
c.ClearUnhealthy(m) | ||
newState = sabakan.StateUnhealthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why are you using the stateUnhealthyImmediate
variable and executing ClearUnhealthy
? I'm not quite clear on their purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zeroalphat
It's because I want to skip the grace period.
Currently sabakan-state-setter
changes a machine's state to unhealthy
only after waiting for the grace period.
This seems reasonable when the cause of the change is serf
status or monitor-hw
metrics.
In this pull request, I add state changes caused by Prometheus
alerts.
In my design, the grace period is not necessary for this new state change because we can configure Prometheus alerts to wait a sufficient amount of time before becoming active.
https://github.com/cybozu-go/neco/pull/2788/files#diff-819895dee5e92a2356096850816be02af8e2b41ea1f6f69832d71967303465f3
So I added a new state candidate stateUnhealthyImmediate
for the new state change.
The controller manages the start timestamps of to-be-unhealthy machines.
If a machine is newly evaluated to StateUnhealthy
(not stateUnhealthyImmediate
), the controller registers the timestamp via Controller.RegisterUnhealthy()
.
The controller continuously checks the timestamp while the machine is evaluated to StateUnhealthy
and the current status is not StateUnhealthy
.
Once the controller has changed the status to StateUnhealthy
, it deletes the timestamp via Controller.ClearUnhealthy()
.
The controller also deletes the timestamp when it becomes unnecessary.
If the new state candidate is stateUnhealthyImmediate
, we don't need the timestamp any more.
So I add c.ClearUnhealthy(m)
.
We can call Controller.ClearUnhealthy()
without checking that a timestamp is registered actually.
}, | ||
}, | ||
{ | ||
Name: "LLPDDown", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this LLDPDown?
"00000003": sabakan.StateUnreachable, // serf status is "failed" | ||
"00000004": sabakan.StateUnhealthy, // alert "DiskNotRecognized" is firing; grace period is ignored | ||
"00000005": sabakan.StateUnreachable, // alert "LLPDDown" is firing; alert "DiskNotRecognized" is ignored because it is less severe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
Signed-off-by: morimoto-cybozu <[email protected]>
60400a0
to
1258bc4
Compare
I've rebased to reflect new |
See-also: https://github.com/cybozu-private/neco-tasks/issues/466
Signed-off-by: morimoto-cybozu [email protected]