Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sabakan-state-setter: detect machine failures using Prometheus alerts #2788

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

morimoto-cybozu
Copy link
Contributor

@morimoto-cybozu morimoto-cybozu commented Jan 29, 2025

@morimoto-cybozu morimoto-cybozu self-assigned this Jan 29, 2025
@morimoto-cybozu morimoto-cybozu force-pushed the detect-machine-failures-using-prometheus-alerts branch from 1051b2a to dffeebf Compare January 31, 2025 07:20
@morimoto-cybozu morimoto-cybozu marked this pull request as ready for review February 4, 2025 05:00
Comment on lines +291 to +296
case !ok || newState == m.State || (newState == stateUnhealthyImmediate && m.State == sabakan.StateUnhealthy):
c.ClearUnhealthy(m)
continue
case newState == stateUnhealthyImmediate:
c.ClearUnhealthy(m)
newState = sabakan.StateUnhealthy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why are you using the stateUnhealthyImmediate variable and executing ClearUnhealthy? I'm not quite clear on their purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zeroalphat
It's because I want to skip the grace period.

Currently sabakan-state-setter changes a machine's state to unhealthy only after waiting for the grace period.
This seems reasonable when the cause of the change is serf status or monitor-hw metrics.
In this pull request, I add state changes caused by Prometheus alerts.
In my design, the grace period is not necessary for this new state change because we can configure Prometheus alerts to wait a sufficient amount of time before becoming active.
https://github.com/cybozu-go/neco/pull/2788/files#diff-819895dee5e92a2356096850816be02af8e2b41ea1f6f69832d71967303465f3
So I added a new state candidate stateUnhealthyImmediate for the new state change.

The controller manages the start timestamps of to-be-unhealthy machines.
If a machine is newly evaluated to StateUnhealthy (not stateUnhealthyImmediate), the controller registers the timestamp via Controller.RegisterUnhealthy().
The controller continuously checks the timestamp while the machine is evaluated to StateUnhealthy and the current status is not StateUnhealthy.
Once the controller has changed the status to StateUnhealthy, it deletes the timestamp via Controller.ClearUnhealthy().
The controller also deletes the timestamp when it becomes unnecessary.
If the new state candidate is stateUnhealthyImmediate, we don't need the timestamp any more.
So I add c.ClearUnhealthy(m).
We can call Controller.ClearUnhealthy() without checking that a timestamp is registered actually.

},
},
{
Name: "LLPDDown",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this LLDPDown?

"00000003": sabakan.StateUnreachable, // serf status is "failed"
"00000004": sabakan.StateUnhealthy, // alert "DiskNotRecognized" is firing; grace period is ignored
"00000005": sabakan.StateUnreachable, // alert "LLPDDown" is firing; alert "DiskNotRecognized" is ignored because it is less severe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@morimoto-cybozu morimoto-cybozu force-pushed the detect-machine-failures-using-prometheus-alerts branch from 60400a0 to 1258bc4 Compare February 7, 2025 03:18
@morimoto-cybozu
Copy link
Contributor Author

I've rebased to reflect new artifacts.go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants