sabakan-state-setter: detect machine failures using Prometheus alerts #2788

morimoto-cybozu · 2025-01-29T03:33:13Z

See-also: https://github.com/cybozu-private/neco-tasks/issues/466
Signed-off-by: morimoto-cybozu [email protected]

zeroalphat · 2025-02-05T08:19:36Z

pkg/sabakan-state-setter/controller.go

+		case !ok || newState == m.State || (newState == stateUnhealthyImmediate && m.State == sabakan.StateUnhealthy):
 			c.ClearUnhealthy(m)
 			continue
+		case newState == stateUnhealthyImmediate:
+			c.ClearUnhealthy(m)
+			newState = sabakan.StateUnhealthy


Could you please explain why are you using the stateUnhealthyImmediate variable and executing ClearUnhealthy? I'm not quite clear on their purpose.

@zeroalphat
It's because I want to skip the grace period.

Currently sabakan-state-setter changes a machine's state to unhealthy only after waiting for the grace period.
This seems reasonable when the cause of the change is serf status or monitor-hw metrics.
In this pull request, I add state changes caused by Prometheus alerts.
In my design, the grace period is not necessary for this new state change because we can configure Prometheus alerts to wait a sufficient amount of time before becoming active.
https://github.com/cybozu-go/neco/pull/2788/files#diff-819895dee5e92a2356096850816be02af8e2b41ea1f6f69832d71967303465f3
So I added a new state candidate stateUnhealthyImmediate for the new state change.

The controller manages the start timestamps of to-be-unhealthy machines.
If a machine is newly evaluated to StateUnhealthy (not stateUnhealthyImmediate), the controller registers the timestamp via Controller.RegisterUnhealthy().
The controller continuously checks the timestamp while the machine is evaluated to StateUnhealthy and the current status is not StateUnhealthy.
Once the controller has changed the status to StateUnhealthy, it deletes the timestamp via Controller.ClearUnhealthy().
The controller also deletes the timestamp when it becomes unnecessary.
If the new state candidate is stateUnhealthyImmediate, we don't need the timestamp any more.
So I add c.ClearUnhealthy(m).
We can call Controller.ClearUnhealthy() without checking that a timestamp is registered actually.

zeroalphat · 2025-02-05T08:24:16Z

pkg/sabakan-state-setter/controller_test.go

+		},
+	},
+	{
+		Name:         "LLPDDown",


Is this LLDPDown?

zeroalphat · 2025-02-05T08:26:04Z

pkg/sabakan-state-setter/controller_test.go

 				"00000003": sabakan.StateUnreachable, // serf status is "failed"
+				"00000004": sabakan.StateUnhealthy,   // alert "DiskNotRecognized" is firing; grace period is ignored
+				"00000005": sabakan.StateUnreachable, // alert "LLPDDown" is firing; alert "DiskNotRecognized" is ignored because it is less severe


Signed-off-by: morimoto-cybozu <[email protected]>

morimoto-cybozu · 2025-02-07T03:19:20Z

I've rebased to reflect new artifacts.go.

morimoto-cybozu self-assigned this Jan 29, 2025

morimoto-cybozu force-pushed the detect-machine-failures-using-prometheus-alerts branch from 1051b2a to dffeebf Compare January 31, 2025 07:20

morimoto-cybozu marked this pull request as ready for review February 4, 2025 05:00

morimoto-cybozu requested a review from zeroalphat February 4, 2025 05:31

zeroalphat reviewed Feb 5, 2025

View reviewed changes

sabakan-state-setter: detect machine failures using Prometheus alerts

1258bc4

Signed-off-by: morimoto-cybozu <[email protected]>

morimoto-cybozu force-pushed the detect-machine-failures-using-prometheus-alerts branch from 60400a0 to 1258bc4 Compare February 7, 2025 03:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sabakan-state-setter: detect machine failures using Prometheus alerts #2788

sabakan-state-setter: detect machine failures using Prometheus alerts #2788

morimoto-cybozu commented Jan 29, 2025 •

edited

Loading

zeroalphat Feb 5, 2025

morimoto-cybozu Feb 6, 2025

zeroalphat Feb 5, 2025

zeroalphat Feb 5, 2025

morimoto-cybozu commented Feb 7, 2025

sabakan-state-setter: detect machine failures using Prometheus alerts #2788

Are you sure you want to change the base?

sabakan-state-setter: detect machine failures using Prometheus alerts #2788

Conversation

morimoto-cybozu commented Jan 29, 2025 • edited Loading

zeroalphat Feb 5, 2025

Choose a reason for hiding this comment

morimoto-cybozu Feb 6, 2025

Choose a reason for hiding this comment

zeroalphat Feb 5, 2025

Choose a reason for hiding this comment

zeroalphat Feb 5, 2025

Choose a reason for hiding this comment

morimoto-cybozu commented Feb 7, 2025

morimoto-cybozu commented Jan 29, 2025 •

edited

Loading