Improve alarms #1336

yrong · 2024-11-20T16:43:28Z

Move alarm configuration out of api package
Make alarm evaluation variables configurable
Monitor with SCAN_INTERVAL more sensitive

Resolves: https://linear.app/snowfork/issue/SNO-1233

yrong · 2024-11-21T05:51:14Z

Reinitialize alarms with threshold more sensitive and monitor process restarted.

Just watch in CloudWatch for sometime to see if that make sense.

alistair-singh

+1, approving. Although I think we should move away from using boolean style metrics and we should just log numbers where possible. Maybe we can address in a separate PR.

alistair-singh · 2024-11-21T11:45:25Z

web/packages/operations/.env.example

-# Scan interval(in minutes)
-SCAN_INTERVAL=60 
+# Scan interval(30 minutes by default)
+SCAN_INTERVAL=30


Do you think it would be valuable to do 15 minutes?

I'd prefer to start from 30 minutes first.

alistair-singh · 2024-11-21T11:49:00Z

web/packages/operations/src/alarm.ts

-            Value: Number(
-                channel.toEthereum.outbound < channel.toEthereum.inbound     
-            ),
+            Value: Number(channel.toEthereum.outbound < channel.toEthereum.inbound),


Instead of logging booleans, maybe we should log the difference. This would then be a metric of undelivered messages.

785b59b metrics for the difference.

alistair-singh · 2024-11-21T11:51:44Z

web/packages/operations/src/alarm.ts

-            ? parseInt(process.env["CheckIntervalToEthereum"])
-            : status.BlockLatencyThreshold.ToEthereum) &&
-                metrics.bridgeStatus.toEthereum.latestPolkadotBlockOnEthereum <=
+            metrics.bridgeStatus.toEthereum.blockLatency > BlockLatencyThreshold.ToEthereum &&


Maybe should also just log the blockLatency as a plain number instead of a boolean. Easier threshold management.

yrong added 2 commits November 20, 2024 16:40

Improve alarms

b3d8e93

Change threshold & Reinitialize

f5c006f

yrong marked this pull request as ready for review November 21, 2024 05:48

yrong requested a review from alistair-singh November 21, 2024 05:52

alistair-singh approved these changes Nov 21, 2024

View reviewed changes

More metrics

785b59b

github-actions bot added the Component: Web label Nov 23, 2024

yrong added 2 commits November 24, 2024 02:09

Increase default value of threshold

579ecf7

Increase timeout

0d7ea00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve alarms #1336

Improve alarms #1336

yrong commented Nov 20, 2024 •

edited by alistair-singh

Loading

yrong commented Nov 21, 2024 •

edited

Loading

alistair-singh left a comment

alistair-singh Nov 21, 2024

yrong Nov 23, 2024

alistair-singh Nov 21, 2024

yrong Nov 23, 2024

alistair-singh Nov 21, 2024

yrong Nov 23, 2024

Improve alarms #1336

Are you sure you want to change the base?

Improve alarms #1336

Conversation

yrong commented Nov 20, 2024 • edited by alistair-singh Loading

yrong commented Nov 21, 2024 • edited Loading

alistair-singh left a comment

Choose a reason for hiding this comment

alistair-singh Nov 21, 2024

Choose a reason for hiding this comment

yrong Nov 23, 2024

Choose a reason for hiding this comment

alistair-singh Nov 21, 2024

Choose a reason for hiding this comment

yrong Nov 23, 2024

Choose a reason for hiding this comment

alistair-singh Nov 21, 2024

Choose a reason for hiding this comment

yrong Nov 23, 2024

Choose a reason for hiding this comment

yrong commented Nov 20, 2024 •

edited by alistair-singh

Loading

yrong commented Nov 21, 2024 •

edited

Loading