Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve alarms #1336

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Improve alarms #1336

wants to merge 5 commits into from

Conversation

yrong
Copy link
Contributor

@yrong yrong commented Nov 20, 2024

  • Move alarm configuration out of api package
  • Make alarm evaluation variables configurable
  • Monitor with SCAN_INTERVAL more sensitive

Resolves: https://linear.app/snowfork/issue/SNO-1233

@yrong yrong marked this pull request as ready for review November 21, 2024 05:48
@yrong
Copy link
Contributor Author

yrong commented Nov 21, 2024

Reinitialize alarms with threshold more sensitive and monitor process restarted.

Just watch in CloudWatch for sometime to see if that make sense.

Copy link
Contributor

@alistair-singh alistair-singh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, approving. Although I think we should move away from using boolean style metrics and we should just log numbers where possible. Maybe we can address in a separate PR.

# Scan interval(in minutes)
SCAN_INTERVAL=60
# Scan interval(30 minutes by default)
SCAN_INTERVAL=30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be valuable to do 15 minutes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to start from 30 minutes first.

Value: Number(
channel.toEthereum.outbound < channel.toEthereum.inbound
),
Value: Number(channel.toEthereum.outbound < channel.toEthereum.inbound),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of logging booleans, maybe we should log the difference. This would then be a metric of undelivered messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

785b59b metrics for the difference.

? parseInt(process.env["CheckIntervalToEthereum"])
: status.BlockLatencyThreshold.ToEthereum) &&
metrics.bridgeStatus.toEthereum.latestPolkadotBlockOnEthereum <=
metrics.bridgeStatus.toEthereum.blockLatency > BlockLatencyThreshold.ToEthereum &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should also just log the blockLatency as a plain number instead of a boolean. Easier threshold management.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants