Skip to content

Commit 9e90c84

Browse files
authored
feat: group replication alerts (#680)
* feat: add gr alert rules * test: alerts unit test * docs: update docs * fix: ci alert unit test job
1 parent 6d2662f commit 9e90c84

File tree

7 files changed

+531
-38
lines changed

7 files changed

+531
-38
lines changed

.github/workflows/ci.yaml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,20 @@ jobs:
4242
- name: Upload Coverage to Codecov
4343
uses: codecov/codecov-action@v5
4444

45+
alert-test:
46+
name: Test Prometheus Alert Rules
47+
runs-on: ubuntu-latest
48+
timeout-minutes: 5
49+
steps:
50+
- name: Checkout repo
51+
uses: actions/checkout@v5
52+
- name: Install prometheus snap
53+
run: sudo snap install prometheus
54+
- name: Check validity of prometheus alert rules
55+
run: promtool check rules src/alert_rules/prometheus/*
56+
- name: Run unit tests for prometheus alert rules
57+
run: promtool test rules tests/alerts/*.yaml
58+
4559
build:
4660
name: Build charm
4761
uses: canonical/data-platform-workflows/.github/workflows/[email protected]

docs/how-to/monitoring-cos/enable-alert-rules.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,9 @@ juju switch <k8s_cos_controller>:<cos_model_name>
7171
juju config alertmanager [email protected]
7272
```
7373

74-
At this stage, the COS Alert Manager will start sending alert notifications to Pushover. Users can receive them on all supported [Pushover clients/apps](https://pushover.net/clients).
74+
At this stage, the COS Alert Manager will start sending alert notifications to Pushover. Users can receive them on all supported [Pushover clients/apps](https://pushover.net/clients).
75+
76+
> Some alert rules use `for: 0m`, but may still appear delayed. This is because Prometheus evaluates alert rules at intervals (configured via [`evaluation_interval`](https://charmhub.io/prometheus-k8s/configurations#evaluation_interval), typically every minute) and depends on fresh data scraped at its own intervals (default: 1 min). As a result, the best-case alert delay is: **scrape interval + evaluation interval**.
7577
7678
<details><summary>Screenshot of Pushover web client
7779
</summary>

docs/reference/alert-rules.md

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,38 @@
11
# Alert rules
22

3-
This page contains a markdown version of the alert rules described in the `mysql-operator` repository.
3+
This page contains a markdown version of the alert rules described in the `mysql-operator` repository.
44

55
See the source of truth on GitHub for the latest information, or an older version:
66

7-
[`alert_rules/prometheus/metrics_alert_rules.yaml`](https://github.com/canonical/mysql-operator/blob/main/src/alert_rules/prometheus/metrics_alert_rules.yaml)
7+
[`alert_rules/prometheus/`](https://github.com/canonical/mysql-operator/blob/main/src/alert_rules/prometheus/)
88

9-
## MySQLExporter
9+
## MySQL General Alerts
1010

1111
| Alert | Severity | Notes |
12-
|------|----------|-------|
13-
| MySQLDown | ![critical] | MySQL instance is down.<br> |
12+
| ----- | -------- | ----- |
13+
| MySQLDown | ![critical] | MySQL instance is down.<br>Please check if the MySQL process is running and the network connectivity. |
14+
| MySQLMetricsScrapeError | ![warning] | MySQL Exporter encountered a metrics scrape error.<br>Check the MySQL Exporter logs for more details. |
1415
| MySQLTooManyConnections(>90%) | ![warning] | MySQL instance is using > 90% of `max_connections`.<br>Consider checking the client application responsible for generating those additional connections. |
15-
| MySQLHighThreadsRunning | ![warning] | MySQL instance is actively using > 80% of `max_connections`.<br>Consider reviewing the value of the `max-connections` config parameter or allocate more resources to your database server. |
16-
| MySQLHighPreparedStatementsUtilization(>80%) | ![warning] | MySQL instance is using > 80% of `max_prepared_stmt_count`.<br>Too many prepared statements might consume a lot of memory. |
17-
| MySQLSlowQueries | ![info] | MySQL instance has a slow query.<br>Consider optimizing the query by reviewing its execution plan, then rewrite the query and add any relevant indexes. |
18-
| MySQLInnoDBLogWaits | ![warning] | MySQL instance has long InnoDB log waits.<br>MySQL InnoDB log writes might be stalling. Check I/O activity on your nodes to find the responsible process or query. Consider using `iotop` and the `performance_schema`. |
19-
| MySQLRestarted | ![info] | MySQL instance restarted.<br>MySQL restarted less than one minute ago. If the restart was unplanned or frequent, check Loki logs (e.g. `error.log`). |
16+
| MySQLHighThreadsRunning | ![warning] | MySQL instance is actively using > 80% of `max_connections`.<br>Consider reviewing the value of the `max-connections` config parameter or allocate more resources to your database server. |
17+
| MySQLHighPreparedStatementsUtilization(>80%) | ![warning] | MySQL instance is using > 80% of `max_prepared_stmt_count`.<br>Too many prepared statements might consume a lot of memory. |
18+
| MySQLSlowQueries | ![info] | MySQL instance has slow queries.<br>Consider optimizing the query by reviewing its execution plan, then rewrite the query and add any relevant indexes. |
19+
| MySQLInnoDBLogWaits | ![warning] | MySQL instance has long InnoDB log waits.<br>MySQL InnoDB log writes might be stalling. Check I/O activity on your nodes to find the responsible process or query. |
20+
| MySQLRestarted | ![info] | MySQL instance restarted.<br>MySQL restarted less than one minute ago. If the restart was unplanned or frequent, check Loki logs (e.g. `error.log`). |
21+
| MySQLConnectionErrors | ![warning] | MySQL instance has connection errors.<br>Connection errors might indicate network issues, authentication problems, or resource limitations. Check the MySQL logs for more details. |
22+
23+
## MySQL Replication Alerts
24+
25+
| Alert | Severity | Notes |
26+
| ----- | -------- | ----- |
27+
| MySQLClusterUnitOffline | ![warning] | MySQL cluster member is marked **offline**.<br>The process might still be running, but the member is excluded from the cluster. |
28+
| MySQLClusterNoPrimary | ![critical] | No **primary** in the cluster.<br>The cluster will likely be in a Read-Only state. Check cluster health and logs. |
29+
| MySQLClusterTooManyPrimaries | ![critical] | More than one **primary** detected.<br>This can indicate a **split-brain** situation. Refer to troubleshooting docs. |
30+
| MySQLNoReplication | ![warning] | No **secondary** members in the cluster.<br>The cluster is not redundant and failure of the primary will cause downtime. |
31+
| MySQLGroupReplicationReduced | ![warning] | The number of ONLINE members in the replication group has reduced compared to the maximum observed in the last 6 hours.<br>Check cluster health and logs. |
32+
| MySQLGroupReplicationConflicts | ![warning] | Conflicts detected in Group Replication.<br>Indicates concurrent writes to the same rows/keys across members. Investigate logs and cluster status. |
33+
| MySQLGroupReplicationQueueSizeHigh | ![warning] | High number of transactions in Group Replication queue (>100).<br>May indicate network issues or overloaded nodes. Investigate cluster performance. |
2034

2135
<!-- Badges -->
2236
[info]: https://img.shields.io/badge/info-blue
2337
[warning]: https://img.shields.io/badge/warning-yellow
2438
[critical]: https://img.shields.io/badge/critical-red
25-
Lines changed: 41 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,30 @@
11
groups:
2-
- name: MySQLExporter
3-
2+
- name: MySQL General Alert Rules
43
rules:
5-
# 2.1.1
64
- alert: MySQLDown
7-
expr: "mysql_up == 0"
5+
expr: mysql_up == 0
86
for: 0m
97
labels:
108
severity: critical
119
annotations:
12-
summary: MySQL instance {{ $labels.instance }} is down.
10+
summary: MySQL instance {{ $labels.instance }} is down.
1311
description: |
12+
The MySQL instance is not reachable.
13+
Please check if the MySQL process is running and the network connectivity.
14+
LABELS = {{ $labels }}.
15+
16+
- alert: MySQLMetricsScrapeError
17+
expr: increase(mysql_exporter_last_scrape_error[5m]) > 1
18+
for: 0m
19+
labels:
20+
severity: warning
21+
annotations:
22+
summary: MySQL instance {{ $labels.instance }} has a metrics scrape error.
23+
description: |
24+
The MySQL Exporter encountered an error while scraping metrics.
25+
Check the MySQL Exporter logs for more details.
1426
LABELS = {{ $labels }}.
1527
16-
# 2.1.2
17-
# customized: 80% -> 90%
1828
- alert: MySQLTooManyConnections(>90%)
1929
expr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 90
2030
for: 2m
@@ -24,10 +34,8 @@ groups:
2434
summary: MySQL instance {{ $labels.instance }} is using > 90% of `max_connections`.
2535
description: |
2636
Consider checking the client application responsible for generating those additional connections.
27-
LABELS = {{ $labels }}.
37+
LABELS = {{ $labels }}.
2838
29-
# 2.1.4
30-
# customized: 60% -> 80%
3139
- alert: MySQLHighThreadsRunning
3240
expr: max_over_time(mysql_global_status_threads_running[1m]) / mysql_global_variables_max_connections * 100 > 80
3341
for: 2m
@@ -36,10 +44,9 @@ groups:
3644
annotations:
3745
summary: MySQL instance {{ $labels.instance }} is actively using > 80% of `max_connections`.
3846
description: |
39-
Consider reviewing the value of the `max-connections` config parameter or allocate more resources to your database server.
40-
LABELS = {{ $labels }}.
47+
Consider reviewing the value of the `max-connections` config parameter or allocate more resources to your database server.
48+
LABELS = {{ $labels }}.
4149
42-
# 2.1.3
4350
- alert: MySQLHighPreparedStatementsUtilization(>80%)
4451
expr: max_over_time(mysql_global_status_prepared_stmt_count[1m]) / mysql_global_variables_max_prepared_stmt_count * 100 > 80
4552
for: 2m
@@ -48,36 +55,32 @@ groups:
4855
annotations:
4956
summary: MySQL instance {{ $labels.instance }} is using > 80% of `max_prepared_stmt_count`.
5057
description: |
51-
Too many prepared statements might consume a lot of memory.
52-
LABELS = {{ $labels }}.
58+
Too many prepared statements might consume a lot of memory.
59+
LABELS = {{ $labels }}.
5360
54-
# 2.1.8
55-
# customized: warning -> info
5661
- alert: MySQLSlowQueries
5762
expr: increase(mysql_global_status_slow_queries[1m]) > 0
5863
for: 2m
5964
labels:
6065
severity: info
6166
annotations:
62-
summary: MySQL instance {{ $labels.instance }} has a slow query.
67+
summary: MySQL instance {{ $labels.instance }} has slow queries.
6368
description: |
64-
Consider optimizing the query by reviewing its execution plan, then rewrite the query and add any relevant indexes.
69+
Consider optimizing the query by reviewing its execution plan, then rewrite the query and add any relevant indexes.
6570
LABELS = {{ $labels }}.
6671
67-
# 2.1.9
6872
- alert: MySQLInnoDBLogWaits
6973
expr: rate(mysql_global_status_innodb_log_waits[15m]) > 10
7074
for: 0m
7175
labels:
7276
severity: warning
7377
annotations:
74-
summary: MySQL instance {{ $labels.instance }} has long InnoDB log waits.
78+
summary: MySQL instance {{ $labels.instance }} has long InnoDB log waits.
7579
description: |
76-
MySQL InnoDB log writes might be stalling.
77-
Check I/O activity on your nodes to find the responsible process or query. Consider using iotop and the performance_schema.
80+
MySQL InnoDB log writes might be stalling.
81+
Check I/O activity on your nodes to find the responsible process or query. Consider using iotop and the performance_schema.
7882
LABELS = {{ $labels }}.
7983
80-
# 2.1.10
8184
- alert: MySQLRestarted
8285
expr: mysql_global_status_uptime < 60
8386
for: 0m
@@ -86,6 +89,18 @@ groups:
8689
annotations:
8790
summary: MySQL instance {{ $labels.instance }} restarted.
8891
description: |
89-
MySQL restarted less than one minute ago.
90-
If the restart was unplanned or frequent, check Loki logs (e.g. `error.log`).
92+
MySQL restarted less than one minute ago.
93+
If the restart was unplanned or frequent, check Loki logs (e.g. `error.log`).
94+
LABELS = {{ $labels }}.
95+
96+
- alert: MySQLConnectionErrors
97+
expr: increase(mysql_global_status_connection_errors_total[5m]) > 10
98+
for: 0m
99+
labels:
100+
severity: warning
101+
annotations:
102+
summary: MySQL instance {{ $labels.instance }} has connection errors.
103+
description: |
104+
Connection errors might indicate network issues, authentication problems, or resource limitations.
105+
Check the MySQL logs for more details.
91106
LABELS = {{ $labels }}.
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
groups:
2+
- name: MySQL Replication Alert Rules
3+
rules:
4+
- alert: MySQLClusterUnitOffline
5+
expr: mysql_perf_schema_replication_group_member_info{member_state="OFFLINE"} == 1
6+
for: 5m
7+
labels:
8+
severity: warning
9+
annotations:
10+
summary: MySQL cluster member {{ $labels.instance }} is offline
11+
description: |
12+
The MySQL member is marked offline in the cluster, although the process might still be running.
13+
If this is unexpected, please check the logs.
14+
LABELS = {{ $labels }}.
15+
16+
- alert: MySQLClusterNoPrimary
17+
expr: absent(mysql_perf_schema_replication_group_member_info{member_role="PRIMARY",member_state="ONLINE"})
18+
for: 0m
19+
labels:
20+
severity: critical
21+
annotations:
22+
summary: MySQL cluster reports no primary
23+
description: |
24+
MySQL has no primaries. The cluster will likely be in a Read-Only state.
25+
Please check the cluster health, the logs and investigate.
26+
LABELS = {{ $labels }}.
27+
28+
- alert: MySQLClusterTooManyPrimaries
29+
expr: count(mysql_perf_schema_replication_group_member_info{member_role="PRIMARY"}) > 1
30+
for: 5m
31+
labels:
32+
severity: critical
33+
annotations:
34+
summary: MySQL cluster reports more than one primary.
35+
description: |
36+
MySQL reports more than one primary. This can indicate a split-brain situation.
37+
Please refer to the troubleshooting docs.
38+
LABELS = {{ $labels }}.
39+
40+
- alert: MySQLNoReplication
41+
expr: absent(mysql_perf_schema_replication_group_member_info{member_role="SECONDARY"})
42+
for: 15m
43+
labels:
44+
severity: warning
45+
annotations:
46+
summary: MySQL cluster has no secondaries.
47+
description: |
48+
The MySQL cluster has no secondaries. This means that the cluster is not redundant and a failure of the primary will lead to downtime.
49+
Please check the cluster health, the logs and investigate.
50+
LABELS = {{ $labels }}.
51+
52+
- alert: MySQLGroupReplicationReduced
53+
expr: |
54+
count(mysql_perf_schema_replication_group_member_info{member_state="ONLINE"} == 1)
55+
<
56+
max_over_time(
57+
count(mysql_perf_schema_replication_group_member_info{member_state="ONLINE"} == 1)[6h:]
58+
)
59+
for: 5m
60+
labels:
61+
severity: warning
62+
annotations:
63+
summary: MySQL cluster's Group Replication size reduced
64+
description: |
65+
The number of ONLINE members in the MySQL Group Replication cluster has reduced compared to the maximum observed in the last 6 hours.
66+
Please check the cluster health, the logs and investigate.
67+
LABELS = {{ $labels }}.
68+
69+
- alert: MySQLGroupReplicationConflicts
70+
expr: rate(mysql_perf_schema_conflicts_detected[5m]) > 0
71+
for: 5m
72+
labels:
73+
severity: warning
74+
annotations:
75+
summary: MySQL cluster reports Group Replication conflicts
76+
description: |
77+
Conflicts indicate concurrent writes to the same rows/keys across members.
78+
Please check the cluster health, the logs and investigate.
79+
LABELS = {{ $labels }}.
80+
81+
- alert: MySQLGroupReplicationQueueSizeHigh
82+
expr: mysql_perf_schema_transactions_in_queue > 100
83+
for: 5m
84+
labels:
85+
severity: warning
86+
annotations:
87+
summary: MySQL cluster reports high Group Replication queue size
88+
description: |
89+
A high number of transactions in the Group Replication queue might indicate network issues or overloaded nodes.
90+
Please check the cluster health, the logs and investigate.
91+
LABELS = {{ $labels }}.

0 commit comments

Comments
 (0)