Skip to content

Commit 5eee6e5

Browse files
committed
feat(cloudwatch): Add DSQL system diagnostics skill
Add a new skill for diagnosing Aurora DSQL performance via PromQL queries against CloudWatch OTel metrics (db.active_sessions.avg). Key capabilities: - Temporal trend analysis with baseline comparison (hour/day/week) - Wait event distribution shift detection (>30% threshold) - Top-SQL regression identification (new or growing queries) - Commit analysis via CW metrics (TotalTransactions/OccConflicts) - Automatic handoff to the dsql skill for live database investigation Includes: - SKILL.md with 6 workflows and handoff protocol - wait-events.md with canonical DSQL wait event definitions - promql-patterns.md with reusable query templates - MCP configuration for CloudWatch server
1 parent e4f10c2 commit 5eee6e5

5 files changed

Lines changed: 823 additions & 0 deletions

File tree

Lines changed: 297 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
---
2+
name: dsql-system-diagnostics
3+
description: Diagnose Aurora DSQL performance issues using PromQL queries against CloudWatch OTel metrics. Detect anomalies in wait event distribution, identify regression points, and hand off to the dsql skill for live database investigation.
4+
---
5+
6+
# DSQL System Diagnostics
7+
8+
Diagnose Aurora DSQL cluster performance by querying Active Average Sessions (AAS) via PromQL, detecting temporal anomalies in wait event distribution, and handing off to the `dsql` skill for root cause analysis.
9+
10+
**Key capabilities:**
11+
- Temporal trend analysis of AAS via `db.active_sessions.avg` metric
12+
- Wait event distribution shift detection
13+
- Top-SQL regression identification (new or growing queries)
14+
- Automatic handoff to `dsql` skill for live database investigation
15+
16+
**Important principles:**
17+
- There is no upper bound to AAS in DSQL — absolute values are not inherently problematic
18+
- What matters is **change over time**: shifts in wait event distribution, new queries appearing, or existing queries consuming disproportionately more time
19+
- This skill observes via CloudWatch only — it does **not** recommend schema changes, indexing strategies, or query rewrites. Those require live database access via the `dsql` skill.
20+
21+
---
22+
23+
## Prerequisites
24+
25+
**MUST** have before starting:
26+
1. A specific `cluster_id` to investigate — never proceed without one. Ask the user if not provided.
27+
2. The CloudWatch MCP server configured with PromQL access (see [mcp-setup.md](mcp/mcp-setup.md))
28+
29+
---
30+
31+
## Reference Files
32+
33+
### MCP:
34+
#### [mcp-setup.md](mcp/mcp-setup.md)
35+
**When:** ALWAYS load before starting a diagnostic session
36+
**Contains:** CloudWatch MCP server configuration for PromQL access
37+
38+
#### [.mcp.json](mcp/.mcp.json)
39+
**When:** Load when setting up MCP servers for the first time
40+
**Contains:** Sample MCP configuration for the CloudWatch MCP server
41+
42+
### [wait-events.md](references/wait-events.md)
43+
**When:** ALWAYS load when interpreting AAS results
44+
**Contains:** DSQL wait events with canonical descriptions and investigation guidance
45+
46+
### [promql-patterns.md](references/promql-patterns.md)
47+
**When:** Load when constructing PromQL queries
48+
**Contains:** Reusable PromQL query templates for all workflows
49+
50+
---
51+
52+
## Core Concept: Active Average Sessions (AAS)
53+
54+
The primary metric is `db.active_sessions.avg` — the average number of sessions actively executing or waiting at a given instant.
55+
56+
**Normalized SQL and AAS interpretation:** All SQL in the metric is normalized (parameterized). The `normalized_sql` label groups all executions of the same query shape. A query with high AAS could mean either:
57+
- A slow query (high per-execution cost)
58+
- A fast query called at very high frequency (volume accumulates AAS)
59+
60+
This skill cannot distinguish between these — the `dsql` skill can via call counts and per-execution timing.
61+
62+
| Label | Purpose |
63+
|-------|---------|
64+
| `wait_event` | Which wait the session is in (OnCpu, ClientRead, Commit, etc.) |
65+
| `normalized_sql` | SQL fingerprint — groups identical query shapes |
66+
| `query_id` | Correlates with DSQL `EXPLAIN` Query Identifier |
67+
| `application_name` | Client application identifier |
68+
| `iam_role_arn` | IAM role used for the connection |
69+
| `session_state` | Session state (active) |
70+
| `@resource.aws.auroradsql.cluster_id` | Cluster identifier for filtering |
71+
| `@resource.cloud.resource_id` | Full cluster ARN |
72+
73+
---
74+
75+
## Workflow 1: Quick Health Check
76+
77+
**Goal:** Detect whether the cluster's wait event distribution has changed compared to historical baselines.
78+
79+
**Steps:**
80+
1. Confirm you have a specific `cluster_id` — do not proceed without one
81+
2. Query AAS by `wait_event` for the **current hour** in 10-minute chunks (step=60s) — compare the chunks against each other to detect recent shifts
82+
3. Query AAS by `wait_event` for the **same hour yesterday** (baseline 1)
83+
4. Query AAS by `wait_event` for the **same hour last week** (baseline 2)
84+
5. Compute the distribution (% each wait event contributes to total AAS) for each period
85+
6. Flag any wait event where the proportion changed by >30% vs either baseline
86+
7. Load [wait-events.md](references/wait-events.md) and interpret flagged changes
87+
88+
**Critical rules:**
89+
- **MUST** have a specific `cluster_id` before proceeding
90+
- **MUST** filter by cluster using `"@resource.aws.auroradsql.cluster_id"`
91+
- **MUST** compare against temporal baselines — do NOT report absolute AAS values as inherently problematic
92+
- **MUST** split the current hour into 10-minute chunks and compare them to detect intra-hour shifts
93+
- A >30% change in a wait event's share of total AAS warrants flagging to the user
94+
- If total AAS increased but distribution is unchanged, this may be legitimate load growth — report but do not alarm
95+
96+
**Example:**
97+
```promql
98+
# Current hour (split into chunks for intra-hour comparison)
99+
execute_promql_range_query(
100+
query='sum by (wait_event)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"})',
101+
start="NOW-1h", end="NOW", step="60s"
102+
)
103+
104+
# Same hour yesterday
105+
execute_promql_range_query(
106+
query='sum by (wait_event)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"})',
107+
start="NOW-25h", end="NOW-24h", step="60s"
108+
)
109+
110+
# Same hour last week
111+
execute_promql_range_query(
112+
query='sum by (wait_event)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"})',
113+
start="NOW-169h", end="NOW-168h", step="60s"
114+
)
115+
```
116+
117+
---
118+
119+
## Workflow 2: Top-SQL Regression Detection
120+
121+
**Goal:** Identify SQL statements that have become more prominent compared to baseline periods.
122+
123+
**Steps:**
124+
1. Query top-N SQL by AAS for the current period
125+
2. Query top-N SQL for the same period yesterday and last week
126+
3. Identify queries that are **new** in the top-N or have **grown** significantly vs baseline
127+
4. Note which `wait_event` dominates for the regressed queries
128+
129+
**Critical rules:**
130+
- **MUST** include `query_id` in grouping — stable identifier for handoff to `dsql` skill
131+
- **MUST** compare top-N across periods — a query being #1 is only notable if it wasn't before
132+
- **MUST NOT** recommend indexing or schema changes — defer to `dsql` skill
133+
134+
**Example:**
135+
```promql
136+
# Top 5 SQL for current period
137+
execute_promql_query(query='topk(5, sum by (normalized_sql, query_id)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"}))')
138+
139+
# Top 5 SQL with wait event context
140+
execute_promql_query(query='topk(10, sum by (normalized_sql, query_id, wait_event)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"}))')
141+
```
142+
143+
---
144+
145+
## Workflow 3: Triage Decision Tree
146+
147+
**Goal:** Guide investigation based on which wait event has shifted.
148+
149+
**Steps:**
150+
1. Run Workflow 1 to identify which wait event(s) have changed
151+
2. Branch on the wait event showing the largest shift:
152+
153+
| Changed Wait Event | Investigation |
154+
|---|---|
155+
| **OnCpu** | Identify which queries grew via Workflow 2. Hand off to `dsql` skill. |
156+
| **ClientRead** | Top IAM role + application. Indicates idle-in-transaction growth — client-side issue. |
157+
| **ClientWrite** | Top application. Client is slow consuming results — check client health. |
158+
| **SequentialScanRead** | Identify query via Workflow 2. Hand off to `dsql` skill — may be plan regression or missing index. |
159+
| **ScatteredBatchRead** | Identify query. Hand off to `dsql` skill. |
160+
| **SingleRead** | Identify query. Hand off to `dsql` skill. |
161+
| **FkExistenceCheck** | Identify query. Check whether insert volume increased. |
162+
| **UniqueConstraintCheck** | Identify query. Check whether insert/upsert patterns changed. |
163+
| **Commit** | Run Workflow 6 (Commit Analysis) to distinguish volume increase from OCC conflicts. |
164+
| **PgSleep** | Identify application. Verify intentional — may indicate new polling behavior. |
165+
166+
3. Load [wait-events.md](references/wait-events.md) for detailed investigation guidance
167+
168+
---
169+
170+
## Workflow 4: IAM Role and Application Attribution
171+
172+
**Goal:** Identify which roles or applications are driving anomalous changes.
173+
174+
**Critical rules:**
175+
- **MUST** compare against baseline — an application being dominant is only noteworthy if it has changed
176+
- Report the delta: "application X increased from 30% to 55% of total AAS"
177+
178+
**Example:**
179+
```promql
180+
# Top IAM roles
181+
execute_promql_query(query='topk(5, sum by (iam_role_arn)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"}))')
182+
183+
# Top applications
184+
execute_promql_query(query='topk(5, sum by (application_name)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"}))')
185+
```
186+
187+
---
188+
189+
## Workflow 5: Time-Series Investigation
190+
191+
**Goal:** Identify when a change occurred and pinpoint the inflection point.
192+
193+
**Critical rules:**
194+
- **SHOULD** use step: 60s (< 1h), 300s (1–6h), 900s (> 6h), 3600s (> 24h)
195+
- **MUST** specify `start` and `end` in RFC 3339 format
196+
- Maximum range per query is 7 days — split longer investigations into multiple queries
197+
- Look for: inflection points, step-changes in specific wait events, distribution shifts
198+
199+
**Example:**
200+
```promql
201+
execute_promql_range_query(
202+
query='sum by (wait_event)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"})',
203+
start="2024-01-15T10:00:00Z",
204+
end="2024-01-15T16:00:00Z",
205+
step="300s"
206+
)
207+
```
208+
209+
---
210+
211+
## Workflow 6: Commit Analysis
212+
213+
**Goal:** Distinguish between increased commit volume (legitimate) and OCC conflict growth (problematic).
214+
215+
**Steps:**
216+
1. Confirm Commit wait has shifted (from Workflow 1 comparison)
217+
2. Query standard CloudWatch metrics for the same period:
218+
- `AuroraDSQL` namespace, dimension `ClusterId`
219+
- Metric: `TotalTransactions` — rate of committed transactions
220+
- Metric: `OccConflicts` — rate of optimistic concurrency conflicts
221+
3. Compare the ratios:
222+
- If TotalTransactions increased proportionally to Commit AAS → legitimate load growth
223+
- If OccConflicts increased disproportionately → write-write conflict problem
224+
- If Commit AAS increased but TotalTransactions did not → transactions are taking longer to commit
225+
226+
**Example (standard CW metrics):**
227+
```
228+
get_metric_data(
229+
namespace="AuroraDSQL",
230+
metric_name="TotalTransactions",
231+
dimensions=[{name: "ClusterId", value: "CLUSTER_ID"}],
232+
statistic="Sum"
233+
)
234+
235+
get_metric_data(
236+
namespace="AuroraDSQL",
237+
metric_name="OccConflicts",
238+
dimensions=[{name: "ClusterId", value: "CLUSTER_ID"}],
239+
statistic="Sum"
240+
)
241+
```
242+
243+
---
244+
245+
## Idle Cluster Detection
246+
247+
A cluster is idle when there is no AAS data for a period. Use a range query and look for gaps (missing timestamps) in the time series.
248+
249+
**Pattern: Sporadic workload** — periods of no data interspersed with periods of AAS > 0 indicate a cluster performing scheduled or batch work.
250+
251+
```promql
252+
execute_promql_range_query(
253+
query='sum({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"})',
254+
start="NOW-24h", end="NOW", step="300s"
255+
)
256+
```
257+
258+
---
259+
260+
## DSQL Skill Handoff
261+
262+
When investigation identifies a query that has become more prominent, hand off to the `dsql` skill for live database analysis. Do not provide specific diagnostics or recommendations — simply describe the observed anomaly.
263+
264+
### Automatic Handoff
265+
266+
Before handing off, verify the `dsql` skill's MCP server is configured for the cluster under investigation:
267+
268+
1. Check if the `aurora-dsql` MCP server is configured (look for it in the active MCP servers)
269+
2. Verify the `--cluster_endpoint` in its args matches the cluster being investigated
270+
3. If not configured or pointing to a different cluster, prompt the user:
271+
> "The Aurora DSQL MCP server needs to be configured for cluster `{CLUSTER_ID}` to proceed with live database investigation. Please add or update the `aurora-dsql` server in your MCP configuration with `--cluster_endpoint {CLUSTER_ENDPOINT}`."
272+
273+
### Handoff Format
274+
275+
When handing off, describe only the observed anomaly — do not suggest causes or fixes:
276+
277+
> "Query `{NORMALIZED_SQL}` (query_id: `{QUERY_ID}`) is using significantly more system time than it did {TIMEFRAME} ago. Its share of cluster AAS on `{WAIT_EVENT}` has grown from {OLD}% to {NEW}%. Please diagnose what is happening with this query."
278+
279+
### When to Hand Off
280+
281+
- Any query identified in Workflow 2 as newly prominent or significantly grown
282+
- SequentialScanRead, OnCpu, ScatteredBatchRead, or SingleRead shifts where a specific query is responsible
283+
- OCC conflict growth confirmed in Workflow 6
284+
285+
---
286+
287+
## Error Handling
288+
289+
| Situation | Action |
290+
|-----------|--------|
291+
| No cluster_id provided | Ask the user — never proceed without a specific cluster |
292+
| No series found | Verify cluster ID with `get_promql_label_values`. Check region. |
293+
| Empty result (no data) | Cluster is idle for that period. Widen time window. |
294+
| `query_id` missing | Not all queries emit it. Filter by `normalized_sql` instead. |
295+
| PromQL timeout | Reduce cardinality — fewer labels or shorter time range. |
296+
| Range > 7 days | Split into multiple 7-day range queries. |
297+
| `dsql` skill not available | Prompt user to install from `awslabs/agent-plugins` (plugin: `databases-on-aws`) |
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
{
2+
"mcpServers": {
3+
"cloudwatch": {
4+
"args": [
5+
"awslabs.cloudwatch-mcp-server@latest"
6+
],
7+
"command": "uvx",
8+
"env": {
9+
"FASTMCP_LOG_LEVEL": "ERROR"
10+
}
11+
}
12+
}
13+
}
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# MCP Server Setup Instructions
2+
3+
This skill uses the CloudWatch MCP Server with PromQL tools for querying Aurora DSQL OTel metrics.
4+
5+
## Prerequisites
6+
7+
```bash
8+
uv --version
9+
```
10+
11+
**If missing:** Install from [Astral](https://docs.astral.sh/uv/getting-started/installation/)
12+
13+
## Required Tools
14+
15+
| Tool | Purpose |
16+
|------|---------|
17+
| `execute_promql_query` | Instant point-in-time query |
18+
| `execute_promql_range_query` | Time-series query over a window |
19+
| `get_promql_label_values` | Discover label values (cluster IDs, wait events) |
20+
| `get_promql_series` | Discover available metric series |
21+
| `get_promql_labels` | List available label names |
22+
23+
## MCP Configuration
24+
25+
```json
26+
{
27+
"mcpServers": {
28+
"cloudwatch": {
29+
"command": "uvx",
30+
"args": ["awslabs.cloudwatch-mcp-server@latest"],
31+
"env": {
32+
"FASTMCP_LOG_LEVEL": "ERROR"
33+
}
34+
}
35+
}
36+
}
37+
```
38+
39+
### Optional Environment Variables
40+
41+
| Variable | When Needed |
42+
|----------|-------------|
43+
| `AWS_PROFILE` | Non-default AWS profile |
44+
| `AWS_REGION` | Override default region — MUST match the DSQL cluster's region |
45+
46+
## Example: Kiro CLI
47+
48+
Create `~/.kiro/agents/dsql-diagnostics.json`:
49+
50+
```json
51+
{
52+
"$schema": "https://raw.githubusercontent.com/aws/amazon-q-developer-cli/refs/heads/main/schemas/agent-v1.json",
53+
"name": "dsql-diagnostics",
54+
"description": "Diagnose Aurora DSQL performance via CloudWatch PromQL metrics",
55+
"mcpServers": {
56+
"cloudwatch": {
57+
"command": "uvx",
58+
"args": ["awslabs.cloudwatch-mcp-server@latest"],
59+
"env": { "FASTMCP_LOG_LEVEL": "ERROR" }
60+
}
61+
},
62+
"resources": [
63+
"skill://.kiro/skills/dsql-system-diagnostics/SKILL.md"
64+
],
65+
"tools": ["fs_read", "execute_bash", "@cloudwatch"]
66+
}
67+
```
68+
69+
### Launch
70+
71+
```bash
72+
kiro-cli chat --agent dsql-diagnostics
73+
```
74+
75+
### Verification
76+
77+
Confirm the CloudWatch MCP server is connected, then run:
78+
```
79+
check health of cluster <CLUSTER_ID>
80+
```

0 commit comments

Comments
 (0)