|
| 1 | +--- |
| 2 | +name: dsql-system-diagnostics |
| 3 | +description: Diagnose Aurora DSQL performance issues using PromQL queries against CloudWatch OTel metrics. Detect anomalies in wait event distribution, identify regression points, and hand off to the dsql skill for live database investigation. |
| 4 | +--- |
| 5 | + |
| 6 | +# DSQL System Diagnostics |
| 7 | + |
| 8 | +Diagnose Aurora DSQL cluster performance by querying Active Average Sessions (AAS) via PromQL, detecting temporal anomalies in wait event distribution, and handing off to the `dsql` skill for root cause analysis. |
| 9 | + |
| 10 | +**Key capabilities:** |
| 11 | +- Temporal trend analysis of AAS via `db.active_sessions.avg` metric |
| 12 | +- Wait event distribution shift detection |
| 13 | +- Top-SQL regression identification (new or growing queries) |
| 14 | +- Automatic handoff to `dsql` skill for live database investigation |
| 15 | + |
| 16 | +**Important principles:** |
| 17 | +- There is no upper bound to AAS in DSQL — absolute values are not inherently problematic |
| 18 | +- What matters is **change over time**: shifts in wait event distribution, new queries appearing, or existing queries consuming disproportionately more time |
| 19 | +- This skill observes via CloudWatch only — it does **not** recommend schema changes, indexing strategies, or query rewrites. Those require live database access via the `dsql` skill. |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## Prerequisites |
| 24 | + |
| 25 | +**MUST** have before starting: |
| 26 | +1. A specific `cluster_id` to investigate — never proceed without one. Ask the user if not provided. |
| 27 | +2. The CloudWatch MCP server configured with PromQL access (see [mcp-setup.md](mcp/mcp-setup.md)) |
| 28 | + |
| 29 | +--- |
| 30 | + |
| 31 | +## Reference Files |
| 32 | + |
| 33 | +### MCP: |
| 34 | +#### [mcp-setup.md](mcp/mcp-setup.md) |
| 35 | +**When:** ALWAYS load before starting a diagnostic session |
| 36 | +**Contains:** CloudWatch MCP server configuration for PromQL access |
| 37 | + |
| 38 | +#### [.mcp.json](mcp/.mcp.json) |
| 39 | +**When:** Load when setting up MCP servers for the first time |
| 40 | +**Contains:** Sample MCP configuration for the CloudWatch MCP server |
| 41 | + |
| 42 | +### [wait-events.md](references/wait-events.md) |
| 43 | +**When:** ALWAYS load when interpreting AAS results |
| 44 | +**Contains:** DSQL wait events with canonical descriptions and investigation guidance |
| 45 | + |
| 46 | +### [promql-patterns.md](references/promql-patterns.md) |
| 47 | +**When:** Load when constructing PromQL queries |
| 48 | +**Contains:** Reusable PromQL query templates for all workflows |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | +## Core Concept: Active Average Sessions (AAS) |
| 53 | + |
| 54 | +The primary metric is `db.active_sessions.avg` — the average number of sessions actively executing or waiting at a given instant. |
| 55 | + |
| 56 | +**Normalized SQL and AAS interpretation:** All SQL in the metric is normalized (parameterized). The `normalized_sql` label groups all executions of the same query shape. A query with high AAS could mean either: |
| 57 | +- A slow query (high per-execution cost) |
| 58 | +- A fast query called at very high frequency (volume accumulates AAS) |
| 59 | + |
| 60 | +This skill cannot distinguish between these — the `dsql` skill can via call counts and per-execution timing. |
| 61 | + |
| 62 | +| Label | Purpose | |
| 63 | +|-------|---------| |
| 64 | +| `wait_event` | Which wait the session is in (OnCpu, ClientRead, Commit, etc.) | |
| 65 | +| `normalized_sql` | SQL fingerprint — groups identical query shapes | |
| 66 | +| `query_id` | Correlates with DSQL `EXPLAIN` Query Identifier | |
| 67 | +| `application_name` | Client application identifier | |
| 68 | +| `iam_role_arn` | IAM role used for the connection | |
| 69 | +| `session_state` | Session state (active) | |
| 70 | +| `@resource.aws.auroradsql.cluster_id` | Cluster identifier for filtering | |
| 71 | +| `@resource.cloud.resource_id` | Full cluster ARN | |
| 72 | + |
| 73 | +--- |
| 74 | + |
| 75 | +## Workflow 1: Quick Health Check |
| 76 | + |
| 77 | +**Goal:** Detect whether the cluster's wait event distribution has changed compared to historical baselines. |
| 78 | + |
| 79 | +**Steps:** |
| 80 | +1. Confirm you have a specific `cluster_id` — do not proceed without one |
| 81 | +2. Query AAS by `wait_event` for the **current hour** in 10-minute chunks (step=60s) — compare the chunks against each other to detect recent shifts |
| 82 | +3. Query AAS by `wait_event` for the **same hour yesterday** (baseline 1) |
| 83 | +4. Query AAS by `wait_event` for the **same hour last week** (baseline 2) |
| 84 | +5. Compute the distribution (% each wait event contributes to total AAS) for each period |
| 85 | +6. Flag any wait event where the proportion changed by >30% vs either baseline |
| 86 | +7. Load [wait-events.md](references/wait-events.md) and interpret flagged changes |
| 87 | + |
| 88 | +**Critical rules:** |
| 89 | +- **MUST** have a specific `cluster_id` before proceeding |
| 90 | +- **MUST** filter by cluster using `"@resource.aws.auroradsql.cluster_id"` |
| 91 | +- **MUST** compare against temporal baselines — do NOT report absolute AAS values as inherently problematic |
| 92 | +- **MUST** split the current hour into 10-minute chunks and compare them to detect intra-hour shifts |
| 93 | +- A >30% change in a wait event's share of total AAS warrants flagging to the user |
| 94 | +- If total AAS increased but distribution is unchanged, this may be legitimate load growth — report but do not alarm |
| 95 | + |
| 96 | +**Example:** |
| 97 | +```promql |
| 98 | +# Current hour (split into chunks for intra-hour comparison) |
| 99 | +execute_promql_range_query( |
| 100 | + query='sum by (wait_event)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"})', |
| 101 | + start="NOW-1h", end="NOW", step="60s" |
| 102 | +) |
| 103 | +
|
| 104 | +# Same hour yesterday |
| 105 | +execute_promql_range_query( |
| 106 | + query='sum by (wait_event)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"})', |
| 107 | + start="NOW-25h", end="NOW-24h", step="60s" |
| 108 | +) |
| 109 | +
|
| 110 | +# Same hour last week |
| 111 | +execute_promql_range_query( |
| 112 | + query='sum by (wait_event)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"})', |
| 113 | + start="NOW-169h", end="NOW-168h", step="60s" |
| 114 | +) |
| 115 | +``` |
| 116 | + |
| 117 | +--- |
| 118 | + |
| 119 | +## Workflow 2: Top-SQL Regression Detection |
| 120 | + |
| 121 | +**Goal:** Identify SQL statements that have become more prominent compared to baseline periods. |
| 122 | + |
| 123 | +**Steps:** |
| 124 | +1. Query top-N SQL by AAS for the current period |
| 125 | +2. Query top-N SQL for the same period yesterday and last week |
| 126 | +3. Identify queries that are **new** in the top-N or have **grown** significantly vs baseline |
| 127 | +4. Note which `wait_event` dominates for the regressed queries |
| 128 | + |
| 129 | +**Critical rules:** |
| 130 | +- **MUST** include `query_id` in grouping — stable identifier for handoff to `dsql` skill |
| 131 | +- **MUST** compare top-N across periods — a query being #1 is only notable if it wasn't before |
| 132 | +- **MUST NOT** recommend indexing or schema changes — defer to `dsql` skill |
| 133 | + |
| 134 | +**Example:** |
| 135 | +```promql |
| 136 | +# Top 5 SQL for current period |
| 137 | +execute_promql_query(query='topk(5, sum by (normalized_sql, query_id)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"}))') |
| 138 | +
|
| 139 | +# Top 5 SQL with wait event context |
| 140 | +execute_promql_query(query='topk(10, sum by (normalized_sql, query_id, wait_event)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"}))') |
| 141 | +``` |
| 142 | + |
| 143 | +--- |
| 144 | + |
| 145 | +## Workflow 3: Triage Decision Tree |
| 146 | + |
| 147 | +**Goal:** Guide investigation based on which wait event has shifted. |
| 148 | + |
| 149 | +**Steps:** |
| 150 | +1. Run Workflow 1 to identify which wait event(s) have changed |
| 151 | +2. Branch on the wait event showing the largest shift: |
| 152 | + |
| 153 | +| Changed Wait Event | Investigation | |
| 154 | +|---|---| |
| 155 | +| **OnCpu** | Identify which queries grew via Workflow 2. Hand off to `dsql` skill. | |
| 156 | +| **ClientRead** | Top IAM role + application. Indicates idle-in-transaction growth — client-side issue. | |
| 157 | +| **ClientWrite** | Top application. Client is slow consuming results — check client health. | |
| 158 | +| **SequentialScanRead** | Identify query via Workflow 2. Hand off to `dsql` skill — may be plan regression or missing index. | |
| 159 | +| **ScatteredBatchRead** | Identify query. Hand off to `dsql` skill. | |
| 160 | +| **SingleRead** | Identify query. Hand off to `dsql` skill. | |
| 161 | +| **FkExistenceCheck** | Identify query. Check whether insert volume increased. | |
| 162 | +| **UniqueConstraintCheck** | Identify query. Check whether insert/upsert patterns changed. | |
| 163 | +| **Commit** | Run Workflow 6 (Commit Analysis) to distinguish volume increase from OCC conflicts. | |
| 164 | +| **PgSleep** | Identify application. Verify intentional — may indicate new polling behavior. | |
| 165 | + |
| 166 | +3. Load [wait-events.md](references/wait-events.md) for detailed investigation guidance |
| 167 | + |
| 168 | +--- |
| 169 | + |
| 170 | +## Workflow 4: IAM Role and Application Attribution |
| 171 | + |
| 172 | +**Goal:** Identify which roles or applications are driving anomalous changes. |
| 173 | + |
| 174 | +**Critical rules:** |
| 175 | +- **MUST** compare against baseline — an application being dominant is only noteworthy if it has changed |
| 176 | +- Report the delta: "application X increased from 30% to 55% of total AAS" |
| 177 | + |
| 178 | +**Example:** |
| 179 | +```promql |
| 180 | +# Top IAM roles |
| 181 | +execute_promql_query(query='topk(5, sum by (iam_role_arn)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"}))') |
| 182 | +
|
| 183 | +# Top applications |
| 184 | +execute_promql_query(query='topk(5, sum by (application_name)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"}))') |
| 185 | +``` |
| 186 | + |
| 187 | +--- |
| 188 | + |
| 189 | +## Workflow 5: Time-Series Investigation |
| 190 | + |
| 191 | +**Goal:** Identify when a change occurred and pinpoint the inflection point. |
| 192 | + |
| 193 | +**Critical rules:** |
| 194 | +- **SHOULD** use step: 60s (< 1h), 300s (1–6h), 900s (> 6h), 3600s (> 24h) |
| 195 | +- **MUST** specify `start` and `end` in RFC 3339 format |
| 196 | +- Maximum range per query is 7 days — split longer investigations into multiple queries |
| 197 | +- Look for: inflection points, step-changes in specific wait events, distribution shifts |
| 198 | + |
| 199 | +**Example:** |
| 200 | +```promql |
| 201 | +execute_promql_range_query( |
| 202 | + query='sum by (wait_event)({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"})', |
| 203 | + start="2024-01-15T10:00:00Z", |
| 204 | + end="2024-01-15T16:00:00Z", |
| 205 | + step="300s" |
| 206 | +) |
| 207 | +``` |
| 208 | + |
| 209 | +--- |
| 210 | + |
| 211 | +## Workflow 6: Commit Analysis |
| 212 | + |
| 213 | +**Goal:** Distinguish between increased commit volume (legitimate) and OCC conflict growth (problematic). |
| 214 | + |
| 215 | +**Steps:** |
| 216 | +1. Confirm Commit wait has shifted (from Workflow 1 comparison) |
| 217 | +2. Query standard CloudWatch metrics for the same period: |
| 218 | + - `AuroraDSQL` namespace, dimension `ClusterId` |
| 219 | + - Metric: `TotalTransactions` — rate of committed transactions |
| 220 | + - Metric: `OccConflicts` — rate of optimistic concurrency conflicts |
| 221 | +3. Compare the ratios: |
| 222 | + - If TotalTransactions increased proportionally to Commit AAS → legitimate load growth |
| 223 | + - If OccConflicts increased disproportionately → write-write conflict problem |
| 224 | + - If Commit AAS increased but TotalTransactions did not → transactions are taking longer to commit |
| 225 | + |
| 226 | +**Example (standard CW metrics):** |
| 227 | +``` |
| 228 | +get_metric_data( |
| 229 | + namespace="AuroraDSQL", |
| 230 | + metric_name="TotalTransactions", |
| 231 | + dimensions=[{name: "ClusterId", value: "CLUSTER_ID"}], |
| 232 | + statistic="Sum" |
| 233 | +) |
| 234 | +
|
| 235 | +get_metric_data( |
| 236 | + namespace="AuroraDSQL", |
| 237 | + metric_name="OccConflicts", |
| 238 | + dimensions=[{name: "ClusterId", value: "CLUSTER_ID"}], |
| 239 | + statistic="Sum" |
| 240 | +) |
| 241 | +``` |
| 242 | + |
| 243 | +--- |
| 244 | + |
| 245 | +## Idle Cluster Detection |
| 246 | + |
| 247 | +A cluster is idle when there is no AAS data for a period. Use a range query and look for gaps (missing timestamps) in the time series. |
| 248 | + |
| 249 | +**Pattern: Sporadic workload** — periods of no data interspersed with periods of AAS > 0 indicate a cluster performing scheduled or batch work. |
| 250 | + |
| 251 | +```promql |
| 252 | +execute_promql_range_query( |
| 253 | + query='sum({__name__="db.active_sessions.avg", "@resource.aws.auroradsql.cluster_id"="CLUSTER_ID"})', |
| 254 | + start="NOW-24h", end="NOW", step="300s" |
| 255 | +) |
| 256 | +``` |
| 257 | + |
| 258 | +--- |
| 259 | + |
| 260 | +## DSQL Skill Handoff |
| 261 | + |
| 262 | +When investigation identifies a query that has become more prominent, hand off to the `dsql` skill for live database analysis. Do not provide specific diagnostics or recommendations — simply describe the observed anomaly. |
| 263 | + |
| 264 | +### Automatic Handoff |
| 265 | + |
| 266 | +Before handing off, verify the `dsql` skill's MCP server is configured for the cluster under investigation: |
| 267 | + |
| 268 | +1. Check if the `aurora-dsql` MCP server is configured (look for it in the active MCP servers) |
| 269 | +2. Verify the `--cluster_endpoint` in its args matches the cluster being investigated |
| 270 | +3. If not configured or pointing to a different cluster, prompt the user: |
| 271 | + > "The Aurora DSQL MCP server needs to be configured for cluster `{CLUSTER_ID}` to proceed with live database investigation. Please add or update the `aurora-dsql` server in your MCP configuration with `--cluster_endpoint {CLUSTER_ENDPOINT}`." |
| 272 | +
|
| 273 | +### Handoff Format |
| 274 | + |
| 275 | +When handing off, describe only the observed anomaly — do not suggest causes or fixes: |
| 276 | + |
| 277 | +> "Query `{NORMALIZED_SQL}` (query_id: `{QUERY_ID}`) is using significantly more system time than it did {TIMEFRAME} ago. Its share of cluster AAS on `{WAIT_EVENT}` has grown from {OLD}% to {NEW}%. Please diagnose what is happening with this query." |
| 278 | +
|
| 279 | +### When to Hand Off |
| 280 | + |
| 281 | +- Any query identified in Workflow 2 as newly prominent or significantly grown |
| 282 | +- SequentialScanRead, OnCpu, ScatteredBatchRead, or SingleRead shifts where a specific query is responsible |
| 283 | +- OCC conflict growth confirmed in Workflow 6 |
| 284 | + |
| 285 | +--- |
| 286 | + |
| 287 | +## Error Handling |
| 288 | + |
| 289 | +| Situation | Action | |
| 290 | +|-----------|--------| |
| 291 | +| No cluster_id provided | Ask the user — never proceed without a specific cluster | |
| 292 | +| No series found | Verify cluster ID with `get_promql_label_values`. Check region. | |
| 293 | +| Empty result (no data) | Cluster is idle for that period. Widen time window. | |
| 294 | +| `query_id` missing | Not all queries emit it. Filter by `normalized_sql` instead. | |
| 295 | +| PromQL timeout | Reduce cardinality — fewer labels or shorter time range. | |
| 296 | +| Range > 7 days | Split into multiple 7-day range queries. | |
| 297 | +| `dsql` skill not available | Prompt user to install from `awslabs/agent-plugins` (plugin: `databases-on-aws`) | |
0 commit comments