Skip to content
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions dns-monitoring/datadog-monitor-dns-down-critical.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"name": "DNS Server Down - CRITICAL - PageDuty",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo in monitor name: "PageDuty" → "PagerDuty".

This appears in the Datadog UI and in PagerDuty alert titles.

🐛 Proposed fix
-  "name": "DNS Server Down - CRITICAL - PageDuty",
+  "name": "DNS Server Down - CRITICAL - PagerDuty",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"name": "DNS Server Down - CRITICAL - PageDuty",
"name": "DNS Server Down - CRITICAL - PagerDuty",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dns-monitoring/datadog-monitor-dns-down-critical.json` at line 2, Fix the
typo in the Datadog monitor name by changing the "name" field value that
currently reads "DNS Server Down - CRITICAL - PageDuty" to "DNS Server Down -
CRITICAL - PagerDuty"; locate the JSON "name" key in
datadog-monitor-dns-down-critical.json and update the string so the UI and alert
titles use the correct "PagerDuty" spelling.

"type": "query alert",
"query": "max(last_15m):max:pihole.dns_up{*} by {server} < 1",
"message": "CRITICAL: DNS server {{server.name}} has been down for 15 minutes. Triggering PagerDuty.\n\n@pagerduty",
"tags": ["dns", "pihole", "critical"],
"options": {
"notify_audit": false,
"locked": false,
"require_full_window": true,
"notify_no_data": false,
"renotify_interval": 15,
"evaluation_delay": 30,
"new_group_delay": 30,
"include_tags": false,
"thresholds": {
"critical": 1
}
}
}
21 changes: 21 additions & 0 deletions dns-monitoring/datadog-monitor-dns-down-warning.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"name": "DNS Server Down - WARNING",
"type": "query alert",
"query": "max(last_5m):max:pihole.dns_up{*} by {server} < 1",
"message": "DNS server {{server.name}} has been down for 5 minutes. \n\n@slack-ops",
"tags": ["dns", "pihole", "warning"],
"options": {
"notify_audit": false,
"locked": false,
"require_full_window": true,
"notify_no_data": false,
"renotify_interval": 30,
"evaluation_delay": 30,
"new_group_delay": 30,
"include_tags": false,
"thresholds": {
"critical": 1,
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: warning and critical thresholds are identical, so this monitor will always go critical and never emit a warning state. Either remove the warning threshold or adjust the monitor to reflect critical severity to avoid misleading alert severity.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dns-monitoring/datadog-monitor-dns-down-warning.json, line 17:

<comment>`warning` and `critical` thresholds are identical, so this monitor will always go critical and never emit a warning state. Either remove the warning threshold or adjust the monitor to reflect critical severity to avoid misleading alert severity.</comment>

<file context>
@@ -0,0 +1,21 @@
+    "new_group_delay": 30,
+    "include_tags": false,
+    "thresholds": {
+      "critical": 1,
+      "warning": 1
+    }
</file context>
Fix with Cubic

"warning": 1
}
Comment on lines +16 to +19
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Warning and critical thresholds are identical — the WARNING state is unreachable.

There are two threshold types for metric alerts: critical and warning. Critical is defined in the query, but can also be specified in this option. Warning threshold can only be specified using the thresholds option. When both are 1 on a < 1 query, the monitor jumps straight from OK to CRITICAL; the WARNING bucket is never entered. If the intent is to fire a Slack warning on a shorter window (5 min here vs. 15 min on the critical monitor), remove the warning key from thresholds and let the monitor use only the critical threshold:

🐛 Proposed fix
     "thresholds": {
-      "critical": 1,
-      "warning": 1
+      "critical": 1
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"thresholds": {
"critical": 1,
"warning": 1
}
"thresholds": {
"critical": 1
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dns-monitoring/datadog-monitor-dns-down-warning.json` around lines 16 - 19,
The monitor's thresholds object currently sets "critical" and "warning" both to
1, making the WARNING state unreachable; remove the "warning" key from the
"thresholds" object so the monitor relies solely on the critical threshold
defined in the query (leave "critical": 1 if needed) — update the JSON where the
"thresholds" object appears to delete the "warning" entry and keep only the
"critical" entry to allow the shorter-window Slack warning monitor to behave as
intended.

}
}
21 changes: 21 additions & 0 deletions dns-monitoring/datadog-monitor-dns-external-latency.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"name": "DNS External Latency High",
"type": "query alert",
"query": "max(last_5m):max:dns_external_latency{*} > 200",
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Query threshold (200) conflicts with critical threshold (500). Datadog monitors require the query threshold to match the critical threshold; otherwise the monitor definition is invalid or behaves unexpectedly. Align the query threshold with the critical value.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dns-monitoring/datadog-monitor-dns-external-latency.json, line 4:

<comment>Query threshold (200) conflicts with critical threshold (500). Datadog monitors require the query threshold to match the critical threshold; otherwise the monitor definition is invalid or behaves unexpectedly. Align the query threshold with the critical value.</comment>

<file context>
@@ -0,0 +1,21 @@
+{
+  "name": "DNS External Latency High",
+  "type": "query alert",
+  "query": "max(last_5m):max:dns_external_latency{*} > 200",
+  "message": "External DNS latency is above 200ms. Check internet connectivity or upstream DNS.\n\n@slack-ops",
+  "tags": ["dns", "external", "latency"],
</file context>
Fix with Cubic

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "datadog-monitor-dns-external-latency.json" -type f

Repository: NickJLange/mkrasberry

Length of output: 123


🏁 Script executed:

# Once file is located, read its full content to check thresholds
cat -n dns-monitoring/datadog-monitor-dns-external-latency.json

Repository: NickJLange/mkrasberry

Length of output: 804


🏁 Script executed:

# Also check the internal latency monitor mentioned for comparison
find . -name "datadog-monitor-dns-latency*.json" -type f | head -5

Repository: NickJLange/mkrasberry

Length of output: 122


🏁 Script executed:

cat -n dns-monitoring/datadog-monitor-dns-latency-warning.json

Repository: NickJLange/mkrasberry

Length of output: 783


🌐 Web query:

Datadog monitor API query threshold critical threshold validation requirement

💡 Result:

When creating/updating a Datadog query-based monitor via the Monitor API, you must define the threshold twice:

  1. In the query string, as part of the comparison, e.g. ... > 200 (the query format is ... operator #). [1]
  2. In options.thresholds.critical, and it must match the numeric value used in the query, otherwise /api/v1/monitor/validate (and monitor create/update) can fail with a validation error like “Alert threshold … does not match that used in the query …”. [1][4]

Notes:

  • warning (and recovery thresholds like critical_recovery, warning_recovery) are optional extras, but critical is the key threshold used for the main alert condition. [1][2][3]
  • For anomaly monitors, Datadog explicitly documents that the critical threshold in options.thresholds should match the threshold used in the query. [2]

Sources: Datadog Monitor API docs (query format + validate examples + thresholds object) [1], Datadog anomaly monitor threshold requirement [2], Datadog recovery thresholds via API [3], community-reported validation error when query/thresholds differ [4].


Query threshold must match the critical threshold for Datadog API validation.

The Datadog Monitor API requires the threshold value in the query string to exactly match options.thresholds.critical, or the monitor creation/update will fail with a validation error. Currently, the query uses > 200 (matching the warning threshold) while thresholds.critical is set to 500.

🐛 Proposed fix
-  "query": "max(last_5m):max:dns_external_latency{*} > 200",
+  "query": "max(last_5m):max:dns_external_latency{*} > 500",

The same issue exists at lines 16–19 in datadog-monitor-dns-latency-warning.json where the query uses > 100 but the critical threshold is 500.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"query": "max(last_5m):max:dns_external_latency{*} > 200",
"query": "max(last_5m):max:dns_external_latency{*} > 500",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dns-monitoring/datadog-monitor-dns-external-latency.json` at line 4, The
monitor query's threshold (> 200) does not match the configured critical
threshold (options.thresholds.critical = 500), causing Datadog API validation to
fail; update the "query" field in datadog-monitor-dns-external-latency.json to
use > 500 so it exactly matches options.thresholds.critical, and make the
analogous change in datadog-monitor-dns-latency-warning.json (change > 100 to >
500) so each "query" string matches its respective options.thresholds.critical
value.

"message": "External DNS latency is above 200ms. Check internet connectivity or upstream DNS.\n\n@slack-ops",
"tags": ["dns", "external", "latency"],
"options": {
"notify_audit": false,
"locked": false,
"require_full_window": true,
"notify_no_data": false,
"renotify_interval": 60,
"evaluation_delay": 30,
"new_group_delay": 30,
"include_tags": false,
"thresholds": {
"critical": 500,
"warning": 200
}
}
}
21 changes: 21 additions & 0 deletions dns-monitoring/datadog-monitor-dns-latency-warning.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"name": "DNS Latency High - WARNING",
"type": "query alert",
"query": "max(last_5m):max:dns_internal_latency{*} > 100",
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The query threshold should match the critical threshold (critical is defined in the query for metric alerts). With > 100 in the query but critical: 500, the monitor will alert at 100 or fail validation. Update the query to use the critical threshold and keep warning in thresholds.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dns-monitoring/datadog-monitor-dns-latency-warning.json, line 4:

<comment>The query threshold should match the critical threshold (critical is defined in the query for metric alerts). With `> 100` in the query but `critical: 500`, the monitor will alert at 100 or fail validation. Update the query to use the critical threshold and keep warning in `thresholds`.</comment>

<file context>
@@ -0,0 +1,21 @@
+{
+  "name": "DNS Latency High - WARNING",
+  "type": "query alert",
+  "query": "max(last_5m):max:dns_internal_latency{*} > 100",
+  "message": "DNS latency on {{server.name}} is above 100ms for 5 minutes.\n\n@slack-ops",
+  "tags": ["dns", "latency", "warning"],
</file context>
Fix with Cubic

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n dns-monitoring/datadog-monitor-dns-latency-warning.json

Repository: NickJLange/mkrasberry

Length of output: 783


🌐 Web query:

Datadog API alert threshold query parameter validation HTTP 400

💡 Result:

An HTTP 400 from the Datadog Monitors API during “alert threshold … validation” is most commonly caused by a mismatch between the threshold embedded in the query string and the thresholds you send in options.thresholds.

What to check / fix

  1. Your query must include operator #
    Datadog monitor queries are expected to end with an operator and a numeric threshold (for example ... > 5). The API docs explicitly describe this query format. [1]

  2. For anomaly monitors, the threshold in the query must match options.thresholds.critical
    Datadog documents that for anomaly monitors the critical threshold should match the threshold used in the query (and related window fields must match too—see #3). If they differ, validation can fail with 400. [2]
    A commonly seen error is: “Alert threshold (n) does not match that used in the query (n2).” [3]

  3. For anomaly monitors, also match the alert window fields
    If you’re creating an anomaly monitor:

    • the query’s alert_window='...' must match options.threshold_windows.trigger_window
    • you must set options.threshold_windows correctly (trigger + recovery)
      Datadog calls this out in the anomaly monitor API docs. [2]
  4. Use the “validate monitor” endpoint to pinpoint the exact validation failure
    Before creating/updating, call Datadog’s validate monitor endpoint with the same body; it’s designed to return validation errors for the query/options you’re sending. [1]

Sources: Datadog Monitors API reference (query format + validate endpoint) [1], Datadog anomaly monitor API requirements (threshold + window matching) [2], example of the threshold-mismatch 400 error message in the wild [3].


🏁 Script executed:

fd -t f "datadog-monitor.*\.json" dns-monitoring/ | head -20

Repository: NickJLange/mkrasberry

Length of output: 284


🏁 Script executed:

git ls-files dns-monitoring/ | grep -i monitor

Repository: NickJLange/mkrasberry

Length of output: 284


🏁 Script executed:

cat -n dns-monitoring/datadog-monitor-dns-down-critical.json

Repository: NickJLange/mkrasberry

Length of output: 795


🏁 Script executed:

cat -n dns-monitoring/datadog-monitor-dns-down-warning.json

Repository: NickJLange/mkrasberry

Length of output: 776


🏁 Script executed:

cat -n dns-monitoring/datadog-monitor-dns-external-latency.json

Repository: NickJLange/mkrasberry

Length of output: 804


Query threshold must match thresholds.critical or the Datadog API will reject this monitor with HTTP 400.

The critical threshold value must match the threshold in the query string; a mismatch causes the Datadog API to return HTTP 400 with "The value provided for parameter 'query' is invalid".

Here the query fires at > 100 but thresholds.critical is 500. Since this is a WARNING alert and the intent is to alert at 100 ms, change thresholds.critical to 100:

🐛 Proposed fix
    "thresholds": {
      "critical": 100,

Note: The same threshold mismatch exists in datadog-monitor-dns-external-latency.json (query > 200, critical 500).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dns-monitoring/datadog-monitor-dns-latency-warning.json` at line 4, The
monitor's query uses a 100ms threshold ("query":
"max(last_5m):max:dns_internal_latency{*} > 100") but the monitor payload's
thresholds.critical is set to 500; update thresholds.critical to 100 so the
query and thresholds.critical match for datadog-monitor-dns-latency-warning.json
(and likewise change thresholds.critical to 200 in
datadog-monitor-dns-external-latency.json to match its query > 200); edit the
JSON key "thresholds" -> "critical" in the monitor objects to the matching
numeric values.

"message": "DNS latency on {{server.name}} is above 100ms for 5 minutes.\n\n@slack-ops",
Comment on lines +4 to +5
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

{{server.name}} won't resolve — query has no by {server} grouping.

If you configure tags or dimensions in your query, these values are available for every group evaluated in the multi alert to dynamically fill in notifications with useful context. Without by {server} in the query the template variable renders empty. Add the grouping to match the pattern used in the other monitors:

🐛 Proposed fix
-  "query": "max(last_5m):max:dns_internal_latency{*} > 500",
+  "query": "max(last_5m):max:dns_internal_latency{*} by {server} > 500",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"query": "max(last_5m):max:dns_internal_latency{*} > 100",
"message": "DNS latency on {{server.name}} is above 100ms for 5 minutes.\n\n@slack-ops",
"query": "max(last_5m):max:dns_internal_latency{*} by {server} > 100",
"message": "DNS latency on {{server.name}} is above 100ms for 5 minutes.\n\n@slack-ops",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dns-monitoring/datadog-monitor-dns-latency-warning.json` around lines 4 - 5,
The monitor query uses a template variable {{server.name}} but the query
"max(last_5m):max:dns_internal_latency{*} > 100" lacks the required grouping, so
add the grouping to the query (e.g., include "by {server}" in the aggregation)
so the multi-alert will evaluate per-server and the {{server.name}} template in
the message can resolve; update the "query" string accordingly to match the
pattern used in other monitors.

"tags": ["dns", "latency", "warning"],
"options": {
"notify_audit": false,
"locked": false,
"require_full_window": true,
"notify_no_data": false,
"renotify_interval": 60,
"evaluation_delay": 30,
"new_group_delay": 30,
"include_tags": false,
"thresholds": {
"critical": 500,
"warning": 100
}
}
}
2 changes: 2 additions & 0 deletions openspec/changes/dns-keepalived-analysis/.openspec.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-02-21
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider quoting the created date to ensure consistent string parsing.

YAML 1.1 (used by PyYAML's default Loader, for example) treats bare YYYY-MM-DD values as native date objects rather than strings. If the consuming tooling reads this field and expects a string, it may encounter a type mismatch or serialization quirk. Quoting the value avoids ambiguity across parsers.

✏️ Suggested fix
-created: 2026-02-21
+created: "2026-02-21"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
created: 2026-02-21
schema: spec-driven
created: "2026-02-21"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/dns-keepalived-analysis/.openspec.yaml` at line 2, The YAML
`created` value is unquoted and may be parsed as a date by some YAML loaders;
update the `.openspec.yaml` entry so the `created` field is a quoted string
(e.g., change created: 2026-02-21 to created: "2026-02-21") to ensure consistent
string parsing across YAML parsers and consumers referencing the `created` key.

187 changes: 187 additions & 0 deletions openspec/changes/dns-keepalived-analysis/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
## Context

### Current DNS Infrastructure

**Wisconsin (marsAlpha, marsBeta)**
- Network: 192.168.20.0/24
- IPs: 192.168.20.250, 192.168.20.251

**New York (terraOmega, terraPhi)**
- Network: 192.168.100.0/24
- IPs: 192.168.100.112, 192.168.100.110

**Miyagi (alphaCenA, alphaCenB)**
- Network: 192.168.3.0/24
- IPs: 192.168.3.100, 192.168.3.101

Each server runs:
- Pi-hole container (DNS sinkhole on port 53)
- Unbound (upstream resolver on port 5335)

**Current client DNS configuration:**
- Clients point to individual DNS servers or round-robin via DHCP
- No automatic failover - if server down, clients fail until manual intervention

### Constraints
- Want to minimize changes
- Brief downtime acceptable when I'm at keyboard
- Prefer open-source, simple solutions
- Current monitoring and restart playbook should continue to work

## Goals / Non-Goals

**Goals:**
- Understand network changes required for DNS HA
- Document failover behavior and timing
- Identify common pitfalls and how to avoid them
- Create implementation roadmap

**Non-Goals:**
- Do NOT implement HA (this is analysis only)
- Do NOT change client configurations unless necessary
- Do NOT modify existing monitoring/alerting

## Decisions

### D1: HA Approach Options

**Option A: Keepalived/VRRP (Recommended for HA)**

| Aspect | Details |
|--------|---------|
| How it works | Virtual IP (VIP) floats between servers using VRRP protocol |
| Failover | Automatic, typically 3-5 seconds |
| Network changes | Need to allow VRRP multicast, may need switch config |
| Complexity | Medium - keepalived daemon on each DNS server |
| Pros | True HA, automatic failover, widely used |
| Cons | Requires network coordination, ARP cache issues |

**Option B: HSRP (Cisco proprietary)**

| Aspect | Details |
|--------|---------|
| How it works | Cisco's proprietary failover protocol |
| Failover | Automatic |
| Network changes | Requires Cisco equipment |
| Complexity | Medium |
| Pros | Integrated with Cisco gear |
| Cons | Proprietary, not applicable to home network |

**Option C: DNS Round-Robin (Simpler, no HA)**

| Aspect | Details |
|--------|---------|
| How it works | Multiple A records for same hostname |
| Failover | None - clients get all IPs, try sequentially |
| Network changes | None |
| Complexity | Low |
| Pros | Simple, no daemon, no network changes |
| Cons | No automatic failover, clients may cache failed IP |

**Option D: Client-side failover (Alternative)**

| Aspect | Details |
|--------|---------|
| How it works | Configure multiple DNS servers in DHCP/client |
| Failover | Client-dependent (some OSes retry automatically) |
| Network changes | None |
| Complexity | Low |
| Pros | No infrastructure changes |
| Cons | Inconsistent behavior across clients |

### D2: Recommended Approach

**Recommendation: Option A (Keepalived) per region**

- Each region (Wisconsin, New York, Miyagi) gets its own VIP
- Pair servers within each region (e.g., marsAlpha ↔ marsBeta)
- VIP becomes the DNS address for that region
- If one server fails, VIP moves to other within seconds

### D3: Network Architecture

**VIP Address Plan (per region):**
- Reserve .2 for DNS VIP in each subnet
- DHCP range: .11 - .254 (leaving .2 for VIP, .10 for infrastructure)

```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language specifier to the fenced code block.

The code fence at this line has no language identifier, triggering a markdownlint MD040 warning. Use text or plain to silence it.

✏️ Proposed fix
-```
+```text
 Wisconsin (192.168.20.0/24):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 107-107: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/dns-keepalived-analysis/design.md` at line 107, Add a
language specifier to the fenced code block that contains the line "Wisconsin
(192.168.20.0/24):" to silence markdownlint MD040; replace the opening ``` with
```text (or ```plain) so the block starts as ```text and preserves the existing
content, ensuring the fence around "Wisconsin (192.168.20.0/24):" is updated
accordingly.

Wisconsin (192.168.20.0/24):
marsAlpha (192.168.20.250) ←→ marsBeta (192.168.20.251)
VIP: 192.168.20.2
DHCP: 192.168.20.11 - 192.168.20.254

New York (192.168.100.0/24):
terraOmega (192.168.100.112) ←→ terraPhi (192.168.100.110)
VIP: 192.168.100.2
DHCP: 192.168.100.11 - 192.168.100.254

Miyagi (192.168.3.0/24):
alphaCenA (192.168.3.100) ←→ alphaCenB (192.168.3.101)
VIP: 192.168.3.2
DHCP: 192.168.3.11 - 192.168.3.254
```

Clients would use regional VIP (.2) as their DNS server.

## Risks / Trade-offs

### Network Risks

- **[Risk]** VRRP multicast blocked by switch
- **Mitigation**: Check switch allows VRRP (multicast 224.0.0.18), or configure unicast

- **[Risk]** ARP cache on switches/clients points to wrong server after failover
- **Mitigation**: Set lower `advertisement_intvl` and use `garp_master_delay`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

keepalived vrrp_instance advert_int advertisement_intvl parameter

💡 Result:

In Keepalived the VRRP advertisement interval for a vrrp_instance is configured with:

  • advert_int <seconds> — how often the MASTER sends VRRP advertisements (can be fractional, e.g. 0.92). All routers participating in the same VRRP instance should use the same value; using different values is considered a protocol deviation. [1]
    Example:
    vrrp_instance VI_1 {
      advert_int 1
    }
    
    [1][2]

There is no advertisement_intvl (or similarly named) vrrp_instance directive in Keepalived; the correct Keepalived parameter is advert_int. [1][2]

Sources:
[1] Keepalived manpage (keepalived.org)
[2] Keepalived configuration synopsis (keepalived.org


🏁 Script executed:

# First, let's locate and read the file mentioned in the review
find . -name "design.md" -path "*dns-keepalived*" 2>/dev/null | head -5

Repository: NickJLange/mkrasberry

Length of output: 117


🏁 Script executed:

# Read the file around line 134 to confirm the context
sed -n '130,140p' ./openspec/changes/dns-keepalived-analysis/design.md

Repository: NickJLange/mkrasberry

Length of output: 651


Correct the keepalived parameter name from advertisement_intvl to advert_int.

The standard keepalived vrrp_instance option for the advertisement interval is advert_int, not advertisement_intvl. Using the wrong name in a configuration would be silently ignored, leaving the interval at its default.

✏️ Proposed fix
-  - **Mitigation**: Set lower `advertisement_intvl` and use `garp_master_delay`
+  - **Mitigation**: Set lower `advert_int` and use `garp_master_delay`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- **Mitigation**: Set lower `advertisement_intvl` and use `garp_master_delay`
- **Mitigation**: Set lower `advert_int` and use `garp_master_delay`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/dns-keepalived-analysis/design.md` at line 134, Replace the
incorrect keepalived option name `advertisement_intvl` with the correct
`advert_int` in the document text and any examples referencing the vrrp_instance
advertisement interval; update the sentence "Set lower `advertisement_intvl` and
use `garp_master_delay`" to "Set lower `advert_int` and use `garp_master_delay`"
and scan the file for any other occurrences of `advertisement_intvl` to correct
to `advert_int`.


- **[Risk]** Both servers claim VIP (split-brain)
- **Mitigation**: Use `priority` to designate master, `preempt` to ensure master always holds VIP

- **[Risk]** Network segmentation blocks VRRP packets
- **Mitigation**: Ensure firewall allows VRRP (protocol 112) between DNS servers

### Operational Risks

- **[Risk]** Keepalived misconfiguration causes failover storms
- **Mitigation**: Test failover in maintenance window, set appropriate timeouts

- **[Risk]** Failover doesn't trigger DNS service restart
- **Mitigation**: Monitor both master and backup, verify DNS service on both

- **[Risk]** Upstream Unbound not configured for VIP
- **Mitigation**: Unbound listens on all interfaces by default, verify

### Performance Trade-offs

- **Latency**: Minor increase during failover (seconds)
- **Complexity**: Additional daemon to monitor and maintain
- **Monitoring**: Need to monitor keepalived state, not just DNS

## Implementation Roadmap

### Phase 1: Network Preparation (Research)
- [ ] Verify network equipment allows VRRP
- [ ] Choose VIP addresses for each region
- [ ] Test VRRP between pairs

### Phase 2: Keepalived Configuration
- [ ] Install keepalived on each DNS server
- [ ] Configure master/backup pairs per region
- [ ] Add firewall rules for VRRP

### Phase 3: DNS Service Integration
- [ ] Ensure pihole/unbound binds to VIP
- [ ] Update monitoring to track keepalived state
- [ ] Update restart playbook for HA (restart VIP holder)

### Phase 4: Client Migration
- [ ] Update DHCP to serve regional VIPs
- [ ] Test failover manually
- [ ] Monitor for issues

## Open Questions

1. ~~VIP addresses~~ - **RESOLVED**: Using .2 for each regional subnet
2. ~~Network equipment~~ - **RESOLVED**: Ubiquity and commodity switches between servers - need VRRP compatibility check
3. ~~Client migration~~ - **RESOLVED**: Migrate clients gradually
4. ~~Cross-region failover~~ - **RESOLVED**: No - too complex
5. ~~Monitoring~~ - **RESOLVED**: Yes - alerts for keepalived state changes
30 changes: 30 additions & 0 deletions openspec/changes/dns-keepalived-analysis/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
## Why

The current DNS infrastructure (Pi-hole + Unbound) has no high availability. If a DNS server fails, clients lose DNS resolution until manual intervention. This is a single point of failure that impacts all services. We need to understand the architectural changes required to add DNS HA using keepalived/HSRP and/or round-robin DNS.

## What Changes

This is a **planning/analysis** change to document the requirements, risks, and architectural impact of adding DNS high availability.

- Document current DNS infrastructure and dependencies
- Analyze keepalived/VRRP/HSRP options for DNS failover
- Analyze DNS round-robin load balancing as an alternative
- Document network changes required (VIP, ARP, etc.)
- Identify common pitfalls and failure modes
- Create implementation roadmap

## Capabilities

### New Capabilities
- `dns-ha-analysis`: Analysis document covering keepalived/HSRP/round-robin options, network requirements, failover behavior, and common pitfalls for DNS HA deployment.

### Modified Capabilities
- (None - this is a new analysis)

## Impact

- **Network**: May require VIP (Virtual IP) configuration, network switch changes for ARP
- **DNS Servers**: Keepalived daemon, additional network config
- **Clients**: May need updated DNS client configurations to use VIP
- **Monitoring**: Update monitoring to track HA state
- **Existing Playbooks**: Restart playbook should continue to work
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
## ADDED Requirements

### Requirement: Network compatibility analysis
The analysis SHALL document whether the existing network infrastructure supports VRRP/keepalived.

#### Scenario: VRRP multicast allowed
- **WHEN** network switches allow VRRP multicast (224.0.0.18)
- **THEN** keepalived can operate in full multicast mode

#### Scenario: VRRP multicast blocked
- **WHEN** network switches block VRRP multicast
- **THEN** keepalived can be configured for unicast mode

#### Scenario: VRRP not supported
- **WHEN** network equipment does not support VRRP
- **THEN** alternative HA solutions (round-robin DNS or client-side) must be used

### Requirement: Failover behavior specification
The analysis SHALL document expected failover timing and behavior.

#### Scenario: Primary server fails
- **WHEN** the primary DNS server becomes unreachable
- **THEN** the VIP should migrate to backup within 5 seconds

#### Scenario: Primary recovers
- **WHEN** the primary DNS server becomes available again
- **THEN** the VIP should migrate back to primary based on preemption policy

#### Scenario: Both servers fail
- **WHEN** all DNS servers in a region are down
- **THEN** clients should timeout and move to secondary DNS (if configured)

### Requirement: Network address planning
The analysis SHALL specify VIP addresses for each region.

#### Scenario: Wisconsin VIP
- **WHEN** planning Wisconsin DNS HA
- **THEN** use 192.168.20.2 as VIP, reserve in DHCP (range: .11-.254)

#### Scenario: New York VIP
- **WHEN** planning New York DNS HA
- **THEN** use 192.168.100.2 as VIP, reserve in DHCP (range: .11-.254)

#### Scenario: Miyagi VIP
- **WHEN** planning Miyagi DNS HA
- **THEN** use 192.168.3.2 as VIP, reserve in DHCP (range: .11-.254)

### Requirement: Common pitfalls documentation
The analysis SHALL document known pitfalls and mitigations.

#### Scenario: ARP cache stale
- **WHEN** failover occurs
- **THEN** client ARP caches may point to wrong MAC for VIP
- **MITIGATION**: Use lower advertisement interval and garp_master_delay

#### Scenario: Split-brain
- **WHEN** both servers claim VIP
- **THEN** DNS responses are inconsistent
- **MITIGATION**: Use priority and preempt settings correctly

#### Scenario: Firewall blocks VRRP
- **WHEN** firewalls between servers block VRRP protocol
- **THEN** failover cannot occur
- **MITIGATION**: Allow protocol 112 and multicast

### Requirement: Implementation roadmap
The analysis SHALL provide a phased implementation plan.

#### Scenario: Phase 1 - Network prep
- **WHEN** beginning HA implementation
- **THEN** verify network equipment (Ubiquity + commodity switches) allows VRRP

#### Scenario: Phase 2 - Keepalived config
- **WHEN** network is ready
- **THEN** install and configure keepalived on each server pair

#### Scenario: Phase 3 - Integration
- **WHEN** keepalived is running
- **THEN** integrate with DNS service and monitoring

#### Scenario: Phase 4 - Client migration
- **WHEN** HA is tested
- **THEN** update DHCP to serve VIPs to clients gradually
Loading