Add VRRP Smoke Tests for DNS Keepalived HA#40
Conversation
- Update podmanSetupRaspberryPi submodule to enforce overlay driver in /etc/containers/storage.conf. - Add fuse-overlayfs to stage 5 package install list. - Enforce overlay driver configuration in stage 5 playbook. - Add openspec/changes/podman-overlayfs.md.
- Add Telegraf DNS monitoring config (pihole API + dig latency checks) - Add Datadog monitor JSON definitions (warning/critical alerts) - Add DNS restart playbook with diagnostic collection - Add Podman healthcheck for pihole containers - Add Unbound access-control for monitoring host Implements: - OpenTelemetry-compatible DNS health metrics per server - Datadog → Pagerduty alerting - Restart playbook for operational response - Container healthcheck fix for Podman
Analysis document covering: - Keepalived/VRRP vs round-robin vs client-side options - VIP address plan (.2 per subnet) - Network requirements (Ubiquity + commodity switches) - Risks: ARP cache, split-brain, firewall - Implementation roadmap: network check -> keepalived -> integration -> client migration - Decision: gradual client migration, no cross-region failover, monitor keepalived state
- test-vrrp-unshare.sh: Full failover test with master/backup simulation - test-configs-only.sh: Quick syntax validation - Uses unshare --user for rootless namespace isolation - Tests VRRP protocol without privileged network access - Validates config syntax, initial state, and failover behavior
- test-vrrp-podman.sh: Containerized test with isolated bridge network - Uses randomized 10.99.x.x subnets to avoid production conflicts - Runs keepalived in containers with NET_ADMIN capabilities - Automatic cleanup of containers, networks, and images - Safer alternative to unshare mode for production systems
📝 WalkthroughWalkthroughThis PR introduces comprehensive DNS server monitoring infrastructure, high-availability (HA) configuration analysis, and operational tooling. Changes include Datadog monitoring configurations, Telegraf DNS monitoring integration, an Ansible restart playbook, VRRP smoke tests, OpenSpec planning artifacts, and updated roles with OpenTelemetry metrics collection via OTLP. Changes
Sequence Diagram(s)sequenceDiagram
participant TG as Telegraf
participant DD as Datadog Agent
participant DDmon as Datadog Monitors
participant PD as Pagerduty
participant Ops as Operations
TG->>TG: Poll Pi-hole API & measure DNS latency
TG->>DD: Send metrics via OTLP (dns_up, latency)
DD->>DDmon: Scrape & aggregate metrics
DDmon->>DDmon: Evaluate dns_up thresholds<br/>(5min WARNING, 15min CRITICAL)
alt Server Down
DDmon->>PD: Send CRITICAL alert
PD->>Ops: Create incident notification
Ops->>Ops: Page on-call engineer
else Server Recovered
DDmon->>PD: Send recovery notification
PD->>Ops: Resolve incident
end
sequenceDiagram
participant Ops as Operations
participant AP as Ansible Playbook
participant SSH as SSH Session
participant PH as Pi-hole Container
participant Logs as Diagnostics
Ops->>AP: Invoke dns-restart playbook
AP->>SSH: Connect to DNS server
SSH->>SSH: Escalate to root
AP->>PH: Stop pihole container
AP->>PH: Start pihole container
alt Restart Success
AP->>AP: Restart unbound service
AP->>Ops: Display success message
else Restart Failure
AP->>PH: Run diagnostic command
PH->>Logs: Capture diagnostics output
AP->>Logs: Fetch logs to local machine
AP->>Ops: Display failure with logs
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes This PR spans multiple domains (monitoring infrastructure, HA analysis, automation, testing) with substantial heterogeneous changes: Datadog/OTLP configurations, comprehensive OpenSpec planning documents, Ansible playbooks with conditional logic and error handling, VRRP test scripts with namespace isolation, and role updates. While individual files are relatively straightforward, the cross-cutting nature of changes and density of new infrastructure require detailed review across distinct areas. Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
roles/teleport-daemon-service/templates/etc/telegraf/telegraf.conf.j2 (1)
92-143:⚠️ Potential issue | 🔴 CriticalAdd
[[outputs.influxdb_v2]]table header before the InfluxDB configuration.The InfluxDB output section (lines 97–143) is missing the required
[[outputs.influxdb_v2]]table declaration. Keys likeurls,token,organization,influx_uint_support,bucket,namedrop, andnamepassare orphaned without proper TOML table context, which will cause the configuration to be misparsed or rejected by Telegraf.Proposed fix
+[[outputs.influxdb_v2]] ## The URLs of the InfluxDB cluster nodes. ## ## Multiple URLs can be specified for a single cluster, only ONE of the ## urls will be written to each interval. ## ex: urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"] urls = ["{{ influx_db_url }}"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@roles/teleport-daemon-service/templates/etc/telegraf/telegraf.conf.j2` around lines 92 - 143, The InfluxDB output keys (urls, token, organization, influx_uint_support, bucket, namedrop, namepass) are missing their TOML table header; add the table declaration line [[outputs.influxdb_v2]] immediately above the block that starts with urls = "{{ influx_db_url }}" so these keys are nested under the outputs.influxdb_v2 table, ensuring the header appears once and that the existing keys remain unchanged and properly aligned beneath it.
🟠 Major comments (17)
retire_forks.sh-4-13 (1)
4-13:⚠️ Potential issue | 🟠 MajorReturn non-zero when any delete fails; don’t always print success.
Line [13] prints
Done.even if a delete command fails. That masks partial failure and makes automation unreliable.Suggested failure tracking + accurate final status
#!/bin/bash # Script to retire unused forks with no personal commits +failures=0 +delete_repo() { + local repo="$1" + if ! gh repo delete "$repo" --yes; then + echo "Failed to delete $repo" >&2 + failures=$((failures + 1)) + fi +} + echo "Deleting nifi fork..." -gh repo delete NickJLange/nifi --yes +delete_repo NickJLange/nifi echo "Deleting homebridge forks..." -gh repo delete NickJLange/homebridge-mqttthing --yes -gh repo delete NickJLange/homebridge-wyze-connected-home --yes -gh repo delete NickJLange/homebridge-tasmota --yes -gh repo delete NickJLange/homebridge.container --yes +delete_repo NickJLange/homebridge-mqttthing +delete_repo NickJLange/homebridge-wyze-connected-home +delete_repo NickJLange/homebridge-tasmota +delete_repo NickJLange/homebridge.container -echo "Done." +if [[ "$failures" -gt 0 ]]; then + echo "Completed with $failures failure(s)." >&2 + exit 1 +fi +echo "Done."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@retire_forks.sh` around lines 4 - 13, The script retire_forks.sh currently prints "Done." regardless of failures; update it to detect failures from each gh repo delete invocation (the gh repo delete commands) and track a non-zero failure flag (e.g., a variable like failure=0 set to 1 on any non-zero exit). After each gh repo delete command check its exit status and set the flag on error (or use command || failure=1). At the end, if failure is 0 print the existing "Done." and exit 0, otherwise print a descriptive error summary and exit with a non-zero status so CI/automation will detect the failure.playbooks/initial-playbook-stage-5.yml-48-64 (1)
48-64:⚠️ Potential issue | 🟠 MajorGuard storage backend mutations or skip on already-initialized hosts.
Changing
driver/graphrootinstorage.confrequirespodman system resetto occur first per Podman documentation; if the config is edited before reset, cleanup may fail silently and corrupt existing container/image state. For idempotent playbook re-runs on the same host, add a check: either skip these tasks if Podman storage is already initialized, or explicitly runpodman system resetbefore mutating the config.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@playbooks/initial-playbook-stage-5.yml` around lines 48 - 64, The playbook is mutating /etc/containers/storage.conf unconditionally which can corrupt existing Podman state; update the tasks titled "Ensure Podman storage driver is overlay" and "Ensure Podman graphroot is set" to first guard or reset: add a conditional check that detects whether Podman storage is already initialized (e.g., test for existing containers/images or presence of initialized storage under /var/lib/containers/storage) and skip the ini_file changes if initialized, or run `podman system reset` (idempotently, with confirmation handled) before applying the ini_file changes so the config mutation is safe on re-runs. Ensure the logic is applied to both tasks and that any reset runs only when safe/explicit to avoid unintended data loss.tests/vrrp-smoke-test/test-vrrp-unshare.sh-243-248 (1)
243-248:⚠️ Potential issue | 🟠 MajorRecovery assertion is currently a historical log check, not a stability check.
Line 243 greps the whole backup log for
Entering MASTER STATE; once failover happened, this stays true even if the node later regresses. This can mask recovery regressions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/vrrp-smoke-test/test-vrrp-unshare.sh` around lines 243 - 248, The current check greps the entire "$TEST_DIR/tmp/backup.log" for "Entering MASTER STATE", which can be true even if the node later regressed; change the assertion to verify that the most recent state transition is MASTER instead of any historical occurrence. Update the check around the grep block to examine only the tail of the log (e.g., tail -n <lines> or parse the last matching line) and confirm that the last "Entering ..." entry is "Entering MASTER STATE" and that no later "Entering BACKUP STATE" or "Leaving MASTER STATE" exists; keep using the same helpers (log_success/log_error) and operate on "$TEST_DIR/tmp/backup.log" so the test fails if the node regresses after failover.tests/vrrp-smoke-test/test-vrrp-podman.sh-149-167 (1)
149-167:⚠️ Potential issue | 🟠 MajorContainer selection/cleanup is nondeterministic and can impact unrelated resources.
Using
ancestor=... | head -1/tail -1(Lines 177–179, 208–210) can pick the wrong container. Cleanup by ancestor and a shared image tag (localhost/vrrp-test:latest) can also stop/remove containers or images outside this specific run.Suggested scoping fix
+IMAGE_TAG="localhost/vrrp-test:${NETWORK_NAME}" +MASTER_CONTAINER="vrrp-master-${NETWORK_NAME}" +BACKUP_CONTAINER="vrrp-backup-${NETWORK_NAME}" - podman build -t localhost/vrrp-test:latest -f "$TEST_DIR/tmp/Containerfile" "$TEST_DIR/tmp" + podman build -t "$IMAGE_TAG" -f "$TEST_DIR/tmp/Containerfile" "$TEST_DIR/tmp" - podman run -d --name "vrrp-master-${RANDOM}" \ + podman run -d --name "$MASTER_CONTAINER" \ ... - localhost/vrrp-test:latest \ + "$IMAGE_TAG" \ -f /etc/keepalived/keepalived.conf -n -l -D - podman run -d --name "vrrp-backup-${RANDOM}" \ + podman run -d --name "$BACKUP_CONTAINER" \ ... - localhost/vrrp-test:latest \ + "$IMAGE_TAG" \ -f /etc/keepalived/keepalived.conf -n -l -D - MASTER_CONTAINER=$(podman ps --filter "ancestor=localhost/vrrp-test:latest" --format "{{.Names}}" | head -1) - BACKUP_CONTAINER=$(podman ps --filter "ancestor=localhost/vrrp-test:latest" --format "{{.Names}}" | tail -1) + # Reuse known names set during run_containers() - for container in $(podman ps -a --filter "ancestor=localhost/vrrp-test:latest" --format "{{.Names}}"); do - podman stop "$container" 2>/dev/null || true - podman rm "$container" 2>/dev/null || true - done + podman stop "$MASTER_CONTAINER" 2>/dev/null || true + podman stop "$BACKUP_CONTAINER" 2>/dev/null || true + podman rm "$MASTER_CONTAINER" 2>/dev/null || true + podman rm "$BACKUP_CONTAINER" 2>/dev/null || true - podman rmi localhost/vrrp-test:latest 2>/dev/null || true + podman rmi "$IMAGE_TAG" 2>/dev/null || trueAlso applies to: 177-179, 208-210, 234-247
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/vrrp-smoke-test/test-vrrp-podman.sh` around lines 149 - 167, The cleanup is nondeterministic because the script selects containers/images by ancestor or shared tag (localhost/vrrp-test:latest) and uses head/tail, which can target unrelated runs; modify podman invocation to assign and reuse deterministic identifiers (store the generated container names from --name like "vrrp-master-${RANDOM}" and "vrrp-backup-${RANDOM}" into variables, or add a unique --label) and capture the container IDs returned by podman run, then use those exact IDs/names or labels for all later selection and removal (instead of ancestor=... | head/tail or relying on the image tag); update any cleanup commands that reference ancestor or the image (where ancestor=..., head/tail are used) to use the stored variables/labels or the captured container IDs so only the intended containers/images for this test are stopped/removed.tests/vrrp-smoke-test/Makefile-11-13 (1)
11-13:⚠️ Potential issue | 🟠 Major
make testcurrently defaults to the less-isolated path.Line 12 runs
test-vrrp-unshare.sh; that makes the default test entrypoint less safe than the Podman mode described as recommended elsewhere in this PR. This can cause accidental host-namespace testing on production-adjacent systems.Suggested target split
test: `@echo` "Running full VRRP test suite..." - @./test-vrrp-unshare.sh + @./test-vrrp-podman.sh + +test-unshare: + `@echo` "Running rootless unshare VRRP test..." + @./test-vrrp-unshare.sh🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/vrrp-smoke-test/Makefile` around lines 11 - 13, The current Makefile test target invokes test-vrrp-unshare.sh (the less-isolated path); change the Makefile so the primary "test" target runs the safer Podman-based script instead and add separate targets for the two modes: keep "test-vrrp-unshare.sh" as a "test-unshare" target and add a "test-podman" target that runs the Podman entrypoint (e.g., test-vrrp-podman.sh), then make the top-level "test" depend on or call the "test-podman" target so Podman becomes the default entrypoint while preserving the unshare target for explicit use.tests/vrrp-smoke-test/test-configs-only.sh-13-22 (1)
13-22:⚠️ Potential issue | 🟠 MajorThis script does not validate configuration syntax yet.
Lines 13–22 only verify the binary/version, so broken keepalived configs still pass
quick. That defeats the stated purpose of config validation.Suggested implementation direction
-# Would validate actual configs if they exist -# For now, just check keepalived binary works -if [ -x "$KEEPALIVED_BIN" ]; then - echo " keepalived binary: OK ($KEEPALIVED_BIN)" - echo " Version: $($KEEPALIVED_BIN -v 2>&1 | head -1)" -else - echo " ERROR: keepalived not found" - exit 1 -fi +if [ ! -x "$KEEPALIVED_BIN" ]; then + echo " ERROR: keepalived not found" + exit 1 +fi + +echo " keepalived binary: OK ($KEEPALIVED_BIN)" +echo " Version: $($KEEPALIVED_BIN -v 2>&1 | head -1)" + +CONFIG_DIR="${CONFIG_DIR:-$PWD/tmp}" +shopt -s nullglob +configs=("$CONFIG_DIR"/*.conf) +if [ ${`#configs`[@]} -eq 0 ]; then + echo " ERROR: no .conf files found in $CONFIG_DIR" + exit 1 +fi +for conf in "${configs[@]}"; do + "$KEEPALIVED_BIN" -t -f "$conf" >/dev/null + echo " OK: $(basename "$conf")" +done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/vrrp-smoke-test/test-configs-only.sh` around lines 13 - 22, The script currently only checks that $KEEPALIVED_BIN exists but doesn't validate config syntax; update the block that checks KEEPALIVED_BIN to also run the keepalived config test (invoke "$KEEPALIVED_BIN -t" or the appropriate syntax-check flag) and capture its output and exit status; if the syntax check fails, print the test output via echo/process logger and exit non-zero, otherwise print a success message (reference the existing KEEPALIVED_BIN variable and the if/.../else/done block to locate where to add the test).tests/vrrp-smoke-test/test-vrrp-podman.sh-53-57 (1)
53-57:⚠️ Potential issue | 🟠 MajorBridge network lacks
--internalflag, contradicting stated isolation guarantees.The file header declares the test "safe to run on production servers" with an "isolated bridge network" (line 4-5), and line 275 reports "Network isolation: Podman bridge network" in the test output. However, the bridge network created at lines 53-57 omits the
--internalflag. Without it, containers retain outbound routing to external networks via NAT masquerading, weakening the isolation guarantee claimed in the file's documentation.To align with the stated safety claim, either add
--internalto the network creation or update the documentation to clarify that isolation applies only to inbound access.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/vrrp-smoke-test/test-vrrp-podman.sh` around lines 53 - 57, The Podman bridge network created by the podman network create command (using variables SUBNET, GATEWAY, NETWORK_NAME) lacks the --internal flag, so add --internal to that command to prevent outbound NAT and enforce isolation; update any comments if you intentionally want outbound access and run tests to ensure other steps expecting external connectivity are adjusted accordingly.openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/pihole-healthcheck-fix/spec.md-14-24 (1)
14-24:⚠️ Potential issue | 🟠 MajorReplace
httpGetwording with Podman command-based healthcheck semantics.Lines 15 and 22 currently describe a native
httpGetmethod, which is not a Podman healthcheck type. Podman supports onlyCMDandCMD-SHELLhealthchecks; HTTP validation must be implemented by executing a command (e.g.,curl) inside the container.Proposed spec wording update
-### Requirement: Healthcheck uses appropriate check method -The healthcheck SHALL use either exec or httpGet method that works reliably with Podman. +### Requirement: Healthcheck uses a Podman-compatible command method +The healthcheck SHALL use a command-based method supported by Podman. HTTP validation MAY be implemented by executing curl/wget within that command. -#### Scenario: HTTP method works -- **WHEN** healthcheck uses httpGet to ping Pi-hole web interface -- **THEN** the HTTP request returns 200 OK within the timeout period +#### Scenario: HTTP probe command works +- **WHEN** healthcheck executes an HTTP probe command inside the container +- **THEN** the command returns exit code 0 within the timeout period🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/pihole-healthcheck-fix/spec.md` around lines 14 - 24, The spec incorrectly refers to an `httpGet` healthcheck; update the requirement and scenarios to use Podman-compatible command-based healthcheck semantics (i.e., `CMD`/`CMD-SHELL`) instead of `httpGet`. Replace mentions of `httpGet` in the "Requirement: Healthcheck uses appropriate check method" and the "HTTP method works" scenario with a description that the healthcheck should run an in-container command (for example `curl` or equivalent) via `exec`/`CMD`/`CMD-SHELL` and assert it returns success (exit code 0) or HTTP 200 within the timeout. Ensure `exec` and `CMD`/`CMD-SHELL` are presented as the mutually exclusive supported methods and reference the healthcheck command semantics when describing expected outcomes.openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md-39-46 (1)
39-46:⚠️ Potential issue | 🟠 MajorAdd diagnostics data-handling controls (redaction, permissions, retention).
pihole -doutput can include sensitive host/network details; the spec should require restricted log permissions and retention/redaction rules.Suggested spec refinement
#### Scenario: Diagnostics saved - **WHEN** diagnostics are collected - **THEN** output is saved to a timestamped file in the playbook logs directory +- **AND** logs are written with restricted permissions (e.g., owner-read/write only) +- **AND** sensitive tokens/credentials are redacted +- **AND** retention period is defined and enforced🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md` around lines 39 - 46, Update the "Diagnostics collected" and "Diagnostics saved" scenarios to mandate data-handling controls for artifacts produced by running `pihole -d` and any other diagnostics: require that the playbook redact or filter sensitive fields (hostnames, IPs, MACs, tokens) before storage, enforce restrictive file permissions (e.g., owner-only read/write) for files written to the playbook logs directory, and specify a retention/rotation policy (e.g., TTL or max versions) and audit/logging for access; reference the `pihole -d` command and the "playbook logs directory" in the spec text so implementers know where to apply redaction, permissions, and retention controls and include a short note that redaction rules and retention settings must be configurable.openspec/changes/dns-keepalived-analysis/specs/dns-ha-analysis/spec.md-61-65 (1)
61-65:⚠️ Potential issue | 🟠 MajorFirewall mitigation is contradictory with the unicast fallback scenario.
This section requires multicast, but the spec also defines unicast mode when multicast is blocked. The mitigation should be mode-aware to avoid incorrect firewall policy.
Suggested doc fix
-- **MITIGATION**: Allow protocol 112 and multicast +- **MITIGATION**: Allow VRRP protocol 112 between peers; if multicast mode is used, allow 224.0.0.18, otherwise permit configured unicast peer traffic.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/dns-keepalived-analysis/specs/dns-ha-analysis/spec.md` around lines 61 - 65, The "Firewall blocks VRRP" scenario's MITIGATION is contradictory: update the MITIGATION text under "Scenario: Firewall blocks VRRP" to be mode-aware by describing two distinct mitigation actions — for multicast mode allow protocol 112 and multicast, and for environments using the documented unicast fallback allow the specific unicast ports/addresses used by the unicast VRRP implementation (or recommend allowing host-to-host VRRP unicasts) — and reference the unicast fallback behavior so readers know which mitigation applies.openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md-14-24 (1)
14-24:⚠️ Potential issue | 🟠 MajorAvoid requiring blanket root escalation; specify least-privilege command execution.
Requiring a root shell for the whole flow broadens blast radius. The requirement should prefer command-scoped sudo/become for only required operations.
Suggested spec refinement
-### Requirement: Privilege escalation to root -The playbook SHALL escalate to root privileges after SSH connection. +### Requirement: Least-privilege privilege escalation +The playbook SHALL use least-privilege escalation only for commands that require it. @@ -- **WHEN** sudo or su is executed -- **THEN** the shell runs as root user +- **WHEN** a privileged operation is required +- **THEN** the playbook uses scoped sudo/become for that command onlyBased on learnings: For SSH Lockdown & User Access (HOMETECH-17), lock down SSH ports, validate local accounts, and strictly control user access.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md` around lines 14 - 24, The spec currently requires blanket root escalation ("Privilege escalation to root" / "playbook SHALL escalate to root privileges after SSH connection"); change this to require least-privilege, command-scoped elevation (use sudo/become only for specific tasks) and update the two scenarios ("Escalation successful" and "Escalation denied") to reflect command-scoped behavior and failure modes (e.g., "WHEN a required command is executed with sudo/become" / "THEN that command runs with elevated privileges" and "WHEN a required command's escalation is denied" / "THEN the playbook aborts that task and logs error without opening a persistent root shell"). Ensure the spec text mentions restricting persistent root shells and references SSH lockdown and user access validation as controls.openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md-47-52 (1)
47-52:⚠️ Potential issue | 🟠 MajorCurrent idempotency requirement conflicts with unconditional restart behavior.
A restart operation is inherently disruptive if always executed. Define health-check gating so repeated runs do not bounce healthy DNS services.
Suggested spec refinement
-### Requirement: Playbook idempotency -The playbook SHALL be idempotent - running it multiple times produces the same result. +### Requirement: Health-gated restart behavior +The playbook SHALL restart DNS only when health checks fail or explicit force-restart is requested. @@ -- **WHEN** the playbook is run multiple times on a healthy server -- **THEN** the server remains healthy and no errors are reported +- **WHEN** the playbook is run multiple times on a healthy server +- **THEN** restart is skipped and the server remains healthy with no service interruption🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md` around lines 47 - 52, The idempotency requirement conflicts with an unconditional restart; update the playbook spec to gate restart actions with a health check so healthy DNS services are not restarted on subsequent runs: modify the "Requirement: Playbook idempotency" and the "Scenario: Multiple runs" sections to state that the playbook must perform a health check (e.g., "is_dns_healthy" predicate) before issuing a restart, and only execute restart steps when the health check fails or when a recovery threshold is met; reference the health-check gating behavior in the playbook's restart step descriptions so the spec explicitly prevents bouncing healthy services.openspec/specs/pihole-healthcheck-fix/spec.md-14-23 (1)
14-23:⚠️ Potential issue | 🟠 MajorRemove or clarify "httpGet" as a Podman healthcheck method.
Podman does not support
httpGetas a native healthcheck method. Healthchecks in Podman are always command-based; for HTTP health checks, useexecwith command-line tools likecurl(e.g.,curl -f http://localhost:80/). If this spec intends to cover Kubernetes YAML used withpodman kube play, clarify thathttpGetprobes are converted toexeccommands and require curl/wget inside the container.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/specs/pihole-healthcheck-fix/spec.md` around lines 14 - 23, The spec currently states healthchecks may use "httpGet" or "exec", but Podman only supports command-based checks; update the "Requirement: Healthcheck uses appropriate check method" section to remove or clearly qualify "httpGet" by stating Podman requires exec-based checks (e.g., exec using curl/wget like `curl -f http://localhost:80/`) and, if the intent is to describe Kubernetes manifests used with podman kube play, add a note that Kubernetes httpGet probes are translated into exec commands and therefore require curl/wget inside the container; reference the terms "httpGet", "exec", and "podman kube play" when making the clarification.openspec/specs/dns-health-monitoring/spec.md-4-13 (1)
4-13:⚠️ Potential issue | 🟠 MajorResolve server identifier mismatch in the requirement text.
Line 4 uses
new york, but Line 12 constrains labels tonew_york. Keep one canonical identifier to prevent implementation drift.Suggested fix
-The system SHALL collect DNS health metrics from each Pi-hole server (all, new york, miyami) using OpenTelemetry-compatible format. +The system SHALL collect DNS health metrics from each Pi-hole server (all, new_york, miyami) using OpenTelemetry-compatible format.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/specs/dns-health-monitoring/spec.md` around lines 4 - 13, The spec uses inconsistent server identifiers: "new york" in the requirement line and "new_york" in the `server` label constraint; update the spec to use one canonical identifier everywhere (choose and replace either "new_york" or "new york") so the `server` label values referenced alongside metrics like `dns_query_total`, `dns_up`, `dns_response_time_ms`, and `dns_upstream_response_time_ms` match exactly; ensure you update both the requirements paragraph and the Scenario: Metrics include server identity line to the same identifier and run a quick search for any other occurrences to keep naming consistent.openspec/changes/dns-server-monitoring-tool/specs/dns-health-monitoring/spec.md-4-13 (1)
4-13:⚠️ Potential issue | 🟠 MajorStandardize server identifiers to a single canonical format.
Line 4 uses
new york, while Line 12 requiresnew_york. This ambiguity can cause metric label/query mismatches.Suggested fix
-The system SHALL collect DNS health metrics from each Pi-hole server (all, new york, miyami) using OpenTelemetry-compatible format. +The system SHALL collect DNS health metrics from each Pi-hole server (all, new_york, miyami) using OpenTelemetry-compatible format.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/dns-server-monitoring-tool/specs/dns-health-monitoring/spec.md` around lines 4 - 13, The spec mixes server identifier formats ("new york" vs "new_york"); update the spec so the canonical identifier is "new_york" everywhere: replace the server name in the server list and in any metric examples/labels (references to the DNS servers: "all, new york, miyami") and ensure the `server` label examples/requirements (mentions of all, new_york, or miyami and metrics `dns_query_total`, `dns_up`, `dns_response_time_ms`, `dns_upstream_response_time_ms`) use the exact canonical token "new_york" so labels and queries match.dns-monitoring/datadog-monitor-dns-latency-warning.json-4-5 (1)
4-5:⚠️ Potential issue | 🟠 MajorAdd server grouping to query and align threshold with critical setting.
Line 4 query uses
> 100while thecriticalthreshold is set to500(Line 17), and Line 5 references{{server.name}}withoutby {server}grouping in the query. Missing grouping means the template variable won't resolve per-server context, and the threshold mismatch creates ambiguous alert behavior.Suggested fix
- "query": "max(last_5m):max:dns_internal_latency{*} > 100", + "query": "max(last_5m):max:dns_internal_latency{*} by {server} > 500", "message": "DNS latency on {{server.name}} is above 100ms for 5 minutes.\n\n@slack-ops",Note: This changes the alert threshold from 100 to 500 (critical level). Verify this aligns with intended alert behavior.
Also applies to: 16-18
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dns-monitoring/datadog-monitor-dns-latency-warning.json` around lines 4 - 5, Update the "query" field to include per-server grouping and align the threshold with the critical setting: change the expression in the query (the "query" JSON key) from using > 100 to > 500 and add "by {server}" (e.g. max(last_5m):max:dns_internal_latency{*} by {server} > 500) so the template variable {{server.name}} in the "message" resolves per-server and the threshold matches the existing "critical" value.roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/dns-monitoring.conf.j2-4-8 (1)
4-8:⚠️ Potential issue | 🟠 MajorAdd explicit JSON parsing for Pi-hole API HTTP input.
The Pi-hole endpoint returns JSON, but without explicit
data_formatconfiguration, Telegraf defaults to InfluxDB line protocol parsing, causing the JSON response to fail parsing and metrics extraction to be incomplete.Suggested fix
[[inputs.http]] urls = ["http://{{ hostvars[host].ansible_host }}/admin/api.php?summaryRaw"] name_override = "pihole" interval = "60s" timeout = "10s" + data_format = "json"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/dns-monitoring.conf.j2` around lines 4 - 8, The Telegraf HTTP input for the Pi-hole endpoint (the [[inputs.http]] block with name_override = "pihole") lacks an explicit data_format, so Telegraf tries to parse the response as Influx line protocol; add data_format = "json" to that inputs.http stanza (and if you only need specific fields, add json_query or json_string_fields as appropriate) so the Pi-hole JSON payload is parsed correctly and metrics are extracted.
🟡 Minor comments (11)
openspec/changes/dns-server-monitoring-tool/tasks.md-8-8 (1)
8-8:⚠️ Potential issue | 🟡 MinorFix naming/spelling to avoid operational confusion.
Line 8 and Line 25 use
miyami(likelymiami), and Line 42 usespiholewhile the rest of docs usually usePi-hole. Standardizing these labels helps avoid mismatch across inventory/specs.Also applies to: 25-25, 42-42
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/dns-server-monitoring-tool/tasks.md` at line 8, Update the server label names to correct spelling and standard casing: change any occurrence of "miyami" to "miami" and normalize "pihole" to "Pi-hole" (e.g., in the task description "1.6 Configure server-specific labels (all, new_york, miyami)" and other occurrences). Ensure all references across the document use "miami" and "Pi-hole" consistently to avoid inventory/spec mismatches.openspec/changes/dns-server-monitoring-tool/proposal.md-3-3 (1)
3-3:⚠️ Potential issue | 🟡 MinorConfirm server name spelling consistency (
miyamivsmiami).Line 3 and Line 15 use
miyami; if this is not the canonical host identifier, correct it to avoid confusion across specs/playbooks.Also applies to: 15-15
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/dns-server-monitoring-tool/proposal.md` at line 3, The document uses the host name "miyami" in multiple places (e.g., the opening paragraph and line ~15); verify the canonical server identifier and standardize all occurrences to the correct spelling (either "miami" or "miyami") across the proposal text and any referenced specs/playbooks, updating every instance to match and running a global search/replace to avoid inconsistent hostnames.openspec/changes/dns-keepalived-analysis/tasks.md-4-4 (1)
4-4:⚠️ Potential issue | 🟡 MinorCorrect vendor spelling for consistency.
Use
Ubiquitiinstead ofUbiquityto avoid confusion in follow-up research/tasks.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/dns-keepalived-analysis/tasks.md` at line 4, Update the checklist item text that currently reads "Verify Ubiquity switch allows VRRP multicast (224.0.0.18)" to use the correct vendor name "Ubiquiti" so it reads "Verify Ubiquiti switch allows VRRP multicast (224.0.0.18)"; locate and edit the string in the tasks.md checklist entry (the line beginning with "- [ ] 1.2 Verify Ubiquity switch...") and replace "Ubiquity" with "Ubiquiti".openspec/changes/dns-keepalived-analysis/specs/dns-ha-analysis/spec.md-71-71 (1)
71-71:⚠️ Potential issue | 🟡 MinorFix vendor spelling (
Ubiquiti).This keeps network task references consistent across docs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/dns-keepalived-analysis/specs/dns-ha-analysis/spec.md` at line 71, Update the vendor spelling from "Ubiquity" to the correct "Ubiquiti" in the spec text — specifically change the phrase "verify network equipment (Ubiquity + commodity switches) allows VRRP" to "verify network equipment (Ubiquiti + commodity switches) allows VRRP"; also scan the document for any other occurrences of "Ubiquity" and replace them to keep network task references consistent.openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/proposal.md-3-3 (1)
3-3:⚠️ Potential issue | 🟡 MinorFix inconsistent region naming to avoid ops targeting mistakes.
miyamiappears inconsistent with the rest of the DNS region naming and can cause confusion in host/group mapping across docs and tooling.Suggested doc fix
-The home DNS infrastructure (Pi-hole + Unbound on multiple servers: all, new york, miyami) is experiencing flaky behavior. +The home DNS infrastructure (Pi-hole + Unbound on multiple servers: all, new york, miyagi) is experiencing flaky behavior. -- `dns-health-monitoring`: OpenTelemetry-compatible DNS health metrics collection at individual server level (all, new york, miyami). +- `dns-health-monitoring`: OpenTelemetry-compatible DNS health metrics collection at individual server level (all, new york, miyagi).Also applies to: 15-15
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/proposal.md` at line 3, The region name "miyami" is a typo and risks misrouting ops; find and replace the incorrect "miyami" with the correct "miami" wherever it appears (e.g., the region list "all, new york, miyami" in proposal.md and any host/group mapping, inventories, or tooling that refer to that region), then verify host/group names, documentation headings, and any automation (healthchecks, restart scripts, Ansible/Podman labels) still align with the corrected "miami" identifier so mappings and tooling continue to work consistently.openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-alerting/spec.md-30-30 (1)
30-30:⚠️ Potential issue | 🟡 MinorRepository path in this requirement appears stale.
The monitor definitions in this PR are under
dns-monitoring/, so pointing tomonitoring/datadog/can misdirect reviewers/operators.Suggested doc fix
-- **THEN** they can view the JSON/YAML definition in the repository under monitoring/datadog/ +- **THEN** they can view the JSON/YAML definition in the repository under dns-monitoring/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-alerting/spec.md` at line 30, Update the stale repository path in the requirement text: locate the sentence starting with "**THEN** they can view the JSON/YAML definition in the repository under monitoring/datadog/" and change the path segment "monitoring/datadog/" to the correct "dns-monitoring/" so the sentence reads that the JSON/YAML definition is under dns-monitoring/.dns-monitoring/datadog-monitor-dns-down-critical.json-2-2 (1)
2-2:⚠️ Potential issue | 🟡 MinorFix monitor title typo (
PageDuty→PagerDuty).Line 2 has a naming typo that will make searching/filtering monitor definitions harder.
Suggested fix
- "name": "DNS Server Down - CRITICAL - PageDuty", + "name": "DNS Server Down - CRITICAL - PagerDuty",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dns-monitoring/datadog-monitor-dns-down-critical.json` at line 2, The monitor name string "DNS Server Down - CRITICAL - PageDuty" contains a typo; update the value used in the "name" field (the monitor title) to "DNS Server Down - CRITICAL - PagerDuty" and search for any other occurrences of "PageDuty" in the repository to replace with "PagerDuty" so naming/searching stays consistent.openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-health-monitoring/spec.md-4-12 (1)
4-12:⚠️ Potential issue | 🟡 MinorNormalize server identifiers and fix typo (
miyami→miami).Line 4 and Line 12 use inconsistent naming (
new yorkvsnew_york) and misspellmiami. Keep identifiers consistent and correctly spelled across scenarios.Suggested fix
-The system SHALL collect DNS health metrics from each Pi-hole server (all, new york, miyami) using OpenTelemetry-compatible format. +The system SHALL collect DNS health metrics from each Pi-hole server (all, new_york, miami) using OpenTelemetry-compatible format. ... -- **THEN** each metric includes a `server` label with value matching the server identifier (all, new_york, or miyami) +- **THEN** each metric includes a `server` label with value matching the server identifier (all, new_york, or miami)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-health-monitoring/spec.md` around lines 4 - 12, Normalize the server identifiers and correct the typo by replacing occurrences of "miyami" with "miami" and unifying spacing/format to use "new_york" (underscore) instead of "new york"; update both the metrics scenario description that lists servers and the expected `server` label values so all references use the exact identifiers "all", "new_york", and "miami" (ensure strings like `miyami`, `new york`, and any inconsistent `new_york`/`new york` usages are corrected to the canonical forms).openspec/changes/dns-server-monitoring-tool/specs/dns-alerting/spec.md-30-30 (1)
30-30:⚠️ Potential issue | 🟡 MinorRepository path in spec does not match this PR layout.
Line 30 points to
monitoring/datadog/, but monitors in this PR are underdns-monitoring/. Update the spec path to avoid confusion.Suggested fix
-- **THEN** they can view the JSON/YAML definition in the repository under monitoring/datadog/ +- **THEN** they can view the JSON/YAML definition in the repository under dns-monitoring/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/dns-server-monitoring-tool/specs/dns-alerting/spec.md` at line 30, Update the repository path in the spec to match this PR's layout: replace the string "monitoring/datadog/" referenced in specs/dns-alerting/spec.md with "dns-monitoring/" so the sentence reads that the JSON/YAML definition is in the repository under dns-monitoring/; ensure the file spec.md and the exact occurrence of "monitoring/datadog/" are updated and any other mentions in the same spec are adjusted accordingly.openspec/specs/dns-alerting/spec.md-26-30 (1)
26-30:⚠️ Potential issue | 🟡 MinorAlign monitor-definition path with actual repository location.
Line 30 points to
monitoring/datadog/, but current monitor configs are indns-monitoring/. This can misdirect maintenance/review.Suggested fix
-- **THEN** they can view the JSON/YAML definition in the repository under monitoring/datadog/ +- **THEN** they can view the JSON/YAML definition in the repository under dns-monitoring/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/specs/dns-alerting/spec.md` around lines 26 - 30, Update the scenario text that references the monitor definition location: replace the path string "monitoring/datadog/" with the actual repository directory "dns-monitoring/" in the Scenario: Monitor definition in repo section so reviewers are directed to the correct folder; ensure the sentence reading "they can view the JSON/YAML definition in the repository under monitoring/datadog/" is changed to reference "dns-monitoring/" (the current location of monitor configs).playbooks/dns-restart.yml-55-59 (1)
55-59:⚠️ Potential issue | 🟡 MinorDiagnostics export discards stderr output.
At line 57, only
pihole_diag.stdoutis persisted. When the diagnostic command runs on failure (line 53), stderr may contain valuable error context that is lost.📝 Suggested fix
- content: "{{ pihole_diag.stdout }}" + content: | + # STDOUT + {{ pihole_diag.stdout | default('') }} + + # STDERR + {{ pihole_diag.stderr | default('') }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@playbooks/dns-restart.yml` around lines 55 - 59, The copy task currently only saves pihole_diag.stdout which discards stderr; update the task that writes the diagnostics (the copy task referencing pihole_diag.stdout) to include both stdout and stderr (e.g., concatenate pihole_diag.stdout and pihole_diag.stderr with a separator and using default('') to avoid undefined errors) and adjust the when condition to check that pihole_diag is defined and at least one of pihole_diag.stdout or pihole_diag.stderr is defined so error output is preserved.
🧹 Nitpick comments (12)
tests/vrrp-smoke-test/test-vrrp-unshare.sh (1)
10-10: Remove or useTEST_IP_PREFIX.Line 10 is currently unused, which adds noise and can mislead future edits.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/vrrp-smoke-test/test-vrrp-unshare.sh` at line 10, The variable TEST_IP_PREFIX is declared but unused; either remove the declaration or use it where appropriate (e.g., replace hard-coded IP prefixes in test-vrrp-unshare.sh with TEST_IP_PREFIX) so the script references TEST_IP_PREFIX consistently; update any occurrences of literal "127.99.0" in functions or commands (such as IP_ADDR construction or test setup) to build IPs from TEST_IP_PREFIX, or simply delete the TEST_IP_PREFIX line if not needed.tests/vrrp-smoke-test/Makefile (2)
1-1: Includecheckin.PHONY.Line 1 omits
check, so a file namedcheckcould shadow the target.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/vrrp-smoke-test/Makefile` at line 1, The .PHONY declaration is missing the check target which allows a file named "check" to shadow the Makefile target; update the .PHONY line to include check (e.g., modify the .PHONY token that currently lists "all test quick clean" to also list "check") so that the check target is always treated as phony.
6-8: Wire prerequisite validation into executable targets.
checkexists butquick/testdo not depend on it, so failures surface later and less clearly.Also applies to: 11-13, 21-25
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/vrrp-smoke-test/Makefile` around lines 6 - 8, The Makefile targets 'quick' and 'test' currently run without first invoking the prerequisite 'check', so configuration validation failures are reported later; update the Makefile so that the 'quick' and 'test' targets depend on 'check' (e.g., change their target lines to "quick: check" and "test: check") and apply the same dependency wiring to the other similar targets referenced in the diff (lines around 11-13 and 21-25) so that 'check' runs automatically before those targets execute.openspec/changes/dns-server-monitoring-tool/tasks.md (1)
3-3: Prefer repo-relative paths in task docs.Using
~/dev/src/...makes the checklist machine/user-specific. Consider repository-relative references for portability.Also applies to: 24-24
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/dns-server-monitoring-tool/tasks.md` at line 3, The task list uses a user-specific absolute path ("~/dev/src/mkrasberry/dns-monitoring/") which should be replaced with a repository-relative reference; update the checklist entry (the line starting "- [x] 1.1 Create metrics exporter script in ~/dev/src/mkrasberry/dns-monitoring/") to use a repo-relative path such as "./dns-monitoring/" or "mkrasberry/dns-monitoring/" so the docs are portable across machines and mirrors the same change for the similar occurrence at line 24.AGENTS.md (1)
1-79: Re-add an explicit Spec-Driven Development requirement.This rewrite is clear, but it should explicitly preserve the mandatory spec-driven workflow for non-trivial work to avoid process drift.
Based on learnings: All agents MUST adhere to the Spec-Driven Development workflow for non-trivial tasks as outlined in the official AI Agent Standards.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@AGENTS.md` around lines 1 - 79, The document removed an explicit requirement for mandatory Spec-Driven Development; re-add a clear, mandatory clause under the Engineering Preferences or Workflow sections stating "All agents MUST follow the Spec-Driven Development workflow for non-trivial tasks as defined in the official AI Agent Standards," and reference the canonical workflow artifact ("Spec-Driven Development workflow" / "AI Agent Standards") so reviewers and contributors know it's non-optional; update the "BEFORE YOU START" section to require choosing BIG CHANGE vs SMALL CHANGE only after confirming adherence to the Spec-Driven Development requirement and ensure the phrasing uses MUST (uppercase) to indicate mandatory policy.openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/proposal.md (1)
17-17: Use repository-relative paths instead of~/...in shared specs.Home-directory paths are machine-specific and make the spec harder to follow for other contributors.
Suggested doc fix
-- `dns-restart-playbook`: Ansible-style playbook (in ~/dev/src/mkrasberry) to SSH to DNS servers, restart the pihole service as the pihole user, and collect diagnostic output if the service is down. +- `dns-restart-playbook`: Ansible-style playbook (in `playbooks/dns-restart.yml`) to SSH to DNS servers, restart the pihole service as the pihole user, and collect diagnostic output if the service is down. -- New code in ~/dev/src/mkrasberry (playbook, monitoring config) +- New code in this repository (playbook, monitoring config)Also applies to: 25-25
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/proposal.md` at line 17, The spec entry for `dns-restart-playbook` uses a home-directory path (`~/dev/src/mkrasberry`); replace that with a repository-relative path (e.g., `docs/playbooks/dns-restart-playbook` or `tools/playbooks/dns-restart-playbook`) wherever it appears (including the other occurrence at lines 25-25) so the shared spec is machine-agnostic and points contributors to the repo location for `dns-restart-playbook`.roles/telegraf-container-service/templates/etc/telegraf/telegraf.conf.j2 (1)
88-90: Remove commented legacy Datadog output after migration is confirmed.Keeping stale output snippets increases ambiguity during incident debugging.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@roles/telegraf-container-service/templates/etc/telegraf/telegraf.conf.j2` around lines 88 - 90, Remove the stale, commented Datadog output block for [[outputs.datadog]] (the lines referencing apikey= "{{ datadog_telegraf_api_key }}" and url = "https://app.datadoghq.com/api/v1/series") from the telegraf.conf.j2 template now that migration is confirmed; locate the commented snippet around the [[outputs.datadog]] marker and delete those commented lines so the template no longer contains legacy Datadog output remnants.openspec/changes/dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md (1)
3-13: Add explicit SSH hardening acceptance criteria.This section defines connectivity behavior but misses concrete controls for restricted SSH exposure and account validation. Add scenarios covering locked-down SSH ports, validated local accounts, and access restrictions.
Based on learnings: For SSH Lockdown & User Access (HOMETECH-17), lock down SSH ports, validate local accounts, and strictly control user access.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md` around lines 3 - 13, The SSH connectivity requirement lacks explicit hardening acceptance criteria; update the spec section titled "Requirement: SSH connectivity to DNS servers" by adding new scenarios that enforce SSH lockdown and account controls: add a scenario "SSH ports locked down" that requires SSH to run only on approved ports and verifies firewall rules, a scenario "Local account validation" that ensures only validated local accounts exist and inactive/unknown accounts are locked/removed, and a scenario "Restricted SSH access" that mandates key-based access only, restricted user lists, and denies root/password authentication; reference the existing scenario names ("Scenario: SSH connection successful" and "Scenario: SSH connection fails") to keep flow consistent and include explicit acceptance checks and expected abort/failure behavior for each new scenario.openspec/changes/dns-server-monitoring-tool/design.md (2)
91-91: Use hyphenated compound adjective.Line 91 should read
5-minutefor clarity and style consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/dns-server-monitoring-tool/design.md` at line 91, The markdown uses an unhyphenated compound adjective "5 minute failure threshold before alerting" under the "Mitigation" bullet; change it to the hyphenated form "5-minute failure threshold before alerting" so the compound adjective is grammatically correct and matches style consistency (edit the line containing the "Mitigation" bullet / the phrase "5 minute failure threshold before alerting").
4-7: Consider matching design server IDs to spec label format.Specs use machine identifiers like
new_york; this design usesnew york. Aligning terms reduces ambiguity when mapping docs to metric labels.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/dns-server-monitoring-tool/design.md` around lines 4 - 7, Update the server ID list to use the spec label format (underscores instead of spaces): replace "new york" with "new_york" and ensure "miyami" matches the canonical spec token (correct to "miyagi" if the spec uses that) while keeping "all" as-is; update any references in the design text that mention the old forms so they align with the spec label identifiers.openspec/changes/dns-keepalived-analysis/design.md (1)
107-122: Add a language to the fenced code block.Line 107 starts a fenced block without a language, which triggers markdown lint warnings.
Suggested fix
-``` +```text Wisconsin (192.168.20.0/24): marsAlpha (192.168.20.250) ←→ marsBeta (192.168.20.251) VIP: 192.168.20.2 ... -``` +```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openspec/changes/dns-keepalived-analysis/design.md` around lines 107 - 122, The fenced code block that begins with the "Wisconsin (192.168.20.0/24):" listing is missing a language tag, causing markdownlint warnings; update the opening fence from ``` to ```text (i.e., add the "text" language identifier to the opening fence for the block containing marsAlpha/marsBeta, terraOmega/terraPhi, and alphaCenA/alphaCenB) and keep the closing fence unchanged so the block is treated as plain text by the linter.playbooks/dns-restart.yml (1)
25-25: Prefer Ansible privilege escalation over inlinesudo -u.Since the play already uses
become: true, embeddingsudo -uin commands makes privilege handling harder to audit and maintain. Use task-levelbecome_user: "{{ pihole_user }}"instead.Also applies to: 30-30, 50-50
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@playbooks/dns-restart.yml` at line 25, Replace inline sudo usage in the podman commands with Ansible privilege escalation: remove the "sudo -u {{ pihole_user }}" prefix from the command (e.g., the task using "podman ps --filter name=pihole") and add task-level become_user: "{{ pihole_user }}" (the play already has become: true); apply the same change to the other tasks that use "sudo -u" (the occurrences around the commands at lines referenced: the podman ps and the other podman commands).
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (53)
.gitignoreAGENTS.mdcallback_plugins/datadog_callback.pycallback_plugins/datadog_callback.ymldns-monitoring/datadog-monitor-dns-down-critical.jsondns-monitoring/datadog-monitor-dns-down-warning.jsondns-monitoring/datadog-monitor-dns-external-latency.jsondns-monitoring/datadog-monitor-dns-latency-warning.jsondns-monitoring/test_pihole_telegraf.confopenspec/changes/archive/2026-03-01-dns-server-monitoring-tool/.openspec.yamlopenspec/changes/archive/2026-03-01-dns-server-monitoring-tool/design.mdopenspec/changes/archive/2026-03-01-dns-server-monitoring-tool/proposal.mdopenspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-alerting/spec.mdopenspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-health-monitoring/spec.mdopenspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-restart-playbook/spec.mdopenspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/pihole-healthcheck-fix/spec.mdopenspec/changes/archive/2026-03-01-dns-server-monitoring-tool/tasks.mdopenspec/changes/dns-keepalived-analysis/.openspec.yamlopenspec/changes/dns-keepalived-analysis/design.mdopenspec/changes/dns-keepalived-analysis/proposal.mdopenspec/changes/dns-keepalived-analysis/specs/dns-ha-analysis/spec.mdopenspec/changes/dns-keepalived-analysis/tasks.mdopenspec/changes/dns-server-monitoring-tool/.openspec.yamlopenspec/changes/dns-server-monitoring-tool/design.mdopenspec/changes/dns-server-monitoring-tool/proposal.mdopenspec/changes/dns-server-monitoring-tool/specs/dns-alerting/spec.mdopenspec/changes/dns-server-monitoring-tool/specs/dns-health-monitoring/spec.mdopenspec/changes/dns-server-monitoring-tool/specs/dns-restart-playbook/spec.mdopenspec/changes/dns-server-monitoring-tool/specs/pihole-healthcheck-fix/spec.mdopenspec/changes/dns-server-monitoring-tool/tasks.mdopenspec/changes/podman-overlayfs.mdopenspec/config.yamlopenspec/specs/dns-alerting/spec.mdopenspec/specs/dns-health-monitoring/spec.mdopenspec/specs/dns-restart-playbook/spec.mdopenspec/specs/pihole-healthcheck-fix/spec.mdplaybooks/dns-restart.ymlplaybooks/initial-playbook-stage-5.ymlpyproject.tomlretire_forks.shroles/pihole.containerroles/podmanSetupRaspberryPiroles/telegraf-container-service/defaults/main.ymlroles/telegraf-container-service/tasks/config.ymlroles/telegraf-container-service/templates/etc/telegraf/telegraf.conf.j2roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/dns-monitoring.conf.j2roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/otlp_output.conf.j2roles/teleport-daemon-service/templates/etc/telegraf/telegraf.conf.j2tests/vrrp-smoke-test/Makefiletests/vrrp-smoke-test/README.mdtests/vrrp-smoke-test/test-configs-only.shtests/vrrp-smoke-test/test-vrrp-podman.shtests/vrrp-smoke-test/test-vrrp-unshare.sh
| --- | ||
| api_key: "{{ datadog_logging_key }}" | ||
| #api_key: "{{ datadog_logging_key }}" | ||
| api_key: "c28b08e17123d0da25aaa98126e1562e" |
There was a problem hiding this comment.
Do not commit Datadog API keys in plaintext.
Line 3 exposes a live-looking API key. Rotate it and source via vaulted variable/CI secret injection only.
🔐 Proposed fix
- api_key: "c28b08e17123d0da25aaa98126e1562e"
+ api_key: "{{ datadog_logging_key }}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| api_key: "c28b08e17123d0da25aaa98126e1562e" | |
| api_key: "{{ datadog_logging_key }}" |
🧰 Tools
🪛 Gitleaks (8.30.0)
[high] 3-3: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@callback_plugins/datadog_callback.yml` at line 3, The file contains a
plaintext Datadog API key under the api_key entry; remove that literal value
from callback_plugins/datadog_callback.yml, rotate the exposed key in Datadog,
and replace it with a secret reference sourced from a secure store (e.g.,
Ansible Vault, CI secret injection, or an environment variable such as
DATADOG_API_KEY) so the code reads the secret at runtime rather than embedding
it; ensure the deployment/CI configuration is updated to inject the secret and
add the original file to .gitignore or restore from a clean commit to avoid
committing the revoked key.
| # Pi-hole v6 authentication | ||
| cookie_auth_url = "http://192.168.100.23:1080/api/auth" | ||
| cookie_auth_method = "POST" | ||
| cookie_auth_body = '{"password":"pihole123abc"}' |
There was a problem hiding this comment.
Remove committed credential and rotate it immediately.
Line 8 hardcodes a real password in source control. Treat it as compromised: rotate it, purge from history if possible, and load secrets from an injected environment/secret store instead of inline config.
🔐 Safer pattern
- cookie_auth_body = '{"password":"pihole123abc"}'
+ cookie_auth_body = '{"password":"${PIHOLE_API_PASSWORD}"}'📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cookie_auth_body = '{"password":"pihole123abc"}' | |
| cookie_auth_body = '{"password":"${PIHOLE_API_PASSWORD}"}' |
🧰 Tools
🪛 Gitleaks (8.30.0)
[high] 8-8: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@dns-monitoring/test_pihole_telegraf.conf` at line 8, Remove the hardcoded
password assigned to cookie_auth_body, rotate the exposed Pi-hole credential
immediately and purge it from VCS history, and replace the inline secret with a
secret injection (e.g., read from an environment variable or secret store) so
cookie_auth_body is constructed from a runtime secret (e.g.,
TELEGRAF_PIHOLE_PASSWORD) rather than a literal string; ensure deployment/CI
injects that secret and update any README/ops notes to describe rotation and
secret provisioning.
| - name: Restart pihole service | ||
| command: sudo -u {{ pihole_user }} sh -c "podman stop pihole && podman start pihole" | ||
| register: restart_result | ||
| failed_when: restart_result.rc != 0 | ||
|
|
||
| - name: Restart unbound service | ||
| systemd: | ||
| name: unbound | ||
| state: restarted | ||
| become: true | ||
|
|
||
| - name: Verify restart succeeded | ||
| fail: | ||
| msg: "Pihole restart failed with rc={{ restart_result.rc }}" | ||
| when: restart_result.rc not in [0, 3] | ||
|
|
||
| - name: Display restart result | ||
| debug: | ||
| msg: "Pihole restarted successfully on {{ inventory_hostname }}" | ||
|
|
||
| - name: Collect summary diagnostics on failure | ||
| command: sudo -u {{ pihole_user }} podman exec pihole pihole -d -a | ||
| register: pihole_diag | ||
| failed_when: false | ||
| when: restart_result is failed |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Ansible task execution behavior after failure without ignore_errors when condition subsequent task
💡 Result:
- Default behavior (no
ignore_errors): when a task fails on a host, Ansible stops executing any further tasks for that host (the host is effectively “removed” from the rest of the play), while the play may still continue on other hosts. So a subsequent task—even if it has awhen:condition—will not run for that failed host, because execution never reaches it. [1] - Implication for
when:on the next task:when:is only evaluated if the task is reached. After an unignored failure, it isn’t reached on that host, so the condition doesn’t matter. [1] - If you want “fail, but still decide next steps with
when:”: you must allow execution to continue (for example withignore_errors+register, or by usingblock+rescue/alwaysfor structured recovery/cleanup). [1][2]
Sources:
[1] Ansible docs — Error handling in playbooks (default stop-on-failure; ignore_errors; related controls) (docs.ansible.com)
[2] Jeff Geerling — Blocks (block/rescue/always) behavior overview (jeffgeerling.com)
Citations:
- 1: https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html
- 2: https://www.jeffgeerling.com/blog/new-features-ansible-20-blocks?utm_source=openai
Failure diagnostics are unreachable with current task flow.
At line 32, the task hard-fails on non-zero RC. When a task fails without ignore_errors, Ansible stops executing any further tasks for that host—including line 53—regardless of the when: condition. The diagnostics path never executes, defeating its purpose.
Wrap the restart tasks in a block/rescue to ensure diagnostics always run on failure:
Suggested fix
- - name: Restart pihole service
- command: sudo -u {{ pihole_user }} sh -c "podman stop pihole && podman start pihole"
- register: restart_result
- failed_when: restart_result.rc != 0
-
- - name: Restart unbound service
- systemd:
- name: unbound
- state: restarted
- become: true
-
- - name: Verify restart succeeded
- fail:
- msg: "Pihole restart failed with rc={{ restart_result.rc }}"
- when: restart_result.rc not in [0, 3]
-
- - name: Display restart result
- debug:
- msg: "Pihole restarted successfully on {{ inventory_hostname }}"
-
- - name: Collect summary diagnostics on failure
- command: sudo -u {{ pihole_user }} podman exec pihole pihole -d -a
- register: pihole_diag
- failed_when: false
- when: restart_result is failed
+ - name: Restart services and handle failure diagnostics
+ block:
+ - name: Restart pihole service
+ command: sudo -u {{ pihole_user }} podman restart pihole
+ register: restart_result
+
+ - name: Restart unbound service
+ systemd:
+ name: unbound
+ state: restarted
+ become: true
+
+ - name: Verify restart succeeded
+ fail:
+ msg: "Pihole restart failed with rc={{ restart_result.rc }}"
+ when: restart_result.rc not in [0, 3]
+
+ - name: Display restart result
+ debug:
+ msg: "Pihole restarted successfully on {{ inventory_hostname }}"
+ rescue:
+ - name: Collect summary diagnostics on failure
+ command: sudo -u {{ pihole_user }} podman exec pihole pihole -d -a
+ register: pihole_diag
+ failed_when: false📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Restart pihole service | |
| command: sudo -u {{ pihole_user }} sh -c "podman stop pihole && podman start pihole" | |
| register: restart_result | |
| failed_when: restart_result.rc != 0 | |
| - name: Restart unbound service | |
| systemd: | |
| name: unbound | |
| state: restarted | |
| become: true | |
| - name: Verify restart succeeded | |
| fail: | |
| msg: "Pihole restart failed with rc={{ restart_result.rc }}" | |
| when: restart_result.rc not in [0, 3] | |
| - name: Display restart result | |
| debug: | |
| msg: "Pihole restarted successfully on {{ inventory_hostname }}" | |
| - name: Collect summary diagnostics on failure | |
| command: sudo -u {{ pihole_user }} podman exec pihole pihole -d -a | |
| register: pihole_diag | |
| failed_when: false | |
| when: restart_result is failed | |
| - name: Restart services and handle failure diagnostics | |
| block: | |
| - name: Restart pihole service | |
| command: sudo -u {{ pihole_user }} podman restart pihole | |
| register: restart_result | |
| - name: Restart unbound service | |
| systemd: | |
| name: unbound | |
| state: restarted | |
| become: true | |
| - name: Verify restart succeeded | |
| fail: | |
| msg: "Pihole restart failed with rc={{ restart_result.rc }}" | |
| when: restart_result.rc not in [0, 3] | |
| - name: Display restart result | |
| debug: | |
| msg: "Pihole restarted successfully on {{ inventory_hostname }}" | |
| rescue: | |
| - name: Collect summary diagnostics on failure | |
| command: sudo -u {{ pihole_user }} podman exec pihole pihole -d -a | |
| register: pihole_diag | |
| failed_when: false |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@playbooks/dns-restart.yml` around lines 29 - 53, The current hard-fail
prevents the "Collect summary diagnostics on failure" task from running; wrap
the pihole restart steps in an Ansible block/rescue: put the "Restart pihole
service" (the podman stop/start command that registers restart_result) and the
"Restart unbound service" and "Verify restart succeeded" logic inside a block,
move the diagnostics task "Collect summary diagnostics on failure" into the
rescue section (executing the sudo -u {{ pihole_user }} podman exec pihole
pihole -d -a command and registering pihole_diag), and in the rescue re-run or
call the fail logic to preserve the original failure outcome; remove the
unconditional failed_when that aborts flow and rely on block/rescue to guarantee
diagnostics run when restart_result indicates failure.
| # Script to retire unused forks with no personal commits | ||
|
|
||
| echo "Deleting nifi fork..." | ||
| gh repo delete NickJLange/nifi --yes | ||
|
|
||
| echo "Deleting homebridge forks..." | ||
| gh repo delete NickJLange/homebridge-mqttthing --yes | ||
| gh repo delete NickJLange/homebridge-wyze-connected-home --yes | ||
| gh repo delete NickJLange/homebridge-tasmota --yes | ||
| gh repo delete NickJLange/homebridge.container --yes |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
git ls-files | grep -E '\.sh$' | head -20Repository: NickJLange/mkrasberry
Length of output: 346
🏁 Script executed:
cat -n retire_forks.shRepository: NickJLange/mkrasberry
Length of output: 562
Add safety gates before executing destructive deletes.
Line 2 claims this script only targets "unused forks with no personal commits," but the code provides no mechanism to enforce or verify this precondition. The unconditional gh repo delete ... --yes calls (lines 4–11) will permanently remove repos without confirmation, and missing error handling means failed deletes go silently ignored while "Done." prints anyway (line 13).
Suggested minimal safety guard
#!/bin/bash
# Script to retire unused forks with no personal commits
+read -r -p "This permanently deletes repos. Type DELETE to continue: " confirm
+if [[ "$confirm" != "DELETE" ]]; then
+ echo "Aborted."
+ exit 1
+fi
+
echo "Deleting nifi fork..."
gh repo delete NickJLange/nifi --yes🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@retire_forks.sh` around lines 2 - 11, The script retire_forks.sh currently
runs unconditional destructive commands using gh repo delete NickJLange/...
--yes; add safety gates by (1) verifying each repo is a fork and has no personal
commits (compare git log or use gh api to check fork=true and upstream commit
ancestry) before deletion, (2) implement a dry-run mode that prints which repos
would be deleted, (3) prompt for explicit user confirmation per-repo (or an
explicit --force flag) instead of --yes by default, and (4) handle and surface
errors from each gh repo delete call and abort on failure so the final "Done."
is accurate; aim to update the gh repo delete invocations and surrounding logic
(the delete command calls and the script-level control flow) accordingly.
| @@ -1 +1 @@ | |||
| Subproject commit cf84a87c0d561e25387816299aa419cc3327d0b5 | |||
| Subproject commit 077f3124fb617fff14a3fb62758d4632f6f0b143 | |||
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify submodule pointer and inspect delta between old/new commits.
OLD="cf84a87c0d561e25387816299aa419cc3327d0b5"
NEW="077f3124fb617fff14a3fb62758d4632f6f0b143"
echo "== Parent repo submodule pointer =="
git ls-tree HEAD roles/podmanSetupRaspberryPi || true
echo
echo "== Ensure submodule is initialized =="
git submodule update --init --recursive roles/podmanSetupRaspberryPi 2>&1 || true
echo
echo "== Submodule commit summary ($OLD..$NEW) =="
git -C roles/podmanSetupRaspberryPi log --oneline --no-merges "${OLD}..${NEW}" 2>&1 || {
echo "Could not resolve one of the commits in local submodule clone."
}
echo
echo "== Submodule changed files ($OLD..$NEW) =="
git -C roles/podmanSetupRaspberryPi diff --name-status "${OLD}..${NEW}" 2>&1 || {
echo "Could not compute diff."
}
echo
echo "== Scope check for expected VRRP smoke test paths =="
git -C roles/podmanSetupRaspberryPi diff --name-only "${OLD}..${NEW}" 2>&1 | \
rg -n "tests/vrrp-smoke-test|keepalived|dns|podman|telegraf|datadog|ansible|README|Makefile" || trueRepository: NickJLange/mkrasberry
Length of output: 1917
🌐 Web query:
NickJLange podmanSetupRaspberryPi GitHub repository commits
💡 Result:
NickJLange/podmanSetupRaspberryPi (branch: master) shows 21 commits on GitHub; the commits visible on the repository’s commit history page are: [1]
- 2023-11-19
2d1efa9— Merge pull request#9fromNickJLange/november_updatee655aaa— November Catchup Commit
- 2021-07-18
0ea3203— Merge pull request#8fromNickJLange/july_updates02b36b9— HOMETECH23
- 2020-12-20
549b6ca— Merge pull request#7fromNickJLange/december_updates4bacbfe— December Updates - HOMETECH-18
- 2020-09-13
c41a931— Merge pull request#5fromNickJLange/synccf013fd— Catch up committ on legacy stuff
- 2020-01-19
7ad2f9b— Merge pull request#4fromNickJLange/more_fixins291ed58— Fixed a bug introduced by linter; updated podman to v1.7.0 and set requirement to buster683d003— Merge pull request#3fromNickJLange/incremental_fixinsd77d5c7— Explicit Env Locatione6d8c4d— Fixed a boatload of linter warnings.52ef056— Devel Packages broken out and docs cleaned upe6f9f90— Largely working.
- 2019-09-03
7a34bab— Merge pull request#1fromNickJLange/NickJLange-README02b58f6— Merge pull request#2fromNickJLange/initial_add0fccfd4— Adding in license filec36ee8e— initial commit of coded79a00f— Create README.mdb0958d6— Initial commit
Latest commit date: November 19, 2023. [1]
Source
- [1] GitHub commits page for
NickJLange/podmanSetupRaspberryPi(/commits/master/).
Submodule pointer references non-existent commits—blocks repository checkout.
The submodule pointer update targets commits that do not exist in the remote repository. The commit cf84a87c0d561e25387816299aa419cc3327d0b5 is unresolvable (confirmed by remote error "not our ref"), and 077f3124fb617fff14a3fb62758d4632f6f0b143 does not appear in the public commit history of NickJLange/podmanSetupRaspberryPi (which shows 21 total commits, latest from Nov 2023). This will cause submodule initialization to fail for anyone cloning this branch. Verify the commit hashes are correct and reachable in the target submodule repository.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@roles/podmanSetupRaspberryPi` at line 1, The submodule pointer in the branch
references unreachable commits (cf84a87c0d561e25387816299aa419cc3327d0b5 and
077f3124fb617fff14a3fb62758d4632f6f0b143), causing checkout failures; fix by
updating the submodule entry to a valid, reachable commit or the correct branch
tag in the submodule repository (inspect .gitmodules and the gitlink entry for
the submodule and update it using git submodule set-branch / git submodule
update --init --remote or by checking out the correct commit in the submodule
and committing the new gitlink), verify the new commit exists in
NickJLange/podmanSetupRaspberryPi and push the corrected submodule pointer so
clones initialize successfully.
| [[outputs.otlp]] | ||
| ## Datadog OTLP endpoint (US) | ||
| endpoint = "https://otlp.datadoghq.com:443" | ||
|
|
||
| ## Optional: configure headers for authentication | ||
| [outputs.otlp.headers] | ||
| dd-api-key = "{{ datadog_telegraf_api_key }}" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find Telegraf version/image pinning
fd -e Dockerfile -e 'docker-compose*' -e requirements -e 'Makefile' -e '*.yml' -e '*.yaml' | head -20Repository: NickJLange/mkrasberry
Length of output: 47
🏁 Script executed:
# Search for telegraf version references
rg -i 'telegraf.*version|telegraf.*image|FROM.*telegraf' -nRepository: NickJLange/mkrasberry
Length of output: 47
🏁 Script executed:
# Look for all OTLP output references
rg '\[\[outputs\.(otlp|opentelemetry)\]\]' -n -A 5Repository: NickJLange/mkrasberry
Length of output: 1311
🏁 Script executed:
# Search for dd-api-key and datadog references
rg 'dd-api-key|datadog.*endpoint|otlp\.datadoghq' -n -C 2Repository: NickJLange/mkrasberry
Length of output: 2289
🌐 Web query:
Telegraf OTLP output plugin official documentation - correct syntax for Datadog
💡 Result:
To send Telegraf metrics via OTLP to Datadog, the “OTLP output plugin” in Telegraf is the OpenTelemetry output plugin ([[outputs.opentelemetry]]), which sends metrics over gRPC and lets you set the target with service_address (default localhost:4317) and optional [outputs.opentelemetry.headers] / [outputs.opentelemetry.attributes]. [1]
Telegraf (telegraf.conf)
# Send OpenTelemetry metrics over gRPC
[[outputs.opentelemetry]]
# Datadog Agent OTLP/gRPC receiver (default port 4317)
service_address = "127.0.0.1:4317"
# Optional extra resource attributes
[outputs.opentelemetry.attributes]
"service.name" = "my-telegraf"
# Optional extra gRPC metadata headers (usually not needed when sending to the Agent)
# [outputs.opentelemetry.headers]
# key1 = "value1"(Fields shown are from the official Telegraf plugin docs.) [1]
Datadog Agent (datadog.yaml)
Enable the Agent’s OTLP receiver on gRPC 4317 (or HTTP 4318, but Telegraf’s plugin is gRPC-focused):
otlp_config:
receiver:
protocols:
grpc:
endpoint: 0.0.0.0:4317
logs:
enabled: false(Example from Datadog’s OTLP ingestion docs; it also documents the default ports 4317 gRPC / 4318 HTTP.) [2]
Sources: Telegraf OpenTelemetry output plugin docs [1]; Datadog “OTLP Ingestion by the Datadog Agent” docs [2].
🌐 Web query:
Telegraf [[outputs.otlp]] headers field name and dd-api-key header support
💡 Result:
- Telegraf doesn’t have
[[outputs.otlp]](in current upstream Telegraf). The OTLP/OTel output plugin is[[outputs.opentelemetry]]. (docs.influxdata.com) - In that plugin, the headers/metadata field name is a TOML table:
[outputs.opentelemetry.headers](documented as “Additional gRPC request metadata”). (docs.influxdata.com) - Because it’s just key/value gRPC metadata, you can set (for example):
[outputs.opentelemetry.headers] dd-api-key = "…". (docs.influxdata.com) - Datadog’s OTLP HTTP intake expects
dd-api-keyas an HTTP header (for OTLP/HTTP Protobuf exporters). Telegraf’soutputs.opentelemetryis OTLP over gRPC, so this header only helps if the gRPC endpoint you’re sending to actually readsdd-api-keyfrom gRPC metadata (Datadog’s documented agentless intake examples are HTTP-header based). (docs.datadoghq.com)
Citations:
- 1: https://docs.influxdata.com/telegraf/v1/output-plugins/?utm_source=openai
- 2: https://docs.influxdata.com/telegraf/v1/output-plugins/opentelemetry/
- 3: https://docs.influxdata.com/telegraf/v1/output-plugins/opentelemetry/
- 4: https://docs.datadoghq.com/opentelemetry/setup/intake_endpoint/otlp_metrics?utm_source=openai
Replace invalid [[outputs.otlp]] with correct [[outputs.opentelemetry]] plugin and fix field names.
The plugin block [[outputs.otlp]] does not exist in Telegraf. The correct OTLP/OpenTelemetry output plugin is [[outputs.opentelemetry]], which uses service_address (not endpoint). Additionally, Datadog's OTLP HTTP intake expects dd-api-key as an HTTP header, but this plugin sends gRPC metadata—verify this matches Datadog's expected authentication method for your target endpoint.
Corrected configuration
-[[outputs.otlp]]
+[[outputs.opentelemetry]]
## Datadog OTLP endpoint (US)
- endpoint = "https://otlp.datadoghq.com:443"
+ service_address = "https://otlp.datadoghq.com:443"
## Optional: configure headers for authentication
[outputs.opentelemetry.headers]
dd-api-key = "{{ datadog_telegraf_api_key }}"Also parameterize the endpoint to support non-US Datadog sites.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@roles/telegraf-container-service/templates/etc/telegraf/telegraf.conf.j2`
around lines 80 - 86, Replace the invalid [[outputs.otlp]] block with the
correct [[outputs.opentelemetry]] plugin, change the endpoint key to
service_address (e.g. service_address = "{{ datadog_otlp_endpoint }}"),
preserve/parameterize the API key variable (datadog_telegraf_api_key) but move
it into the plugin's headers/auth configuration as appropriate, and verify
whether dd-api-key should be sent as an HTTP header vs gRPC metadata for Datadog
(adjust header name/location accordingly); also add a templated
datadog_otlp_endpoint/datadog_site variable so non‑US Datadog sites can be used.
| [[inputs.exec]] | ||
| commands = ["dig +short {{ host }}.{{ dns_subdomain_base }} @{{ hostvars[host].ansible_host }}#5335 +time=3 +tries=2"] | ||
| name_override = "dns_internal_latency" | ||
| interval = "60s" | ||
| data_format = "value" | ||
| data_type = "integer" | ||
| [inputs.exec.tags] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the template file
find . -name "dns-monitoring.conf.j2" -type f 2>/dev/nullRepository: NickJLange/mkrasberry
Length of output: 156
🏁 Script executed:
# Check if the file exists and read the relevant sections
if [ -f "roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/dns-monitoring.conf.j2" ]; then
echo "=== File content (with line numbers) ==="
cat -n "roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/dns-monitoring.conf.j2"
else
echo "File not found"
fiRepository: NickJLange/mkrasberry
Length of output: 2341
🌐 Web query:
dig +short output format DNS query latency
💡 Result:
dig +short output format
+shortprints only the “useful” result data (record RDATA) and suppresses the usual header/sections/stats. (ucartz.com)- Typical formatting:
- A / AAAA: one IP per line
- CNAME / NS: hostnames (often with a trailing
.) - MX:
preference mailhost(e.g.,10 mail.example.com.) - TXT: quoted strings
- Multiple records are printed as multiple lines. (ucartz.com)
DNS query latency (“how long the lookup took”)
digreports latency as;; Query time: <N> msecin the footer of normal output. (support.dnsimple.com)- Because
+shorthides that footer, use+stats(optionally with+noall) to get timing without the full answer:dig example.com +noall +stats | grep "Query time"(zuzia.app)- or (full output)
dig example.com | grep "Query time"(ucartz.com)
Citations:
- 1: https://www.ucartz.com/blog/dig-command-in-linux-and-unix-with-examples/?utm_source=openai
- 2: https://www.ucartz.com/blog/dig-command-in-linux-and-unix-with-examples/?utm_source=openai
- 3: https://support.dnsimple.com/articles/how-dig/?utm_source=openai
- 4: https://zuzia.app/recipes/monitor-dns-query-response-times/?utm_source=openai
- 5: https://www.ucartz.com/blog/dig-command-in-linux-and-unix-with-examples/?utm_source=openai
🌐 Web query:
Telegraf inputs.exec data_type integer parser how it works
💡 Result:
With [[inputs.exec]], Telegraf runs your command, captures stdout, then parses it using whatever data_format you configured. The data_type = "integer" setting only applies when you use the value input data format (data_format = "value"). [1]
What data_type = "integer" actually does (with data_format = "value")
Internally, the value parser:
- Trims surrounding whitespace and any
\x00bytes. - If
data_typeis not"string", it splits the entire output on whitespace (spaces, tabs, newlines), keeps only the last token, and ignores the rest.
Soecho "foo 10"→ parses10;printf "1\n2\n"→ parses2. [2] - For integer mode, it parses that last token with Go’s
strconv.Atoi(base-10 integer). If parsing fails (e.g.,10ms,1.2, empty), the parser returns an error and the metric won’t be produced/sent for that gather. [2]
What metric you get
- Measurement name: defaults to the input plugin name (
exec) unless you setname_override. [1] - Field name: defaults to
valueunless you setvalue_field_name. [1] - Timestamp: set to “now (UTC)” at parse time. [2]
Minimal working example
[[inputs.exec]]
commands = ["sh -c 'echo 123'"]
data_format = "value"
data_type = "integer"
name_override = "my_metric"
# value_field_name = "value"Produces a single metric like: measurement my_metric, field value=123 (integer). [1][2]
Sources: Telegraf value input format docs (data_type, name_override, value_field_name) [1]; Telegraf value parser implementation (trim, “last token wins”, Atoi, timestamp) [2].
dig +short returns record data (IP addresses), not latency—parsing will fail with data_type = "integer".
The +short flag suppresses the query statistics footer where latency appears (;; Query time: <N> msec). The command will output either an IP address (e.g., 192.168.1.1) or be empty, both of which fail integer parsing. No metrics will be produced.
Use full dig output with +stats and extract the latency value:
Suggested fix
[[inputs.exec]]
- commands = ["dig +short {{ host }}.{{ dns_subdomain_base }} @{{ hostvars[host].ansible_host }}#5335 +time=3 +tries=2"]
+ commands = ["sh -c 'dig {{ host }}.{{ dns_subdomain_base }} @{{ hostvars[host].ansible_host }} -p 5335 +time=3 +tries=2 +stats | awk \"/Query time:/ {print \\$4}\"'"]
name_override = "dns_internal_latency"Also applies to lines 35–39 and 45–50.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/dns-monitoring.conf.j2`
around lines 18 - 24, The inputs.exec blocks (name_override
"dns_internal_latency", commands, data_format/data_type) currently use "dig
+short" which returns records not latency; change the command to run dig with
full stats (enable +stats or remove +short) and pipe/parse the output to emit
only the numeric "Query time" value in milliseconds so the inputs.exec returns a
single integer value compatible with data_format="value" and
data_type="integer"; update the commands in the inputs.exec blocks that set
name_override "dns_internal_latency" (and the other similar inputs.exec blocks
referenced in the review) to perform this parse so empty/no-answer cases produce
no metric or an empty output that Telegraf can ignore.
| [[outputs.datadog]] | ||
| apikey = "{{ datadog_telegraf_api_key }}" | ||
| url = "https://otlp.datadoghq.com/api/v1/series" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Telegraf Datadog OTLP endpoint otlp.datadoghq.com configuration
💡 Result:
Telegraf’s [[outputs.opentelemetry]] plugin exports metrics via OTLP/gRPC (default localhost:4317) and can add gRPC metadata via [outputs.opentelemetry.headers]—it’s not an OTLP/HTTP exporter. (docs.influxdata.com)
Datadog’s otlp.datadoghq.com intake endpoints shown in docs are OTLP/HTTP (for example, metrics go to https://otlp.datadoghq.com/v1/metrics with a dd-api-key header). (docs.datadoghq.com)
So, to use Telegraf with otlp.datadoghq.com, the typical setup is: Telegraf (OTLP/gRPC) → OpenTelemetry Collector (translate) → Datadog OTLP/HTTP intake.
1) Telegraf (telegraf.conf) → send OTLP/gRPC to a local/nearby Collector
[[outputs.opentelemetry]]
service_address = "otel-collector:4317"(Telegraf OTLP output is gRPC.) (docs.influxdata.com)
2) OpenTelemetry Collector (config.yaml) → export OTLP/HTTP to Datadog
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
exporters:
otlphttp:
metrics_endpoint: https://otlp.datadoghq.com/v1/metrics
logs_endpoint: https://otlp.datadoghq.com/v1/logs
headers:
dd-api-key: ${DD_API_KEY}
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlphttp]
logs:
receivers: [otlp]
exporters: [otlphttp]- Datadog metrics endpoint +
dd-api-keyheader: (docs.datadoghq.com) - Datadog logs use OTLP HTTP with
/v1/logsanddd-api-keyheader: (docs.datadoghq.com)
Alternative (if you meant “OTLP to the Datadog Agent”, not otlp.datadoghq.com)
Enable the Agent’s OTLP receiver (gRPC :4317 or HTTP :4318) and point Telegraf at the Agent. (docs.datadoghq.com)
Citations:
- 1: https://docs.influxdata.com/telegraf/v1/output-plugins/opentelemetry/
- 2: https://docs.datadoghq.com/opentelemetry/setup/intake_endpoint/otlp_metrics
- 3: https://docs.influxdata.com/telegraf/v1/output-plugins/opentelemetry/
- 4: https://docs.datadoghq.com/opentelemetry/setup/intake_endpoint/otlp_metrics
- 5: https://docs.datadoghq.com/opentelemetry/setup/intake_endpoint/otlp_logs
- 6: https://docs.datadoghq.com/opentelemetry/setup/otlp_ingest_in_the_agent/?utm_source=openai
🌐 Web query:
Telegraf outputs.datadog plugin OTLP support
💡 Result:
Telegraf’s [[outputs.datadog]] plugin does not support OTLP—it sends Telegraf metrics to Datadog using Datadog’s Metrics API (v1), and the plugin docs explicitly note it “only supports the v1 API.” [1]
If you want OTLP from Telegraf to Datadog, use Telegraf’s OpenTelemetry output instead:
[[outputs.opentelemetry]]sends metrics via OpenTelemetry over gRPC (defaultlocalhost:4317). [2]- Point that at a Datadog endpoint that accepts OTLP, typically the Datadog Agent OTLP receiver (listening on
4317gRPC /4318HTTP when enabled) or an OpenTelemetry Collector in front of Datadog. [3]
Sources: Telegraf Datadog output plugin docs [1], Telegraf OpenTelemetry output plugin docs [2], Datadog OTLP receiver docs [3].
🌐 Web query:
Telegraf outputs.opentelemetry plugin Datadog OTLP authentication
💡 Result:
What Datadog expects for OTLP auth
- Datadog’s direct OTLP intake endpoints (agentless) authenticate with an HTTP header like
dd-api-key=<YOUR_DD_API_KEY>(for example viaOTEL_EXPORTER_OTLP_*_HEADERS). [1], [2]
What Telegraf [[outputs.opentelemetry]] can send
- Telegraf’s
outputs.opentelemetryoutput sends OTLP over gRPC (not OTLP/HTTP). [3] - It can attach custom gRPC metadata via the plugin’s headers table:
(This is generic gRPC metadata support; whether the receiver uses it is up to that receiver.) [3]
[outputs.opentelemetry.headers] dd-api-key = "YOUR_DD_API_KEY"
Practical implication for Datadog
- If you want to send Telegraf metrics to Datadog reliably, the usual setup is:
- Enable OTLP ingestion on the Datadog Agent (listening on 4317 gRPC and/or 4318 HTTP) [4]
- Point Telegraf at the Agent:
[[outputs.opentelemetry]] service_address = "datadog-agent:4317"
dd-api-keybetween Telegraf and the local Agent; the Agent is what’s configured withDD_API_KEYto talk to Datadog. [4]
If you tell me which Datadog target you’re sending to (Datadog Agent, OTel Collector/DDOT, Observability Pipelines Worker, or an “OTLP intake endpoint” URL), I can provide the exact minimal working config for that path.
Sources: [1] Datadog OTLP Logs intake docs, [2] Datadog OTLP Metrics intake docs, [3] Telegraf outputs.opentelemetry docs, [4] Datadog “OTLP Ingestion by the Datadog Agent” docs.
The protocol mismatch concern is valid, but the proposed solution introduces a different protocol mismatch.
[[outputs.datadog]] (Metrics API v1) with an OTLP endpoint is indeed incorrect. However, the proposed [[outputs.opentelemetry]] will not work with otlp.datadoghq.com:443 either: Telegraf's outputs.opentelemetry sends OTLP/gRPC (default localhost:4317), while Datadog's direct OTLP intake (https://otlp.datadoghq.com/v1/metrics) expects OTLP/HTTP.
To send OTLP metrics to Datadog from Telegraf, use one of:
- Telegraf → Datadog Agent (enable OTLP receiver on port 4317) → Datadog
[[outputs.opentelemetry]] service_address = "datadog-agent:4317"
- Telegraf → OpenTelemetry Collector (gRPC receiver) → Datadog OTLP/HTTP intake with
dd-api-keyheader → Datadog - Use
[[outputs.datadog]]correctly (Datadog Metrics API v1 format only, no OTLP endpoint)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/otlp_output.conf.j2`
around lines 3 - 5, The current block incorrectly pairs [[outputs.datadog]] with
an OTLP endpoint; change the config so you either keep [[outputs.datadog]] and
point it to the Datadog Metrics API (remove url =
"https://otlp.datadoghq.com/api/v1/series" and use the proper metrics API
endpoint/format with apikey), or replace with [[outputs.opentelemetry]] but
configure it to send OTLP/gRPC to a receiver (e.g., set service_address =
"datadog-agent:4317") or route through an OpenTelemetry Collector (OTLP/gRPC)
that forwards to Datadog OTLP/HTTP with the dd-api-key header; do not leave
[[outputs.datadog]] pointed at otlp.datadoghq.com nor use
[[outputs.opentelemetry]] without a gRPC receiver address.
| [[outputs.otlp]] | ||
| ## Datadog OTLP endpoint (US) | ||
| endpoint = "https://otlp.datadoghq.com:443" | ||
|
|
||
| ## Optional: configure headers for authentication | ||
| [outputs.otlp.headers] | ||
| dd-api-key = "{{ datadog_telegraf_api_key }}" | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Telegraf OTLP output plugin official documentation table name syntax
💡 Result:
Telegraf doesn’t have a separate outputs.otlp plugin in the official docs; OTLP export is handled by the OpenTelemetry output plugin ([[outputs.opentelemetry]]). (docs.influxdata.com)
In that plugin’s documented InfluxDB line-protocol → OpenTelemetry mapping, the “name” syntax is:
- Default metric name:
"[measurement]_[field key]" - Metric labels: Influx line-protocol tags
- Metric value: the field value (cast to float) (docs.influxdata.com)
Citations:
- 1: https://docs.influxdata.com/telegraf/v1/output-plugins/opentelemetry/?utm_source=openai
- 2: https://docs.influxdata.com/telegraf/v1/output-plugins/opentelemetry/?utm_source=openai
🏁 Script executed:
# Search for Telegraf version specification in the codebase
fd -e requirements.txt -e Dockerfile -e "*.yml" -e "*.yaml" | head -20Repository: NickJLange/mkrasberry
Length of output: 47
🏁 Script executed:
# Check for version specifications in the repository
rg -i "telegraf" --type-list | head -20
rg -i "telegraf.*version|version.*telegraf" -A 2 -B 2Repository: NickJLange/mkrasberry
Length of output: 454
🏁 Script executed:
# Search for Telegraf version in the repository
rg -i "telegraf" --type yaml --type toml --type dockerfileRepository: NickJLange/mkrasberry
Length of output: 99
🏁 Script executed:
# Check for any opentelemetry references
rg -i "opentelemetry"Repository: NickJLange/mkrasberry
Length of output: 3765
🏁 Script executed:
# Check the telegraf config file itself for other output plugins
cat roles/teleport-daemon-service/templates/etc/telegraf/telegraf.conf.j2 | head -100Repository: NickJLange/mkrasberry
Length of output: 4166
Replace [[outputs.otlp]] with [[outputs.opentelemetry]]—current syntax is invalid.
Per official Telegraf documentation, OTLP export to Datadog requires the [[outputs.opentelemetry]] plugin table name, not [[outputs.otlp]] (which does not exist as a Telegraf output plugin). This configuration will fail at runtime.
Update line 80:
-[[outputs.otlp]]
+[[outputs.opentelemetry]]Also remove or reorganize the orphaned InfluxDB keys (urls, token, etc.) below—they have no output table header and will be orphaned after correcting the OTLP block.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@roles/teleport-daemon-service/templates/etc/telegraf/telegraf.conf.j2` around
lines 80 - 87, Replace the invalid Telegraf output table name [[outputs.otlp]]
with the correct [[outputs.opentelemetry]] header so the OTLP/Datadog export
block (the endpoint and [outputs.opentelemetry.headers]/dd-api-key) uses the
supported plugin name; after that, locate the orphaned InfluxDB keys (e.g.,
urls, token, org, bucket) that currently sit without an output table header and
either remove them or move them under the appropriate [[outputs.influxdb_v2]]
(or other intended) output table so they are not left orphaned.
Summary
This PR adds containerized VRRP smoke tests for validating DNS keepalived high availability configurations.
Changes
tests/vrrp-smoke-test/test-vrrp-podman.sh- Containerized failover test with network isolationtests/vrrp-smoke-test/test-vrrp-unshare.sh- Rootless test using Linux namespacestests/vrrp-smoke-test/test-configs-only.sh- Quick syntax validationtests/vrrp-smoke-test/README.md- Documentation and usage guidetests/vrrp-smoke-test/Makefile- Convenience commandsSafety Features
10.99.x.x/24subnets (never conflicts with production192.168.x.x)Test Coverage
Testing
✅ Successfully tested on lunarBeacon (192.168.100.112):
Usage
Summary by cubic
Adds isolated VRRP smoke tests to validate Keepalived DNS HA and introduces DNS monitoring/alerting with a restart playbook. Also switches Telegraf to OTLP for Datadog and enforces Podman overlay storage.
New Features
Dependencies
Written for commit 9da0e68. Summary will update on new commits.
Summary by CodeRabbit
New Features
Chores