Skip to content

Add VRRP Smoke Tests for DNS Keepalived HA#40

Open
NickJLange wants to merge 19 commits intomasterfrom
feature/vrrp-smoke-tests
Open

Add VRRP Smoke Tests for DNS Keepalived HA#40
NickJLange wants to merge 19 commits intomasterfrom
feature/vrrp-smoke-tests

Conversation

@NickJLange
Copy link
Copy Markdown
Owner

@NickJLange NickJLange commented Mar 2, 2026

Summary

This PR adds containerized VRRP smoke tests for validating DNS keepalived high availability configurations.

Changes

  • tests/vrrp-smoke-test/test-vrrp-podman.sh - Containerized failover test with network isolation
  • tests/vrrp-smoke-test/test-vrrp-unshare.sh - Rootless test using Linux namespaces
  • tests/vrrp-smoke-test/test-configs-only.sh - Quick syntax validation
  • tests/vrrp-smoke-test/README.md - Documentation and usage guide
  • tests/vrrp-smoke-test/Makefile - Convenience commands

Safety Features

  • Isolated Network: Uses randomized 10.99.x.x/24 subnets (never conflicts with production 192.168.x.x)
  • Container Isolation: Podman bridge network with no host network access
  • Minimal Capabilities: Only NET_ADMIN and NET_RAW required
  • Auto-Cleanup: Removes containers, networks, and images on exit

Test Coverage

  1. Initial State Validation: Master enters MASTER state, backup enters BACKUP state
  2. Failover Test: Master stopped, backup takes over within 5 seconds
  3. Config Validation: Syntax checking for keepalived configurations

Testing

✅ Successfully tested on lunarBeacon (192.168.100.112):

  • Network isolation confirmed
  • Failover working correctly
  • No interference with production DNS

Usage

cd tests/vrrp-smoke-test
make quick   # Config syntax check
make test    # Full failover test

Summary by cubic

Adds isolated VRRP smoke tests to validate Keepalived DNS HA and introduces DNS monitoring/alerting with a restart playbook. Also switches Telegraf to OTLP for Datadog and enforces Podman overlay storage.

  • New Features

    • VRRP smoke tests: Podman and unshare scripts with isolated networking, plus Makefile/README; validate config, state, and failover.
    • DNS monitoring and alerting: Telegraf config for Pi-hole metrics and DNS latency; Datadog monitor JSONs; restart playbook for Pi-hole and Unbound with diagnostics.
    • Telemetry: Telegraf output moved to Datadog OTLP intake (headers-based auth).
    • Podman storage: Enforce overlay driver and graphroot; install fuse-overlayfs in stage 5.
    • Specs/docs: OpenSpec artifacts for DNS monitoring tool and Keepalived HA analysis (VIP plan, risks, tasks); minor fixes (Ansible Datadog callback compatibility, .gitignore, submodule updates, retire_forks script).
  • Dependencies

    • Add ansible-core and datadog to pyproject; include uv.lock.

Written for commit 9da0e68. Summary will update on new commits.

Summary by CodeRabbit

  • New Features

    • Added DNS server health monitoring with Datadog alerts and Pagerduty integration for critical failures
    • Added automated DNS server restart playbook with diagnostic collection on failures
    • Added DNS high-availability architecture analysis and VRRP implementation framework
    • Added VRRP testing infrastructure for validating failover behavior
  • Chores

    • Updated project dependencies: Ansible and Datadog integrations
    • Updated Telegraf configuration to use OpenTelemetry (OTLP) output format
    • Updated submodule references for container configurations

- Update podmanSetupRaspberryPi submodule to enforce overlay driver in /etc/containers/storage.conf.
- Add fuse-overlayfs to stage 5 package install list.
- Enforce overlay driver configuration in stage 5 playbook.
- Add openspec/changes/podman-overlayfs.md.
- Add Telegraf DNS monitoring config (pihole API + dig latency checks)
- Add Datadog monitor JSON definitions (warning/critical alerts)
- Add DNS restart playbook with diagnostic collection
- Add Podman healthcheck for pihole containers
- Add Unbound access-control for monitoring host

Implements:
- OpenTelemetry-compatible DNS health metrics per server
- Datadog → Pagerduty alerting
- Restart playbook for operational response
- Container healthcheck fix for Podman
Analysis document covering:
- Keepalived/VRRP vs round-robin vs client-side options
- VIP address plan (.2 per subnet)
- Network requirements (Ubiquity + commodity switches)
- Risks: ARP cache, split-brain, firewall
- Implementation roadmap: network check -> keepalived -> integration -> client migration
- Decision: gradual client migration, no cross-region failover, monitor keepalived state
- test-vrrp-unshare.sh: Full failover test with master/backup simulation
- test-configs-only.sh: Quick syntax validation
- Uses unshare --user for rootless namespace isolation
- Tests VRRP protocol without privileged network access
- Validates config syntax, initial state, and failover behavior
- test-vrrp-podman.sh: Containerized test with isolated bridge network
- Uses randomized 10.99.x.x subnets to avoid production conflicts
- Runs keepalived in containers with NET_ADMIN capabilities
- Automatic cleanup of containers, networks, and images
- Safer alternative to unshare mode for production systems
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 2, 2026

📝 Walkthrough

Walkthrough

This PR introduces comprehensive DNS server monitoring infrastructure, high-availability (HA) configuration analysis, and operational tooling. Changes include Datadog monitoring configurations, Telegraf DNS monitoring integration, an Ansible restart playbook, VRRP smoke tests, OpenSpec planning artifacts, and updated roles with OpenTelemetry metrics collection via OTLP.

Changes

Cohort / File(s) Summary
DNS Monitoring Infrastructure
dns-monitoring/datadog-monitor-*.json, dns-monitoring/test_pihole_telegraf.conf, roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/dns-monitoring.conf.j2, roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/otlp_output.conf.j2
Introduces Datadog monitors for DNS server health (down/up state, latency) with CRITICAL/WARNING thresholds and Pagerduty escalation. Adds Telegraf DNS monitoring template for Pi-hole API polling and internal/external DNS latency checks with configurable external domains and OTLP output configuration.
HA DNS Infrastructure & Analysis
openspec/changes/dns-keepalived-analysis/*, tests/vrrp-smoke-test/*
Comprehensive keepalived/VRRP high-availability analysis with design rationale, network topology, failover strategies, and regional VIP planning. Includes VRRP smoke test suite with Podman and unshare-based integration tests, validation scripts, and test infrastructure (Makefile, README).
OpenSpec Planning Artifacts
openspec/changes/dns-server-monitoring-tool/*, openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/*, openspec/specs/dns-*/, openspec/config.yaml, openspec/changes/podman-overlayfs.md
New spec-driven planning and design documentation for DNS monitoring tool, HA analysis, and Podman overlay configuration. Includes proposal, design, and detailed specification documents for metrics collection, alerting, restart playbook, and healthcheck fixes. Establishes OpenSpec configuration schema.
Ansible Playbooks & Role Updates
playbooks/dns-restart.yml, playbooks/initial-playbook-stage-5.yml, roles/telegraf-container-service/, roles/teleport-daemon-service/templates/etc/telegraf/telegraf.conf.j2
Introduces DNS restart playbook with Pi-hole container restart, service restart, and diagnostic collection on failure. Updates playbooks to install fuse-overlayfs and configure Podman storage driver. Modifies Telegraf roles to support DNS monitoring with conditional config copying and OTLP output for Datadog integration.
Testing & Validation Infrastructure
tests/vrrp-smoke-test/test-*.sh, tests/vrrp-smoke-test/Makefile, tests/vrrp-smoke-test/README.md
Adds VRRP integration test suite with three test modes: config validation only, Podman-based containerized testing, and unshare namespace-based testing. Includes comprehensive failover and recovery validation, network isolation, and diagnostic output. Makefile provides quick validation and full test targets.
Configuration & Dependencies
.gitignore, pyproject.toml, retire_forks.sh, callback_plugins/datadog_callback.*
Adds gitignore entries for AI agent tooling directories. Updates project dependencies with ansible-core 2.20.2+ and datadog 0.52.1+. Introduces GitHub fork retirement script. Adds wants_implicit_tasks attribute to Datadog callback plugin for Ansible 2.16+ compatibility. Updates Datadog callback config with hardcoded API key.
Submodule & Role Reference Updates
roles/pihole.container, roles/podmanSetupRaspberryPi, roles/telegraf-container-service/defaults/main.yml
Updates pihole.container and podmanSetupRaspberryPi submodule pointers to newer commits. Adds telegraf role default variables for DNS monitoring enablement flag and external domain list for latency testing.
Documentation Updates
AGENTS.md
Replaces project overview and architecture documentation with Engineering Preferences and Review Guidelines, detailing evaluation frameworks, workflow interactions, and decision points for interactive review processes.

Sequence Diagram(s)

sequenceDiagram
    participant TG as Telegraf
    participant DD as Datadog Agent
    participant DDmon as Datadog Monitors
    participant PD as Pagerduty
    participant Ops as Operations

    TG->>TG: Poll Pi-hole API & measure DNS latency
    TG->>DD: Send metrics via OTLP (dns_up, latency)
    DD->>DDmon: Scrape & aggregate metrics
    DDmon->>DDmon: Evaluate dns_up thresholds<br/>(5min WARNING, 15min CRITICAL)
    alt Server Down
        DDmon->>PD: Send CRITICAL alert
        PD->>Ops: Create incident notification
        Ops->>Ops: Page on-call engineer
    else Server Recovered
        DDmon->>PD: Send recovery notification
        PD->>Ops: Resolve incident
    end
Loading
sequenceDiagram
    participant Ops as Operations
    participant AP as Ansible Playbook
    participant SSH as SSH Session
    participant PH as Pi-hole Container
    participant Logs as Diagnostics

    Ops->>AP: Invoke dns-restart playbook
    AP->>SSH: Connect to DNS server
    SSH->>SSH: Escalate to root
    AP->>PH: Stop pihole container
    AP->>PH: Start pihole container
    alt Restart Success
        AP->>AP: Restart unbound service
        AP->>Ops: Display success message
    else Restart Failure
        AP->>PH: Run diagnostic command
        PH->>Logs: Capture diagnostics output
        AP->>Logs: Fetch logs to local machine
        AP->>Ops: Display failure with logs
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

This PR spans multiple domains (monitoring infrastructure, HA analysis, automation, testing) with substantial heterogeneous changes: Datadog/OTLP configurations, comprehensive OpenSpec planning documents, Ansible playbooks with conditional logic and error handling, VRRP test scripts with namespace isolation, and role updates. While individual files are relatively straightforward, the cross-cutting nature of changes and density of new infrastructure require detailed review across distinct areas.

Possibly related PRs

Poem

🐰 Whiskers twitching with DNS glee,
Monitoring flows like tea—
Datadog alerts hop and bound,
While keepalived keeps the network round!
HA dreams of failover spins,
This rabbit's telemetry surely wins! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 67.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Add VRRP Smoke Tests for DNS Keepalived HA' directly and clearly describes the main change: introducing VRRP smoke tests for keepalived high-availability configuration validation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/vrrp-smoke-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 54 files

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
roles/teleport-daemon-service/templates/etc/telegraf/telegraf.conf.j2 (1)

92-143: ⚠️ Potential issue | 🔴 Critical

Add [[outputs.influxdb_v2]] table header before the InfluxDB configuration.

The InfluxDB output section (lines 97–143) is missing the required [[outputs.influxdb_v2]] table declaration. Keys like urls, token, organization, influx_uint_support, bucket, namedrop, and namepass are orphaned without proper TOML table context, which will cause the configuration to be misparsed or rejected by Telegraf.

Proposed fix
+[[outputs.influxdb_v2]]
  ## The URLs of the InfluxDB cluster nodes.
  ##
  ## Multiple URLs can be specified for a single cluster, only ONE of the
  ## urls will be written to each interval.
  ##   ex: urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"]
  urls = ["{{ influx_db_url }}"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@roles/teleport-daemon-service/templates/etc/telegraf/telegraf.conf.j2` around
lines 92 - 143, The InfluxDB output keys (urls, token, organization,
influx_uint_support, bucket, namedrop, namepass) are missing their TOML table
header; add the table declaration line [[outputs.influxdb_v2]] immediately above
the block that starts with urls = "{{ influx_db_url }}" so these keys are nested
under the outputs.influxdb_v2 table, ensuring the header appears once and that
the existing keys remain unchanged and properly aligned beneath it.
🟠 Major comments (17)
retire_forks.sh-4-13 (1)

4-13: ⚠️ Potential issue | 🟠 Major

Return non-zero when any delete fails; don’t always print success.

Line [13] prints Done. even if a delete command fails. That masks partial failure and makes automation unreliable.

Suggested failure tracking + accurate final status
 #!/bin/bash
 # Script to retire unused forks with no personal commits

+failures=0
+delete_repo() {
+  local repo="$1"
+  if ! gh repo delete "$repo" --yes; then
+    echo "Failed to delete $repo" >&2
+    failures=$((failures + 1))
+  fi
+}
+
 echo "Deleting nifi fork..."
-gh repo delete NickJLange/nifi --yes
+delete_repo NickJLange/nifi

 echo "Deleting homebridge forks..."
-gh repo delete NickJLange/homebridge-mqttthing --yes
-gh repo delete NickJLange/homebridge-wyze-connected-home --yes
-gh repo delete NickJLange/homebridge-tasmota --yes
-gh repo delete NickJLange/homebridge.container --yes
+delete_repo NickJLange/homebridge-mqttthing
+delete_repo NickJLange/homebridge-wyze-connected-home
+delete_repo NickJLange/homebridge-tasmota
+delete_repo NickJLange/homebridge.container

-echo "Done."
+if [[ "$failures" -gt 0 ]]; then
+  echo "Completed with $failures failure(s)." >&2
+  exit 1
+fi
+echo "Done."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@retire_forks.sh` around lines 4 - 13, The script retire_forks.sh currently
prints "Done." regardless of failures; update it to detect failures from each gh
repo delete invocation (the gh repo delete commands) and track a non-zero
failure flag (e.g., a variable like failure=0 set to 1 on any non-zero exit).
After each gh repo delete command check its exit status and set the flag on
error (or use command || failure=1). At the end, if failure is 0 print the
existing "Done." and exit 0, otherwise print a descriptive error summary and
exit with a non-zero status so CI/automation will detect the failure.
playbooks/initial-playbook-stage-5.yml-48-64 (1)

48-64: ⚠️ Potential issue | 🟠 Major

Guard storage backend mutations or skip on already-initialized hosts.

Changing driver/graphroot in storage.conf requires podman system reset to occur first per Podman documentation; if the config is edited before reset, cleanup may fail silently and corrupt existing container/image state. For idempotent playbook re-runs on the same host, add a check: either skip these tasks if Podman storage is already initialized, or explicitly run podman system reset before mutating the config.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@playbooks/initial-playbook-stage-5.yml` around lines 48 - 64, The playbook is
mutating /etc/containers/storage.conf unconditionally which can corrupt existing
Podman state; update the tasks titled "Ensure Podman storage driver is overlay"
and "Ensure Podman graphroot is set" to first guard or reset: add a conditional
check that detects whether Podman storage is already initialized (e.g., test for
existing containers/images or presence of initialized storage under
/var/lib/containers/storage) and skip the ini_file changes if initialized, or
run `podman system reset` (idempotently, with confirmation handled) before
applying the ini_file changes so the config mutation is safe on re-runs. Ensure
the logic is applied to both tasks and that any reset runs only when
safe/explicit to avoid unintended data loss.
tests/vrrp-smoke-test/test-vrrp-unshare.sh-243-248 (1)

243-248: ⚠️ Potential issue | 🟠 Major

Recovery assertion is currently a historical log check, not a stability check.

Line 243 greps the whole backup log for Entering MASTER STATE; once failover happened, this stays true even if the node later regresses. This can mask recovery regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/vrrp-smoke-test/test-vrrp-unshare.sh` around lines 243 - 248, The
current check greps the entire "$TEST_DIR/tmp/backup.log" for "Entering MASTER
STATE", which can be true even if the node later regressed; change the assertion
to verify that the most recent state transition is MASTER instead of any
historical occurrence. Update the check around the grep block to examine only
the tail of the log (e.g., tail -n <lines> or parse the last matching line) and
confirm that the last "Entering ..." entry is "Entering MASTER STATE" and that
no later "Entering BACKUP STATE" or "Leaving MASTER STATE" exists; keep using
the same helpers (log_success/log_error) and operate on
"$TEST_DIR/tmp/backup.log" so the test fails if the node regresses after
failover.
tests/vrrp-smoke-test/test-vrrp-podman.sh-149-167 (1)

149-167: ⚠️ Potential issue | 🟠 Major

Container selection/cleanup is nondeterministic and can impact unrelated resources.

Using ancestor=... | head -1/tail -1 (Lines 177–179, 208–210) can pick the wrong container. Cleanup by ancestor and a shared image tag (localhost/vrrp-test:latest) can also stop/remove containers or images outside this specific run.

Suggested scoping fix
+IMAGE_TAG="localhost/vrrp-test:${NETWORK_NAME}"
+MASTER_CONTAINER="vrrp-master-${NETWORK_NAME}"
+BACKUP_CONTAINER="vrrp-backup-${NETWORK_NAME}"

-    podman build -t localhost/vrrp-test:latest -f "$TEST_DIR/tmp/Containerfile" "$TEST_DIR/tmp"
+    podman build -t "$IMAGE_TAG" -f "$TEST_DIR/tmp/Containerfile" "$TEST_DIR/tmp"

-    podman run -d --name "vrrp-master-${RANDOM}" \
+    podman run -d --name "$MASTER_CONTAINER" \
         ...
-        localhost/vrrp-test:latest \
+        "$IMAGE_TAG" \
         -f /etc/keepalived/keepalived.conf -n -l -D

-    podman run -d --name "vrrp-backup-${RANDOM}" \
+    podman run -d --name "$BACKUP_CONTAINER" \
         ...
-        localhost/vrrp-test:latest \
+        "$IMAGE_TAG" \
         -f /etc/keepalived/keepalived.conf -n -l -D

-    MASTER_CONTAINER=$(podman ps --filter "ancestor=localhost/vrrp-test:latest" --format "{{.Names}}" | head -1)
-    BACKUP_CONTAINER=$(podman ps --filter "ancestor=localhost/vrrp-test:latest" --format "{{.Names}}" | tail -1)
+    # Reuse known names set during run_containers()

-    for container in $(podman ps -a --filter "ancestor=localhost/vrrp-test:latest" --format "{{.Names}}"); do
-        podman stop "$container" 2>/dev/null || true
-        podman rm "$container" 2>/dev/null || true
-    done
+    podman stop "$MASTER_CONTAINER" 2>/dev/null || true
+    podman stop "$BACKUP_CONTAINER" 2>/dev/null || true
+    podman rm "$MASTER_CONTAINER" 2>/dev/null || true
+    podman rm "$BACKUP_CONTAINER" 2>/dev/null || true

-    podman rmi localhost/vrrp-test:latest 2>/dev/null || true
+    podman rmi "$IMAGE_TAG" 2>/dev/null || true

Also applies to: 177-179, 208-210, 234-247

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/vrrp-smoke-test/test-vrrp-podman.sh` around lines 149 - 167, The
cleanup is nondeterministic because the script selects containers/images by
ancestor or shared tag (localhost/vrrp-test:latest) and uses head/tail, which
can target unrelated runs; modify podman invocation to assign and reuse
deterministic identifiers (store the generated container names from --name like
"vrrp-master-${RANDOM}" and "vrrp-backup-${RANDOM}" into variables, or add a
unique --label) and capture the container IDs returned by podman run, then use
those exact IDs/names or labels for all later selection and removal (instead of
ancestor=... | head/tail or relying on the image tag); update any cleanup
commands that reference ancestor or the image (where ancestor=..., head/tail are
used) to use the stored variables/labels or the captured container IDs so only
the intended containers/images for this test are stopped/removed.
tests/vrrp-smoke-test/Makefile-11-13 (1)

11-13: ⚠️ Potential issue | 🟠 Major

make test currently defaults to the less-isolated path.

Line 12 runs test-vrrp-unshare.sh; that makes the default test entrypoint less safe than the Podman mode described as recommended elsewhere in this PR. This can cause accidental host-namespace testing on production-adjacent systems.

Suggested target split
 test:
 	`@echo` "Running full VRRP test suite..."
-	@./test-vrrp-unshare.sh
+	@./test-vrrp-podman.sh
+
+test-unshare:
+	`@echo` "Running rootless unshare VRRP test..."
+	@./test-vrrp-unshare.sh
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/vrrp-smoke-test/Makefile` around lines 11 - 13, The current Makefile
test target invokes test-vrrp-unshare.sh (the less-isolated path); change the
Makefile so the primary "test" target runs the safer Podman-based script instead
and add separate targets for the two modes: keep "test-vrrp-unshare.sh" as a
"test-unshare" target and add a "test-podman" target that runs the Podman
entrypoint (e.g., test-vrrp-podman.sh), then make the top-level "test" depend on
or call the "test-podman" target so Podman becomes the default entrypoint while
preserving the unshare target for explicit use.
tests/vrrp-smoke-test/test-configs-only.sh-13-22 (1)

13-22: ⚠️ Potential issue | 🟠 Major

This script does not validate configuration syntax yet.

Lines 13–22 only verify the binary/version, so broken keepalived configs still pass quick. That defeats the stated purpose of config validation.

Suggested implementation direction
-# Would validate actual configs if they exist
-# For now, just check keepalived binary works
-if [ -x "$KEEPALIVED_BIN" ]; then
-    echo "  keepalived binary: OK ($KEEPALIVED_BIN)"
-    echo "  Version: $($KEEPALIVED_BIN -v 2>&1 | head -1)"
-else
-    echo "  ERROR: keepalived not found"
-    exit 1
-fi
+if [ ! -x "$KEEPALIVED_BIN" ]; then
+    echo "  ERROR: keepalived not found"
+    exit 1
+fi
+
+echo "  keepalived binary: OK ($KEEPALIVED_BIN)"
+echo "  Version: $($KEEPALIVED_BIN -v 2>&1 | head -1)"
+
+CONFIG_DIR="${CONFIG_DIR:-$PWD/tmp}"
+shopt -s nullglob
+configs=("$CONFIG_DIR"/*.conf)
+if [ ${`#configs`[@]} -eq 0 ]; then
+    echo "  ERROR: no .conf files found in $CONFIG_DIR"
+    exit 1
+fi
+for conf in "${configs[@]}"; do
+    "$KEEPALIVED_BIN" -t -f "$conf" >/dev/null
+    echo "  OK: $(basename "$conf")"
+done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/vrrp-smoke-test/test-configs-only.sh` around lines 13 - 22, The script
currently only checks that $KEEPALIVED_BIN exists but doesn't validate config
syntax; update the block that checks KEEPALIVED_BIN to also run the keepalived
config test (invoke "$KEEPALIVED_BIN -t" or the appropriate syntax-check flag)
and capture its output and exit status; if the syntax check fails, print the
test output via echo/process logger and exit non-zero, otherwise print a success
message (reference the existing KEEPALIVED_BIN variable and the if/.../else/done
block to locate where to add the test).
tests/vrrp-smoke-test/test-vrrp-podman.sh-53-57 (1)

53-57: ⚠️ Potential issue | 🟠 Major

Bridge network lacks --internal flag, contradicting stated isolation guarantees.

The file header declares the test "safe to run on production servers" with an "isolated bridge network" (line 4-5), and line 275 reports "Network isolation: Podman bridge network" in the test output. However, the bridge network created at lines 53-57 omits the --internal flag. Without it, containers retain outbound routing to external networks via NAT masquerading, weakening the isolation guarantee claimed in the file's documentation.

To align with the stated safety claim, either add --internal to the network creation or update the documentation to clarify that isolation applies only to inbound access.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/vrrp-smoke-test/test-vrrp-podman.sh` around lines 53 - 57, The Podman
bridge network created by the podman network create command (using variables
SUBNET, GATEWAY, NETWORK_NAME) lacks the --internal flag, so add --internal to
that command to prevent outbound NAT and enforce isolation; update any comments
if you intentionally want outbound access and run tests to ensure other steps
expecting external connectivity are adjusted accordingly.
openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/pihole-healthcheck-fix/spec.md-14-24 (1)

14-24: ⚠️ Potential issue | 🟠 Major

Replace httpGet wording with Podman command-based healthcheck semantics.

Lines 15 and 22 currently describe a native httpGet method, which is not a Podman healthcheck type. Podman supports only CMD and CMD-SHELL healthchecks; HTTP validation must be implemented by executing a command (e.g., curl) inside the container.

Proposed spec wording update
-### Requirement: Healthcheck uses appropriate check method
-The healthcheck SHALL use either exec or httpGet method that works reliably with Podman.
+### Requirement: Healthcheck uses a Podman-compatible command method
+The healthcheck SHALL use a command-based method supported by Podman. HTTP validation MAY be implemented by executing curl/wget within that command.

-#### Scenario: HTTP method works
-- **WHEN** healthcheck uses httpGet to ping Pi-hole web interface
-- **THEN** the HTTP request returns 200 OK within the timeout period
+#### Scenario: HTTP probe command works
+- **WHEN** healthcheck executes an HTTP probe command inside the container
+- **THEN** the command returns exit code 0 within the timeout period
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/pihole-healthcheck-fix/spec.md`
around lines 14 - 24, The spec incorrectly refers to an `httpGet` healthcheck;
update the requirement and scenarios to use Podman-compatible command-based
healthcheck semantics (i.e., `CMD`/`CMD-SHELL`) instead of `httpGet`. Replace
mentions of `httpGet` in the "Requirement: Healthcheck uses appropriate check
method" and the "HTTP method works" scenario with a description that the
healthcheck should run an in-container command (for example `curl` or
equivalent) via `exec`/`CMD`/`CMD-SHELL` and assert it returns success (exit
code 0) or HTTP 200 within the timeout. Ensure `exec` and `CMD`/`CMD-SHELL` are
presented as the mutually exclusive supported methods and reference the
healthcheck command semantics when describing expected outcomes.
openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md-39-46 (1)

39-46: ⚠️ Potential issue | 🟠 Major

Add diagnostics data-handling controls (redaction, permissions, retention).

pihole -d output can include sensitive host/network details; the spec should require restricted log permissions and retention/redaction rules.

Suggested spec refinement
 #### Scenario: Diagnostics saved
 - **WHEN** diagnostics are collected
 - **THEN** output is saved to a timestamped file in the playbook logs directory
+- **AND** logs are written with restricted permissions (e.g., owner-read/write only)
+- **AND** sensitive tokens/credentials are redacted
+- **AND** retention period is defined and enforced
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md`
around lines 39 - 46, Update the "Diagnostics collected" and "Diagnostics saved"
scenarios to mandate data-handling controls for artifacts produced by running
`pihole -d` and any other diagnostics: require that the playbook redact or
filter sensitive fields (hostnames, IPs, MACs, tokens) before storage, enforce
restrictive file permissions (e.g., owner-only read/write) for files written to
the playbook logs directory, and specify a retention/rotation policy (e.g., TTL
or max versions) and audit/logging for access; reference the `pihole -d` command
and the "playbook logs directory" in the spec text so implementers know where to
apply redaction, permissions, and retention controls and include a short note
that redaction rules and retention settings must be configurable.
openspec/changes/dns-keepalived-analysis/specs/dns-ha-analysis/spec.md-61-65 (1)

61-65: ⚠️ Potential issue | 🟠 Major

Firewall mitigation is contradictory with the unicast fallback scenario.

This section requires multicast, but the spec also defines unicast mode when multicast is blocked. The mitigation should be mode-aware to avoid incorrect firewall policy.

Suggested doc fix
-- **MITIGATION**: Allow protocol 112 and multicast
+- **MITIGATION**: Allow VRRP protocol 112 between peers; if multicast mode is used, allow 224.0.0.18, otherwise permit configured unicast peer traffic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/dns-keepalived-analysis/specs/dns-ha-analysis/spec.md`
around lines 61 - 65, The "Firewall blocks VRRP" scenario's MITIGATION is
contradictory: update the MITIGATION text under "Scenario: Firewall blocks VRRP"
to be mode-aware by describing two distinct mitigation actions — for multicast
mode allow protocol 112 and multicast, and for environments using the documented
unicast fallback allow the specific unicast ports/addresses used by the unicast
VRRP implementation (or recommend allowing host-to-host VRRP unicasts) — and
reference the unicast fallback behavior so readers know which mitigation
applies.
openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md-14-24 (1)

14-24: ⚠️ Potential issue | 🟠 Major

Avoid requiring blanket root escalation; specify least-privilege command execution.

Requiring a root shell for the whole flow broadens blast radius. The requirement should prefer command-scoped sudo/become for only required operations.

Suggested spec refinement
-### Requirement: Privilege escalation to root
-The playbook SHALL escalate to root privileges after SSH connection.
+### Requirement: Least-privilege privilege escalation
+The playbook SHALL use least-privilege escalation only for commands that require it.
@@
-- **WHEN** sudo or su is executed
-- **THEN** the shell runs as root user
+- **WHEN** a privileged operation is required
+- **THEN** the playbook uses scoped sudo/become for that command only

Based on learnings: For SSH Lockdown & User Access (HOMETECH-17), lock down SSH ports, validate local accounts, and strictly control user access.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md`
around lines 14 - 24, The spec currently requires blanket root escalation
("Privilege escalation to root" / "playbook SHALL escalate to root privileges
after SSH connection"); change this to require least-privilege, command-scoped
elevation (use sudo/become only for specific tasks) and update the two scenarios
("Escalation successful" and "Escalation denied") to reflect command-scoped
behavior and failure modes (e.g., "WHEN a required command is executed with
sudo/become" / "THEN that command runs with elevated privileges" and "WHEN a
required command's escalation is denied" / "THEN the playbook aborts that task
and logs error without opening a persistent root shell"). Ensure the spec text
mentions restricting persistent root shells and references SSH lockdown and user
access validation as controls.
openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md-47-52 (1)

47-52: ⚠️ Potential issue | 🟠 Major

Current idempotency requirement conflicts with unconditional restart behavior.

A restart operation is inherently disruptive if always executed. Define health-check gating so repeated runs do not bounce healthy DNS services.

Suggested spec refinement
-### Requirement: Playbook idempotency
-The playbook SHALL be idempotent - running it multiple times produces the same result.
+### Requirement: Health-gated restart behavior
+The playbook SHALL restart DNS only when health checks fail or explicit force-restart is requested.
@@
-- **WHEN** the playbook is run multiple times on a healthy server
-- **THEN** the server remains healthy and no errors are reported
+- **WHEN** the playbook is run multiple times on a healthy server
+- **THEN** restart is skipped and the server remains healthy with no service interruption
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md`
around lines 47 - 52, The idempotency requirement conflicts with an
unconditional restart; update the playbook spec to gate restart actions with a
health check so healthy DNS services are not restarted on subsequent runs:
modify the "Requirement: Playbook idempotency" and the "Scenario: Multiple runs"
sections to state that the playbook must perform a health check (e.g.,
"is_dns_healthy" predicate) before issuing a restart, and only execute restart
steps when the health check fails or when a recovery threshold is met; reference
the health-check gating behavior in the playbook's restart step descriptions so
the spec explicitly prevents bouncing healthy services.
openspec/specs/pihole-healthcheck-fix/spec.md-14-23 (1)

14-23: ⚠️ Potential issue | 🟠 Major

Remove or clarify "httpGet" as a Podman healthcheck method.

Podman does not support httpGet as a native healthcheck method. Healthchecks in Podman are always command-based; for HTTP health checks, use exec with command-line tools like curl (e.g., curl -f http://localhost:80/). If this spec intends to cover Kubernetes YAML used with podman kube play, clarify that httpGet probes are converted to exec commands and require curl/wget inside the container.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/specs/pihole-healthcheck-fix/spec.md` around lines 14 - 23, The spec
currently states healthchecks may use "httpGet" or "exec", but Podman only
supports command-based checks; update the "Requirement: Healthcheck uses
appropriate check method" section to remove or clearly qualify "httpGet" by
stating Podman requires exec-based checks (e.g., exec using curl/wget like `curl
-f http://localhost:80/`) and, if the intent is to describe Kubernetes manifests
used with podman kube play, add a note that Kubernetes httpGet probes are
translated into exec commands and therefore require curl/wget inside the
container; reference the terms "httpGet", "exec", and "podman kube play" when
making the clarification.
openspec/specs/dns-health-monitoring/spec.md-4-13 (1)

4-13: ⚠️ Potential issue | 🟠 Major

Resolve server identifier mismatch in the requirement text.

Line 4 uses new york, but Line 12 constrains labels to new_york. Keep one canonical identifier to prevent implementation drift.

Suggested fix
-The system SHALL collect DNS health metrics from each Pi-hole server (all, new york, miyami) using OpenTelemetry-compatible format.
+The system SHALL collect DNS health metrics from each Pi-hole server (all, new_york, miyami) using OpenTelemetry-compatible format.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/specs/dns-health-monitoring/spec.md` around lines 4 - 13, The spec
uses inconsistent server identifiers: "new york" in the requirement line and
"new_york" in the `server` label constraint; update the spec to use one
canonical identifier everywhere (choose and replace either "new_york" or "new
york") so the `server` label values referenced alongside metrics like
`dns_query_total`, `dns_up`, `dns_response_time_ms`, and
`dns_upstream_response_time_ms` match exactly; ensure you update both the
requirements paragraph and the Scenario: Metrics include server identity line to
the same identifier and run a quick search for any other occurrences to keep
naming consistent.
openspec/changes/dns-server-monitoring-tool/specs/dns-health-monitoring/spec.md-4-13 (1)

4-13: ⚠️ Potential issue | 🟠 Major

Standardize server identifiers to a single canonical format.

Line 4 uses new york, while Line 12 requires new_york. This ambiguity can cause metric label/query mismatches.

Suggested fix
-The system SHALL collect DNS health metrics from each Pi-hole server (all, new york, miyami) using OpenTelemetry-compatible format.
+The system SHALL collect DNS health metrics from each Pi-hole server (all, new_york, miyami) using OpenTelemetry-compatible format.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@openspec/changes/dns-server-monitoring-tool/specs/dns-health-monitoring/spec.md`
around lines 4 - 13, The spec mixes server identifier formats ("new york" vs
"new_york"); update the spec so the canonical identifier is "new_york"
everywhere: replace the server name in the server list and in any metric
examples/labels (references to the DNS servers: "all, new york, miyami") and
ensure the `server` label examples/requirements (mentions of all, new_york, or
miyami and metrics `dns_query_total`, `dns_up`, `dns_response_time_ms`,
`dns_upstream_response_time_ms`) use the exact canonical token "new_york" so
labels and queries match.
dns-monitoring/datadog-monitor-dns-latency-warning.json-4-5 (1)

4-5: ⚠️ Potential issue | 🟠 Major

Add server grouping to query and align threshold with critical setting.

Line 4 query uses > 100 while the critical threshold is set to 500 (Line 17), and Line 5 references {{server.name}} without by {server} grouping in the query. Missing grouping means the template variable won't resolve per-server context, and the threshold mismatch creates ambiguous alert behavior.

Suggested fix
-  "query": "max(last_5m):max:dns_internal_latency{*} > 100",
+  "query": "max(last_5m):max:dns_internal_latency{*} by {server} > 500",
   "message": "DNS latency on {{server.name}} is above 100ms for 5 minutes.\n\n@slack-ops",

Note: This changes the alert threshold from 100 to 500 (critical level). Verify this aligns with intended alert behavior.

Also applies to: 16-18

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dns-monitoring/datadog-monitor-dns-latency-warning.json` around lines 4 - 5,
Update the "query" field to include per-server grouping and align the threshold
with the critical setting: change the expression in the query (the "query" JSON
key) from using > 100 to > 500 and add "by {server}" (e.g.
max(last_5m):max:dns_internal_latency{*} by {server} > 500) so the template
variable {{server.name}} in the "message" resolves per-server and the threshold
matches the existing "critical" value.
roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/dns-monitoring.conf.j2-4-8 (1)

4-8: ⚠️ Potential issue | 🟠 Major

Add explicit JSON parsing for Pi-hole API HTTP input.

The Pi-hole endpoint returns JSON, but without explicit data_format configuration, Telegraf defaults to InfluxDB line protocol parsing, causing the JSON response to fail parsing and metrics extraction to be incomplete.

Suggested fix
 [[inputs.http]]
   urls = ["http://{{ hostvars[host].ansible_host }}/admin/api.php?summaryRaw"]
   name_override = "pihole"
   interval = "60s"
   timeout = "10s"
+  data_format = "json"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/dns-monitoring.conf.j2`
around lines 4 - 8, The Telegraf HTTP input for the Pi-hole endpoint (the
[[inputs.http]] block with name_override = "pihole") lacks an explicit
data_format, so Telegraf tries to parse the response as Influx line protocol;
add data_format = "json" to that inputs.http stanza (and if you only need
specific fields, add json_query or json_string_fields as appropriate) so the
Pi-hole JSON payload is parsed correctly and metrics are extracted.
🟡 Minor comments (11)
openspec/changes/dns-server-monitoring-tool/tasks.md-8-8 (1)

8-8: ⚠️ Potential issue | 🟡 Minor

Fix naming/spelling to avoid operational confusion.

Line 8 and Line 25 use miyami (likely miami), and Line 42 uses pihole while the rest of docs usually use Pi-hole. Standardizing these labels helps avoid mismatch across inventory/specs.

Also applies to: 25-25, 42-42

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/dns-server-monitoring-tool/tasks.md` at line 8, Update the
server label names to correct spelling and standard casing: change any
occurrence of "miyami" to "miami" and normalize "pihole" to "Pi-hole" (e.g., in
the task description "1.6 Configure server-specific labels (all, new_york,
miyami)" and other occurrences). Ensure all references across the document use
"miami" and "Pi-hole" consistently to avoid inventory/spec mismatches.
openspec/changes/dns-server-monitoring-tool/proposal.md-3-3 (1)

3-3: ⚠️ Potential issue | 🟡 Minor

Confirm server name spelling consistency (miyami vs miami).

Line 3 and Line 15 use miyami; if this is not the canonical host identifier, correct it to avoid confusion across specs/playbooks.

Also applies to: 15-15

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/dns-server-monitoring-tool/proposal.md` at line 3, The
document uses the host name "miyami" in multiple places (e.g., the opening
paragraph and line ~15); verify the canonical server identifier and standardize
all occurrences to the correct spelling (either "miami" or "miyami") across the
proposal text and any referenced specs/playbooks, updating every instance to
match and running a global search/replace to avoid inconsistent hostnames.
openspec/changes/dns-keepalived-analysis/tasks.md-4-4 (1)

4-4: ⚠️ Potential issue | 🟡 Minor

Correct vendor spelling for consistency.

Use Ubiquiti instead of Ubiquity to avoid confusion in follow-up research/tasks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/dns-keepalived-analysis/tasks.md` at line 4, Update the
checklist item text that currently reads "Verify Ubiquity switch allows VRRP
multicast (224.0.0.18)" to use the correct vendor name "Ubiquiti" so it reads
"Verify Ubiquiti switch allows VRRP multicast (224.0.0.18)"; locate and edit the
string in the tasks.md checklist entry (the line beginning with "- [ ] 1.2
Verify Ubiquity switch...") and replace "Ubiquity" with "Ubiquiti".
openspec/changes/dns-keepalived-analysis/specs/dns-ha-analysis/spec.md-71-71 (1)

71-71: ⚠️ Potential issue | 🟡 Minor

Fix vendor spelling (Ubiquiti).

This keeps network task references consistent across docs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/dns-keepalived-analysis/specs/dns-ha-analysis/spec.md` at
line 71, Update the vendor spelling from "Ubiquity" to the correct "Ubiquiti" in
the spec text — specifically change the phrase "verify network equipment
(Ubiquity + commodity switches) allows VRRP" to "verify network equipment
(Ubiquiti + commodity switches) allows VRRP"; also scan the document for any
other occurrences of "Ubiquity" and replace them to keep network task references
consistent.
openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/proposal.md-3-3 (1)

3-3: ⚠️ Potential issue | 🟡 Minor

Fix inconsistent region naming to avoid ops targeting mistakes.

miyami appears inconsistent with the rest of the DNS region naming and can cause confusion in host/group mapping across docs and tooling.

Suggested doc fix
-The home DNS infrastructure (Pi-hole + Unbound on multiple servers: all, new york, miyami) is experiencing flaky behavior.
+The home DNS infrastructure (Pi-hole + Unbound on multiple servers: all, new york, miyagi) is experiencing flaky behavior.

-- `dns-health-monitoring`: OpenTelemetry-compatible DNS health metrics collection at individual server level (all, new york, miyami).
+- `dns-health-monitoring`: OpenTelemetry-compatible DNS health metrics collection at individual server level (all, new york, miyagi).

Also applies to: 15-15

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/proposal.md`
at line 3, The region name "miyami" is a typo and risks misrouting ops; find and
replace the incorrect "miyami" with the correct "miami" wherever it appears
(e.g., the region list "all, new york, miyami" in proposal.md and any host/group
mapping, inventories, or tooling that refer to that region), then verify
host/group names, documentation headings, and any automation (healthchecks,
restart scripts, Ansible/Podman labels) still align with the corrected "miami"
identifier so mappings and tooling continue to work consistently.
openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-alerting/spec.md-30-30 (1)

30-30: ⚠️ Potential issue | 🟡 Minor

Repository path in this requirement appears stale.

The monitor definitions in this PR are under dns-monitoring/, so pointing to monitoring/datadog/ can misdirect reviewers/operators.

Suggested doc fix
-- **THEN** they can view the JSON/YAML definition in the repository under monitoring/datadog/
+- **THEN** they can view the JSON/YAML definition in the repository under dns-monitoring/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-alerting/spec.md`
at line 30, Update the stale repository path in the requirement text: locate the
sentence starting with "**THEN** they can view the JSON/YAML definition in the
repository under monitoring/datadog/" and change the path segment
"monitoring/datadog/" to the correct "dns-monitoring/" so the sentence reads
that the JSON/YAML definition is under dns-monitoring/.
dns-monitoring/datadog-monitor-dns-down-critical.json-2-2 (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Fix monitor title typo (PageDutyPagerDuty).

Line 2 has a naming typo that will make searching/filtering monitor definitions harder.

Suggested fix
-  "name": "DNS Server Down - CRITICAL - PageDuty",
+  "name": "DNS Server Down - CRITICAL - PagerDuty",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dns-monitoring/datadog-monitor-dns-down-critical.json` at line 2, The monitor
name string "DNS Server Down - CRITICAL - PageDuty" contains a typo; update the
value used in the "name" field (the monitor title) to "DNS Server Down -
CRITICAL - PagerDuty" and search for any other occurrences of "PageDuty" in the
repository to replace with "PagerDuty" so naming/searching stays consistent.
openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-health-monitoring/spec.md-4-12 (1)

4-12: ⚠️ Potential issue | 🟡 Minor

Normalize server identifiers and fix typo (miyamimiami).

Line 4 and Line 12 use inconsistent naming (new york vs new_york) and misspell miami. Keep identifiers consistent and correctly spelled across scenarios.

Suggested fix
-The system SHALL collect DNS health metrics from each Pi-hole server (all, new york, miyami) using OpenTelemetry-compatible format.
+The system SHALL collect DNS health metrics from each Pi-hole server (all, new_york, miami) using OpenTelemetry-compatible format.
...
-- **THEN** each metric includes a `server` label with value matching the server identifier (all, new_york, or miyami)
+- **THEN** each metric includes a `server` label with value matching the server identifier (all, new_york, or miami)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-health-monitoring/spec.md`
around lines 4 - 12, Normalize the server identifiers and correct the typo by
replacing occurrences of "miyami" with "miami" and unifying spacing/format to
use "new_york" (underscore) instead of "new york"; update both the metrics
scenario description that lists servers and the expected `server` label values
so all references use the exact identifiers "all", "new_york", and "miami"
(ensure strings like `miyami`, `new york`, and any inconsistent `new_york`/`new
york` usages are corrected to the canonical forms).
openspec/changes/dns-server-monitoring-tool/specs/dns-alerting/spec.md-30-30 (1)

30-30: ⚠️ Potential issue | 🟡 Minor

Repository path in spec does not match this PR layout.

Line 30 points to monitoring/datadog/, but monitors in this PR are under dns-monitoring/. Update the spec path to avoid confusion.

Suggested fix
-- **THEN** they can view the JSON/YAML definition in the repository under monitoring/datadog/
+- **THEN** they can view the JSON/YAML definition in the repository under dns-monitoring/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/dns-server-monitoring-tool/specs/dns-alerting/spec.md` at
line 30, Update the repository path in the spec to match this PR's layout:
replace the string "monitoring/datadog/" referenced in
specs/dns-alerting/spec.md with "dns-monitoring/" so the sentence reads that the
JSON/YAML definition is in the repository under dns-monitoring/; ensure the file
spec.md and the exact occurrence of "monitoring/datadog/" are updated and any
other mentions in the same spec are adjusted accordingly.
openspec/specs/dns-alerting/spec.md-26-30 (1)

26-30: ⚠️ Potential issue | 🟡 Minor

Align monitor-definition path with actual repository location.

Line 30 points to monitoring/datadog/, but current monitor configs are in dns-monitoring/. This can misdirect maintenance/review.

Suggested fix
-- **THEN** they can view the JSON/YAML definition in the repository under monitoring/datadog/
+- **THEN** they can view the JSON/YAML definition in the repository under dns-monitoring/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/specs/dns-alerting/spec.md` around lines 26 - 30, Update the
scenario text that references the monitor definition location: replace the path
string "monitoring/datadog/" with the actual repository directory
"dns-monitoring/" in the Scenario: Monitor definition in repo section so
reviewers are directed to the correct folder; ensure the sentence reading "they
can view the JSON/YAML definition in the repository under monitoring/datadog/"
is changed to reference "dns-monitoring/" (the current location of monitor
configs).
playbooks/dns-restart.yml-55-59 (1)

55-59: ⚠️ Potential issue | 🟡 Minor

Diagnostics export discards stderr output.

At line 57, only pihole_diag.stdout is persisted. When the diagnostic command runs on failure (line 53), stderr may contain valuable error context that is lost.

📝 Suggested fix
-        content: "{{ pihole_diag.stdout }}"
+        content: |
+          # STDOUT
+          {{ pihole_diag.stdout | default('') }}
+
+          # STDERR
+          {{ pihole_diag.stderr | default('') }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@playbooks/dns-restart.yml` around lines 55 - 59, The copy task currently only
saves pihole_diag.stdout which discards stderr; update the task that writes the
diagnostics (the copy task referencing pihole_diag.stdout) to include both
stdout and stderr (e.g., concatenate pihole_diag.stdout and pihole_diag.stderr
with a separator and using default('') to avoid undefined errors) and adjust the
when condition to check that pihole_diag is defined and at least one of
pihole_diag.stdout or pihole_diag.stderr is defined so error output is
preserved.
🧹 Nitpick comments (12)
tests/vrrp-smoke-test/test-vrrp-unshare.sh (1)

10-10: Remove or use TEST_IP_PREFIX.

Line 10 is currently unused, which adds noise and can mislead future edits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/vrrp-smoke-test/test-vrrp-unshare.sh` at line 10, The variable
TEST_IP_PREFIX is declared but unused; either remove the declaration or use it
where appropriate (e.g., replace hard-coded IP prefixes in test-vrrp-unshare.sh
with TEST_IP_PREFIX) so the script references TEST_IP_PREFIX consistently;
update any occurrences of literal "127.99.0" in functions or commands (such as
IP_ADDR construction or test setup) to build IPs from TEST_IP_PREFIX, or simply
delete the TEST_IP_PREFIX line if not needed.
tests/vrrp-smoke-test/Makefile (2)

1-1: Include check in .PHONY.

Line 1 omits check, so a file named check could shadow the target.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/vrrp-smoke-test/Makefile` at line 1, The .PHONY declaration is missing
the check target which allows a file named "check" to shadow the Makefile
target; update the .PHONY line to include check (e.g., modify the .PHONY token
that currently lists "all test quick clean" to also list "check") so that the
check target is always treated as phony.

6-8: Wire prerequisite validation into executable targets.

check exists but quick/test do not depend on it, so failures surface later and less clearly.

Also applies to: 11-13, 21-25

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/vrrp-smoke-test/Makefile` around lines 6 - 8, The Makefile targets
'quick' and 'test' currently run without first invoking the prerequisite
'check', so configuration validation failures are reported later; update the
Makefile so that the 'quick' and 'test' targets depend on 'check' (e.g., change
their target lines to "quick: check" and "test: check") and apply the same
dependency wiring to the other similar targets referenced in the diff (lines
around 11-13 and 21-25) so that 'check' runs automatically before those targets
execute.
openspec/changes/dns-server-monitoring-tool/tasks.md (1)

3-3: Prefer repo-relative paths in task docs.

Using ~/dev/src/... makes the checklist machine/user-specific. Consider repository-relative references for portability.

Also applies to: 24-24

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/dns-server-monitoring-tool/tasks.md` at line 3, The task
list uses a user-specific absolute path ("~/dev/src/mkrasberry/dns-monitoring/")
which should be replaced with a repository-relative reference; update the
checklist entry (the line starting "- [x] 1.1 Create metrics exporter script in
~/dev/src/mkrasberry/dns-monitoring/") to use a repo-relative path such as
"./dns-monitoring/" or "mkrasberry/dns-monitoring/" so the docs are portable
across machines and mirrors the same change for the similar occurrence at line
24.
AGENTS.md (1)

1-79: Re-add an explicit Spec-Driven Development requirement.

This rewrite is clear, but it should explicitly preserve the mandatory spec-driven workflow for non-trivial work to avoid process drift.

Based on learnings: All agents MUST adhere to the Spec-Driven Development workflow for non-trivial tasks as outlined in the official AI Agent Standards.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@AGENTS.md` around lines 1 - 79, The document removed an explicit requirement
for mandatory Spec-Driven Development; re-add a clear, mandatory clause under
the Engineering Preferences or Workflow sections stating "All agents MUST follow
the Spec-Driven Development workflow for non-trivial tasks as defined in the
official AI Agent Standards," and reference the canonical workflow artifact
("Spec-Driven Development workflow" / "AI Agent Standards") so reviewers and
contributors know it's non-optional; update the "BEFORE YOU START" section to
require choosing BIG CHANGE vs SMALL CHANGE only after confirming adherence to
the Spec-Driven Development requirement and ensure the phrasing uses MUST
(uppercase) to indicate mandatory policy.
openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/proposal.md (1)

17-17: Use repository-relative paths instead of ~/... in shared specs.

Home-directory paths are machine-specific and make the spec harder to follow for other contributors.

Suggested doc fix
-- `dns-restart-playbook`: Ansible-style playbook (in ~/dev/src/mkrasberry) to SSH to DNS servers, restart the pihole service as the pihole user, and collect diagnostic output if the service is down.
+- `dns-restart-playbook`: Ansible-style playbook (in `playbooks/dns-restart.yml`) to SSH to DNS servers, restart the pihole service as the pihole user, and collect diagnostic output if the service is down.

-- New code in ~/dev/src/mkrasberry (playbook, monitoring config)
+- New code in this repository (playbook, monitoring config)

Also applies to: 25-25

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/proposal.md`
at line 17, The spec entry for `dns-restart-playbook` uses a home-directory path
(`~/dev/src/mkrasberry`); replace that with a repository-relative path (e.g.,
`docs/playbooks/dns-restart-playbook` or `tools/playbooks/dns-restart-playbook`)
wherever it appears (including the other occurrence at lines 25-25) so the
shared spec is machine-agnostic and points contributors to the repo location for
`dns-restart-playbook`.
roles/telegraf-container-service/templates/etc/telegraf/telegraf.conf.j2 (1)

88-90: Remove commented legacy Datadog output after migration is confirmed.

Keeping stale output snippets increases ambiguity during incident debugging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@roles/telegraf-container-service/templates/etc/telegraf/telegraf.conf.j2`
around lines 88 - 90, Remove the stale, commented Datadog output block for
[[outputs.datadog]] (the lines referencing apikey= "{{ datadog_telegraf_api_key
}}" and url = "https://app.datadoghq.com/api/v1/series") from the
telegraf.conf.j2 template now that migration is confirmed; locate the commented
snippet around the [[outputs.datadog]] marker and delete those commented lines
so the template no longer contains legacy Datadog output remnants.
openspec/changes/dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md (1)

3-13: Add explicit SSH hardening acceptance criteria.

This section defines connectivity behavior but misses concrete controls for restricted SSH exposure and account validation. Add scenarios covering locked-down SSH ports, validated local accounts, and access restrictions.

Based on learnings: For SSH Lockdown & User Access (HOMETECH-17), lock down SSH ports, validate local accounts, and strictly control user access.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@openspec/changes/dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md`
around lines 3 - 13, The SSH connectivity requirement lacks explicit hardening
acceptance criteria; update the spec section titled "Requirement: SSH
connectivity to DNS servers" by adding new scenarios that enforce SSH lockdown
and account controls: add a scenario "SSH ports locked down" that requires SSH
to run only on approved ports and verifies firewall rules, a scenario "Local
account validation" that ensures only validated local accounts exist and
inactive/unknown accounts are locked/removed, and a scenario "Restricted SSH
access" that mandates key-based access only, restricted user lists, and denies
root/password authentication; reference the existing scenario names ("Scenario:
SSH connection successful" and "Scenario: SSH connection fails") to keep flow
consistent and include explicit acceptance checks and expected abort/failure
behavior for each new scenario.
openspec/changes/dns-server-monitoring-tool/design.md (2)

91-91: Use hyphenated compound adjective.

Line 91 should read 5-minute for clarity and style consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/dns-server-monitoring-tool/design.md` at line 91, The
markdown uses an unhyphenated compound adjective "5 minute failure threshold
before alerting" under the "Mitigation" bullet; change it to the hyphenated form
"5-minute failure threshold before alerting" so the compound adjective is
grammatically correct and matches style consistency (edit the line containing
the "Mitigation" bullet / the phrase "5 minute failure threshold before
alerting").

4-7: Consider matching design server IDs to spec label format.

Specs use machine identifiers like new_york; this design uses new york. Aligning terms reduces ambiguity when mapping docs to metric labels.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/dns-server-monitoring-tool/design.md` around lines 4 - 7,
Update the server ID list to use the spec label format (underscores instead of
spaces): replace "new york" with "new_york" and ensure "miyami" matches the
canonical spec token (correct to "miyagi" if the spec uses that) while keeping
"all" as-is; update any references in the design text that mention the old forms
so they align with the spec label identifiers.
openspec/changes/dns-keepalived-analysis/design.md (1)

107-122: Add a language to the fenced code block.

Line 107 starts a fenced block without a language, which triggers markdown lint warnings.

Suggested fix
-```
+```text
 Wisconsin (192.168.20.0/24):
   marsAlpha (192.168.20.250) ←→ marsBeta (192.168.20.251)
   VIP: 192.168.20.2
...
-```
+```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openspec/changes/dns-keepalived-analysis/design.md` around lines 107 - 122,
The fenced code block that begins with the "Wisconsin (192.168.20.0/24):"
listing is missing a language tag, causing markdownlint warnings; update the
opening fence from ``` to ```text (i.e., add the "text" language identifier to
the opening fence for the block containing marsAlpha/marsBeta,
terraOmega/terraPhi, and alphaCenA/alphaCenB) and keep the closing fence
unchanged so the block is treated as plain text by the linter.
playbooks/dns-restart.yml (1)

25-25: Prefer Ansible privilege escalation over inline sudo -u.

Since the play already uses become: true, embedding sudo -u in commands makes privilege handling harder to audit and maintain. Use task-level become_user: "{{ pihole_user }}" instead.

Also applies to: 30-30, 50-50

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@playbooks/dns-restart.yml` at line 25, Replace inline sudo usage in the
podman commands with Ansible privilege escalation: remove the "sudo -u {{
pihole_user }}" prefix from the command (e.g., the task using "podman ps
--filter name=pihole") and add task-level become_user: "{{ pihole_user }}" (the
play already has become: true); apply the same change to the other tasks that
use "sudo -u" (the occurrences around the commands at lines referenced: the
podman ps and the other podman commands).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d20347f and 9da0e68.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (53)
  • .gitignore
  • AGENTS.md
  • callback_plugins/datadog_callback.py
  • callback_plugins/datadog_callback.yml
  • dns-monitoring/datadog-monitor-dns-down-critical.json
  • dns-monitoring/datadog-monitor-dns-down-warning.json
  • dns-monitoring/datadog-monitor-dns-external-latency.json
  • dns-monitoring/datadog-monitor-dns-latency-warning.json
  • dns-monitoring/test_pihole_telegraf.conf
  • openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/.openspec.yaml
  • openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/design.md
  • openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/proposal.md
  • openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-alerting/spec.md
  • openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-health-monitoring/spec.md
  • openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md
  • openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/specs/pihole-healthcheck-fix/spec.md
  • openspec/changes/archive/2026-03-01-dns-server-monitoring-tool/tasks.md
  • openspec/changes/dns-keepalived-analysis/.openspec.yaml
  • openspec/changes/dns-keepalived-analysis/design.md
  • openspec/changes/dns-keepalived-analysis/proposal.md
  • openspec/changes/dns-keepalived-analysis/specs/dns-ha-analysis/spec.md
  • openspec/changes/dns-keepalived-analysis/tasks.md
  • openspec/changes/dns-server-monitoring-tool/.openspec.yaml
  • openspec/changes/dns-server-monitoring-tool/design.md
  • openspec/changes/dns-server-monitoring-tool/proposal.md
  • openspec/changes/dns-server-monitoring-tool/specs/dns-alerting/spec.md
  • openspec/changes/dns-server-monitoring-tool/specs/dns-health-monitoring/spec.md
  • openspec/changes/dns-server-monitoring-tool/specs/dns-restart-playbook/spec.md
  • openspec/changes/dns-server-monitoring-tool/specs/pihole-healthcheck-fix/spec.md
  • openspec/changes/dns-server-monitoring-tool/tasks.md
  • openspec/changes/podman-overlayfs.md
  • openspec/config.yaml
  • openspec/specs/dns-alerting/spec.md
  • openspec/specs/dns-health-monitoring/spec.md
  • openspec/specs/dns-restart-playbook/spec.md
  • openspec/specs/pihole-healthcheck-fix/spec.md
  • playbooks/dns-restart.yml
  • playbooks/initial-playbook-stage-5.yml
  • pyproject.toml
  • retire_forks.sh
  • roles/pihole.container
  • roles/podmanSetupRaspberryPi
  • roles/telegraf-container-service/defaults/main.yml
  • roles/telegraf-container-service/tasks/config.yml
  • roles/telegraf-container-service/templates/etc/telegraf/telegraf.conf.j2
  • roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/dns-monitoring.conf.j2
  • roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/otlp_output.conf.j2
  • roles/teleport-daemon-service/templates/etc/telegraf/telegraf.conf.j2
  • tests/vrrp-smoke-test/Makefile
  • tests/vrrp-smoke-test/README.md
  • tests/vrrp-smoke-test/test-configs-only.sh
  • tests/vrrp-smoke-test/test-vrrp-podman.sh
  • tests/vrrp-smoke-test/test-vrrp-unshare.sh

---
api_key: "{{ datadog_logging_key }}"
#api_key: "{{ datadog_logging_key }}"
api_key: "c28b08e17123d0da25aaa98126e1562e"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Do not commit Datadog API keys in plaintext.

Line 3 exposes a live-looking API key. Rotate it and source via vaulted variable/CI secret injection only.

🔐 Proposed fix
- api_key: "c28b08e17123d0da25aaa98126e1562e"
+ api_key: "{{ datadog_logging_key }}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
api_key: "c28b08e17123d0da25aaa98126e1562e"
api_key: "{{ datadog_logging_key }}"
🧰 Tools
🪛 Gitleaks (8.30.0)

[high] 3-3: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@callback_plugins/datadog_callback.yml` at line 3, The file contains a
plaintext Datadog API key under the api_key entry; remove that literal value
from callback_plugins/datadog_callback.yml, rotate the exposed key in Datadog,
and replace it with a secret reference sourced from a secure store (e.g.,
Ansible Vault, CI secret injection, or an environment variable such as
DATADOG_API_KEY) so the code reads the secret at runtime rather than embedding
it; ensure the deployment/CI configuration is updated to inject the secret and
add the original file to .gitignore or restore from a clean commit to avoid
committing the revoked key.

# Pi-hole v6 authentication
cookie_auth_url = "http://192.168.100.23:1080/api/auth"
cookie_auth_method = "POST"
cookie_auth_body = '{"password":"pihole123abc"}'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove committed credential and rotate it immediately.

Line 8 hardcodes a real password in source control. Treat it as compromised: rotate it, purge from history if possible, and load secrets from an injected environment/secret store instead of inline config.

🔐 Safer pattern
-  cookie_auth_body = '{"password":"pihole123abc"}'
+  cookie_auth_body = '{"password":"${PIHOLE_API_PASSWORD}"}'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cookie_auth_body = '{"password":"pihole123abc"}'
cookie_auth_body = '{"password":"${PIHOLE_API_PASSWORD}"}'
🧰 Tools
🪛 Gitleaks (8.30.0)

[high] 8-8: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dns-monitoring/test_pihole_telegraf.conf` at line 8, Remove the hardcoded
password assigned to cookie_auth_body, rotate the exposed Pi-hole credential
immediately and purge it from VCS history, and replace the inline secret with a
secret injection (e.g., read from an environment variable or secret store) so
cookie_auth_body is constructed from a runtime secret (e.g.,
TELEGRAF_PIHOLE_PASSWORD) rather than a literal string; ensure deployment/CI
injects that secret and update any README/ops notes to describe rotation and
secret provisioning.

Comment thread playbooks/dns-restart.yml
Comment on lines +29 to +53
- name: Restart pihole service
command: sudo -u {{ pihole_user }} sh -c "podman stop pihole && podman start pihole"
register: restart_result
failed_when: restart_result.rc != 0

- name: Restart unbound service
systemd:
name: unbound
state: restarted
become: true

- name: Verify restart succeeded
fail:
msg: "Pihole restart failed with rc={{ restart_result.rc }}"
when: restart_result.rc not in [0, 3]

- name: Display restart result
debug:
msg: "Pihole restarted successfully on {{ inventory_hostname }}"

- name: Collect summary diagnostics on failure
command: sudo -u {{ pihole_user }} podman exec pihole pihole -d -a
register: pihole_diag
failed_when: false
when: restart_result is failed
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Ansible task execution behavior after failure without ignore_errors when condition subsequent task

💡 Result:

  • Default behavior (no ignore_errors): when a task fails on a host, Ansible stops executing any further tasks for that host (the host is effectively “removed” from the rest of the play), while the play may still continue on other hosts. So a subsequent task—even if it has a when: condition—will not run for that failed host, because execution never reaches it. [1]
  • Implication for when: on the next task: when: is only evaluated if the task is reached. After an unignored failure, it isn’t reached on that host, so the condition doesn’t matter. [1]
  • If you want “fail, but still decide next steps with when:: you must allow execution to continue (for example with ignore_errors + register, or by using block + rescue/always for structured recovery/cleanup). [1][2]

Sources:
[1] Ansible docs — Error handling in playbooks (default stop-on-failure; ignore_errors; related controls) (docs.ansible.com)
[2] Jeff Geerling — Blocks (block/rescue/always) behavior overview (jeffgeerling.com)

Citations:


Failure diagnostics are unreachable with current task flow.

At line 32, the task hard-fails on non-zero RC. When a task fails without ignore_errors, Ansible stops executing any further tasks for that host—including line 53—regardless of the when: condition. The diagnostics path never executes, defeating its purpose.

Wrap the restart tasks in a block/rescue to ensure diagnostics always run on failure:

Suggested fix
-    - name: Restart pihole service
-      command: sudo -u {{ pihole_user }} sh -c "podman stop pihole && podman start pihole"
-      register: restart_result
-      failed_when: restart_result.rc != 0
-
-    - name: Restart unbound service
-      systemd:
-        name: unbound
-        state: restarted
-      become: true
-
-    - name: Verify restart succeeded
-      fail:
-        msg: "Pihole restart failed with rc={{ restart_result.rc }}"
-      when: restart_result.rc not in [0, 3]
-
-    - name: Display restart result
-      debug:
-        msg: "Pihole restarted successfully on {{ inventory_hostname }}"
-
-    - name: Collect summary diagnostics on failure
-      command: sudo -u {{ pihole_user }} podman exec pihole pihole -d -a
-      register: pihole_diag
-      failed_when: false
-      when: restart_result is failed
+    - name: Restart services and handle failure diagnostics
+      block:
+        - name: Restart pihole service
+          command: sudo -u {{ pihole_user }} podman restart pihole
+          register: restart_result
+
+        - name: Restart unbound service
+          systemd:
+            name: unbound
+            state: restarted
+          become: true
+
+        - name: Verify restart succeeded
+          fail:
+            msg: "Pihole restart failed with rc={{ restart_result.rc }}"
+          when: restart_result.rc not in [0, 3]
+
+        - name: Display restart result
+          debug:
+            msg: "Pihole restarted successfully on {{ inventory_hostname }}"
+      rescue:
+        - name: Collect summary diagnostics on failure
+          command: sudo -u {{ pihole_user }} podman exec pihole pihole -d -a
+          register: pihole_diag
+          failed_when: false
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Restart pihole service
command: sudo -u {{ pihole_user }} sh -c "podman stop pihole && podman start pihole"
register: restart_result
failed_when: restart_result.rc != 0
- name: Restart unbound service
systemd:
name: unbound
state: restarted
become: true
- name: Verify restart succeeded
fail:
msg: "Pihole restart failed with rc={{ restart_result.rc }}"
when: restart_result.rc not in [0, 3]
- name: Display restart result
debug:
msg: "Pihole restarted successfully on {{ inventory_hostname }}"
- name: Collect summary diagnostics on failure
command: sudo -u {{ pihole_user }} podman exec pihole pihole -d -a
register: pihole_diag
failed_when: false
when: restart_result is failed
- name: Restart services and handle failure diagnostics
block:
- name: Restart pihole service
command: sudo -u {{ pihole_user }} podman restart pihole
register: restart_result
- name: Restart unbound service
systemd:
name: unbound
state: restarted
become: true
- name: Verify restart succeeded
fail:
msg: "Pihole restart failed with rc={{ restart_result.rc }}"
when: restart_result.rc not in [0, 3]
- name: Display restart result
debug:
msg: "Pihole restarted successfully on {{ inventory_hostname }}"
rescue:
- name: Collect summary diagnostics on failure
command: sudo -u {{ pihole_user }} podman exec pihole pihole -d -a
register: pihole_diag
failed_when: false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@playbooks/dns-restart.yml` around lines 29 - 53, The current hard-fail
prevents the "Collect summary diagnostics on failure" task from running; wrap
the pihole restart steps in an Ansible block/rescue: put the "Restart pihole
service" (the podman stop/start command that registers restart_result) and the
"Restart unbound service" and "Verify restart succeeded" logic inside a block,
move the diagnostics task "Collect summary diagnostics on failure" into the
rescue section (executing the sudo -u {{ pihole_user }} podman exec pihole
pihole -d -a command and registering pihole_diag), and in the rescue re-run or
call the fail logic to preserve the original failure outcome; remove the
unconditional failed_when that aborts flow and rely on block/rescue to guarantee
diagnostics run when restart_result indicates failure.

Comment thread retire_forks.sh
Comment on lines +2 to +11
# Script to retire unused forks with no personal commits

echo "Deleting nifi fork..."
gh repo delete NickJLange/nifi --yes

echo "Deleting homebridge forks..."
gh repo delete NickJLange/homebridge-mqttthing --yes
gh repo delete NickJLange/homebridge-wyze-connected-home --yes
gh repo delete NickJLange/homebridge-tasmota --yes
gh repo delete NickJLange/homebridge.container --yes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

git ls-files | grep -E '\.sh$' | head -20

Repository: NickJLange/mkrasberry

Length of output: 346


🏁 Script executed:

cat -n retire_forks.sh

Repository: NickJLange/mkrasberry

Length of output: 562


Add safety gates before executing destructive deletes.

Line 2 claims this script only targets "unused forks with no personal commits," but the code provides no mechanism to enforce or verify this precondition. The unconditional gh repo delete ... --yes calls (lines 4–11) will permanently remove repos without confirmation, and missing error handling means failed deletes go silently ignored while "Done." prints anyway (line 13).

Suggested minimal safety guard
 #!/bin/bash
 # Script to retire unused forks with no personal commits

+read -r -p "This permanently deletes repos. Type DELETE to continue: " confirm
+if [[ "$confirm" != "DELETE" ]]; then
+  echo "Aborted."
+  exit 1
+fi
+
 echo "Deleting nifi fork..."
 gh repo delete NickJLange/nifi --yes
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@retire_forks.sh` around lines 2 - 11, The script retire_forks.sh currently
runs unconditional destructive commands using gh repo delete NickJLange/...
--yes; add safety gates by (1) verifying each repo is a fork and has no personal
commits (compare git log or use gh api to check fork=true and upstream commit
ancestry) before deletion, (2) implement a dry-run mode that prints which repos
would be deleted, (3) prompt for explicit user confirmation per-repo (or an
explicit --force flag) instead of --yes by default, and (4) handle and surface
errors from each gh repo delete call and abort on failure so the final "Done."
is accurate; aim to update the gh repo delete invocations and surrounding logic
(the delete command calls and the script-level control flow) accordingly.

@@ -1 +1 @@
Subproject commit cf84a87c0d561e25387816299aa419cc3327d0b5
Subproject commit 077f3124fb617fff14a3fb62758d4632f6f0b143
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify submodule pointer and inspect delta between old/new commits.
OLD="cf84a87c0d561e25387816299aa419cc3327d0b5"
NEW="077f3124fb617fff14a3fb62758d4632f6f0b143"

echo "== Parent repo submodule pointer =="
git ls-tree HEAD roles/podmanSetupRaspberryPi || true

echo
echo "== Ensure submodule is initialized =="
git submodule update --init --recursive roles/podmanSetupRaspberryPi 2>&1 || true

echo
echo "== Submodule commit summary ($OLD..$NEW) =="
git -C roles/podmanSetupRaspberryPi log --oneline --no-merges "${OLD}..${NEW}" 2>&1 || {
  echo "Could not resolve one of the commits in local submodule clone."
}

echo
echo "== Submodule changed files ($OLD..$NEW) =="
git -C roles/podmanSetupRaspberryPi diff --name-status "${OLD}..${NEW}" 2>&1 || {
  echo "Could not compute diff."
}

echo
echo "== Scope check for expected VRRP smoke test paths =="
git -C roles/podmanSetupRaspberryPi diff --name-only "${OLD}..${NEW}" 2>&1 | \
  rg -n "tests/vrrp-smoke-test|keepalived|dns|podman|telegraf|datadog|ansible|README|Makefile" || true

Repository: NickJLange/mkrasberry

Length of output: 1917


🌐 Web query:

NickJLange podmanSetupRaspberryPi GitHub repository commits

💡 Result:

NickJLange/podmanSetupRaspberryPi (branch: master) shows 21 commits on GitHub; the commits visible on the repository’s commit history page are: [1]

  • 2023-11-19
    • 2d1efa9 — Merge pull request #9 from NickJLange/november_update
    • e655aaa — November Catchup Commit
  • 2021-07-18
    • 0ea3203 — Merge pull request #8 from NickJLange/july_updates
    • 02b36b9 — HOMETECH23
  • 2020-12-20
    • 549b6ca — Merge pull request #7 from NickJLange/december_updates
    • 4bacbfe — December Updates - HOMETECH-18
  • 2020-09-13
    • c41a931 — Merge pull request #5 from NickJLange/sync
    • cf013fd — Catch up committ on legacy stuff
  • 2020-01-19
    • 7ad2f9b — Merge pull request #4 from NickJLange/more_fixins
    • 291ed58 — Fixed a bug introduced by linter; updated podman to v1.7.0 and set requirement to buster
    • 683d003 — Merge pull request #3 from NickJLange/incremental_fixins
    • d77d5c7 — Explicit Env Location
    • e6d8c4d — Fixed a boatload of linter warnings.
    • 52ef056 — Devel Packages broken out and docs cleaned up
    • e6f9f90 — Largely working.
  • 2019-09-03
    • 7a34bab — Merge pull request #1 from NickJLange/NickJLange-README
    • 02b58f6 — Merge pull request #2 from NickJLange/initial_add
    • 0fccfd4 — Adding in license file
    • c36ee8e — initial commit of code
    • d79a00f — Create README.md
    • b0958d6 — Initial commit

Latest commit date: November 19, 2023. [1]

Source

  • [1] GitHub commits page for NickJLange/podmanSetupRaspberryPi (/commits/master/).

Submodule pointer references non-existent commits—blocks repository checkout.

The submodule pointer update targets commits that do not exist in the remote repository. The commit cf84a87c0d561e25387816299aa419cc3327d0b5 is unresolvable (confirmed by remote error "not our ref"), and 077f3124fb617fff14a3fb62758d4632f6f0b143 does not appear in the public commit history of NickJLange/podmanSetupRaspberryPi (which shows 21 total commits, latest from Nov 2023). This will cause submodule initialization to fail for anyone cloning this branch. Verify the commit hashes are correct and reachable in the target submodule repository.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@roles/podmanSetupRaspberryPi` at line 1, The submodule pointer in the branch
references unreachable commits (cf84a87c0d561e25387816299aa419cc3327d0b5 and
077f3124fb617fff14a3fb62758d4632f6f0b143), causing checkout failures; fix by
updating the submodule entry to a valid, reachable commit or the correct branch
tag in the submodule repository (inspect .gitmodules and the gitlink entry for
the submodule and update it using git submodule set-branch / git submodule
update --init --remote or by checking out the correct commit in the submodule
and committing the new gitlink), verify the new commit exists in
NickJLange/podmanSetupRaspberryPi and push the corrected submodule pointer so
clones initialize successfully.

Comment on lines +80 to +86
[[outputs.otlp]]
## Datadog OTLP endpoint (US)
endpoint = "https://otlp.datadoghq.com:443"

## Optional: configure headers for authentication
[outputs.otlp.headers]
dd-api-key = "{{ datadog_telegraf_api_key }}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find Telegraf version/image pinning
fd -e Dockerfile -e 'docker-compose*' -e requirements -e 'Makefile' -e '*.yml' -e '*.yaml' | head -20

Repository: NickJLange/mkrasberry

Length of output: 47


🏁 Script executed:

# Search for telegraf version references
rg -i 'telegraf.*version|telegraf.*image|FROM.*telegraf' -n

Repository: NickJLange/mkrasberry

Length of output: 47


🏁 Script executed:

# Look for all OTLP output references
rg '\[\[outputs\.(otlp|opentelemetry)\]\]' -n -A 5

Repository: NickJLange/mkrasberry

Length of output: 1311


🏁 Script executed:

# Search for dd-api-key and datadog references
rg 'dd-api-key|datadog.*endpoint|otlp\.datadoghq' -n -C 2

Repository: NickJLange/mkrasberry

Length of output: 2289


🌐 Web query:

Telegraf OTLP output plugin official documentation - correct syntax for Datadog

💡 Result:

To send Telegraf metrics via OTLP to Datadog, the “OTLP output plugin” in Telegraf is the OpenTelemetry output plugin ([[outputs.opentelemetry]]), which sends metrics over gRPC and lets you set the target with service_address (default localhost:4317) and optional [outputs.opentelemetry.headers] / [outputs.opentelemetry.attributes]. [1]

Telegraf (telegraf.conf)

# Send OpenTelemetry metrics over gRPC
[[outputs.opentelemetry]]
  # Datadog Agent OTLP/gRPC receiver (default port 4317)
  service_address = "127.0.0.1:4317"

  # Optional extra resource attributes
  [outputs.opentelemetry.attributes]
    "service.name" = "my-telegraf"

  # Optional extra gRPC metadata headers (usually not needed when sending to the Agent)
  # [outputs.opentelemetry.headers]
  #   key1 = "value1"

(Fields shown are from the official Telegraf plugin docs.) [1]

Datadog Agent (datadog.yaml)

Enable the Agent’s OTLP receiver on gRPC 4317 (or HTTP 4318, but Telegraf’s plugin is gRPC-focused):

otlp_config:
  receiver:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
  logs:
    enabled: false

(Example from Datadog’s OTLP ingestion docs; it also documents the default ports 4317 gRPC / 4318 HTTP.) [2]

Sources: Telegraf OpenTelemetry output plugin docs [1]; Datadog “OTLP Ingestion by the Datadog Agent” docs [2].


🌐 Web query:

Telegraf [[outputs.otlp]] headers field name and dd-api-key header support

💡 Result:

  • Telegraf doesn’t have [[outputs.otlp]] (in current upstream Telegraf). The OTLP/OTel output plugin is [[outputs.opentelemetry]]. (docs.influxdata.com)
  • In that plugin, the headers/metadata field name is a TOML table:
    [outputs.opentelemetry.headers] (documented as “Additional gRPC request metadata”). (docs.influxdata.com)
  • Because it’s just key/value gRPC metadata, you can set (for example):
    [outputs.opentelemetry.headers] dd-api-key = "…". (docs.influxdata.com)
  • Datadog’s OTLP HTTP intake expects dd-api-key as an HTTP header (for OTLP/HTTP Protobuf exporters). Telegraf’s outputs.opentelemetry is OTLP over gRPC, so this header only helps if the gRPC endpoint you’re sending to actually reads dd-api-key from gRPC metadata (Datadog’s documented agentless intake examples are HTTP-header based). (docs.datadoghq.com)

Citations:


Replace invalid [[outputs.otlp]] with correct [[outputs.opentelemetry]] plugin and fix field names.

The plugin block [[outputs.otlp]] does not exist in Telegraf. The correct OTLP/OpenTelemetry output plugin is [[outputs.opentelemetry]], which uses service_address (not endpoint). Additionally, Datadog's OTLP HTTP intake expects dd-api-key as an HTTP header, but this plugin sends gRPC metadata—verify this matches Datadog's expected authentication method for your target endpoint.

Corrected configuration
-[[outputs.otlp]]
+[[outputs.opentelemetry]]
   ## Datadog OTLP endpoint (US)
-  endpoint = "https://otlp.datadoghq.com:443"
+  service_address = "https://otlp.datadoghq.com:443"
   
   ## Optional: configure headers for authentication
   [outputs.opentelemetry.headers]
     dd-api-key = "{{ datadog_telegraf_api_key }}"

Also parameterize the endpoint to support non-US Datadog sites.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@roles/telegraf-container-service/templates/etc/telegraf/telegraf.conf.j2`
around lines 80 - 86, Replace the invalid [[outputs.otlp]] block with the
correct [[outputs.opentelemetry]] plugin, change the endpoint key to
service_address (e.g. service_address = "{{ datadog_otlp_endpoint }}"),
preserve/parameterize the API key variable (datadog_telegraf_api_key) but move
it into the plugin's headers/auth configuration as appropriate, and verify
whether dd-api-key should be sent as an HTTP header vs gRPC metadata for Datadog
(adjust header name/location accordingly); also add a templated
datadog_otlp_endpoint/datadog_site variable so non‑US Datadog sites can be used.

Comment on lines +18 to +24
[[inputs.exec]]
commands = ["dig +short {{ host }}.{{ dns_subdomain_base }} @{{ hostvars[host].ansible_host }}#5335 +time=3 +tries=2"]
name_override = "dns_internal_latency"
interval = "60s"
data_format = "value"
data_type = "integer"
[inputs.exec.tags]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the template file
find . -name "dns-monitoring.conf.j2" -type f 2>/dev/null

Repository: NickJLange/mkrasberry

Length of output: 156


🏁 Script executed:

# Check if the file exists and read the relevant sections
if [ -f "roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/dns-monitoring.conf.j2" ]; then
  echo "=== File content (with line numbers) ==="
  cat -n "roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/dns-monitoring.conf.j2"
else
  echo "File not found"
fi

Repository: NickJLange/mkrasberry

Length of output: 2341


🌐 Web query:

dig +short output format DNS query latency

💡 Result:

dig +short output format

  • +short prints only the “useful” result data (record RDATA) and suppresses the usual header/sections/stats. (ucartz.com)
  • Typical formatting:
    • A / AAAA: one IP per line
    • CNAME / NS: hostnames (often with a trailing .)
    • MX: preference mailhost (e.g., 10 mail.example.com.)
    • TXT: quoted strings
    • Multiple records are printed as multiple lines. (ucartz.com)

DNS query latency (“how long the lookup took”)

  • dig reports latency as ;; Query time: <N> msec in the footer of normal output. (support.dnsimple.com)
  • Because +short hides that footer, use +stats (optionally with +noall) to get timing without the full answer:
    • dig example.com +noall +stats | grep "Query time" (zuzia.app)
    • or (full output) dig example.com | grep "Query time" (ucartz.com)

Citations:


🌐 Web query:

Telegraf inputs.exec data_type integer parser how it works

💡 Result:

With [[inputs.exec]], Telegraf runs your command, captures stdout, then parses it using whatever data_format you configured. The data_type = "integer" setting only applies when you use the value input data format (data_format = "value"). [1]

What data_type = "integer" actually does (with data_format = "value")

Internally, the value parser:

  1. Trims surrounding whitespace and any \x00 bytes.
  2. If data_type is not "string", it splits the entire output on whitespace (spaces, tabs, newlines), keeps only the last token, and ignores the rest.
    So echo "foo 10" → parses 10; printf "1\n2\n" → parses 2. [2]
  3. For integer mode, it parses that last token with Go’s strconv.Atoi (base-10 integer). If parsing fails (e.g., 10ms, 1.2, empty), the parser returns an error and the metric won’t be produced/sent for that gather. [2]

What metric you get

  • Measurement name: defaults to the input plugin name (exec) unless you set name_override. [1]
  • Field name: defaults to value unless you set value_field_name. [1]
  • Timestamp: set to “now (UTC)” at parse time. [2]

Minimal working example

[[inputs.exec]]
  commands = ["sh -c 'echo 123'"]
  data_format = "value"
  data_type = "integer"
  name_override = "my_metric"
  # value_field_name = "value"

Produces a single metric like: measurement my_metric, field value=123 (integer). [1][2]

Sources: Telegraf value input format docs (data_type, name_override, value_field_name) [1]; Telegraf value parser implementation (trim, “last token wins”, Atoi, timestamp) [2].


dig +short returns record data (IP addresses), not latency—parsing will fail with data_type = "integer".

The +short flag suppresses the query statistics footer where latency appears (;; Query time: <N> msec). The command will output either an IP address (e.g., 192.168.1.1) or be empty, both of which fail integer parsing. No metrics will be produced.

Use full dig output with +stats and extract the latency value:

Suggested fix
 [[inputs.exec]]
-  commands = ["dig +short {{ host }}.{{ dns_subdomain_base }} @{{ hostvars[host].ansible_host }}#5335 +time=3 +tries=2"]
+  commands = ["sh -c 'dig {{ host }}.{{ dns_subdomain_base }} @{{ hostvars[host].ansible_host }} -p 5335 +time=3 +tries=2 +stats | awk \"/Query time:/ {print \\$4}\"'"]
   name_override = "dns_internal_latency"

Also applies to lines 35–39 and 45–50.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/dns-monitoring.conf.j2`
around lines 18 - 24, The inputs.exec blocks (name_override
"dns_internal_latency", commands, data_format/data_type) currently use "dig
+short" which returns records not latency; change the command to run dig with
full stats (enable +stats or remove +short) and pipe/parse the output to emit
only the numeric "Query time" value in milliseconds so the inputs.exec returns a
single integer value compatible with data_format="value" and
data_type="integer"; update the commands in the inputs.exec blocks that set
name_override "dns_internal_latency" (and the other similar inputs.exec blocks
referenced in the review) to perform this parse so empty/no-answer cases produce
no metric or an empty output that Telegraf can ignore.

Comment on lines +3 to +5
[[outputs.datadog]]
apikey = "{{ datadog_telegraf_api_key }}"
url = "https://otlp.datadoghq.com/api/v1/series"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Telegraf Datadog OTLP endpoint otlp.datadoghq.com configuration

💡 Result:

Telegraf’s [[outputs.opentelemetry]] plugin exports metrics via OTLP/gRPC (default localhost:4317) and can add gRPC metadata via [outputs.opentelemetry.headers]—it’s not an OTLP/HTTP exporter. (docs.influxdata.com)
Datadog’s otlp.datadoghq.com intake endpoints shown in docs are OTLP/HTTP (for example, metrics go to https://otlp.datadoghq.com/v1/metrics with a dd-api-key header). (docs.datadoghq.com)
So, to use Telegraf with otlp.datadoghq.com, the typical setup is: Telegraf (OTLP/gRPC) → OpenTelemetry Collector (translate) → Datadog OTLP/HTTP intake.

1) Telegraf (telegraf.conf) → send OTLP/gRPC to a local/nearby Collector

[[outputs.opentelemetry]]
  service_address = "otel-collector:4317"

(Telegraf OTLP output is gRPC.) (docs.influxdata.com)

2) OpenTelemetry Collector (config.yaml) → export OTLP/HTTP to Datadog

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  otlphttp:
    metrics_endpoint: https://otlp.datadoghq.com/v1/metrics
    logs_endpoint: https://otlp.datadoghq.com/v1/logs
    headers:
      dd-api-key: ${DD_API_KEY}

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [otlphttp]
    logs:
      receivers: [otlp]
      exporters: [otlphttp]

Alternative (if you meant “OTLP to the Datadog Agent”, not otlp.datadoghq.com)

Enable the Agent’s OTLP receiver (gRPC :4317 or HTTP :4318) and point Telegraf at the Agent. (docs.datadoghq.com)

Citations:


🌐 Web query:

Telegraf outputs.datadog plugin OTLP support

💡 Result:

Telegraf’s [[outputs.datadog]] plugin does not support OTLP—it sends Telegraf metrics to Datadog using Datadog’s Metrics API (v1), and the plugin docs explicitly note it “only supports the v1 API.” [1]

If you want OTLP from Telegraf to Datadog, use Telegraf’s OpenTelemetry output instead:

  • [[outputs.opentelemetry]] sends metrics via OpenTelemetry over gRPC (default localhost:4317). [2]
  • Point that at a Datadog endpoint that accepts OTLP, typically the Datadog Agent OTLP receiver (listening on 4317 gRPC / 4318 HTTP when enabled) or an OpenTelemetry Collector in front of Datadog. [3]

Sources: Telegraf Datadog output plugin docs [1], Telegraf OpenTelemetry output plugin docs [2], Datadog OTLP receiver docs [3].


🌐 Web query:

Telegraf outputs.opentelemetry plugin Datadog OTLP authentication

💡 Result:

What Datadog expects for OTLP auth

  • Datadog’s direct OTLP intake endpoints (agentless) authenticate with an HTTP header like dd-api-key=<YOUR_DD_API_KEY> (for example via OTEL_EXPORTER_OTLP_*_HEADERS). [1], [2]

What Telegraf [[outputs.opentelemetry]] can send

  • Telegraf’s outputs.opentelemetry output sends OTLP over gRPC (not OTLP/HTTP). [3]
  • It can attach custom gRPC metadata via the plugin’s headers table:
    [outputs.opentelemetry.headers]
      dd-api-key = "YOUR_DD_API_KEY"
    (This is generic gRPC metadata support; whether the receiver uses it is up to that receiver.) [3]

Practical implication for Datadog

  • If you want to send Telegraf metrics to Datadog reliably, the usual setup is:
    1. Enable OTLP ingestion on the Datadog Agent (listening on 4317 gRPC and/or 4318 HTTP) [4]
    2. Point Telegraf at the Agent:
      [[outputs.opentelemetry]]
        service_address = "datadog-agent:4317"
    In this path, you typically do not use dd-api-key between Telegraf and the local Agent; the Agent is what’s configured with DD_API_KEY to talk to Datadog. [4]

If you tell me which Datadog target you’re sending to (Datadog Agent, OTel Collector/DDOT, Observability Pipelines Worker, or an “OTLP intake endpoint” URL), I can provide the exact minimal working config for that path.

Sources: [1] Datadog OTLP Logs intake docs, [2] Datadog OTLP Metrics intake docs, [3] Telegraf outputs.opentelemetry docs, [4] Datadog “OTLP Ingestion by the Datadog Agent” docs.


The protocol mismatch concern is valid, but the proposed solution introduces a different protocol mismatch.

[[outputs.datadog]] (Metrics API v1) with an OTLP endpoint is indeed incorrect. However, the proposed [[outputs.opentelemetry]] will not work with otlp.datadoghq.com:443 either: Telegraf's outputs.opentelemetry sends OTLP/gRPC (default localhost:4317), while Datadog's direct OTLP intake (https://otlp.datadoghq.com/v1/metrics) expects OTLP/HTTP.

To send OTLP metrics to Datadog from Telegraf, use one of:

  1. Telegraf → Datadog Agent (enable OTLP receiver on port 4317) → Datadog
    [[outputs.opentelemetry]]
      service_address = "datadog-agent:4317"
  2. Telegraf → OpenTelemetry Collector (gRPC receiver) → Datadog OTLP/HTTP intake with dd-api-key header → Datadog
  3. Use [[outputs.datadog]] correctly (Datadog Metrics API v1 format only, no OTLP endpoint)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@roles/telegraf-container-service/templates/etc/telegraf/telegraf.d/otlp_output.conf.j2`
around lines 3 - 5, The current block incorrectly pairs [[outputs.datadog]] with
an OTLP endpoint; change the config so you either keep [[outputs.datadog]] and
point it to the Datadog Metrics API (remove url =
"https://otlp.datadoghq.com/api/v1/series" and use the proper metrics API
endpoint/format with apikey), or replace with [[outputs.opentelemetry]] but
configure it to send OTLP/gRPC to a receiver (e.g., set service_address =
"datadog-agent:4317") or route through an OpenTelemetry Collector (OTLP/gRPC)
that forwards to Datadog OTLP/HTTP with the dd-api-key header; do not leave
[[outputs.datadog]] pointed at otlp.datadoghq.com nor use
[[outputs.opentelemetry]] without a gRPC receiver address.

Comment on lines +80 to +87
[[outputs.otlp]]
## Datadog OTLP endpoint (US)
endpoint = "https://otlp.datadoghq.com:443"

## Optional: configure headers for authentication
[outputs.otlp.headers]
dd-api-key = "{{ datadog_telegraf_api_key }}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Telegraf OTLP output plugin official documentation table name syntax

💡 Result:

Telegraf doesn’t have a separate outputs.otlp plugin in the official docs; OTLP export is handled by the OpenTelemetry output plugin ([[outputs.opentelemetry]]). (docs.influxdata.com)

In that plugin’s documented InfluxDB line-protocol → OpenTelemetry mapping, the “name” syntax is:

  • Default metric name: "[measurement]_[field key]"
  • Metric labels: Influx line-protocol tags
  • Metric value: the field value (cast to float) (docs.influxdata.com)

Citations:


🏁 Script executed:

# Search for Telegraf version specification in the codebase
fd -e requirements.txt -e Dockerfile -e "*.yml" -e "*.yaml" | head -20

Repository: NickJLange/mkrasberry

Length of output: 47


🏁 Script executed:

# Check for version specifications in the repository
rg -i "telegraf" --type-list | head -20
rg -i "telegraf.*version|version.*telegraf" -A 2 -B 2

Repository: NickJLange/mkrasberry

Length of output: 454


🏁 Script executed:

# Search for Telegraf version in the repository
rg -i "telegraf" --type yaml --type toml --type dockerfile

Repository: NickJLange/mkrasberry

Length of output: 99


🏁 Script executed:

# Check for any opentelemetry references
rg -i "opentelemetry"

Repository: NickJLange/mkrasberry

Length of output: 3765


🏁 Script executed:

# Check the telegraf config file itself for other output plugins
cat roles/teleport-daemon-service/templates/etc/telegraf/telegraf.conf.j2 | head -100

Repository: NickJLange/mkrasberry

Length of output: 4166


Replace [[outputs.otlp]] with [[outputs.opentelemetry]]—current syntax is invalid.

Per official Telegraf documentation, OTLP export to Datadog requires the [[outputs.opentelemetry]] plugin table name, not [[outputs.otlp]] (which does not exist as a Telegraf output plugin). This configuration will fail at runtime.

Update line 80:

-[[outputs.otlp]]
+[[outputs.opentelemetry]]

Also remove or reorganize the orphaned InfluxDB keys (urls, token, etc.) below—they have no output table header and will be orphaned after correcting the OTLP block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@roles/teleport-daemon-service/templates/etc/telegraf/telegraf.conf.j2` around
lines 80 - 87, Replace the invalid Telegraf output table name [[outputs.otlp]]
with the correct [[outputs.opentelemetry]] header so the OTLP/Datadog export
block (the endpoint and [outputs.opentelemetry.headers]/dd-api-key) uses the
supported plugin name; after that, locate the orphaned InfluxDB keys (e.g.,
urls, token, org, bucket) that currently sit without an output table header and
either remove them or move them under the appropriate [[outputs.influxdb_v2]]
(or other intended) output table so they are not left orphaned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant