server: migrate notifications stack to grafana by illegalprime · Pull Request #278 · block/proto-fleet

illegalprime · 2026-05-20T19:01:27Z

Migrate notifications stack to Grafana instead of VM, VMAlert, & Alertmanager.

github-actions · 2026-05-20T19:05:03Z

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

Reviewed pull request diff only (cf6ff018f14bbcbaa8e67972132b3f01d4247300...8ff083d7e8b99a2addbc3a88dd7fd9b781de5381, exact PR three-dot diff)

Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.

Review Summary

Overall Risk: HIGH

Findings

[HIGH] Duplicate OTel Metric Module Requirement Leaves Server Module Graph Inconsistent

Category: Other
Location: server/go.mod:87
Description: The PR adds go.opentelemetry.io/otel/metric v1.43.0 // indirect while the same module is already required directly at line 36. The diff also removes the otlpmetricgrpc go.sum entries while go.mod still directly requires that exporter.
Impact: Server go test, lint, tidy, and release builds can fail or dirty go.mod / go.sum, blocking CI and artifact builds.
Recommendation: Run go mod tidy after deleting the OTel metric exporter code, remove unused direct OTel metric/exporter requirements, and keep go.sum consistent with the remaining module graph.

[MEDIUM] Alert Webhook Can Spend Unbounded Time On DB Work Per Request

Category: Reliability
Location: server/internal/handlers/alertmanagerwebhook/handler.go:131
Description: The webhook lists all active orgs using the raw request context, then self-monitoring fan-out performs up to 1000 serial activity inserts, each with its own 10s timeout. A slow DB can therefore hold a single webhook request for a very long time, especially during the exact outage conditions that trigger self-monitoring alerts.
Impact: Grafana retries or repeated alert deliveries can accumulate stuck HTTP handlers and DB work, degrading fleet-api during an incident.
Recommendation: Apply one bounded context for the whole webhook handling path, lazy-load org IDs only when a self-monitoring alert actually needs fan-out, and batch or asynchronously enqueue activity rows instead of doing per-org synchronous inserts.

Notes

No auth bypass, SQL injection, command injection, cryptostealing/pool hijack, frontend XSS, or protobuf wire-format issues were evident in the scoped diff. I did not run the test suite in this read-only review environment.

_{Generated by Codex Security Review |

Triggered by: @illegalprime |

Review workflow run}

ankitgoswami · 2026-05-21T23:17:31Z

+}
+
+func (s *SQLOrganizationStore) ListActiveOrganizationIDs(ctx context.Context) ([]int64, error) {
+	orgs, err := s.GetQueries(ctx).ListOrganizations(ctx)


we should probably update ListOrganizations to check DeletedAt instead of this

ankitgoswami · 2026-05-22T00:03:50Z

+
+	// If nothing landed — typically a DB outage or activity-log
+	// rejection affecting every alert in the batch — return 5xx so
+	// Grafana retries the delivery. Acking 204 here would let a


this would lead to duplicates right? maybe we should add a unique partial index for alert.* event_types

If persisted == 0 then nothing hit the DB, so no duplicates?

ankitgoswami · 2026-05-22T00:23:18Z

+		description = fmt.Sprintf("alert %s %s", alertName, status)
+	}
+
+	result := models.ResultFailure


The Result field shouldn't be encoding alert state imo. Result means "did the operation this row audits succeed?". it describes the outcome of the recorded action and here the recorded action is the alert firing which was successful.

cc: @rongxin-liu

I think we're overloading activity events for something they weren't designed for - maybe we just need a alert_history table instead of this

ankitgoswami · 2026-05-22T00:25:13Z

+		result = models.ResultSuccess
+	}
+
+	scopeType := alert.Labels[labelTemplate]


this also seems like we're repurposing it to store something different

ankitgoswami · 2026-05-22T00:37:15Z

+
+var maxSamplesPerInsert = maxPostgresBindParameters / columnsPerSample
+
 type Config struct {


The flushLoop + pendingRetry + nextBackoff + appendBoundedRetry retry layer seems over-engineered now that we're simply doing a local db insert. Or am I missing something?

ankitgoswami · 2026-05-22T00:39:11Z

+
+// inMemoryStore is the test double exposed to provider_test.go. It
+// records every sample handed to InsertSamples so tests can assert on
+// metric names and label sets without a real TimescaleDB.


we could drop this and use a real DB in tests. Would lead to better tests and less abstraction. not a blocker though.

ankitgoswami · 2026-05-22T00:42:10Z

+const maxBodyBytes = 1 << 20 // 1 MiB
+
+// maxAlertsPerRequest caps the number of alerts a single webhook delivery may carry.
+const maxAlertsPerRequest = 100
+
+// maxRowsPerRequest caps the total number of activity_log inserts a single webhook delivery may issue.
+const maxRowsPerRequest = 1000


Maybe I'm missing something but all these + maxSamplesPerInsert are kinda just defending the same surface no? Lets just pick one or two?

ankitgoswami

Great job figuring this out! I think there are some architectural things to sort out but the stack and integrations look like they would work well for our needs ❤️

flesher

Phase 1 work for multi site included site-stamped hypertables for: errors, miner_state_snapshots, device_metrics(migration 000047). We should probably include SiteID with notification_metric_sample too.

flesher · 2026-05-22T12:50:47Z

 }

 type CommandLabels struct {
 	OrganizationID string


flesher · 2026-05-22T12:51:25Z


 // DeviceLabels is the canonical label set for per-device gauges.
 type DeviceLabels struct {
 	OrganizationID string


we could add SiteID

flesher · 2026-05-22T12:51:44Z

@@ -29,110 +26,110 @@
 	Result         string


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf54becdb5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-22T18:54:14Z

+}
+
+func (h *Handler) insertEvent(parent context.Context, event models.Event, alert alertmanagerAlert) int {
+	ctx, cancel := context.WithTimeout(parent, 10*time.Second)


Bound webhook persistence by request-level timeout

Each insert in the fan-out path gets its own fresh 10s timeout, so one webhook request can block for maxRowsPerRequest sequential inserts when the DB is slow or down. With the current cap of 1000 rows, a single self-monitoring delivery can tie up a handler for hours before returning, which amplifies retries and can starve request capacity. Use one deadline for the whole request (or abort after sustained insert failures) so worst-case webhook latency stays bounded.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-22T18:54:14Z

+				case sample := <-p.samples:
+					batch = append(batch, sample)
+					if len(batch) >= p.cfg.BatchSize {
+						flush(context.Background(), true)


Honor Shutdown context during final metrics flush

The stop path flushes with context.Background() instead of a cancelable shutdown context, so final writes ignore the caller's deadline. Under TimescaleDB slowness/outage, these forced flushes can continue blocking in 10s chunks after Shutdown(ctx) times out, which risks delayed shutdowns and buffered-sample loss on process exit. Pass a context tied to Shutdown so final draining respects the requested timeout.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8ff083d7e8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-22T19:58:09Z

+    if ! grep -q "^FLEET_METRICS_WEBHOOK_TOKEN=" "$ENV_FILE" 2>/dev/null; then
+        FLEET_METRICS_WEBHOOK_TOKEN=$(openssl rand -base64 32)
+        echo "FLEET_METRICS_WEBHOOK_TOKEN=$FLEET_METRICS_WEBHOOK_TOKEN" >> "$ENV_FILE"
+        echo "Generated alertmanager webhook token (stored in $ENV_FILE)."
+    fi


Treat empty webhook token as missing during bootstrap

The bootstrap gate only checks for the presence of FLEET_METRICS_WEBHOOK_TOKEN= and does not verify it is non-empty. If .env contains FLEET_METRICS_WEBHOOK_TOKEN= (for example after manual redaction), this branch skips regeneration, yet the notifications compose file requires a non-empty token for Grafana and fleet-api rejects empty tokens for webhook auth. Use the same non-empty/scrub pattern used for DB credentials so reruns self-heal blank values.

Useful? React with 👍 / 👎.

github-actions Bot assigned illegalprime May 20, 2026

github-actions Bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file automation server labels May 20, 2026

illegalprime force-pushed the eden/notifications.grafana branch from 30552f0 to 79be995 Compare May 20, 2026 21:15

github-actions Bot added the github_actions Pull requests that update GitHub Actions code label May 20, 2026

illegalprime force-pushed the eden/notifications.grafana branch 5 times, most recently from 4d429b6 to e56093a Compare May 21, 2026 13:31

github-actions Bot removed the github_actions Pull requests that update GitHub Actions code label May 21, 2026

illegalprime force-pushed the eden/notifications.grafana branch 6 times, most recently from fc77f60 to b051431 Compare May 21, 2026 16:39

rongxin-liu mentioned this pull request May 21, 2026

feat(curtailment): operator read APIs + admin terminate + audit + metrics interface + E2E #289

Closed

illegalprime force-pushed the eden/notifications.grafana branch from b051431 to 52ad036 Compare May 21, 2026 17:48

illegalprime marked this pull request as ready for review May 21, 2026 20:26

illegalprime requested a review from a team as a code owner May 21, 2026 20:26