Skip to content

server: migrate notifications stack to grafana#278

Open
illegalprime wants to merge 2 commits into
mainfrom
eden/notifications.grafana
Open

server: migrate notifications stack to grafana#278
illegalprime wants to merge 2 commits into
mainfrom
eden/notifications.grafana

Conversation

@illegalprime
Copy link
Copy Markdown
Contributor

@illegalprime illegalprime commented May 20, 2026

Migrate notifications stack to Grafana instead of VM, VMAlert, & Alertmanager.

@github-actions github-actions Bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file automation server labels May 20, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (cf6ff018f14bbcbaa8e67972132b3f01d4247300...8ff083d7e8b99a2addbc3a88dd7fd9b781de5381, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Duplicate OTel Metric Module Requirement Leaves Server Module Graph Inconsistent

  • Category: Other
  • Location: server/go.mod:87
  • Description: The PR adds go.opentelemetry.io/otel/metric v1.43.0 // indirect while the same module is already required directly at line 36. The diff also removes the otlpmetricgrpc go.sum entries while go.mod still directly requires that exporter.
  • Impact: Server go test, lint, tidy, and release builds can fail or dirty go.mod / go.sum, blocking CI and artifact builds.
  • Recommendation: Run go mod tidy after deleting the OTel metric exporter code, remove unused direct OTel metric/exporter requirements, and keep go.sum consistent with the remaining module graph.

[MEDIUM] Alert Webhook Can Spend Unbounded Time On DB Work Per Request

  • Category: Reliability
  • Location: server/internal/handlers/alertmanagerwebhook/handler.go:131
  • Description: The webhook lists all active orgs using the raw request context, then self-monitoring fan-out performs up to 1000 serial activity inserts, each with its own 10s timeout. A slow DB can therefore hold a single webhook request for a very long time, especially during the exact outage conditions that trigger self-monitoring alerts.
  • Impact: Grafana retries or repeated alert deliveries can accumulate stuck HTTP handlers and DB work, degrading fleet-api during an incident.
  • Recommendation: Apply one bounded context for the whole webhook handling path, lazy-load org IDs only when a self-monitoring alert actually needs fan-out, and batch or asynchronously enqueue activity rows instead of doing per-org synchronous inserts.

Notes

No auth bypass, SQL injection, command injection, cryptostealing/pool hijack, frontend XSS, or protobuf wire-format issues were evident in the scoped diff. I did not run the test suite in this read-only review environment.


Generated by Codex Security Review |
Triggered by: @illegalprime |
Review workflow run

@illegalprime illegalprime force-pushed the eden/notifications.grafana branch from 30552f0 to 79be995 Compare May 20, 2026 21:15
@github-actions github-actions Bot added the github_actions Pull requests that update GitHub Actions code label May 20, 2026
@illegalprime illegalprime force-pushed the eden/notifications.grafana branch 5 times, most recently from 4d429b6 to e56093a Compare May 21, 2026 13:31
@github-actions github-actions Bot removed the github_actions Pull requests that update GitHub Actions code label May 21, 2026
@illegalprime illegalprime force-pushed the eden/notifications.grafana branch 6 times, most recently from fc77f60 to b051431 Compare May 21, 2026 16:39
@illegalprime illegalprime force-pushed the eden/notifications.grafana branch from b051431 to 52ad036 Compare May 21, 2026 17:48
@illegalprime illegalprime marked this pull request as ready for review May 21, 2026 20:26
@illegalprime illegalprime requested a review from a team as a code owner May 21, 2026 20:26
Comment thread deployment-files/run-fleet.sh
Comment thread server/migrations/000052_notification_metric_samples.down.sql Outdated
Comment thread server/internal/handlers/alertmanagerwebhook/handler.go Outdated
Comment thread server/internal/handlers/alertmanagerwebhook/handler.go
}

func (s *SQLOrganizationStore) ListActiveOrganizationIDs(ctx context.Context) ([]int64, error) {
orgs, err := s.GetQueries(ctx).ListOrganizations(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably update ListOrganizations to check DeletedAt instead of this

Comment thread deployment-files/README.md
Comment thread deployment-files/run-fleet.sh
Comment thread server/internal/handlers/alertmanagerwebhook/handler.go Outdated
Comment thread server/cmd/fleetd/main.go Outdated
Comment thread server/internal/handlers/alertmanagerwebhook/handler.go

// If nothing landed — typically a DB outage or activity-log
// rejection affecting every alert in the batch — return 5xx so
// Grafana retries the delivery. Acking 204 here would let a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would lead to duplicates right? maybe we should add a unique partial index for alert.* event_types

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If persisted == 0 then nothing hit the DB, so no duplicates?

Comment thread server/docker-compose.base.yaml Outdated
Comment thread server/docker-compose.notifications.yaml
Comment thread server/docker-compose.base.yaml
description = fmt.Sprintf("alert %s %s", alertName, status)
}

result := models.ResultFailure
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Result field shouldn't be encoding alert state imo. Result means "did the operation this row audits succeed?". it describes the outcome of the recorded action and here the recorded action is the alert firing which was successful.

cc: @rongxin-liu

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're overloading activity events for something they weren't designed for - maybe we just need a alert_history table instead of this

result = models.ResultSuccess
}

scopeType := alert.Labels[labelTemplate]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also seems like we're repurposing it to store something different


var maxSamplesPerInsert = maxPostgresBindParameters / columnsPerSample

type Config struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flushLoop + pendingRetry + nextBackoff + appendBoundedRetry retry layer seems over-engineered now that we're simply doing a local db insert. Or am I missing something?


// inMemoryStore is the test double exposed to provider_test.go. It
// records every sample handed to InsertSamples so tests can assert on
// metric names and label sets without a real TimescaleDB.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could drop this and use a real DB in tests. Would lead to better tests and less abstraction. not a blocker though.

Comment on lines +35 to +41
const maxBodyBytes = 1 << 20 // 1 MiB

// maxAlertsPerRequest caps the number of alerts a single webhook delivery may carry.
const maxAlertsPerRequest = 100

// maxRowsPerRequest caps the total number of activity_log inserts a single webhook delivery may issue.
const maxRowsPerRequest = 1000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but all these + maxSamplesPerInsert are kinda just defending the same surface no? Lets just pick one or two?

Copy link
Copy Markdown
Contributor

@ankitgoswami ankitgoswami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job figuring this out! I think there are some architectural things to sort out but the stack and integrations look like they would work well for our needs ❤️

Copy link
Copy Markdown
Contributor

@flesher flesher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 1 work for multi site included site-stamped hypertables for: errors, miner_state_snapshots, device_metrics(migration 000047). We should probably include SiteID with notification_metric_sample too.

}

type CommandLabels struct {
OrganizationID string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


// DeviceLabels is the canonical label set for per-device gauges.
type DeviceLabels struct {
OrganizationID string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could add SiteID

@@ -29,110 +26,110 @@
Result string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@illegalprime illegalprime force-pushed the eden/notifications.grafana branch from 848b1b3 to 2a219b0 Compare May 22, 2026 18:41
@github-actions github-actions Bot added the github_actions Pull requests that update GitHub Actions code label May 22, 2026
@illegalprime illegalprime force-pushed the eden/notifications.grafana branch from 2a219b0 to cf54bec Compare May 22, 2026 18:46
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf54becdb5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}

func (h *Handler) insertEvent(parent context.Context, event models.Event, alert alertmanagerAlert) int {
ctx, cancel := context.WithTimeout(parent, 10*time.Second)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bound webhook persistence by request-level timeout

Each insert in the fan-out path gets its own fresh 10s timeout, so one webhook request can block for maxRowsPerRequest sequential inserts when the DB is slow or down. With the current cap of 1000 rows, a single self-monitoring delivery can tie up a handler for hours before returning, which amplifies retries and can starve request capacity. Use one deadline for the whole request (or abort after sustained insert failures) so worst-case webhook latency stays bounded.

Useful? React with 👍 / 👎.

case sample := <-p.samples:
batch = append(batch, sample)
if len(batch) >= p.cfg.BatchSize {
flush(context.Background(), true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor Shutdown context during final metrics flush

The stop path flushes with context.Background() instead of a cancelable shutdown context, so final writes ignore the caller's deadline. Under TimescaleDB slowness/outage, these forced flushes can continue blocking in 10s chunks after Shutdown(ctx) times out, which risks delayed shutdowns and buffered-sample loss on process exit. Pass a context tied to Shutdown so final draining respects the requested timeout.

Useful? React with 👍 / 👎.

@illegalprime illegalprime force-pushed the eden/notifications.grafana branch from cf54bec to 5d68d86 Compare May 22, 2026 19:48
@illegalprime illegalprime force-pushed the eden/notifications.grafana branch from 5d68d86 to 8ff083d Compare May 22, 2026 19:49
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8ff083d7e8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +587 to +591
if ! grep -q "^FLEET_METRICS_WEBHOOK_TOKEN=" "$ENV_FILE" 2>/dev/null; then
FLEET_METRICS_WEBHOOK_TOKEN=$(openssl rand -base64 32)
echo "FLEET_METRICS_WEBHOOK_TOKEN=$FLEET_METRICS_WEBHOOK_TOKEN" >> "$ENV_FILE"
echo "Generated alertmanager webhook token (stored in $ENV_FILE)."
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat empty webhook token as missing during bootstrap

The bootstrap gate only checks for the presence of FLEET_METRICS_WEBHOOK_TOKEN= and does not verify it is non-empty. If .env contains FLEET_METRICS_WEBHOOK_TOKEN= (for example after manual redaction), this branch skips regeneration, yet the notifications compose file requires a non-empty token for Grafana and fleet-api rejects empty tokens for webhook auth. Use the same non-empty/scrub pattern used for DB credentials so reruns self-heal blank values.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automation dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation github_actions Pull requests that update GitHub Actions code server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants