server: migrate notifications stack to grafana#278
Conversation
🔐 Codex Security Review
Review SummaryOverall Risk: HIGH Findings[HIGH] Duplicate OTel Metric Module Requirement Leaves Server Module Graph Inconsistent
[MEDIUM] Alert Webhook Can Spend Unbounded Time On DB Work Per Request
NotesNo auth bypass, SQL injection, command injection, cryptostealing/pool hijack, frontend XSS, or protobuf wire-format issues were evident in the scoped diff. I did not run the test suite in this read-only review environment. Generated by Codex Security Review | |
30552f0 to
79be995
Compare
4d429b6 to
e56093a
Compare
fc77f60 to
b051431
Compare
b051431 to
52ad036
Compare
| } | ||
|
|
||
| func (s *SQLOrganizationStore) ListActiveOrganizationIDs(ctx context.Context) ([]int64, error) { | ||
| orgs, err := s.GetQueries(ctx).ListOrganizations(ctx) |
There was a problem hiding this comment.
we should probably update ListOrganizations to check DeletedAt instead of this
|
|
||
| // If nothing landed — typically a DB outage or activity-log | ||
| // rejection affecting every alert in the batch — return 5xx so | ||
| // Grafana retries the delivery. Acking 204 here would let a |
There was a problem hiding this comment.
this would lead to duplicates right? maybe we should add a unique partial index for alert.* event_types
There was a problem hiding this comment.
If persisted == 0 then nothing hit the DB, so no duplicates?
| description = fmt.Sprintf("alert %s %s", alertName, status) | ||
| } | ||
|
|
||
| result := models.ResultFailure |
There was a problem hiding this comment.
The Result field shouldn't be encoding alert state imo. Result means "did the operation this row audits succeed?". it describes the outcome of the recorded action and here the recorded action is the alert firing which was successful.
cc: @rongxin-liu
There was a problem hiding this comment.
I think we're overloading activity events for something they weren't designed for - maybe we just need a alert_history table instead of this
| result = models.ResultSuccess | ||
| } | ||
|
|
||
| scopeType := alert.Labels[labelTemplate] |
There was a problem hiding this comment.
this also seems like we're repurposing it to store something different
|
|
||
| var maxSamplesPerInsert = maxPostgresBindParameters / columnsPerSample | ||
|
|
||
| type Config struct { |
There was a problem hiding this comment.
The flushLoop + pendingRetry + nextBackoff + appendBoundedRetry retry layer seems over-engineered now that we're simply doing a local db insert. Or am I missing something?
|
|
||
| // inMemoryStore is the test double exposed to provider_test.go. It | ||
| // records every sample handed to InsertSamples so tests can assert on | ||
| // metric names and label sets without a real TimescaleDB. |
There was a problem hiding this comment.
we could drop this and use a real DB in tests. Would lead to better tests and less abstraction. not a blocker though.
| const maxBodyBytes = 1 << 20 // 1 MiB | ||
|
|
||
| // maxAlertsPerRequest caps the number of alerts a single webhook delivery may carry. | ||
| const maxAlertsPerRequest = 100 | ||
|
|
||
| // maxRowsPerRequest caps the total number of activity_log inserts a single webhook delivery may issue. | ||
| const maxRowsPerRequest = 1000 |
There was a problem hiding this comment.
Maybe I'm missing something but all these + maxSamplesPerInsert are kinda just defending the same surface no? Lets just pick one or two?
ankitgoswami
left a comment
There was a problem hiding this comment.
Great job figuring this out! I think there are some architectural things to sort out but the stack and integrations look like they would work well for our needs ❤️
flesher
left a comment
There was a problem hiding this comment.
Phase 1 work for multi site included site-stamped hypertables for: errors, miner_state_snapshots, device_metrics(migration 000047). We should probably include SiteID with notification_metric_sample too.
| } | ||
|
|
||
| type CommandLabels struct { | ||
| OrganizationID string |
|
|
||
| // DeviceLabels is the canonical label set for per-device gauges. | ||
| type DeviceLabels struct { | ||
| OrganizationID string |
| @@ -29,110 +26,110 @@ | |||
| Result string | |||
848b1b3 to
2a219b0
Compare
2a219b0 to
cf54bec
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cf54becdb5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
|
|
||
| func (h *Handler) insertEvent(parent context.Context, event models.Event, alert alertmanagerAlert) int { | ||
| ctx, cancel := context.WithTimeout(parent, 10*time.Second) |
There was a problem hiding this comment.
Bound webhook persistence by request-level timeout
Each insert in the fan-out path gets its own fresh 10s timeout, so one webhook request can block for maxRowsPerRequest sequential inserts when the DB is slow or down. With the current cap of 1000 rows, a single self-monitoring delivery can tie up a handler for hours before returning, which amplifies retries and can starve request capacity. Use one deadline for the whole request (or abort after sustained insert failures) so worst-case webhook latency stays bounded.
Useful? React with 👍 / 👎.
| case sample := <-p.samples: | ||
| batch = append(batch, sample) | ||
| if len(batch) >= p.cfg.BatchSize { | ||
| flush(context.Background(), true) |
There was a problem hiding this comment.
Honor Shutdown context during final metrics flush
The stop path flushes with context.Background() instead of a cancelable shutdown context, so final writes ignore the caller's deadline. Under TimescaleDB slowness/outage, these forced flushes can continue blocking in 10s chunks after Shutdown(ctx) times out, which risks delayed shutdowns and buffered-sample loss on process exit. Pass a context tied to Shutdown so final draining respects the requested timeout.
Useful? React with 👍 / 👎.
cf54bec to
5d68d86
Compare
5d68d86 to
8ff083d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8ff083d7e8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if ! grep -q "^FLEET_METRICS_WEBHOOK_TOKEN=" "$ENV_FILE" 2>/dev/null; then | ||
| FLEET_METRICS_WEBHOOK_TOKEN=$(openssl rand -base64 32) | ||
| echo "FLEET_METRICS_WEBHOOK_TOKEN=$FLEET_METRICS_WEBHOOK_TOKEN" >> "$ENV_FILE" | ||
| echo "Generated alertmanager webhook token (stored in $ENV_FILE)." | ||
| fi |
There was a problem hiding this comment.
Treat empty webhook token as missing during bootstrap
The bootstrap gate only checks for the presence of FLEET_METRICS_WEBHOOK_TOKEN= and does not verify it is non-empty. If .env contains FLEET_METRICS_WEBHOOK_TOKEN= (for example after manual redaction), this branch skips regeneration, yet the notifications compose file requires a non-empty token for Grafana and fleet-api rejects empty tokens for webhook auth. Use the same non-empty/scrub pattern used for DB credentials so reruns self-heal blank values.
Useful? React with 👍 / 👎.
Migrate notifications stack to Grafana instead of VM, VMAlert, & Alertmanager.