diff --git a/docs/DB_MODEL.md b/docs/DB_MODEL.md index 79c9bc8..009319e 100644 --- a/docs/DB_MODEL.md +++ b/docs/DB_MODEL.md @@ -1,4 +1,4 @@ -# StarForge — Canonical Database Model (DB_MODEL.md) +# StarForge — Canonical Database Model > This document is the canonical reference for the database schema used by StarForge. It expands the summary in the PRD > and contains: diff --git a/docs/OP_RUNBOOKS/APPLY_MIGRATIONS.md b/docs/OP_RUNBOOKS/APPLY_MIGRATIONS.md new file mode 100644 index 0000000..c76ed7c --- /dev/null +++ b/docs/OP_RUNBOOKS/APPLY_MIGRATIONS.md @@ -0,0 +1,283 @@ +# Runbook — Apply Database Migrations + +--- + +## Purpose + +This runbook describes the safe, repeatable procedure to apply schema migrations to production (and staging) +environments for the StarForge project. It consolidates the guidance in [docs/MIGRATIONS.md](../MIGRATIONS.md) and +provides a concise, +actionable checklist, commands, verification queries and troubleshooting steps for operators and engineers. + +--- + +## Audience + +- `SRE` / `Ops engineers` executing production changes +- `Backend engineers` owning migration PRs +- `Release approvers` and on-call engineers + +--- + +## When to use + +- To apply migration PRs to staging or production. +- To run emergency schema changes after appropriate approvals. +- To validate migrations that were applied by `CI` (post-apply verification). + +--- + +## Principles (short) + +- Always be safe: take a backup before applying production migrations. +- Use the protected `CI` workflow (`db-bootstrap`) for production when possible — it enforces approvals and records + artifacts. +- Prefer additive, phased migrations: add → backfill → enforce. +- Monitor system health during and after migration; be prepared to rollback or restore. + +--- + +## Pre-flight checklist (must pass before apply) + +1. Backup & snapshot + - Create and record a DB backup/snapshot ID. Verify the backup was successful. + - Document the backup ID in the migration ticket/PR. + +2. Approvals + - Confirm the GitHub `db-bootstrap` workflow will run under a protected environment (requires approval). + - Ensure required approvers (`Engineering` + `SRE`) are available during the window. + +3. `CI` Preflight & Tests + - Ensure `CI` `migrate:preflight` job passed for the migration PR (applied against ephemeral DB). + - Confirm unit/integration tests and migration tests passed in `CI`. + +4. Runbook & Impact + - Confirm the migration PR includes impact estimates (row counts, index build time). + - Confirm any required maintenance window or low-traffic window is scheduled if the change is heavy. + +5. Communication + - Notify stakeholders (product, support) of the planned maintenance window and expected `ETA`. + - Post the planned change with contact / pager information. + +6. Operational readiness + - Ensure `SRE` on-call is available. + - Confirm ability to pause ingestion and workers (feature flag or scaledown commands). + - Confirm you can run the rollback / restore plan and have the runbook open. + +--- + +## How to apply (recommended: GitHub Actions / db-bootstrap) + +Use the repository's protected `db-bootstrap` GitHub Actions workflow. This is the preferred and auditable path. + +1. Open the PR with migration files and ensure it includes `MIGRATION: ` in commit message. +2. From the repository Actions tab, locate `db-bootstrap` (or run `workflow_dispatch`). +3. Provide required inputs (if any) and kick off the workflow. +4. Approver(s): Approve the environment prompt to let the workflow run against production. + - The workflow will run preflight checks, run migrations, and surface logs. + +> Notes: +> - The workflow requires the secret `DATABASE_URL` in `GitHub environment` and the `apply` confirmation input to + actually apply. +> - The workflow logs and artifacts will be retained in `GitHub Actions` and should be saved for audit. + +--- + +## How to apply (alternative: manual via CLI) + +Only use if GitHub Actions is not available. Prefer scripted, idempotent commands. + +1. Pause ingestion & workers + - Disable ingestion if possible (`API` feature flag) or scale down workers to 0. + - Example (`Kubernetes`): scale down `etl worker` deployment: + - `kubectl scale deployment etl-worker --replicas=0 -n starforge` + - For non-`K8s`: stop worker processes or toggle feature flag. + +2. Ensure you have a recent backup. + +3. Run preflight locally (optional but recommended) + - `./scripts/migrate-preflight.sh` + - Validate `pgcrypto` availability, connectivity, and run a smoke `ETL`. + +4. Run migrations + - From repo root: + - `pnpm install --frozen-lockfile` + - `pnpm run migrate:up -- --config database/migration-config.js --env production` + - If migrations require `CONCURRENTLY` for indexes, follow documented instructions — they may run outside a + transaction. + +5. Collect logs and proceed to verification. + +--- + +## Post-apply verification (smoke & health checks) + +Immediately after migrations finish, run the following checks before re-enabling full traffic: + +1. Schema migration table + - Verify applied migrations: + - `SELECT * FROM node_pgmigrations_schema_version ORDER BY installed_on DESC;` + - (or check `schema_migrations` / the table created by `node-pg-migrate` in your setup) + +2. Basic DB health + - Active connections: + - `SELECT count(*) FROM pg_stat_activity;` + - Long running transactions: + - + `SELECT pid, now() - xact_start AS duration, query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY duration DESC LIMIT 10;` + +3. Verify key tables/indexes exist (examples) + - Check table existence: + - `SELECT to_regclass('public.hero_snapshots');` + - Check index existence: + - `SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'hero_snapshots';` + +4. Run a smoke `ETL` + - Insert a small test snapshot (or use sample fixture) into `hero_snapshots` and enqueue a processing job. + - Verify worker processes it (if workers are still paused, re-enable a single worker instance temporarily). + - Confirm `user_profile_summary` and `user_troops` upserts succeed. + - Helpful SQL: + - + `SELECT id, processed_at, error_count, last_error FROM hero_snapshots WHERE created_at > now() - interval '10 minutes' ORDER BY created_at DESC LIMIT 20;` + +5. Check `etl_errors` and logs + - Recent errors: + - `SELECT * FROM etl_errors ORDER BY created_at DESC LIMIT 50;` + - Check application logs and `Prometheus` metrics (`starforge_etl_snapshots_processed_total`, + `starforge_etl_snapshots_failed_total`). + +6. Monitor metrics & dashboards + - Watch `ETL` processing rate and failure rate for at least `30–60 minutes`. + - Verify DB CPU, IO, and connection metrics are within expected ranges. + +--- + +## Rollback & emergency restore (decision flow) + +If migration causes critical failures or data corruption, follow this decision flow: + +1. If migration has a safe `down` script, and you are confident its execution will restore safe state: + - Run the down migration for the offending change: + - `pnpm run migrate:down -- --count 1 -- --config database/migration-config.js` + - Note: downs may not be safe for destructive changes. Ensure you understand side effects. + +2. If `down` is unsafe, perform a DB restore from the backup created before the migration: + - Stop ingestion and workers immediately. + - Restore the DB from the backup snapshot ID recorded earlier (follow provider-specific restore steps). + - Notify stakeholders and follow restore verification steps (same as post-apply verification). + - After recovery, coordinate re-deploy of any fixes and controlled re-apply if required. + +3. Communicate clearly: + - Open an incident ticket and notify on-call and stakeholders. + - Document timeline and actions taken in the migration PR or incident system. + +--- + +## Common failure modes & mitigations + +- CREATE EXTENSION / pgcrypto permission denied + - Symptom: migration fails on CREATE EXTENSION. + - Mitigation: check provider support. If unsupported, use application-side `UUID` generation or request provider + privileges. Revert/skip extension creation if planned. + +- Long-running CONCURRENTLY index builds causing resource exhaustion + - Symptom: elevated IO/CPU, slowed queries. + - Mitigation: monitor index build, spread index creation to off-peak hours, throttle background jobs, or create + indexes on replicas if available. + +- Migration wrapped in transaction (CONCURRENTLY disallowed) + - Symptom: failure when attempting CREATE INDEX CONCURRENTLY inside transaction. + - Mitigation: separate that step into a non-transactional migration (use `pgm.sql` outside a transaction) as + documented in [docs/MIGRATIONS.md](../MIGRATIONS.md). + +- DB connection exhaustion + - Symptom: connection errors, failed migrations. + - Mitigation: reduce migration concurrency, pause workers, increase DB capacity temporarily, use pgbouncer. + +- Partial backfill failures + - Symptom: backfill job errors after schema change. + - Mitigation: investigate error, re-run backfill with smaller batches, record failures in `backfill_jobs` for + resume. + +--- + +## Recommended verification queries (copy/paste) + +```SQL +-- Check last applied migrations (adjust to your migration table name) +SELECT * +FROM schema_migrations +ORDER BY installed_on DESC LIMIT 20; + +-- Check hero_snapshots presence +SELECT to_regclass('public.hero_snapshots') as hero_snapshots_exists; + +-- Check GIN index presence for hero_snapshots.raw +SELECT indexname, indexdef +FROM pg_indexes +WHERE tablename = 'hero_snapshots' + AND indexdef ILIKE '%gin%'; + +-- Show recent etl_errors +SELECT id, snapshot_id, error_type, message, created_at +FROM etl_errors +ORDER BY created_at DESC LIMIT 50; + +-- Recent processed snapshots +SELECT id, user_id, processed_at, error_count +FROM hero_snapshots +WHERE processed_at IS NOT NULL +ORDER BY processed_at DESC LIMIT 20; +``` + +--- + +## Operational checklist (concise) + +- [ ] Backup created and backup ID recorded +- [ ] Approvals obtained +- [ ] `CI` preflight passed +- [ ] Maintenance window & communications sent +- [ ] Workers paused / ingestion throttled +- [ ] Run migrations via `CI` (preferred) or manual `CLI` +- [ ] Run post-apply verification queries +- [ ] Run smoke `ETL` and confirm no errors +- [ ] Monitor metrics for `30–60 minutes` +- [ ] Re-enable workers and normal traffic +- [ ] Record migration outcome and attach logs/artifacts to PR + +--- + +## Audit & artifacts + +- Keep GitHub Actions logs and artifacts (workflow run ID) attached to the migration PR. +- Record backup snapshot ID and any run IDs (backfill job IDs) in the PR and change log. +- Save verification query outputs into the PR comments or incident log for traceability. + +--- + +## Contacts & escalation + +- Primary `SRE`: (fill with on-call contact or team alias) +- Backend owner(s): from migration PR +- Pager / Slack channel: `#starforge-ops` +- If severe outage: page `SRE` and Engineering leads immediately. + +--- + +## References + +- [docs/MIGRATIONS.md](../MIGRATIONS.md) — migration conventions and patterns +- [docs/ETL_AND_WORKER.md](../ETL_AND_WORKER.md) — `ETL` contract and smoke test guidance +- [docs/DB_MODEL.md](../DB_MODEL.md) — canonical schema +- `scripts/migrate-preflight.sh` — preflight helper script +- `GitHub Actions` workflow: `.github/workflows/db-bootstrap.yml` + +--- + +## Notes + +- This runbook is intended to be concise and actionable. For complex or high-risk migrations (large table rewrites, + partitioning, destructive changes), prepare a migration plan that includes a rehearsal run in staging, detailed + backfill scripts, and extended monitoring windows. +- Keep the runbook updated with contact details and any provider-specific restore steps. diff --git a/docs/OP_RUNBOOKS/BACKFILL.md b/docs/OP_RUNBOOKS/BACKFILL.md new file mode 100644 index 0000000..332b7a7 --- /dev/null +++ b/docs/OP_RUNBOOKS/BACKFILL.md @@ -0,0 +1,406 @@ +# Runbook — Backfill Historical Snapshots + +## Purpose + +This runbook explains how to safely run a backfill of historical `hero_snapshots` into the normalized schema +(`user_troops`, `user_pets`, `user_artifacts`, `user_profile_summary`, etc.). It covers planning, preflight checks, safe +execution (staging → canary → production), verification, throttling recommendations and troubleshooting. + +--- + +## Audience + +- `SRE` / `DevOps` running backfill jobs +- `Backend` / `Data engineers` implementing backfill jobs +- `QA engineers` validating results +- On-call engineers responding to backfill incidents + +--- + +## When to use + +- Populate normalized tables from archived or historical `hero_snapshots`. +- Rebuild derived tables after schema or `ETL` mapping changes. +- Resume or resume a previously interrupted backfill. + +--- + +## High-level strategy + +1. Use the same `ETL` worker codepath for backfill as for real-time processing to guarantee parity. +2. Run backfill in small, resumable batches; track progress in a `backfill_jobs` / `queue_jobs` table. +3. Run backfill on an isolated worker pool (separate from real-time workers) and throttle to protect the primary DB. +4. Validate results via sample parity checks and automated queries; keep raw snapshots immutable for replay. + +--- + +## Prerequisites & assumptions + +- You have a recent DB backup (required before production backfills). +- `hero_snapshots` contains raw JSONB payloads to process. +- `ETL` worker code is tested and supports idempotent reprocessing. +- A `backfill_jobs` / `queue_jobs` mechanism exists to schedule and track jobs ( + see [docs/MIGRATIONS.md](../MIGRATIONS.md)). +- Observability: `Prometheus` metrics and logs are configured for `ETL`, DB, and workers. + +--- + +## Preflight checklist (must pass) + +- [ ] Backup taken and backup ID recorded. +- [ ] Approvals obtained (`Engineering + SRE`; product if user-impacting). +- [ ] Run dry-run in staging with representative sample payloads. +- [ ] Verify worker code handles large snapshots with streaming parsing. +- [ ] Confirm catalog seed coverage (`troop_catalog`, `pet_catalog`) or fallback behavior (placeholders). +- [ ] Confirm capacity limits: DB max connections, `CPU`, `IO`, and estimated snapshot throughput. +- [ ] Confirm monitoring dashboards and alerts are active (`ETL` failure rate, queue depth, DB connections). +- [ ] Communication: inform support and stakeholders of planned windows and contact points. + +--- + +## Planning & capacity estimation + +### Estimate runtime: + +- Measure average processing time per snapshot (`t_avg`) from staging dry-run (seconds). +- Total snapshots (N). +- Desired wall time (`T_target`) and concurrency (C) estimate: + - `T_estimate = (N * t_avg) / C` + - Pick C so DB connection usage and IO remain under safe thresholds. + +### Example: + +- `t_avg = 12s`, `N = 10,000 snapshots`, `C = 10` → `T_estimate ≈ (10_000 * 12) / 10 = 12,000s ≈ 3.3 hours`. + +### Batch sizing + +- Use per-user or per-snapshot batches; recommended batch sizes: `50–500 snapshots` per DB transaction depending on + payload size and DB load. +- For very large snapshots (`2–3MB`) prefer smaller batches (`10–50`). + +--- + +## Execution modes + +- Dry-run (staging): run backfill on a representative sample and validate outputs. +- Canary (production small slice): run backfill for a small time range or subset of users (e.g., last 7 days or + whitelisted namecodes). +- Gradual production: increase coverage and concurrency in controlled steps with monitoring gates. +- One-shot (not recommended): only for tiny datasets or pre-approved maintenance windows. + +--- + +## How to schedule a backfill (examples) + +> Note: adjust commands to your orchestration (Kubernetes, systemd, or runbooks). + +A) Enqueue via `queue_jobs` (DB-backed job table) + +- Insert a backfill-range job row: + +```sql +INSERT INTO queue_jobs (id, type, payload, priority, attempts, max_attempts, status, run_after, created_at, updated_at) +VALUES (gen_random_uuid(), + 'backfill_range', + jsonb_build_object( + 'from_created_at', '2024-01-01T00:00:00Z', + 'to_created_at', '2024-06-01T00:00:00Z', + 'batch_size', 250, + 'owner', 'data-team' + )::jsonb, + 0, + 0, + 5, + 'pending', + now(), + now(), + now()); +``` + +- Workers configured to process `backfill_range` should pick jobs and create internal `backfill_jobs` checkpoints. + +B) Use admin API (if implemented) + +```http +POST /api/v1/admin/backfill +Authorization: Bearer +Content-Type: application/json + +{ + "from": "2024-01-01T00:00:00Z", + "to": "2024-06-01T00:00:00Z", + "batch_size": 250, + "concurrency": 8, + "owner": "data-team" +} +``` + +--- + +## Dry-run (staging) procedure + +1. Select sample snapshots (small, medium, large, malformed cases). Example SQL to pick samples: + +```sql +-- 10 random snapshots across size buckets +WITH sizes AS (SELECT id, size_bytes, ntile(3) OVER (ORDER BY size_bytes) AS bucket + FROM hero_snapshots + WHERE created_at < now()) +SELECT id +FROM sizes +WHERE bucket = 1 +ORDER BY random() LIMIT 3 +UNION ALL +SELECT id +FROM sizes +WHERE bucket = 2 +ORDER BY random() LIMIT 3 +UNION ALL +SELECT id +FROM sizes +WHERE bucket = 3 +ORDER BY random() LIMIT 4; +``` + +2. Start a staging worker cluster with the same code and configuration you will use in production (but reduced + concurrency). +3. Enqueue these sample snapshot ids as backfill jobs (or trigger reprocess) and observe: + - Memory usage, processing time, and DB operations. + - ETL emits `snapshot_processed` events and no unhandled exceptions. + +--- + +## Canary procedure (production small slice) + +1. Pick a narrow range (e.g., 1 day or 500 users) or whitelisted namecodes. +2. Run the backfill job for the slice with conservative concurrency (C = 1–4). +3. Monitor for `30–60 minutes`: + - `ETL` failure rate (should be near zero). + - DB `CPU/IO` and connection count (no spike above safe thresholds). + - Application latencies and errors. + +--- + +## Production rollout (gradual) + +1. Start with a small concurrency and slice. +2. If gate checks pass after observation window, scale up: + - Increase number of concurrent workers or batch size incrementally. + - Expand the date range or number of users processed. +3. Continue until full coverage achieved. + +--- + +## Progress tracking & resume + +- Maintain a `backfill_jobs` table with fields: + - job_id, owner, from_ts, to_ts, batch_size, concurrency, status (pending|running|paused|failed|done), + processed_count, error_count, last_checkpoint, started_at, finished_at. +- Use `processed_count` and `last_checkpoint` to resume from last successful snapshot on failure. + +Sample SQL to inspect progress: + +```sql +SELECT job_id, status, processed_count, error_count, started_at, finished_at +FROM backfill_jobs +ORDER BY started_at DESC LIMIT 50; +``` + +--- + +## Validation & verification (sample queries) + +Run automated checks during and after backfill to validate correctness. + +1. Processed counts: + +```sql +SELECT COUNT(*) +FROM hero_snapshots +WHERE processed_at IS NOT NULL + AND processed_at >= now() - interval '1 day'; +``` + +2. Etl errors (investigate top error types): + +```sql +SELECT error_type, count(*) +FROM etl_errors +WHERE created_at >= now() - interval '1 hour' +GROUP BY error_type +ORDER BY count DESC; +``` + +3. Spot-check data parity for example snapshot: + +- Extract a small piece of truth from raw JSON and compare to normalized table. + +```sql +-- Raw: retrieve troop entries for a snapshot (example JSON path may vary) +SELECT raw - > 'ProfileData' - > 'Troops' AS troops +FROM hero_snapshots +WHERE id = ''; + +-- Normalized: compare total troop rows for the user +SELECT count(*) +FROM user_troops +WHERE user_id = (SELECT user_id FROM hero_snapshots WHERE id = ''); +``` + +4. Uniqueness checks: + +```sql +-- Ensure no duplicate user_troops per (user_id, troop_id) +SELECT user_id, troop_id, COUNT(*) +FROM user_troops +GROUP BY user_id, troop_id +HAVING COUNT(*) > 1 LIMIT 50; +``` + +5. Summary generation validation: + +```sql +-- Ensure profile_summary exists for sampled users +SELECT u.id, ups.user_id IS NOT NULL AS has_summary +FROM users u + LEFT JOIN user_profile_summary ups ON u.id = ups.user_id +WHERE u.id IN (); +``` + +--- + +## Monitoring & alerts (what to watch) + +- `ETL` metrics: + - `starforge_etl_snapshots_processed_total` (throughput) + - `starforge_etl_snapshots_failed_total` (failures) + - `starforge_etl_processing_duration_seconds` (latency) +- Queue metrics: queue depth and job age +- DB metrics: active connections, long-running transactions, `CPU`, `IOPS` +- Worker metrics: memory usage, restarts +- Alerts: + - Failure rate > `1%` sustained → page on-call. + - Queue depth growing unexpectedly → investigate consumer capacity. + - DB connections > `80%` of max → pause backfill and scale DB or reduce concurrency. + +--- + +## Throttling & safety knobs + +- Reduce concurrency (worker count) if DB connections climb. +- Reduce batch size if individual transactions are slow or if FK violations appear. +- Pause the backfill job(s) by updating `backfill_jobs.status = 'paused'` or removing pending queue entries. +- Temporarily scale down real-time workers to free DB capacity (do this carefully to avoid service disruption). + +--- + +## Common failure modes & remediation + +- Worker `OOM` / restarts + - Action: reduce batch size, process large snapshots separately, increase worker memory for dedicated backfill pool. +- FK violations (missing catalog rows) + - Action: either create safe placeholder rows in catalogs (`troop_catalog`, `pet_catalog`) or capture failing + snapshot ids for manual review; do not delete other processed data. +- DB connection exhaustion + - Action: pause backfill, scale DB or pgbouncer, reduce per-worker pool size, resume with lower concurrency. +- High `etl_errors` rate (`PARSE_ERROR`) + - Action: capture sample failing snapshot raw JSON to `docs/examples/quarantine` or S3 for developer analysis; pause + automated reprocessing until fix. +- Long-running index builds or blocking operations + - Action: monitor `pg_stat_activity` and cancel offending queries if safe; consider scheduling heavy DDL for + maintenance windows. + +--- + +## Rollback & recovery + +Backfill itself is idempotent; use these steps if unacceptable data changes happen: + +1. Pause backfill (stop workers). +2. If corruption is localized and fixable by reprocessing (mapping bug),: + - Fix `ETL` code. + - Re-enqueue affected snapshot ids or run targeted reprocessing. +3. If destructive corruption occurred (rare), restore DB from pre-backfill backup: + - Follow [DB_RESTORE.md](./DB_RESTORE.md) runbook. + - Re-run controlled backfill with corrected process. +4. Document actions and notify stakeholders. + +--- + +## Post-backfill tasks + +- Run full validation suite (sampling and aggregates) and record reports/artifacts. +- Mark `backfill_jobs` as `done` and record `finished_at`. +- Incrementally remove any temporary placeholder catalog rows if created and update catalog with authoritative data. +- Update dashboards to reflect production read usage from the normalized tables. +- Archive logs and push final reports into PR or release artifacts. + +--- + +## Artifacts & audit + +Store the following artifacts for auditing and troubleshooting: + +- Backup snapshot ID used before backfill. +- Backfill job records (`backfill_jobs` rows). +- `ETL` logs for the backfill period (worker logs). +- A report with sample verification queries and their results. +- Links to any `S3` objects for quarantined snapshots. + +--- + +## Contacts & escalation + +- Primary `SRE` / on-call: (replace with team alias) `#starforge-ops` / pager +- Backend owner(s): check migration/backfill PR +- Data owner: `data-team` (as recorded in job payload) +- Security contact: (security officer alias) + +--- + +## Appendix: helpful SQL snippets + +```SQL +-- Pause future job processing by marking pending backfill jobs paused +UPDATE backfill_jobs +SET status = 'paused' +WHERE status = 'pending'; + +-- Resume a paused backfill job +UPDATE backfill_jobs +SET status = 'running' +WHERE job_id = '' + AND status = 'paused'; + +-- Find snapshots not processed (candidate for backfill) +SELECT id, created_at, size_bytes +FROM hero_snapshots +WHERE processed_at IS NULL +ORDER BY created_at ASC LIMIT 1000; + +-- Get top ETL error types in last 24h +SELECT error_type, COUNT(*) +FROM etl_errors +WHERE created_at >= now() - interval '24 hours' +GROUP BY error_type +ORDER BY COUNT DESC; + +-- Check long-running transactions +SELECT pid, usename, now() - xact_start AS duration, query +FROM pg_stat_activity +WHERE xact_start IS NOT NULL + AND now() - xact_start > interval '1 minute' +ORDER BY duration DESC LIMIT 20; +``` + +--- + +## Related documents + +- [docs/MIGRATIONS.md](../MIGRATIONS.md) — migration conventions and safe patterns +- [docs/ETL_AND_WORKER.md](../ETL_AND_WORKER.md) — ETL worker contract and upsert patterns +- [docs/DB_MODEL.md](../DB_MODEL.md) — canonical schema and table definitions +- [docs/OP_RUNBOOKS/APPLY_MIGRATIONS.md](./APPLY_MIGRATIONS.md) — apply migrations runbook +- `scripts/migrate-preflight.sh` — preflight helper + +--- diff --git a/docs/OP_RUNBOOKS/DB_CONNECTION_EXHAUST.md b/docs/OP_RUNBOOKS/DB_CONNECTION_EXHAUST.md new file mode 100644 index 0000000..278af13 --- /dev/null +++ b/docs/OP_RUNBOOKS/DB_CONNECTION_EXHAUST.md @@ -0,0 +1,352 @@ +# Runbook — Postgres Connection Exhaustion + +--- + +## Purpose + +Immediate, actionable steps to diagnose and mitigate a `Postgres` (or managed `Postgres`) connection exhaustion incident +for StarForge. This runbook is for on-call `SREs` and `backend engineers` to restore service quickly and safely, then +remediate root causes. + +--- + +## Scope + +- Symptoms covered: new connections failing, application errors like "too many connections", high `pg_stat_activity` + connection counts, `pgbouncer`/`connection-pool` saturation, or managed provider connection limits reached. +- Assumes metrics & logging per [docs/OBSERVABILITY.md](../OBSERVABILITY.md) are available and that you have access to + DB admin credentials + and the ability to scale/pause workers. + +--- + +## Emergency summary (short) + +1. Acknowledge the alert and notify stakeholders (#starforge-ops). +2. Quickly reduce incoming load (pause ingestion and scale down workers). +3. Identify & cancel long-running queries; prefer cancel over terminate. +4. If safe, increase connection resources (`pgbouncer` scaling or DB scaling) while preparing a long-term fix. +5. Run postmortem and implement permanent controls (connection pooler, limits, timeouts). + +--- + +## Contacts / escalation + +- Primary on-call `SRE`: #starforge-ops (`Pager`) +- Backend owner(s): from recent PR/migration +- Engineering lead / DB admin: `` +- Security contact: `security@org` (if suspicious activity) + +--- + +## Triage checklist (first 5 minutes) + +- [ ] Acknowledge `PagerDuty` / alert. +- [ ] Set a dedicated incident channel (e.g. `#incident-dbconn-)`. +- [ ] Find current symptom: application errors, `502/503`, DB-side rejects, or elevated queue length. +- [ ] If possible, temporarily block new writes / `API` ingestion (feature flag) and lower worker concurrency. + +--- + +## Important safety note + +Always avoid destructive actions you don't understand. Prefer conservative actions (pause, cancel, scale) and keep a +clear log of commands run. If unsure about a connection/user, consult the backend owner before terminating connections. + +--- + +## Quick diagnostics (commands) + +Run these from a bastion / `CI` / workstation with `psql` access (replace `DATABASE_URL` or connection params). + +1) Check DB max connections and current counts + +```sql +-- Show configured max connections +SHOW +max_connections; + +-- Current active connection count +SELECT count(*) AS total_connections +FROM pg_stat_activity; + +-- Breakdown by state +SELECT state, count(*) +FROM pg_stat_activity +GROUP BY state; + +-- Connections by application_name / user / client_addr +SELECT application_name, usename, client_addr, count(*) AS c +FROM pg_stat_activity +GROUP BY application_name, usename, client_addr +ORDER BY c DESC LIMIT 50; +``` + +2) Find long-running queries / transactions + +```sql +-- Longest running queries +SELECT pid, usename, application_name, client_addr, now() - query_start AS duration, state, query +FROM pg_stat_activity +WHERE state <> 'idle' + AND query_start IS NOT NULL +ORDER BY duration DESC LIMIT 50; + +-- Long transactions that may hold locks +SELECT pid, usename, now() - xact_start AS tx_duration, query +FROM pg_stat_activity +WHERE xact_start IS NOT NULL +ORDER BY tx_duration DESC LIMIT 50; +``` + +3) Check locks and waiting queries + +```sql +-- Waiting queries +SELECT pid, wait_event_type, wait_event, state, query_start, query +FROM pg_stat_activity +WHERE wait_event IS NOT NULL +ORDER BY query_start; + +-- Inspect pg_locks for blocking relationships +SELECT blocked_locks.pid AS blocked_pid, + blocked_activity.usename AS blocked_user, + blocking_locks.pid AS blocking_pid, + blocking_activity.usename AS blocking_user, + blocking_activity.query AS blocking_query +FROM pg_locks blocked_locks + JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid + JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype + AND blocking_locks.database IS NOT DISTINCT +FROM blocked_locks.database + AND blocking_locks.relation IS NOT DISTINCT +FROM blocked_locks.relation + AND blocking_locks.page IS NOT DISTINCT +FROM blocked_locks.page + AND blocking_locks.tuple IS NOT DISTINCT +FROM blocked_locks.tuple + AND blocking_locks.virtualxid IS NOT DISTINCT +FROM blocked_locks.virtualxid + AND blocking_locks.transactionid IS NOT DISTINCT +FROM blocked_locks.transactionid + JOIN pg_stat_activity blocking_activity +ON blocking_activity.pid = blocking_locks.pid +WHERE NOT blocked_locks.GRANTED; +``` + +--- + +## Immediate mitigation steps (safe, prioritized) + +Follow in order until connections recover. + +1) Pause ingestion / reduce incoming traffic (very high impact but safe) + - Flip feature flag that accepts new snapshots (`API`) to OFF, or configure `API` to return `503` for ingestion + endpoints. + - Announce to support: "ingestion paused, attempting mitigation". + +2) Scale down background workers (free DB connections) + - If using `Kubernetes`: + ```bash + # reduce worker replicas quickly (example) + kubectl -n starforge scale deployment etl-worker --replicas=0 + kubectl -n starforge scale deployment api --replicas= # careful if API needs DB + ``` + - If non-`K8s`: stop or pause worker processes/containers, or set worker concurrency env var to `0/1` then restart. + +3) If using `pgbouncer` or a pooler, check / scale pooler + - For `pgbouncer`: check stats and connection usage; if `pgbouncer` itself is saturated, increase instances or pool + size. + - If using managed pooling (`Supabase`, `RDS proxy`), consider scaling that layer. + +4) Cancel long-running queries (non-destructive) + - Prefer `pg_cancel_backend(pid)` first to politely cancel the query: + ```sql + SELECT pg_cancel_backend(); + ``` + - Only use `pg_terminate_backend(pid)` if cancel fails or the backend is stuck in an unresponsive state: + ```sql + SELECT pg_terminate_backend(); + ``` + - Before terminating, confirm the `pid` is not a replication or monitoring connection (`usename`/system user). + +5) Reduce application DB pool sizes (short-term) + - Update worker/app environment to smaller `PG_POOL_MAX` and restart a small number of instances. + - Example: if workers use `PG_POOL=20`, reduce to `2–4` and restart. + +6) Re-enable ingestion carefully once headroom exists + - Gradually bring workers back online and observe connections; do not immediately restore full concurrency. + +--- + +## If you must increase capacity + +- For managed DB: request a temporary vertical scale (larger instance) or increase connection limit if provider supports + it. +- Add or scale `pgbouncer`/`proxy` to front the DB. +- These are medium-impact and require change control: document actions and monitor costs. + +--- + +## Investigations & deeper diagnostics (post-stabilize) + +Once service is restored to a stable state: + +1) Correlate with metrics + - Check `Prometheus` for connection spikes: + - `pg_connections`, `pg_active_queries`, `starforge_etl_jobs_in_progress`, `starforge_queue_jobs_waiting_total` + - Look at timeline to find the spike origin (deploy, backfill, large import, `DDoS`). + +2) Identify offending clients + - From the `pg_stat_activity` breakdown, find `application_name`, user or host contributing the most connections. + - Common culprits: misconfigured worker pool size, runaway backfill, `CI` job, monitoring tool misconfigured, bulk + `ETL`. + +3) Inspect recent deployments & migrations + - Was a migration or deploy pushed around the time of the spike? See `GitHub Actions` run and PR link. + - A broken change may cause connection leaks or extremely slow queries. + +4) Examine slow queries & missing indexes + - Use `pg_stat_statements` (if available) to find expensive queries and top query-by-total-time. + - Consider adding or rebuilding indexes or rewriting queries to reduce execution time. + +5) Check `pgbouncer` / pooler settings + - Pooling mode (session / transaction / statement), `max_client_conn`, `default_pool_size` and `reserve_pool_size`. + - Use transaction pooling if the app is compatible (no session-level temp tables). + +--- + +## Permanent remediation (next actions) + +- Implement or tune a connection pooler (`pgbouncer`) in front of `Postgres`; use transaction pooling if application + permits. +- Enforce sensible per-worker connection pool sizes and cap total connections via orchestration. +- Add and enforce statement and transaction timeouts: + - `SET statement_timeout = '30s';` in application session or via DB role. +- Harden `ETL`: + - Make workers use smaller per-process pools (`PG_POOL_MAX=2–4`) and limit concurrency. + - Add resource-aware parsing (streaming) to avoid long transactions. +- Add soft quotas and backpressure: + - Throttle ingestion at `API` level when queue depth grows. + - Implement admission control for backfill jobs. +- Alerting & dashboards: + - Add alert: `pg_connections > 0.8 * max_connections` for `2m`. + - Add alert on `pg_stat_activity` long-running tx count > threshold. +- `CI` / deploy guard: + - Include migration and backfill preflight tests and throttling defaults with any code that touches DB connection + behavior. + +--- + +## Useful SQL snippets for remediation & postmortem + +```SQL +-- Show max connections and current usage +SELECT name, setting +FROM pg_settings +WHERE name IN ('max_connections', 'superuser_reserved_connections'); + +-- Top users by connection +SELECT usename, count(*) +FROM pg_stat_activity +GROUP BY usename +ORDER BY count DESC; + +-- Top application names +SELECT application_name, count(*) +FROM pg_stat_activity +GROUP BY application_name +ORDER BY count DESC; + +-- Identify connections from a given host +SELECT pid, usename, application_name, client_addr, state, backend_start, query +FROM pg_stat_activity +WHERE client_addr = '1.2.3.4'; + +-- Cancel a query +SELECT pg_cancel_backend(); + +-- Terminate a backend (last resort) +SELECT pg_terminate_backend(); +``` + +--- + +## Playbook for a common scenario (worker storm) + +1. Observation: queue depth spikes and DB connections hit max. +2. Actions: + - Pause enqueueing from API (if addable) OR reduce `API` acceptance. + - Scale down worker replicas to zero (or to 1) to immediately drain new DB connections. + - In `Kubernetes`: + ```bash + kubectl -n starforge scale deployment etl-worker --replicas=0 + ``` + - Watch `pg_stat_activity` drop; cancel any long-running queries if necessary. +3. Recovery: + - Fix root cause (e.g., buggy worker job loop). + - Increase worker rollout gradually: + - set `replicas=1`, observe. + - if safe, scale to normal count. + +--- + +## Post-incident: RCA & follow-up + +1. Document timeline: when the spike started, mitigation steps, who approved actions. +2. Root cause analysis: + - Identify the exact code path / job / deploy that caused the spike. + - Capture query texts and stack traces if available. +3. Remediation tasks (track as tickets): + - Add/adjust connection pooling and default pool sizes. + - Add `statement_timeout` and `idle_in_transaction_session_timeout`. + - Harden workers to backoff on DB errors and avoid retry storms. + - Improve monitoring and add guardrails (circuit-breakers, ingestion throttles). +4. Validate fix in staging and schedule a controlled rollout. + +--- + +## Appendix: Helpful commands & references + +- Inspect active connections: + ```sql + SELECT pid, usename, application_name, client_addr, state, now()-query_start AS duration, query + FROM pg_stat_activity ORDER BY duration DESC LIMIT 50; + ``` + +- Cancel & terminate: + ```sql + SELECT pg_cancel_backend(12345); -- polite + SELECT pg_terminate_backend(12345); -- kills session + ``` + +- Check max connections: + ```sql + SHOW max_connections; + SHOW superuser_reserved_connections; + ``` + +- Check `pgbouncer` (if present): + ```sql + # Connect to pgbouncer and run: + SHOW POOLS; + SHOW CLIENTS; + SHOW SERVERS; + SHOW STATS; + ``` + +- `Kubernetes` example to reduce DB client pods: + ```bash + kubectl -n starforge scale deployment etl-worker --replicas=0 + ``` + +--- + +## Related docs + +- [docs/OBSERVABILITY.md](../OBSERVABILITY.md) — metrics and alert guidance +- [docs/MIGRATIONS.md](../MIGRATIONS.md) — migration preflight (ensure migrations don't create connection spikes) +- [docs/ETL_AND_WORKER.md](../ETL_AND_WORKER.md) — worker concurrency and connection budgeting +- [docs/OP_RUNBOOKS/APPLY_MIGRATIONS.md](./APPLY_MIGRATIONS.md) — safe migration runbook + +--- diff --git a/docs/OP_RUNBOOKS/DB_RESTORE.md b/docs/OP_RUNBOOKS/DB_RESTORE.md new file mode 100644 index 0000000..ca90a7d --- /dev/null +++ b/docs/OP_RUNBOOKS/DB_RESTORE.md @@ -0,0 +1,298 @@ +# Runbook — Database Restore / Recovery + +--- + +## Purpose + +Step-by-step guidance to restore the StarForge `PostgreSQL` database from backups or perform a point-in-time restore +(`PITR`). This runbook is the authoritative operational procedure to recover from catastrophic failures caused by +destructive migrations, accidental deletes, or provider incidents. + +--- + +## Scope + +- Full database restore from backup snapshot / dump. +- Point-In-Time Recovery (`PITR`) to a specific timestamp (if `WAL` / `PITR` enabled). +- Promoting a replica as primary (when applicable). +- Validation and smoke checks after restore. +- Communication, escalation and post-restore actions. + +--- + +## Audience + +- `SRE` / `DevOps engineers` executing restores +- `Backend engineers` assisting with verification and data integrity checks +- `Incident commander` during recovery + +--- + +## Prerequisites & assumptions + +- You have access to cloud provider console or backups location (`S3`/`GCS`) and DB admin credentials. +- A known-good backup snapshot ID or a backup file is available. +- For `PITR` you must have `WAL` archive enabled and accessible for the target window. +- You have documented restore permissions and an incident channel for coordination. +- Pre-restore backup (freeze point) was taken as part of migration/backfill safety steps (recommended practice). + +--- + +## Emergency summary (short) + +1. Pause writes and ingestion (stop workers & `API` writes). +2. Identify reliable backup (snapshot id or latest consistent backup). +3. Restore to a new DB instance (do NOT overwrite primary unless planned). +4. Run smoke tests and validation queries. +5. Promote restored instance to primary or remap application connections. +6. Resume services in controlled manner and monitor. + +--- + +## Pre-restore checklist (must complete) + +- [ ] Notify stakeholders & open an incident channel (`#incident-db-restore`). +- [ ] Record the current state (timestamps, error messages, affected services). +- [ ] Pause ingestion and stop workers: scale worker replica count to `0` or stop processes. +- [ ] Capture diagnostics: `pg_stat_activity`, `error logs`, `recent migrations`, `last successful backup id`, and current `WAL` position if available. +- [ ] Confirm backup availability (snapshot id or path) and estimated time-to-restore. +- [ ] Confirm access to secrets and credentials required for restore (cloud console, DB admin). +- [ ] Confirm rollback/contingency plan and that approvers are available. + +--- + +## Types of restores + +1. Provider snapshot restore (managed DB snapshot) + - Fastest option; provider restores an instance from snapshot. + - Use when you have a full snapshot taken previously (recommended before migrations). + +2. Restore from logical dump (`pg_dump` / `pg_restore`) + - Required when snapshots not available or when restoring specific schema/data. + - Slower for large DBs; use for targeted restores. + +3. Point-In-Time Recovery (`PITR`) + - Restore base backup and apply `WAL` logs up to target timestamp. + - Use when you need to recover to a specific moment (e.g., before accidental DELETE). + +4. Promote read-replica + - If a healthy read-replica exists and is up-to-date, promote it to primary. + - Quick and safe if replica is suitable. + +--- + +## Restore workflow (recommended safest flow) + +### A. Preparation (operator) + +1. Pause writers: + - Pause ingestion `API` or flip feature flag. + - Scale down `ETL` worker replicas: + ```bash + kubectl -n starforge scale deployment etl-worker --replicas=0 + ``` +2. Ensure no scheduled jobs are running that will write to DB. + +3. Select restore target: + - Option 1: New instance for restored DB (recommended). + - Option 2: Restore into staging cluster first for verification (strongly recommended if time allows). + +### B. Execute restore (provider snapshot example) + +1. In provider console (`Supabase` / `RDS` / `Cloud SQL`): + - Locate snapshot id (or automated backup). + - Click "Restore" and choose a new instance name (e.g., `starforge-restore-2025-12-03`). + - Choose same region and compatible instance size. If under load, consider a larger instance temporarily. + +2. Wait for restore to complete. Time depends on DB size and provider speed. + +3. Obtain the new `DATABASE_URL` for the restored instance and restrict access (whitelist operator IPs). + +### C. Execute restore (logical dump example) + +1. Upload dump file to the restore host or use direct streaming. + +2. Create a blank DB on target host: + ```bash + psql -h -U -c "CREATE DATABASE starforge_restore;" + ``` + +3. Restore schema & data: + ```bash + pg_restore --verbose --no-owner --role= -h -U -d starforge_restore /path/to/backup.dump + ``` + +4. Monitor restore progress and `pg_restore` output for errors. + +### D. Execute PITR (when WAL available) + +`PITR` is advanced; follow provider docs or below high-level steps. + +1. Restore base backup to a new instance (base backup). +2. Configure `recovery.conf` or restore settings to point to `WAL` archive. +3. Set recovery target time: + ```bash + recovery_target_time = '2025-12-03 14:23:00+00' + ``` +4. Start instance and wait until recovery completes and database is consistent. +5. Verify recovered state matches expected time. + +--- + +### E. Promote & cutover + +1. Test the restored DB with a read-only smoke test. + - Run sanity queries: count of essential tables, sample user lookup, basic `API` health with read-only endpoint. + +2. Plan cutover strategy: + - `DNS`/connection string swap: + - Update application `DATABASE_URL` to point to restored DB (best done via secrets manager or config). + - Alternatively, change read/write roles and promote the restored host to primary (provider option). + +3. Minimize downtime: + - Use a rolling approach: bring up a small application replica pointing to restored DB, validate, then gradually switch traffic. + +--- + +## Validation & verification (post-restore) + +Run these checks immediately and for the monitoring window that follows. + +1. Basic connectivity + ```bash + psql -c "SELECT 1;" + ``` + +2. Schema sanity + ```sql + SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public'; + SELECT to_regclass('public.hero_snapshots'); + ``` + +3. Critical row counts / sample checks + - Users: + ```sql + SELECT count(*) FROM users; + SELECT id FROM users ORDER BY created_at DESC LIMIT 5; + ``` + - Hero snapshots: + ```sql + SELECT count(*) FROM hero_snapshots; + SELECT id, created_at FROM hero_snapshots ORDER BY created_at DESC LIMIT 5; + ``` + +4. Application smoke test (read & write) + - With a staging `API` instance, call: + - `GET /api/health` + - `GET /api/v1/profile/summary/` + - (If read-write allowed) insert a small test snapshot and ensure it processes. + +5. `ETL` smoke test + - Start a single worker against restored DB and process a sample snapshot: + - Confirm `hero_snapshots.processed_at` is set and `user_profile_summary` updated. + - Ensure worker logs show no fatal errors. + +6. Consistency checks + - Referential integrity: + ```sql + SELECT count(*) FROM user_troops WHERE user_id IS NULL; + ``` + - Uniqueness constraints: + ```sql + SELECT user_id, troop_id, count(*) FROM user_troops GROUP BY user_id, troop_id HAVING count(*) > 1 LIMIT 10; + ``` + +7. Application-scale verification + - Monitor metrics: `API` latency, DB connections, `ETL` error rate for at least `30–60 minutes` before full cutover. + +--- + +## Rollback & contingency during restore + +- Do not overwrite the original primary instance immediately. Always restore to a new instance first. +- If restored DB is invalid, abort the cutover and iterate (try a different snapshot or time). +- If primary must be reinstated, you can revert application connection strings to the original `DATABASE_URL`. + +--- + +## Post-restore steps and follow-up + +- Document the restore: snapshot id, who executed, timestamps, and verification outputs. Attach to incident ticket. +- Run a full application regression test in staging, then in production for critical flows. +- Re-enable workers and ingestion progressively: + - Start a small number of workers (`concurrency=1`), monitor, then scale. +- Rebuild or re-apply any migrations that are required and were not present in the restored point (coordinate migration history). +- Run a backfill for any missing or late data if necessary (use [docs/OP_RUNBOOKS/BACKFILL.md](./BACKFILL.md)). + +--- + +## Security & access considerations + +- Rotate credentials if restore was caused by a security incident (see [docs/OP_RUNBOOKS/SECRET_COMPROMISE.md](./SECRET_COMPROMISE.md) runbook). +- Audit who accessed backups and restoration artifacts. +- Limit access to restored instance until verified. + +--- + +## Troubleshooting common issues + +- Restore taking too long: + - Option: restore into a larger instance type to speed `IO`. + - For logical restores, use parallel restore with `pg_restore -j `. + +- `WAL` unavailable for `PITR`: + - If `WAL` segment missing, `PITR` cannot reach target time. Consider restoring to the last available `WAL` and then applying compensating actions. + - Consult provider's support for `WAL` retrieval if they archive it. + +- Errors during `pg_restore` (permission, role problems): + - Re-run with `--no-owner --role=` and ensure roles exist. + - Create missing roles temporarily or adjust dump options. + +- Post-restore replication / replica issues: + - If promoting a replica, ensure replication slots are cleaned and standby configs updated. + - If using pgbouncer, update its server config to point to the new primary. + +--- + +## Communication template (incident updates) + +- Initial alert (short): + - "Incident: DB outage — starting restore. Using snapshot ``. `ETA` to service: `~ minutes`. Channel: `#incident-db-restore`." + +- Progress update: + - "Restore progress: snapshot restore completed / `pg_restore` at `40%` / `PITR` applying `WALs` — `ETA ~20m`. Next: smoke tests." + +- Resolution: + - "Restore complete. Restored instance: ``. Smoke tests passed. Cutover started at