diff --git a/docs/DB_MODEL.md b/docs/DB_MODEL.md
index 79c9bc8..009319e 100644
--- a/docs/DB_MODEL.md
+++ b/docs/DB_MODEL.md
@@ -1,4 +1,4 @@
-# StarForge — Canonical Database Model (DB_MODEL.md)
+# StarForge — Canonical Database Model
 
 > This document is the canonical reference for the database schema used by StarForge. It expands the summary in the PRD
 > and contains:
diff --git a/docs/OP_RUNBOOKS/APPLY_MIGRATIONS.md b/docs/OP_RUNBOOKS/APPLY_MIGRATIONS.md
new file mode 100644
index 0000000..c76ed7c
--- /dev/null
+++ b/docs/OP_RUNBOOKS/APPLY_MIGRATIONS.md
@@ -0,0 +1,283 @@
+# Runbook — Apply Database Migrations
+
+---
+
+## Purpose
+
+This runbook describes the safe, repeatable procedure to apply schema migrations to production (and staging)
+environments for the StarForge project. It consolidates the guidance in [docs/MIGRATIONS.md](../MIGRATIONS.md) and
+provides a concise,
+actionable checklist, commands, verification queries and troubleshooting steps for operators and engineers.
+
+---
+
+## Audience
+
+- `SRE` / `Ops engineers` executing production changes
+- `Backend engineers` owning migration PRs
+- `Release approvers` and on-call engineers
+
+---
+
+## When to use
+
+- To apply migration PRs to staging or production.
+- To run emergency schema changes after appropriate approvals.
+- To validate migrations that were applied by `CI` (post-apply verification).
+
+---
+
+## Principles (short)
+
+- Always be safe: take a backup before applying production migrations.
+- Use the protected `CI` workflow (`db-bootstrap`) for production when possible — it enforces approvals and records
+  artifacts.
+- Prefer additive, phased migrations: add → backfill → enforce.
+- Monitor system health during and after migration; be prepared to rollback or restore.
+
+---
+
+## Pre-flight checklist (must pass before apply)
+
+1. Backup & snapshot
+    - Create and record a DB backup/snapshot ID. Verify the backup was successful.
+    - Document the backup ID in the migration ticket/PR.
+
+2. Approvals
+    - Confirm the GitHub `db-bootstrap` workflow will run under a protected environment (requires approval).
+    - Ensure required approvers (`Engineering` + `SRE`) are available during the window.
+
+3. `CI` Preflight & Tests
+    - Ensure `CI` `migrate:preflight` job passed for the migration PR (applied against ephemeral DB).
+    - Confirm unit/integration tests and migration tests passed in `CI`.
+
+4. Runbook & Impact
+    - Confirm the migration PR includes impact estimates (row counts, index build time).
+    - Confirm any required maintenance window or low-traffic window is scheduled if the change is heavy.
+
+5. Communication
+    - Notify stakeholders (product, support) of the planned maintenance window and expected `ETA`.
+    - Post the planned change with contact / pager information.
+
+6. Operational readiness
+    - Ensure `SRE` on-call is available.
+    - Confirm ability to pause ingestion and workers (feature flag or scaledown commands).
+    - Confirm you can run the rollback / restore plan and have the runbook open.
+
+---
+
+## How to apply (recommended: GitHub Actions / db-bootstrap)
+
+Use the repository's protected `db-bootstrap` GitHub Actions workflow. This is the preferred and auditable path.
+
+1. Open the PR with migration files and ensure it includes `MIGRATION: <filename>` in commit message.
+2. From the repository Actions tab, locate `db-bootstrap` (or run `workflow_dispatch`).
+3. Provide required inputs (if any) and kick off the workflow.
+4. Approver(s): Approve the environment prompt to let the workflow run against production.
+    - The workflow will run preflight checks, run migrations, and surface logs.
+
+> Notes:
+> - The workflow requires the secret `DATABASE_URL` in `GitHub environment` and the `apply` confirmation input to
+    actually apply.
+> - The workflow logs and artifacts will be retained in `GitHub Actions` and should be saved for audit.
+
+---
+
+## How to apply (alternative: manual via CLI)
+
+Only use if GitHub Actions is not available. Prefer scripted, idempotent commands.
+
+1. Pause ingestion & workers
+    - Disable ingestion if possible (`API` feature flag) or scale down workers to 0.
+    - Example (`Kubernetes`): scale down `etl worker` deployment:
+        - `kubectl scale deployment etl-worker --replicas=0 -n starforge`
+    - For non-`K8s`: stop worker processes or toggle feature flag.
+
+2. Ensure you have a recent backup.
+
+3. Run preflight locally (optional but recommended)
+    - `./scripts/migrate-preflight.sh`
+    - Validate `pgcrypto` availability, connectivity, and run a smoke `ETL`.
+
+4. Run migrations
+    - From repo root:
+        - `pnpm install --frozen-lockfile`
+        - `pnpm run migrate:up -- --config database/migration-config.js --env production`
+    - If migrations require `CONCURRENTLY` for indexes, follow documented instructions — they may run outside a
+      transaction.
+
+5. Collect logs and proceed to verification.
+
+---
+
+## Post-apply verification (smoke & health checks)
+
+Immediately after migrations finish, run the following checks before re-enabling full traffic:
+
+1. Schema migration table
+    - Verify applied migrations:
+        - `SELECT * FROM node_pgmigrations_schema_version ORDER BY installed_on DESC;`
+    - (or check `schema_migrations` / the table created by `node-pg-migrate` in your setup)
+
+2. Basic DB health
+    - Active connections:
+        - `SELECT count(*) FROM pg_stat_activity;`
+    - Long running transactions:
+        -
+        `SELECT pid, now() - xact_start AS duration, query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY duration DESC LIMIT 10;`
+
+3. Verify key tables/indexes exist (examples)
+    - Check table existence:
+        - `SELECT to_regclass('public.hero_snapshots');`
+    - Check index existence:
+        - `SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'hero_snapshots';`
+
+4. Run a smoke `ETL`
+    - Insert a small test snapshot (or use sample fixture) into `hero_snapshots` and enqueue a processing job.
+    - Verify worker processes it (if workers are still paused, re-enable a single worker instance temporarily).
+    - Confirm `user_profile_summary` and `user_troops` upserts succeed.
+    - Helpful SQL:
+        -
+        `SELECT id, processed_at, error_count, last_error FROM hero_snapshots WHERE created_at > now() - interval '10 minutes' ORDER BY created_at DESC LIMIT 20;`
+
+5. Check `etl_errors` and logs
+    - Recent errors:
+        - `SELECT * FROM etl_errors ORDER BY created_at DESC LIMIT 50;`
+    - Check application logs and `Prometheus` metrics (`starforge_etl_snapshots_processed_total`,
+      `starforge_etl_snapshots_failed_total`).
+
+6. Monitor metrics & dashboards
+    - Watch `ETL` processing rate and failure rate for at least `30–60 minutes`.
+    - Verify DB CPU, IO, and connection metrics are within expected ranges.
+
+---
+
+## Rollback & emergency restore (decision flow)
+
+If migration causes critical failures or data corruption, follow this decision flow:
+
+1. If migration has a safe `down` script, and you are confident its execution will restore safe state:
+    - Run the down migration for the offending change:
+        - `pnpm run migrate:down -- --count 1 -- --config database/migration-config.js`
+    - Note: downs may not be safe for destructive changes. Ensure you understand side effects.
+
+2. If `down` is unsafe, perform a DB restore from the backup created before the migration:
+    - Stop ingestion and workers immediately.
+    - Restore the DB from the backup snapshot ID recorded earlier (follow provider-specific restore steps).
+    - Notify stakeholders and follow restore verification steps (same as post-apply verification).
+    - After recovery, coordinate re-deploy of any fixes and controlled re-apply if required.
+
+3. Communicate clearly:
+    - Open an incident ticket and notify on-call and stakeholders.
+    - Document timeline and actions taken in the migration PR or incident system.
+
+---
+
+## Common failure modes & mitigations
+
+- CREATE EXTENSION / pgcrypto permission denied
+    - Symptom: migration fails on CREATE EXTENSION.
+    - Mitigation: check provider support. If unsupported, use application-side `UUID` generation or request provider
+      privileges. Revert/skip extension creation if planned.
+
+- Long-running CONCURRENTLY index builds causing resource exhaustion
+    - Symptom: elevated IO/CPU, slowed queries.
+    - Mitigation: monitor index build, spread index creation to off-peak hours, throttle background jobs, or create
+      indexes on replicas if available.
+
+- Migration wrapped in transaction (CONCURRENTLY disallowed)
+    - Symptom: failure when attempting CREATE INDEX CONCURRENTLY inside transaction.
+    - Mitigation: separate that step into a non-transactional migration (use `pgm.sql` outside a transaction) as
+      documented in [docs/MIGRATIONS.md](../MIGRATIONS.md).
+
+- DB connection exhaustion
+    - Symptom: connection errors, failed migrations.
+    - Mitigation: reduce migration concurrency, pause workers, increase DB capacity temporarily, use pgbouncer.
+
+- Partial backfill failures
+    - Symptom: backfill job errors after schema change.
+    - Mitigation: investigate error, re-run backfill with smaller batches, record failures in `backfill_jobs` for
+      resume.
+
+---
+
+## Recommended verification queries (copy/paste)
+
+```SQL
+-- Check last applied migrations (adjust to your migration table name)
+SELECT *
+FROM schema_migrations
+ORDER BY installed_on DESC LIMIT 20;
+
+-- Check hero_snapshots presence
+SELECT to_regclass('public.hero_snapshots') as hero_snapshots_exists;
+
+-- Check GIN index presence for hero_snapshots.raw
+SELECT indexname, indexdef
+FROM pg_indexes
+WHERE tablename = 'hero_snapshots'
+  AND indexdef ILIKE '%gin%';
+
+-- Show recent etl_errors
+SELECT id, snapshot_id, error_type, message, created_at
+FROM etl_errors
+ORDER BY created_at DESC LIMIT 50;
+
+-- Recent processed snapshots
+SELECT id, user_id, processed_at, error_count
+FROM hero_snapshots
+WHERE processed_at IS NOT NULL
+ORDER BY processed_at DESC LIMIT 20;
+```
+
+---
+
+## Operational checklist (concise)
+
+- [ ] Backup created and backup ID recorded
+- [ ] Approvals obtained
+- [ ] `CI` preflight passed
+- [ ] Maintenance window & communications sent
+- [ ] Workers paused / ingestion throttled
+- [ ] Run migrations via `CI` (preferred) or manual `CLI`
+- [ ] Run post-apply verification queries
+- [ ] Run smoke `ETL` and confirm no errors
+- [ ] Monitor metrics for `30–60 minutes`
+- [ ] Re-enable workers and normal traffic
+- [ ] Record migration outcome and attach logs/artifacts to PR
+
+---
+
+## Audit & artifacts
+
+- Keep GitHub Actions logs and artifacts (workflow run ID) attached to the migration PR.
+- Record backup snapshot ID and any run IDs (backfill job IDs) in the PR and change log.
+- Save verification query outputs into the PR comments or incident log for traceability.
+
+---
+
+## Contacts & escalation
+
+- Primary `SRE`: (fill with on-call contact or team alias)
+- Backend owner(s): from migration PR
+- Pager / Slack channel: `#starforge-ops`
+- If severe outage: page `SRE` and Engineering leads immediately.
+
+---
+
+## References
+
+- [docs/MIGRATIONS.md](../MIGRATIONS.md) — migration conventions and patterns
+- [docs/ETL_AND_WORKER.md](../ETL_AND_WORKER.md) — `ETL` contract and smoke test guidance
+- [docs/DB_MODEL.md](../DB_MODEL.md) — canonical schema
+- `scripts/migrate-preflight.sh` — preflight helper script
+- `GitHub Actions` workflow: `.github/workflows/db-bootstrap.yml`
+
+---
+
+## Notes
+
+- This runbook is intended to be concise and actionable. For complex or high-risk migrations (large table rewrites,
+  partitioning, destructive changes), prepare a migration plan that includes a rehearsal run in staging, detailed
+  backfill scripts, and extended monitoring windows.
+- Keep the runbook updated with contact details and any provider-specific restore steps.
diff --git a/docs/OP_RUNBOOKS/BACKFILL.md b/docs/OP_RUNBOOKS/BACKFILL.md
new file mode 100644
index 0000000..332b7a7
--- /dev/null
+++ b/docs/OP_RUNBOOKS/BACKFILL.md
@@ -0,0 +1,406 @@
+# Runbook — Backfill Historical Snapshots
+
+## Purpose
+
+This runbook explains how to safely run a backfill of historical `hero_snapshots` into the normalized schema
+(`user_troops`, `user_pets`, `user_artifacts`, `user_profile_summary`, etc.). It covers planning, preflight checks, safe
+execution (staging → canary → production), verification, throttling recommendations and troubleshooting.
+
+---
+
+## Audience
+
+- `SRE` / `DevOps` running backfill jobs
+- `Backend` / `Data engineers` implementing backfill jobs
+- `QA engineers` validating results
+- On-call engineers responding to backfill incidents
+
+---
+
+## When to use
+
+- Populate normalized tables from archived or historical `hero_snapshots`.
+- Rebuild derived tables after schema or `ETL` mapping changes.
+- Resume or resume a previously interrupted backfill.
+
+---
+
+## High-level strategy
+
+1. Use the same `ETL` worker codepath for backfill as for real-time processing to guarantee parity.
+2. Run backfill in small, resumable batches; track progress in a `backfill_jobs` / `queue_jobs` table.
+3. Run backfill on an isolated worker pool (separate from real-time workers) and throttle to protect the primary DB.
+4. Validate results via sample parity checks and automated queries; keep raw snapshots immutable for replay.
+
+---
+
+## Prerequisites & assumptions
+
+- You have a recent DB backup (required before production backfills).
+- `hero_snapshots` contains raw JSONB payloads to process.
+- `ETL` worker code is tested and supports idempotent reprocessing.
+- A `backfill_jobs` / `queue_jobs` mechanism exists to schedule and track jobs (
+  see [docs/MIGRATIONS.md](../MIGRATIONS.md)).
+- Observability: `Prometheus` metrics and logs are configured for `ETL`, DB, and workers.
+
+---
+
+## Preflight checklist (must pass)
+
+- [ ] Backup taken and backup ID recorded.
+- [ ] Approvals obtained (`Engineering + SRE`; product if user-impacting).
+- [ ] Run dry-run in staging with representative sample payloads.
+- [ ] Verify worker code handles large snapshots with streaming parsing.
+- [ ] Confirm catalog seed coverage (`troop_catalog`, `pet_catalog`) or fallback behavior (placeholders).
+- [ ] Confirm capacity limits: DB max connections, `CPU`, `IO`, and estimated snapshot throughput.
+- [ ] Confirm monitoring dashboards and alerts are active (`ETL` failure rate, queue depth, DB connections).
+- [ ] Communication: inform support and stakeholders of planned windows and contact points.
+
+---
+
+## Planning & capacity estimation
+
+### Estimate runtime:
+
+- Measure average processing time per snapshot (`t_avg`) from staging dry-run (seconds).
+- Total snapshots (N).
+- Desired wall time (`T_target`) and concurrency (C) estimate:
+    - `T_estimate = (N * t_avg) / C`
+    - Pick C so DB connection usage and IO remain under safe thresholds.
+
+### Example:
+
+- `t_avg = 12s`, `N = 10,000 snapshots`, `C = 10` → `T_estimate ≈ (10_000 * 12) / 10 = 12,000s ≈ 3.3 hours`.
+
+### Batch sizing
+
+- Use per-user or per-snapshot batches; recommended batch sizes: `50–500 snapshots` per DB transaction depending on
+  payload size and DB load.
+- For very large snapshots (`2–3MB`) prefer smaller batches (`10–50`).
+
+---
+
+## Execution modes
+
+- Dry-run (staging): run backfill on a representative sample and validate outputs.
+- Canary (production small slice): run backfill for a small time range or subset of users (e.g., last 7 days or
+  whitelisted namecodes).
+- Gradual production: increase coverage and concurrency in controlled steps with monitoring gates.
+- One-shot (not recommended): only for tiny datasets or pre-approved maintenance windows.
+
+---
+
+## How to schedule a backfill (examples)
+
+> Note: adjust commands to your orchestration (Kubernetes, systemd, or runbooks).
+
+A) Enqueue via `queue_jobs` (DB-backed job table)
+
+- Insert a backfill-range job row:
+
+```sql
+INSERT INTO queue_jobs (id, type, payload, priority, attempts, max_attempts, status, run_after, created_at, updated_at)
+VALUES (gen_random_uuid(),
+        'backfill_range',
+        jsonb_build_object(
+                'from_created_at', '2024-01-01T00:00:00Z',
+                'to_created_at', '2024-06-01T00:00:00Z',
+                'batch_size', 250,
+                'owner', 'data-team'
+        )::jsonb,
+        0,
+        0,
+        5,
+        'pending',
+        now(),
+        now(),
+        now());
+```
+
+- Workers configured to process `backfill_range` should pick jobs and create internal `backfill_jobs` checkpoints.
+
+B) Use admin API (if implemented)
+
+```http
+POST /api/v1/admin/backfill
+Authorization: Bearer <admin-token>
+Content-Type: application/json
+
+{
+  "from": "2024-01-01T00:00:00Z",
+  "to": "2024-06-01T00:00:00Z",
+  "batch_size": 250,
+  "concurrency": 8,
+  "owner": "data-team"
+}
+```
+
+---
+
+## Dry-run (staging) procedure
+
+1. Select sample snapshots (small, medium, large, malformed cases). Example SQL to pick samples:
+
+```sql
+-- 10 random snapshots across size buckets
+WITH sizes AS (SELECT id, size_bytes, ntile(3) OVER (ORDER BY size_bytes) AS bucket
+               FROM hero_snapshots
+               WHERE created_at < now())
+SELECT id
+FROM sizes
+WHERE bucket = 1
+ORDER BY random() LIMIT 3
+UNION ALL
+SELECT id
+FROM sizes
+WHERE bucket = 2
+ORDER BY random() LIMIT 3
+UNION ALL
+SELECT id
+FROM sizes
+WHERE bucket = 3
+ORDER BY random() LIMIT 4;
+```
+
+2. Start a staging worker cluster with the same code and configuration you will use in production (but reduced
+   concurrency).
+3. Enqueue these sample snapshot ids as backfill jobs (or trigger reprocess) and observe:
+    - Memory usage, processing time, and DB operations.
+    - ETL emits `snapshot_processed` events and no unhandled exceptions.
+
+---
+
+## Canary procedure (production small slice)
+
+1. Pick a narrow range (e.g., 1 day or 500 users) or whitelisted namecodes.
+2. Run the backfill job for the slice with conservative concurrency (C = 1–4).
+3. Monitor for `30–60 minutes`:
+    - `ETL` failure rate (should be near zero).
+    - DB `CPU/IO` and connection count (no spike above safe thresholds).
+    - Application latencies and errors.
+
+---
+
+## Production rollout (gradual)
+
+1. Start with a small concurrency and slice.
+2. If gate checks pass after observation window, scale up:
+    - Increase number of concurrent workers or batch size incrementally.
+    - Expand the date range or number of users processed.
+3. Continue until full coverage achieved.
+
+---
+
+## Progress tracking & resume
+
+- Maintain a `backfill_jobs` table with fields:
+    - job_id, owner, from_ts, to_ts, batch_size, concurrency, status (pending|running|paused|failed|done),
+      processed_count, error_count, last_checkpoint, started_at, finished_at.
+- Use `processed_count` and `last_checkpoint` to resume from last successful snapshot on failure.
+
+Sample SQL to inspect progress:
+
+```sql
+SELECT job_id, status, processed_count, error_count, started_at, finished_at
+FROM backfill_jobs
+ORDER BY started_at DESC LIMIT 50;
+```
+
+---
+
+## Validation & verification (sample queries)
+
+Run automated checks during and after backfill to validate correctness.
+
+1. Processed counts:
+
+```sql
+SELECT COUNT(*)
+FROM hero_snapshots
+WHERE processed_at IS NOT NULL
+  AND processed_at >= now() - interval '1 day';
+```
+
+2. Etl errors (investigate top error types):
+
+```sql
+SELECT error_type, count(*)
+FROM etl_errors
+WHERE created_at >= now() - interval '1 hour'
+GROUP BY error_type
+ORDER BY count DESC;
+```
+
+3. Spot-check data parity for example snapshot:
+
+- Extract a small piece of truth from raw JSON and compare to normalized table.
+
+```sql
+-- Raw: retrieve troop entries for a snapshot (example JSON path may vary)
+SELECT raw - > 'ProfileData' - > 'Troops' AS troops
+FROM hero_snapshots
+WHERE id = '<snapshot_id>';
+
+-- Normalized: compare total troop rows for the user
+SELECT count(*)
+FROM user_troops
+WHERE user_id = (SELECT user_id FROM hero_snapshots WHERE id = '<snapshot_id>');
+```
+
+4. Uniqueness checks:
+
+```sql
+-- Ensure no duplicate user_troops per (user_id, troop_id)
+SELECT user_id, troop_id, COUNT(*)
+FROM user_troops
+GROUP BY user_id, troop_id
+HAVING COUNT(*) > 1 LIMIT 50;
+```
+
+5. Summary generation validation:
+
+```sql
+-- Ensure profile_summary exists for sampled users
+SELECT u.id, ups.user_id IS NOT NULL AS has_summary
+FROM users u
+         LEFT JOIN user_profile_summary ups ON u.id = ups.user_id
+WHERE u.id IN (<sample_user_ids>);
+```
+
+---
+
+## Monitoring & alerts (what to watch)
+
+- `ETL` metrics:
+    - `starforge_etl_snapshots_processed_total` (throughput)
+    - `starforge_etl_snapshots_failed_total` (failures)
+    - `starforge_etl_processing_duration_seconds` (latency)
+- Queue metrics: queue depth and job age
+- DB metrics: active connections, long-running transactions, `CPU`, `IOPS`
+- Worker metrics: memory usage, restarts
+- Alerts:
+    - Failure rate > `1%` sustained → page on-call.
+    - Queue depth growing unexpectedly → investigate consumer capacity.
+    - DB connections > `80%` of max → pause backfill and scale DB or reduce concurrency.
+
+---
+
+## Throttling & safety knobs
+
+- Reduce concurrency (worker count) if DB connections climb.
+- Reduce batch size if individual transactions are slow or if FK violations appear.
+- Pause the backfill job(s) by updating `backfill_jobs.status = 'paused'` or removing pending queue entries.
+- Temporarily scale down real-time workers to free DB capacity (do this carefully to avoid service disruption).
+
+---
+
+## Common failure modes & remediation
+
+- Worker `OOM` / restarts
+    - Action: reduce batch size, process large snapshots separately, increase worker memory for dedicated backfill pool.
+- FK violations (missing catalog rows)
+    - Action: either create safe placeholder rows in catalogs (`troop_catalog`, `pet_catalog`) or capture failing
+      snapshot ids for manual review; do not delete other processed data.
+- DB connection exhaustion
+    - Action: pause backfill, scale DB or pgbouncer, reduce per-worker pool size, resume with lower concurrency.
+- High `etl_errors` rate (`PARSE_ERROR`)
+    - Action: capture sample failing snapshot raw JSON to `docs/examples/quarantine` or S3 for developer analysis; pause
+      automated reprocessing until fix.
+- Long-running index builds or blocking operations
+    - Action: monitor `pg_stat_activity` and cancel offending queries if safe; consider scheduling heavy DDL for
+      maintenance windows.
+
+---
+
+## Rollback & recovery
+
+Backfill itself is idempotent; use these steps if unacceptable data changes happen:
+
+1. Pause backfill (stop workers).
+2. If corruption is localized and fixable by reprocessing (mapping bug),:
+    - Fix `ETL` code.
+    - Re-enqueue affected snapshot ids or run targeted reprocessing.
+3. If destructive corruption occurred (rare), restore DB from pre-backfill backup:
+    - Follow [DB_RESTORE.md](./DB_RESTORE.md) runbook.
+    - Re-run controlled backfill with corrected process.
+4. Document actions and notify stakeholders.
+
+---
+
+## Post-backfill tasks
+
+- Run full validation suite (sampling and aggregates) and record reports/artifacts.
+- Mark `backfill_jobs` as `done` and record `finished_at`.
+- Incrementally remove any temporary placeholder catalog rows if created and update catalog with authoritative data.
+- Update dashboards to reflect production read usage from the normalized tables.
+- Archive logs and push final reports into PR or release artifacts.
+
+---
+
+## Artifacts & audit
+
+Store the following artifacts for auditing and troubleshooting:
+
+- Backup snapshot ID used before backfill.
+- Backfill job records (`backfill_jobs` rows).
+- `ETL` logs for the backfill period (worker logs).
+- A report with sample verification queries and their results.
+- Links to any `S3` objects for quarantined snapshots.
+
+---
+
+## Contacts & escalation
+
+- Primary `SRE` / on-call: (replace with team alias) `#starforge-ops` / pager
+- Backend owner(s): check migration/backfill PR
+- Data owner: `data-team` (as recorded in job payload)
+- Security contact: (security officer alias)
+
+---
+
+## Appendix: helpful SQL snippets
+
+```SQL
+-- Pause future job processing by marking pending backfill jobs paused
+UPDATE backfill_jobs
+SET status = 'paused'
+WHERE status = 'pending';
+
+-- Resume a paused backfill job
+UPDATE backfill_jobs
+SET status = 'running'
+WHERE job_id = '<job_id>'
+  AND status = 'paused';
+
+-- Find snapshots not processed (candidate for backfill)
+SELECT id, created_at, size_bytes
+FROM hero_snapshots
+WHERE processed_at IS NULL
+ORDER BY created_at ASC LIMIT 1000;
+
+-- Get top ETL error types in last 24h
+SELECT error_type, COUNT(*)
+FROM etl_errors
+WHERE created_at >= now() - interval '24 hours'
+GROUP BY error_type
+ORDER BY COUNT DESC;
+
+-- Check long-running transactions
+SELECT pid, usename, now() - xact_start AS duration, query
+FROM pg_stat_activity
+WHERE xact_start IS NOT NULL
+  AND now() - xact_start > interval '1 minute'
+ORDER BY duration DESC LIMIT 20;
+```
+
+---
+
+## Related documents
+
+- [docs/MIGRATIONS.md](../MIGRATIONS.md) — migration conventions and safe patterns
+- [docs/ETL_AND_WORKER.md](../ETL_AND_WORKER.md) — ETL worker contract and upsert patterns
+- [docs/DB_MODEL.md](../DB_MODEL.md) — canonical schema and table definitions
+- [docs/OP_RUNBOOKS/APPLY_MIGRATIONS.md](./APPLY_MIGRATIONS.md) — apply migrations runbook
+- `scripts/migrate-preflight.sh` — preflight helper
+
+---
diff --git a/docs/OP_RUNBOOKS/DB_CONNECTION_EXHAUST.md b/docs/OP_RUNBOOKS/DB_CONNECTION_EXHAUST.md
new file mode 100644
index 0000000..278af13
--- /dev/null
+++ b/docs/OP_RUNBOOKS/DB_CONNECTION_EXHAUST.md
@@ -0,0 +1,352 @@
+# Runbook — Postgres Connection Exhaustion
+
+---
+
+## Purpose
+
+Immediate, actionable steps to diagnose and mitigate a `Postgres` (or managed `Postgres`) connection exhaustion incident
+for StarForge. This runbook is for on-call `SREs` and `backend engineers` to restore service quickly and safely, then
+remediate root causes.
+
+---
+
+## Scope
+
+- Symptoms covered: new connections failing, application errors like "too many connections", high `pg_stat_activity`
+  connection counts, `pgbouncer`/`connection-pool` saturation, or managed provider connection limits reached.
+- Assumes metrics & logging per [docs/OBSERVABILITY.md](../OBSERVABILITY.md) are available and that you have access to
+  DB admin credentials
+  and the ability to scale/pause workers.
+
+---
+
+## Emergency summary (short)
+
+1. Acknowledge the alert and notify stakeholders (#starforge-ops).
+2. Quickly reduce incoming load (pause ingestion and scale down workers).
+3. Identify & cancel long-running queries; prefer cancel over terminate.
+4. If safe, increase connection resources (`pgbouncer` scaling or DB scaling) while preparing a long-term fix.
+5. Run postmortem and implement permanent controls (connection pooler, limits, timeouts).
+
+---
+
+## Contacts / escalation
+
+- Primary on-call `SRE`: #starforge-ops (`Pager`)
+- Backend owner(s): from recent PR/migration
+- Engineering lead / DB admin: `<engineering-lead@org>`
+- Security contact: `security@org` (if suspicious activity)
+
+---
+
+## Triage checklist (first 5 minutes)
+
+- [ ] Acknowledge `PagerDuty` / alert.
+- [ ] Set a dedicated incident channel (e.g. `#incident-dbconn-<ts>)`.
+- [ ] Find current symptom: application errors, `502/503`, DB-side rejects, or elevated queue length.
+- [ ] If possible, temporarily block new writes / `API` ingestion (feature flag) and lower worker concurrency.
+
+---
+
+## Important safety note
+
+Always avoid destructive actions you don't understand. Prefer conservative actions (pause, cancel, scale) and keep a
+clear log of commands run. If unsure about a connection/user, consult the backend owner before terminating connections.
+
+---
+
+## Quick diagnostics (commands)
+
+Run these from a bastion / `CI` / workstation with `psql` access (replace `DATABASE_URL` or connection params).
+
+1) Check DB max connections and current counts
+
+```sql
+-- Show configured max connections
+SHOW
+max_connections;
+
+-- Current active connection count
+SELECT count(*) AS total_connections
+FROM pg_stat_activity;
+
+-- Breakdown by state
+SELECT state, count(*)
+FROM pg_stat_activity
+GROUP BY state;
+
+-- Connections by application_name / user / client_addr
+SELECT application_name, usename, client_addr, count(*) AS c
+FROM pg_stat_activity
+GROUP BY application_name, usename, client_addr
+ORDER BY c DESC LIMIT 50;
+```
+
+2) Find long-running queries / transactions
+
+```sql
+-- Longest running queries
+SELECT pid, usename, application_name, client_addr, now() - query_start AS duration, state, query
+FROM pg_stat_activity
+WHERE state <> 'idle'
+  AND query_start IS NOT NULL
+ORDER BY duration DESC LIMIT 50;
+
+-- Long transactions that may hold locks
+SELECT pid, usename, now() - xact_start AS tx_duration, query
+FROM pg_stat_activity
+WHERE xact_start IS NOT NULL
+ORDER BY tx_duration DESC LIMIT 50;
+```
+
+3) Check locks and waiting queries
+
+```sql
+-- Waiting queries
+SELECT pid, wait_event_type, wait_event, state, query_start, query
+FROM pg_stat_activity
+WHERE wait_event IS NOT NULL
+ORDER BY query_start;
+
+-- Inspect pg_locks for blocking relationships
+SELECT blocked_locks.pid         AS blocked_pid,
+       blocked_activity.usename  AS blocked_user,
+       blocking_locks.pid        AS blocking_pid,
+       blocking_activity.usename AS blocking_user,
+       blocking_activity.query   AS blocking_query
+FROM pg_locks blocked_locks
+         JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
+         JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
+    AND blocking_locks.database IS NOT DISTINCT
+FROM blocked_locks.database
+    AND blocking_locks.relation IS NOT DISTINCT
+FROM blocked_locks.relation
+    AND blocking_locks.page IS NOT DISTINCT
+FROM blocked_locks.page
+    AND blocking_locks.tuple IS NOT DISTINCT
+FROM blocked_locks.tuple
+    AND blocking_locks.virtualxid IS NOT DISTINCT
+FROM blocked_locks.virtualxid
+    AND blocking_locks.transactionid IS NOT DISTINCT
+FROM blocked_locks.transactionid
+    JOIN pg_stat_activity blocking_activity
+ON blocking_activity.pid = blocking_locks.pid
+WHERE NOT blocked_locks.GRANTED;
+```
+
+---
+
+## Immediate mitigation steps (safe, prioritized)
+
+Follow in order until connections recover.
+
+1) Pause ingestion / reduce incoming traffic (very high impact but safe)
+    - Flip feature flag that accepts new snapshots (`API`) to OFF, or configure `API` to return `503` for ingestion
+      endpoints.
+    - Announce to support: "ingestion paused, attempting mitigation".
+
+2) Scale down background workers (free DB connections)
+    - If using `Kubernetes`:
+      ```bash
+      # reduce worker replicas quickly (example)
+      kubectl -n starforge scale deployment etl-worker --replicas=0
+      kubectl -n starforge scale deployment api --replicas=<safe-count>  # careful if API needs DB
+      ```
+    - If non-`K8s`: stop or pause worker processes/containers, or set worker concurrency env var to `0/1` then restart.
+
+3) If using `pgbouncer` or a pooler, check / scale pooler
+    - For `pgbouncer`: check stats and connection usage; if `pgbouncer` itself is saturated, increase instances or pool
+      size.
+    - If using managed pooling (`Supabase`, `RDS proxy`), consider scaling that layer.
+
+4) Cancel long-running queries (non-destructive)
+    - Prefer `pg_cancel_backend(pid)` first to politely cancel the query:
+      ```sql
+      SELECT pg_cancel_backend(<pid>);
+      ```
+    - Only use `pg_terminate_backend(pid)` if cancel fails or the backend is stuck in an unresponsive state:
+      ```sql
+      SELECT pg_terminate_backend(<pid>);
+      ```
+    - Before terminating, confirm the `pid` is not a replication or monitoring connection (`usename`/system user).
+
+5) Reduce application DB pool sizes (short-term)
+    - Update worker/app environment to smaller `PG_POOL_MAX` and restart a small number of instances.
+    - Example: if workers use `PG_POOL=20`, reduce to `2–4` and restart.
+
+6) Re-enable ingestion carefully once headroom exists
+    - Gradually bring workers back online and observe connections; do not immediately restore full concurrency.
+
+---
+
+## If you must increase capacity
+
+- For managed DB: request a temporary vertical scale (larger instance) or increase connection limit if provider supports
+  it.
+- Add or scale `pgbouncer`/`proxy` to front the DB.
+- These are medium-impact and require change control: document actions and monitor costs.
+
+---
+
+## Investigations & deeper diagnostics (post-stabilize)
+
+Once service is restored to a stable state:
+
+1) Correlate with metrics
+    - Check `Prometheus` for connection spikes:
+        - `pg_connections`, `pg_active_queries`, `starforge_etl_jobs_in_progress`, `starforge_queue_jobs_waiting_total`
+    - Look at timeline to find the spike origin (deploy, backfill, large import, `DDoS`).
+
+2) Identify offending clients
+    - From the `pg_stat_activity` breakdown, find `application_name`, user or host contributing the most connections.
+    - Common culprits: misconfigured worker pool size, runaway backfill, `CI` job, monitoring tool misconfigured, bulk
+      `ETL`.
+
+3) Inspect recent deployments & migrations
+    - Was a migration or deploy pushed around the time of the spike? See `GitHub Actions` run and PR link.
+    - A broken change may cause connection leaks or extremely slow queries.
+
+4) Examine slow queries & missing indexes
+    - Use `pg_stat_statements` (if available) to find expensive queries and top query-by-total-time.
+    - Consider adding or rebuilding indexes or rewriting queries to reduce execution time.
+
+5) Check `pgbouncer` / pooler settings
+    - Pooling mode (session / transaction / statement), `max_client_conn`, `default_pool_size` and `reserve_pool_size`.
+    - Use transaction pooling if the app is compatible (no session-level temp tables).
+
+---
+
+## Permanent remediation (next actions)
+
+- Implement or tune a connection pooler (`pgbouncer`) in front of `Postgres`; use transaction pooling if application
+  permits.
+- Enforce sensible per-worker connection pool sizes and cap total connections via orchestration.
+- Add and enforce statement and transaction timeouts:
+    - `SET statement_timeout = '30s';` in application session or via DB role.
+- Harden `ETL`:
+    - Make workers use smaller per-process pools (`PG_POOL_MAX=2–4`) and limit concurrency.
+    - Add resource-aware parsing (streaming) to avoid long transactions.
+- Add soft quotas and backpressure:
+    - Throttle ingestion at `API` level when queue depth grows.
+    - Implement admission control for backfill jobs.
+- Alerting & dashboards:
+    - Add alert: `pg_connections > 0.8 * max_connections` for `2m`.
+    - Add alert on `pg_stat_activity` long-running tx count > threshold.
+- `CI` / deploy guard:
+    - Include migration and backfill preflight tests and throttling defaults with any code that touches DB connection
+      behavior.
+
+---
+
+## Useful SQL snippets for remediation & postmortem
+
+```SQL
+-- Show max connections and current usage
+SELECT name, setting
+FROM pg_settings
+WHERE name IN ('max_connections', 'superuser_reserved_connections');
+
+-- Top users by connection
+SELECT usename, count(*)
+FROM pg_stat_activity
+GROUP BY usename
+ORDER BY count DESC;
+
+-- Top application names
+SELECT application_name, count(*)
+FROM pg_stat_activity
+GROUP BY application_name
+ORDER BY count DESC;
+
+-- Identify connections from a given host
+SELECT pid, usename, application_name, client_addr, state, backend_start, query
+FROM pg_stat_activity
+WHERE client_addr = '1.2.3.4';
+
+-- Cancel a query
+SELECT pg_cancel_backend(<pid>);
+
+-- Terminate a backend (last resort)
+SELECT pg_terminate_backend(<pid>);
+```
+
+---
+
+## Playbook for a common scenario (worker storm)
+
+1. Observation: queue depth spikes and DB connections hit max.
+2. Actions:
+    - Pause enqueueing from API (if addable) OR reduce `API` acceptance.
+    - Scale down worker replicas to zero (or to 1) to immediately drain new DB connections.
+    - In `Kubernetes`:
+      ```bash
+      kubectl -n starforge scale deployment etl-worker --replicas=0
+      ```
+    - Watch `pg_stat_activity` drop; cancel any long-running queries if necessary.
+3. Recovery:
+    - Fix root cause (e.g., buggy worker job loop).
+    - Increase worker rollout gradually:
+        - set `replicas=1`, observe.
+        - if safe, scale to normal count.
+
+---
+
+## Post-incident: RCA & follow-up
+
+1. Document timeline: when the spike started, mitigation steps, who approved actions.
+2. Root cause analysis:
+    - Identify the exact code path / job / deploy that caused the spike.
+    - Capture query texts and stack traces if available.
+3. Remediation tasks (track as tickets):
+    - Add/adjust connection pooling and default pool sizes.
+    - Add `statement_timeout` and `idle_in_transaction_session_timeout`.
+    - Harden workers to backoff on DB errors and avoid retry storms.
+    - Improve monitoring and add guardrails (circuit-breakers, ingestion throttles).
+4. Validate fix in staging and schedule a controlled rollout.
+
+---
+
+## Appendix: Helpful commands & references
+
+- Inspect active connections:
+  ```sql
+  SELECT pid, usename, application_name, client_addr, state, now()-query_start AS duration, query
+  FROM pg_stat_activity ORDER BY duration DESC LIMIT 50;
+  ```
+
+- Cancel & terminate:
+  ```sql
+  SELECT pg_cancel_backend(12345);  -- polite
+  SELECT pg_terminate_backend(12345); -- kills session
+  ```
+
+- Check max connections:
+  ```sql
+  SHOW max_connections;
+  SHOW superuser_reserved_connections;
+  ```
+
+- Check `pgbouncer` (if present):
+  ```sql
+  # Connect to pgbouncer and run:
+  SHOW POOLS;
+  SHOW CLIENTS;
+  SHOW SERVERS;
+  SHOW STATS;
+  ```
+
+- `Kubernetes` example to reduce DB client pods:
+  ```bash
+  kubectl -n starforge scale deployment etl-worker --replicas=0
+  ```
+
+---
+
+## Related docs
+
+- [docs/OBSERVABILITY.md](../OBSERVABILITY.md) — metrics and alert guidance
+- [docs/MIGRATIONS.md](../MIGRATIONS.md) — migration preflight (ensure migrations don't create connection spikes)
+- [docs/ETL_AND_WORKER.md](../ETL_AND_WORKER.md) — worker concurrency and connection budgeting
+- [docs/OP_RUNBOOKS/APPLY_MIGRATIONS.md](./APPLY_MIGRATIONS.md) — safe migration runbook
+
+---
diff --git a/docs/OP_RUNBOOKS/DB_RESTORE.md b/docs/OP_RUNBOOKS/DB_RESTORE.md
new file mode 100644
index 0000000..ca90a7d
--- /dev/null
+++ b/docs/OP_RUNBOOKS/DB_RESTORE.md
@@ -0,0 +1,298 @@
+# Runbook — Database Restore / Recovery
+
+---
+
+## Purpose
+
+Step-by-step guidance to restore the StarForge `PostgreSQL` database from backups or perform a point-in-time restore
+(`PITR`). This runbook is the authoritative operational procedure to recover from catastrophic failures caused by
+destructive migrations, accidental deletes, or provider incidents.
+
+---
+
+## Scope
+
+- Full database restore from backup snapshot / dump.
+- Point-In-Time Recovery (`PITR`) to a specific timestamp (if `WAL` / `PITR` enabled).
+- Promoting a replica as primary (when applicable).
+- Validation and smoke checks after restore.
+- Communication, escalation and post-restore actions.
+
+---
+
+## Audience
+
+- `SRE` / `DevOps engineers` executing restores
+- `Backend engineers` assisting with verification and data integrity checks
+- `Incident commander` during recovery
+
+---
+
+## Prerequisites & assumptions
+
+- You have access to cloud provider console or backups location (`S3`/`GCS`) and DB admin credentials.
+- A known-good backup snapshot ID or a backup file is available.
+- For `PITR` you must have `WAL` archive enabled and accessible for the target window.
+- You have documented restore permissions and an incident channel for coordination.
+- Pre-restore backup (freeze point) was taken as part of migration/backfill safety steps (recommended practice).
+
+---
+
+## Emergency summary (short)
+
+1. Pause writes and ingestion (stop workers & `API` writes).
+2. Identify reliable backup (snapshot id or latest consistent backup).
+3. Restore to a new DB instance (do NOT overwrite primary unless planned).
+4. Run smoke tests and validation queries.
+5. Promote restored instance to primary or remap application connections.
+6. Resume services in controlled manner and monitor.
+
+---
+
+## Pre-restore checklist (must complete)
+
+- [ ] Notify stakeholders & open an incident channel (`#incident-db-restore`).
+- [ ] Record the current state (timestamps, error messages, affected services).
+- [ ] Pause ingestion and stop workers: scale worker replica count to `0` or stop processes.
+- [ ] Capture diagnostics: `pg_stat_activity`, `error logs`, `recent migrations`, `last successful backup id`, and current `WAL` position if available.
+- [ ] Confirm backup availability (snapshot id or path) and estimated time-to-restore.
+- [ ] Confirm access to secrets and credentials required for restore (cloud console, DB admin).
+- [ ] Confirm rollback/contingency plan and that approvers are available.
+
+---
+
+## Types of restores
+
+1. Provider snapshot restore (managed DB snapshot)
+   - Fastest option; provider restores an instance from snapshot.
+   - Use when you have a full snapshot taken previously (recommended before migrations).
+
+2. Restore from logical dump (`pg_dump` / `pg_restore`)
+   - Required when snapshots not available or when restoring specific schema/data.
+   - Slower for large DBs; use for targeted restores.
+
+3. Point-In-Time Recovery (`PITR`)
+   - Restore base backup and apply `WAL` logs up to target timestamp.
+   - Use when you need to recover to a specific moment (e.g., before accidental DELETE).
+
+4. Promote read-replica
+   - If a healthy read-replica exists and is up-to-date, promote it to primary.
+   - Quick and safe if replica is suitable.
+
+---
+
+## Restore workflow (recommended safest flow)
+
+### A. Preparation (operator)
+
+1. Pause writers:
+   - Pause ingestion `API` or flip feature flag.
+   - Scale down `ETL` worker replicas:
+     ```bash
+     kubectl -n starforge scale deployment etl-worker --replicas=0
+     ```
+2. Ensure no scheduled jobs are running that will write to DB.
+
+3. Select restore target:
+   - Option 1: New instance for restored DB (recommended).
+   - Option 2: Restore into staging cluster first for verification (strongly recommended if time allows).
+
+### B. Execute restore (provider snapshot example)
+
+1. In provider console (`Supabase` / `RDS` / `Cloud SQL`):
+   - Locate snapshot id (or automated backup).
+   - Click "Restore" and choose a new instance name (e.g., `starforge-restore-2025-12-03`).
+   - Choose same region and compatible instance size. If under load, consider a larger instance temporarily.
+
+2. Wait for restore to complete. Time depends on DB size and provider speed.
+
+3. Obtain the new `DATABASE_URL` for the restored instance and restrict access (whitelist operator IPs).
+
+### C. Execute restore (logical dump example)
+
+1. Upload dump file to the restore host or use direct streaming.
+
+2. Create a blank DB on target host:
+   ```bash
+   psql -h <host> -U <admin> -c "CREATE DATABASE starforge_restore;"
+   ```
+
+3. Restore schema & data:
+   ```bash
+   pg_restore --verbose --no-owner --role=<role> -h <host> -U <admin> -d starforge_restore /path/to/backup.dump
+   ```
+
+4. Monitor restore progress and `pg_restore` output for errors.
+
+### D. Execute PITR (when WAL available)
+
+`PITR` is advanced; follow provider docs or below high-level steps.
+
+1. Restore base backup to a new instance (base backup).
+2. Configure `recovery.conf` or restore settings to point to `WAL` archive.
+3. Set recovery target time:
+   ```bash
+   recovery_target_time = '2025-12-03 14:23:00+00'
+   ```
+4. Start instance and wait until recovery completes and database is consistent.
+5. Verify recovered state matches expected time.
+
+---
+
+### E. Promote & cutover
+
+1. Test the restored DB with a read-only smoke test.
+    - Run sanity queries: count of essential tables, sample user lookup, basic `API` health with read-only endpoint.
+
+2. Plan cutover strategy:
+    - `DNS`/connection string swap:
+        - Update application `DATABASE_URL` to point to restored DB (best done via secrets manager or config).
+    - Alternatively, change read/write roles and promote the restored host to primary (provider option).
+
+3. Minimize downtime:
+    - Use a rolling approach: bring up a small application replica pointing to restored DB, validate, then gradually switch traffic.
+
+---
+
+## Validation & verification (post-restore)
+
+Run these checks immediately and for the monitoring window that follows.
+
+1. Basic connectivity
+   ```bash
+   psql <restored_DATABASE_URL> -c "SELECT 1;"
+   ```
+
+2. Schema sanity
+   ```sql
+   SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';
+   SELECT to_regclass('public.hero_snapshots');
+   ```
+
+3. Critical row counts / sample checks
+    - Users:
+      ```sql
+      SELECT count(*) FROM users;
+      SELECT id FROM users ORDER BY created_at DESC LIMIT 5;
+      ```
+    - Hero snapshots:
+      ```sql
+      SELECT count(*) FROM hero_snapshots;
+      SELECT id, created_at FROM hero_snapshots ORDER BY created_at DESC LIMIT 5;
+      ```
+
+4. Application smoke test (read & write)
+    - With a staging `API` instance, call:
+        - `GET /api/health`
+        - `GET /api/v1/profile/summary/<sample_namecode>`
+        - (If read-write allowed) insert a small test snapshot and ensure it processes.
+
+5. `ETL` smoke test
+    - Start a single worker against restored DB and process a sample snapshot:
+        - Confirm `hero_snapshots.processed_at` is set and `user_profile_summary` updated.
+    - Ensure worker logs show no fatal errors.
+
+6. Consistency checks
+    - Referential integrity:
+      ```sql
+      SELECT count(*) FROM user_troops WHERE user_id IS NULL;
+      ```
+    - Uniqueness constraints:
+      ```sql
+      SELECT user_id, troop_id, count(*) FROM user_troops GROUP BY user_id, troop_id HAVING count(*) > 1 LIMIT 10;
+      ```
+
+7. Application-scale verification
+    - Monitor metrics: `API` latency, DB connections, `ETL` error rate for at least `30–60 minutes` before full cutover.
+
+---
+
+## Rollback & contingency during restore
+
+- Do not overwrite the original primary instance immediately. Always restore to a new instance first.
+- If restored DB is invalid, abort the cutover and iterate (try a different snapshot or time).
+- If primary must be reinstated, you can revert application connection strings to the original `DATABASE_URL`.
+
+---
+
+## Post-restore steps and follow-up
+
+- Document the restore: snapshot id, who executed, timestamps, and verification outputs. Attach to incident ticket.
+- Run a full application regression test in staging, then in production for critical flows.
+- Re-enable workers and ingestion progressively:
+    - Start a small number of workers (`concurrency=1`), monitor, then scale.
+- Rebuild or re-apply any migrations that are required and were not present in the restored point (coordinate migration history).
+- Run a backfill for any missing or late data if necessary (use [docs/OP_RUNBOOKS/BACKFILL.md](./BACKFILL.md)).
+
+---
+
+## Security & access considerations
+
+- Rotate credentials if restore was caused by a security incident (see [docs/OP_RUNBOOKS/SECRET_COMPROMISE.md](./SECRET_COMPROMISE.md) runbook).
+- Audit who accessed backups and restoration artifacts.
+- Limit access to restored instance until verified.
+
+---
+
+## Troubleshooting common issues
+
+- Restore taking too long:
+    - Option: restore into a larger instance type to speed `IO`.
+    - For logical restores, use parallel restore with `pg_restore -j <n>`.
+
+- `WAL` unavailable for `PITR`:
+    - If `WAL` segment missing, `PITR` cannot reach target time. Consider restoring to the last available `WAL` and then applying compensating actions.
+    - Consult provider's support for `WAL` retrieval if they archive it.
+
+- Errors during `pg_restore` (permission, role problems):
+    - Re-run with `--no-owner --role=<desired_role>` and ensure roles exist.
+    - Create missing roles temporarily or adjust dump options.
+
+- Post-restore replication / replica issues:
+    - If promoting a replica, ensure replication slots are cleaned and standby configs updated.
+    - If using pgbouncer, update its server config to point to the new primary.
+
+---
+
+## Communication template (incident updates)
+
+- Initial alert (short):
+    - "Incident: DB outage — starting restore. Using snapshot `<id>`. `ETA` to service: `~<X> minutes`. Channel: `#incident-db-restore`."
+
+- Progress update:
+    - "Restore progress: snapshot restore completed / `pg_restore` at `40%` / `PITR` applying `WALs` — `ETA ~20m`. Next: smoke tests."
+
+- Resolution:
+    - "Restore complete. Restored instance: `<host>`. Smoke tests passed. Cutover started at <time>. Services resumed."
+    - Provide link to incident ticket with timeline & artifacts.
+
+---
+
+## References & related docs
+
+- [docs/MIGRATIONS.md](../MIGRATIONS.md) — migration preflight & safe patterns
+- [docs/OP_RUNBOOKS/APPLY_MIGRATIONS.md](./APPLY_MIGRATIONS.md) — runbook for applying migrations
+- [docs/OP_RUNBOOKS/BACKFILL.md](./BACKFILL.md) — backfill procedures for populating missing data
+- Provider docs:
+    - `Supabase`: https://supabase.com/docs
+    - `AWS RDS` snapshot & `PITR`: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.html
+    - `Google Cloud SQL` backups & `PITR`: https://cloud.google.com/sql/docs/postgres/backup-recovery
+
+---
+
+## Appendix: Useful commands (examples)
+```SQL
+-- Check current active queries
+SELECT pid, usename, application_name, client_addr, state, now()-query_start AS duration, query FROM pg_stat_activity ORDER BY duration DESC LIMIT 50;
+
+-- Create DB on target host
+psql -h <host> -U <admin> -c "CREATE DATABASE starforge_restore;"
+
+-- Restore from dump (parallel)
+pg_restore -h <host> -U <admin> -d starforge_restore -j 8 /path/to/backup.dump
+
+-- Promote read replica (AWS RDS example)
+# Use AWS Console or:
+aws rds promote-read-replica --db-instance-identifier my-replica
+```
+---
diff --git a/docs/OP_RUNBOOKS/ETL_FAILURE_SPIKE.md b/docs/OP_RUNBOOKS/ETL_FAILURE_SPIKE.md
new file mode 100644
index 0000000..b1652f3
--- /dev/null
+++ b/docs/OP_RUNBOOKS/ETL_FAILURE_SPIKE.md
@@ -0,0 +1,294 @@
+# Runbook — ETL Failure Spike
+
+---
+
+## Purpose
+
+This runbook describes how to triage, mitigate and resolve a sudden spike in `ETL` worker failures (processing
+snapshots)
+for StarForge. It's intended for `SREs` and `backend engineers` responding to an incident where many `ETL` jobs fail in
+a
+short window (alerts triggered by the `ETL` failure-rate rule).
+
+---
+
+## Scope
+
+- Rapid triage of spikes in `starforge_etl_snapshots_failed_total`.
+- Short-term mitigation to protect production systems.
+- Root-cause investigation patterns (parse errors, DB errors, schema/migration regressions, upstream changes).
+- Post-incident actions and hardening.
+
+---
+
+## Audience
+
+- `On-call SRE` / `Platform engineer`
+- Backend engineer owning `ETL`/worker
+- Data team (if backfill or mapping changes implicated)
+
+---
+
+## When to run this
+
+- Alert: `ETL` failure rate > configured threshold (for example `>1% over 5m`) — see `Prometheus` alert
+  `StarforgeETLFailureRateHigh`.
+- Rapid increase in `etl_errors` rows, repeated worker crashes, or queue failure storm.
+
+---
+
+## Quick summary (first 5 minutes)
+
+1. Acknowledge alert and create incident channel (e.g., `#incident-etl-<ts>`).
+2. Find scope: how many snapshots failing, which `error_type`(s), which worker instances.
+3. Apply protective mitigations: slow or pause backfills, reduce worker concurrency, pause ingestion if needed.
+4. Collect artifacts (recent worker logs, `etl_errors` rows, failing snapshot ids) and attach to incident.
+5. Escalate to Backend owner if upstream schema change or deploy suspected.
+
+---
+
+## Contacts / escalation
+
+- Primary on-call SRE: `#starforge-ops` (pager)
+- Backend `ETL` owner: Check migration/backfill PR or owner tag
+- Data team alias: `data-team@org`
+- Engineering lead: `engineering-lead@org`
+
+---
+
+## Triage checklist (first 10 minutes)
+
+- [ ] Acknowledge pager and create incident Slack channel.
+- [ ] Determine severity (`P0` if consumer-facing outages, `P1` if degraded).
+- [ ] Query `Prometheus` for failure-rate and throughput:
+    - failure rate, processed rate, backlog length, worker restarts.
+- [ ] Identify most frequent `error_type` in `etl_errors` (example: `PARSE_ERROR`, `DB_ERROR`, `FK_VIOLATION`, `OOM`).
+- [ ] Collect recent worker logs (last `30 minutes`) and a small sample of failing snapshot ids.
+- [ ] Decide immediate mitigation: throttle backfills or reduce worker concurrency.
+
+---
+
+## Key Prometheus queries (examples)
+
+Use `Prometheus` / `Grafana` console.
+
+- `ETL` failure rate (`5m`):
+  ```promql
+  sum(rate(starforge_etl_snapshots_failed_total[5m])) / sum(rate(starforge_etl_snapshots_processed_total[5m]))
+  ```
+- `ETL` processed and failed rates:
+  ```promql
+  sum(rate(starforge_etl_snapshots_processed_total[5m]))
+  sum(rate(starforge_etl_snapshots_failed_total[5m]))
+  ```
+- Queue depth:
+  ```promql
+  starforge_queue_jobs_waiting_total
+  ```
+- Worker restarts:
+  ```promql
+  increase(kube_pod_container_status_restarts_total{pod=~"etl-worker.*"}[10m])
+  ```
+
+---
+
+## Immediate mitigations (choose appropriate)
+
+Pick minimally invasive options first; escalate if needed.
+
+1. Throttle or pause heavy backfills
+    - If a backfill job or bulk reprocess is running, pause it immediately.
+    - If backfill is queued in `queue_jobs` or `backfill_jobs`, set `status='paused'` or remove pending jobs.
+
+2. Reduce worker concurrency (fast, safe)
+    - If workers run in `Kubernetes`:
+      ```bash
+      kubectl -n starforge scale deployment etl-worker --replicas=0
+      # or reduce replicas gradually to 1 then observe
+      kubectl -n starforge scale deployment etl-worker --replicas=1
+      ```
+    - If using env var for concurrency, update and restart a small number of pods.
+
+3. Pause ingestion (if failures originate from recent incoming snapshots)
+    - Flip ingestion feature flag to stop new snapshots being created/queued.
+    - Inform support: "ingestion paused".
+
+4. Prevent requeues / retries flood
+    - Temporarily stop automatic requeueing of failing jobs if queue system supports it, or set `maxAttempts` lower for
+      new jobs to avoid storm.
+
+5. Isolate canary workers
+    - Start a small isolated worker instance with debug logging to reproduce failure without affecting main pool.
+
+---
+
+## Data collection (essential artifacts)
+
+Collect and attach to incident:
+
+- Top 50 `etl_errors` rows for the last 30–60 minutes:
+  ```sql
+  SELECT id,snapshot_id,error_type,message,details,created_at
+  FROM etl_errors
+  WHERE created_at >= now() - interval '60 minutes'
+  ORDER BY created_at DESC LIMIT 50;
+  ```
+
+- Recent failing snapshot ids and their sizes:
+  ```sql
+  SELECT id, size_bytes, created_at, source
+  FROM hero_snapshots
+  WHERE id IN (<sample_failed_ids>);
+  ```
+
+- Worker pod logs (last 1000 lines) for affected pods:
+  ```bash
+  kubectl -n starforge logs <worker_pod> --since=30m
+  ```
+
+- Job queue state:
+  ```sql
+  SELECT * FROM queue_jobs WHERE status IN ('pending','running') ORDER BY run_after LIMIT 200;
+  ```
+
+---
+
+## Common failure classes & targeted actions
+
+1) `PARSE_ERROR` or shape mismatch (most frequent after upstream change)
+    - Symptoms:
+        - Many errors with `PARSE_ERROR` or "unexpected field" in logs.
+        - Failures concentrated after a recent fetch or upstream change.
+    - Actions:
+        - Capture a sample raw payload (`hero_snapshots.raw`) and save to `S3` (quarantine) for developer analysis.
+        - Start an isolated worker with updated tolerant parsing (if available).
+        - If changes are wide-ranging, pause ingestion and notify product/integration owner to confirm upstream change.
+        - Short-term: implement tolerant `ETL` mapping (store unknown fields into `extra` JSONB) and re-run.
+
+2) `DB_ERROR` (connection, deadlock, constraint)
+    - Symptoms:
+        - Errors like "duplicate key", "deadlock detected", "cannot connect".
+    - Actions:
+        - Check DB health and connection exhaustion (see [DB_CONNECTION_EXHAUST.md](./DB_CONNECTION_EXHAUST.md)).
+        - If DB is overloaded: reduce worker concurrency and scale DB or `pgbouncer`.
+        - For constraint errors (FK violation): capture offending snapshot ids and create placeholder catalog rows or
+          pause processing and notify Data/Backend team.
+
+3) `FK_VIOLATION` (missing catalog entries)
+    - Symptoms:
+        - ON CONFLICT/insert failures referencing catalog FK.
+    - Actions:
+        - Option A: Create placeholder catalog rows with `meta -> { "placeholder": true }` so `ETL` can proceed (
+          low-risk).
+        - Option B: Pause processing of affected batches; coordinate with Data team to seed catalogs.
+        - Record all placeholder IDs to reconcile later.
+
+4) `OOM` / Worker crashes
+    - Symptoms:
+        - Worker pods restart, `OOMKilled` logs, or memory spike in metrics.
+    - Actions:
+        - Immediately stop workers; start a debug worker with increased memory or streaming parser.
+        - Reduce batch size and enforce streaming parsing for large snapshots.
+        - Implement per-snapshot size guard (skip/ quarantine very large payloads and notify owners).
+
+5) Code regression (recent deploy)
+    - Symptoms:
+        - Failure spike begins right after a deploy.
+    - Actions:
+        - Roll back the deploy to previous working version if rollback safe.
+        - If rollback not possible, fix and patch quickly; consider an emergency patch and redeploy to a fixed worker
+          set.
+        - Confirm `CI` preflight and add tests to prevent recurrence.
+
+6) Retry storm / exponential backoff misconfiguration
+    - Symptoms:
+        - Many retries flood system, growing failure counts.
+    - Actions:
+        - Temporarily disable automatic retries in the queue or reduce attempts/backoff.
+        - Throttle re-enqueueing and manually reprocess after fix.
+
+---
+
+## Reproduction & debug (developer steps)
+
+- Reproduce locally using sample failing snapshot(s):
+    1. Retrieve raw snapshot:
+       ```sql
+       SELECT raw::text FROM hero_snapshots WHERE id = '<snapshot_id>';
+       ```
+    2. Save raw JSON to file and run worker locally in debug mode (stream parser) to surface parse stack traces.
+- Attach stack traces and `details` fields from `etl_errors` to PR/issue for engineering to resolve.
+
+---
+
+## Communication (stakeholders & users)
+
+- Notify: Product, Support, Data, and any affected external integrators.
+- Customer-facing message (if public):
+    - Short message: "We're investigating an issue causing snapshot processing delays and failures. We're pausing
+      ingestion and will update in 30 minutes."
+- Keep incident channel updated every `15–30 minutes` with status, actions taken, `ETA`.
+
+---
+
+## Recovery & gradual re-enable
+
+1. Fix root cause (deploy patch, seed catalogs, increase DB capacity, or handle upstream change).
+2. Start a small canary worker pool (`1–2 pods`) and process a handful of snapshots while monitoring errors.
+3. If stable over observation window (e.g., `15–30 minutes`), slowly scale workers back to normal concurrency.
+4. Re-enable ingestion (feature flag) and monitor closely for regression.
+
+---
+
+## Post-incident: RCA and hardening
+
+- Produce a postmortem within agreed `SLA` (e.g., `72 hours`) with:
+    - Timeline, root cause, detection & mitigation actions, and permanent fixes (change requests).
+- Permanent mitigations may include:
+    - Add more robust schema/version detection and tolerant parsing to `ETL`.
+    - Better catalog seeding and validation checks.
+    - Stronger preflight for migrations and backfills.
+    - Connection & resource limits for backfill jobs.
+    - Enhanced alerts that include sample failing snapshot ids for faster triage.
+
+---
+
+## Useful SQL snippets (for triage)
+
+- Top error types last hour:
+  ```sql
+  SELECT error_type, count(*) FROM etl_errors WHERE created_at >= now() - interval '1 hour' GROUP BY error_type ORDER BY count DESC;
+  ```
+- Recent failing snapshot ids:
+  ```sql
+  SELECT snapshot_id, created_at, error_type FROM etl_errors WHERE created_at >= now() - interval '30 minutes' ORDER BY created_at DESC LIMIT 100;
+  ```
+- Sample payload for investigation:
+  ```sql
+  SELECT id, raw::text FROM hero_snapshots WHERE id = '<snapshot_id>';
+  ```
+
+---
+
+## Playbook summary (flow)
+
+1. Acknowledge & create incident channel.
+2. Triage: identify error type & scope.
+3. Mitigate: pause backfills, reduce worker concurrency, or pause ingestion.
+4. Collect artifacts (logs, `etl_errors`, sample snapshots).
+5. Fix / patch (code, catalog seed, DB scaling).
+6. Canary: test fix on isolated worker(s).
+7. Gradual re-enable & monitor.
+8. Postmortem & implement long-term fixes.
+
+---
+
+## References
+
+- [docs/ETL_AND_WORKER.md](../ETL_AND_WORKER.md) — worker design & upsert patterns
+- [docs/OBSERVABILITY.md](../OBSERVABILITY.md) — metrics and dashboards
+- [docs/MIGRATIONS.md](../MIGRATIONS.md) — migration preflight & safe patterns
+- [docs/OP_RUNBOOKS/BACKFILL.md](./BACKFILL.md) — backfill runbook
+- [docs/OP_RUNBOOKS/DB_CONNECTION_EXHAUST.md](./DB_CONNECTION_EXHAUST.md) — DB connection exhaustion runbook
+
+---
diff --git a/docs/OP_RUNBOOKS/MIGRATION_ROLLBACK.md b/docs/OP_RUNBOOKS/MIGRATION_ROLLBACK.md
new file mode 100644
index 0000000..49b609a
--- /dev/null
+++ b/docs/OP_RUNBOOKS/MIGRATION_ROLLBACK.md
@@ -0,0 +1,270 @@
+# Runbook — Migration Rollback
+
+---
+
+## Purpose
+
+Step-by-step procedure to roll back a problematic database migration in staging or production. This runbook covers
+safe options (down migrations), when to restore from backups, coordinated actions (application revert, quarantine jobs),
+and verification checks.
+
+---
+
+## Audience
+
+- `SRE` / `Operations engineers` executing rollbacks
+- Backend engineers owning migrations
+- Incident commander and product/stakeholder contacts
+
+---
+
+## When to use
+
+- A recently applied migration causes critical errors, data integrity issues, or service degradation.
+- Post-migration validation detects major regressions.
+- A destructive migration was applied by mistake.
+
+---
+
+## Guiding principles
+
+- Prefer safe, non-destructive rollback: run the documented down migration only if it’s safe and tested.
+- Never overwrite production in-place without restoring to a separate instance and validating first.
+- Take an additional backup before performing corrective actions.
+- Communicate clearly and continuously: who is acting, what is being done, and when.
+
+---
+
+## Quick terminology
+
+- up migration: migration that applies changes.
+- down migration: migration that reverts an up migration.
+- restore: recover DB from snapshot/dump (see [./docs/DB_RESTORE.md](./DB_RESTORE.md)).
+- cutover: switching application traffic to a restored or corrected DB.
+
+---
+
+## Initial triage (first 10 minutes)
+
+1. Acknowledge alert and open an incident channel (e.g. `#incident-db-migration-<ts>`).
+2. Identify the offending migration (file name / version in migration table).
+3. Estimate scope/impact: affected queries, tables, rows, and whether data was deleted or transformed irreversibly.
+4. Do not run irreversible operations before taking an extra snapshot/dump of current state.
+5. Inform product/support and prepare approvers for rollback actions.
+
+---
+
+## Mandatory pre-action checklist
+
+- [ ] Take a fresh snapshot or logical dump and record its ID.
+- [ ] Confirm approvals are available (`SRE` + migration owner).
+- [ ] Ensure down scripts are present and reviewed by a developer familiar with the migration.
+- [ ] Schedule a maintenance window or notify users if user-impact is expected.
+- [ ] Prepare a validation plan (list of queries / smoke tests to run after rollback).
+
+---
+
+## Choose rollback strategy
+
+- Option A — Run the down migration
+    - Use when:
+        - A well-tested down migration exists and is non-destructive for critical data.
+        - No irreversible data transformations were applied that the down cannot safely repair.
+    - Pros: targeted and usually faster.
+    - Cons: downs can be incomplete or leave inconsistent state if not carefully designed.
+
+- Option B — Restore from backup (recommended for destructive changes)
+    - Use when:
+        - The migration performed irreversible destructive changes (`DROP TABLE`, `mass DELETE`, irreversible
+          transforms).
+        - No safe or reliable down exists.
+    - Pros: returns to a known-good state.
+    - Cons: restore time, and writes after the backup will be lost unless replayed or backfilled.
+
+- Option C — Revert application deployment + apply corrective DB action
+    - Use when:
+        - The regression is caused by a coupled application change (schema + app code mismatch).
+        - You need to stop the app from writing incompatible data before fixing DB.
+    - Pattern: revert app first, then handle DB rollback or restore.
+
+---
+
+## Execute a down migration (recommended steps)
+
+1. Validate in test/staging
+    - Run the down migration on a test or restored copy to observe effects and verify no unexpected data loss.
+
+2. Create an additional snapshot of the current production state
+    - Even if you plan to run a down, snapshot current state to enable fallback.
+
+3. Identify non-transactional steps
+    - Note steps that cannot run inside a transaction (e.g., `CREATE INDEX CONCURRENTLY`) and plan to run them
+      separately.
+
+4. Run the down via `CI/CD` (preferred)
+    - Prefer an auditable `CI` workflow that applies the down and records logs and artifacts.
+
+5. Manual execution (if `CI` not available)
+    - Example using node-pg-migrate:
+      ```
+      pnpm run migrate:down -- --count 1 -- --config database/migration-config.js --env production
+      ```
+        - Adjust `--count` to revert the needed number of migrations.
+
+6. Monitor and verify migration table
+    - Confirm the migration was recorded as reverted:
+      ```sql
+      SELECT * FROM node_pgmigrations_schema_version ORDER BY installed_on DESC LIMIT 20;
+      ```
+
+7. Run verification checks (see below).
+
+---
+
+## Restore from backup (recommended flow)
+
+1. Restore to a new DB instance (do not overwrite prod)
+    - Follow [docs/OP_RUNBOOKS/DB_RESTORE.md](./DB_RESTORE.md) for provider-specific restore steps.
+
+2. Run smoke tests on the restored instance
+    - Verify schema, critical row counts, and representative application flows.
+
+3. Plan cutover
+    - Put application in read-only mode or pause ingestion.
+    - Update `DATABASE_URL` via secrets manager, rotate connection strings or switch `DNS`/`Proxy` to point to restored
+      DB.
+
+4. Gradual traffic shift
+    - Start a small set of app instances pointed to restored DB, validate, then shift more traffic.
+
+5. Handle lost writes
+    - If writes occurred after the backup time, plan a manual replay or backfill for missing data (
+      see [./docs/BACKFILL.md](./BACKFILL.md)).
+
+---
+
+## Revert application deployment (if needed)
+
+- Revert the application to the previous stable release using your deployment tool (`kubectl`, `Helm`, `Cloud Run UI`).
+- Disable any feature flags that enable the new schema-dependent flows.
+- After app rollback, apply the DB rollback path (down or restore), depending on situation.
+
+---
+
+## Post-rollback verification (minimum)
+
+- Migration table:
+  ```sql
+  SELECT version, installed_on FROM node_pgmigrations_schema_version ORDER BY installed_on DESC LIMIT 10;
+  ```
+- Sanity checks:
+    - `SELECT count(*) FROM users;`
+    - `SELECT count(*) FROM hero_snapshots;`
+    - `SELECT to_regclass('public.hero_snapshots');`
+- Health endpoints:
+    - `GET /api/health` returns 200.
+- Application smoke tests:
+    - Read profile summary for sample users.
+    - If write tests are allowed, insert a small test snapshot and ensure ETL processes it.
+- Monitor metrics for `30–60 minutes`:
+    - `API` latency, `ETL` failure rate, DB connections.
+
+---
+
+## Communication & coordination
+
+- Initial notice:
+    - "Rollback of migration `<file or version>` initiated. Snapshot saved: `<id>`. `ETA: ~<X> minutes`."
+- Status updates: every `15–30 minutes` with progress and any issues encountered.
+- Resolution message:
+    - "Rollback complete. Validation passed. Link to incident ticket and artifacts."
+
+---
+
+## Common scenarios & specific guidance
+
+1. Destructive migration (`DROP/DELETE`)
+    - Do not attempt a down unless it was specifically designed to restore data. Prefer restore from backup.
+
+2. `CREATE INDEX CONCURRENTLY` or other non-transactional `DDL`
+    - These steps must be run separately outside transactional migration steps. Re-run or reverse as appropriate in
+      isolated migrations.
+
+3. Partial down failure
+    - Stop further automated actions. Capture diagnostics, run the down on a test copy to reproduce, and restore from
+      backup if state cannot be reliably recovered.
+
+4. App-schema regression
+    - Revert the app first to stop writes to the new schema. Then address DB rollback.
+
+---
+
+## Recovery & controlled re-enable
+
+1. Bring back workers and ingestion gradually:
+    - Start with a small worker pool (`concurrency=1`) and observe.
+    - If stable, scale up incrementally.
+
+2. Run targeted backfills for any missing or inconsistent data.
+
+3. Keep enhanced monitoring and alerts for an observation window.
+
+---
+
+## Condensed rollback checklist (playbook)
+
+- [ ] Snapshot/dump taken and ID recorded
+- [ ] Incident channel open and approvers available
+- [ ] Strategy chosen: down / restore / revert app
+- [ ] Down tested on staging (if using)
+- [ ] Action executed via `CI` (preferred) or manually with logs recorded
+- [ ] Post-rollback verification passed
+- [ ] Communication/incident documentation completed
+
+---
+
+## Useful SQL snippets
+
+```SQL
+-- Recent migrations
+SELECT *
+FROM node_pgmigrations_schema_version
+ORDER BY installed_on DESC LIMIT 20;
+
+-- Check table existence
+SELECT to_regclass('public.hero_snapshots') AS hero_snapshots_exists;
+
+-- Quick counts
+SELECT COUNT(*)
+FROM users;
+SELECT COUNT(*)
+FROM hero_snapshots;
+
+-- Long running transactions (diagnostic)
+SELECT pid, usename, now() - xact_start AS duration, query
+FROM pg_stat_activity
+WHERE xact_start IS NOT NULL
+ORDER BY duration DESC LIMIT 20;
+```
+
+---
+
+## Post-incident: postmortem & corrective actions
+
+- Produce a postmortem with timeline, root cause, detection & mitigation actions, and permanent fixes.
+- Typical corrective actions:
+    - Improve migration tests and preflight procedures.
+    - Enforce backup-before-migration policy and verify restore steps.
+    - Adopt migration patterns: add → backfill → enforce.
+    - Add guardrails for destructive operations and limit automatic backfills.
+
+---
+
+## References
+
+- [docs/OP_RUNBOOKS/DB_RESTORE.md](./DB_RESTORE.md) — backup & restore procedures
+- [docs/OP_RUNBOOKS/APPLY_MIGRATIONS.md](./APPLY_MIGRATIONS.md) — migration apply runbook
+- [docs/MIGRATIONS.md](../MIGRATIONS.md) — migration conventions and safe patterns
+- [docs/OP_RUNBOOKS/BACKFILL.md](./BACKFILL.md) — backfill procedure
+
+---
diff --git a/docs/OP_RUNBOOKS/QUEUE_BACKLOG.md b/docs/OP_RUNBOOKS/QUEUE_BACKLOG.md
new file mode 100644
index 0000000..4b2ef5c
--- /dev/null
+++ b/docs/OP_RUNBOOKS/QUEUE_BACKLOG.md
@@ -0,0 +1,329 @@
+# Runbook — Queue Backlog
+
+---
+
+## Purpose
+
+Procedures to triage, mitigate and resolve sudden or sustained queue backlogs for StarForge. This covers `Redis`/
+`BullMQ`
+or DB-backed queues (`queue_jobs`), causes, quick mitigations, recovery steps, and long-term preventative measures so
+the
+system returns to steady-state with minimal customer impact.
+
+---
+
+## Audience
+
+- `SRE` / `Platform engineers`
+- `Backend engineers` owning queue consumers (`ETL workers`)
+- `On-call responder` / `incident commander`
+- `Data engineers` operating backfills
+
+---
+
+## Scope
+
+- Backlog detection and triage for job queues used by `ETL` and background processing (`BullMQ` + `Redis` and DB
+  `queue_jobs`).
+- Applying short-term mitigations (throttles, scale changes, dead-lettering).
+- Safe recovery and reprocessing guidance.
+- Prevention and operational improvements.
+
+---
+
+## When to run this
+
+- Alert: queue depth above threshold or long-running pending jobs (see metrics
+  in [docs/OBSERVABILITY.md](../OBSERVABILITY.md)).
+- Observed processing lag: processed rate << enqueue rate for sustained period.
+- User-visible degradation (slow profiles, admin actions blocked).
+
+---
+
+## Quick summary (first 5 minutes)
+
+1. Acknowledge pager and create incident channel (e.g. `#incident-queue-backlog-<ts>`).
+2. Snapshot queue state (depth, oldest job age, types).
+3. Determine root cause category: producer storm, consumer outage, slow processing, retry storms, poison messages, DB
+   problems.
+4. Apply conservative mitigations: stop new producers, reduce retry flood, scale consumers cautiously.
+5. Collect logs, job samples and queue metrics for postmortem.
+
+---
+
+## Key signals to check (immediately)
+
+- Queue depth / backlog:
+    - `starforge_queue_jobs_waiting_total` or BullMQ waiting length (`Redis` keys).
+- Job age:
+    - Oldest job `run_after` or `BullMQ` job timestamp.
+- Processing throughput:
+    - `starforge_etl_snapshots_processed_total` (rate) vs `starforge_etl_snapshots_received_total`
+- Failure rates:
+    - `starforge_etl_snapshots_failed_total` and `etl_errors` table
+- Consumer health:
+    - Worker pod restarts, `CPU`, memory, DB connections
+- DB health:
+    - Connection exhaustion or long transactions (see [DB_CONNECTION_EXHAUST.md](./DB_CONNECTION_EXHAUST.md))
+
+---
+
+## Triage checklist (first 10 minutes)
+
+- [ ] Acknowledge alert and create incident channel.
+- [ ] Snapshot metrics from Prometheus/Grafana for the last `30–60 minutes`.
+- [ ] Run quick queue inspection (`Redis`/`BullMQ` or DB):
+    - For `BullMQ`: list waiting/active/failed job counts.
+    - For DB `queue_jobs`: `SELECT status, COUNT(*) GROUP BY status`.
+- [ ] Identify top job types causing backlog (e.g., `process_snapshot`, `backfill_range`).
+- [ ] Identify oldest pending job (timestamp) and sample its payload.
+- [ ] Check consumer logs (worker pods) for errors or `OOMs`.
+- [ ] Decide immediate mitigation (throttle producers / scale consumers / pause backfills / dead-letter).
+
+---
+
+## Commands & queries (copy/paste)
+
+```SQL
+-- DB-backed queue (queue_jobs) snapshot
+SELECT status, type, COUNT(*) AS cnt
+FROM queue_jobs
+GROUP BY status, type
+ORDER BY cnt DESC;
+
+-- Oldest pending jobs
+SELECT id, type, payload, run_after, created_at
+FROM queue_jobs
+WHERE status = 'pending'
+ORDER BY run_after ASC, created_at ASC LIMIT 50;
+
+-- Failed jobs recent
+SELECT id, type, attempts, last_error, updated_at
+FROM queue_jobs
+WHERE status = 'failed'
+ORDER BY updated_at DESC LIMIT 100;
+-- Top waiting jobs by type (BullMQ/Redis: use router/monitoring or Bull Board)
+#
+If you have redis-cli and bullmq key naming, inspect counts or use the queue UI.
+```
+
+---
+
+## Causes and targeted actions
+
+1) Consumer outage or insufficient consumers
+    - Symptoms:
+        - Consumer pods crashed or scaled to zero
+        - `jobs processed` drops to 0 while `jobs enqueued` high
+    - Actions:
+        - Check and restart consumer workers:
+            - Kubernetes:
+              `kubectl -n starforge get deploy etl-worker`
+              `kubectl -n starforge rollout restart deployment/etl-worker`
+        - If pods are in `CrashLoopBackOff`, inspect logs and fix root cause (`OOM`, exception).
+        - If healthy but overloaded, scale up consumers gradually (see safe scaling below).
+
+2) Producer storm (sudden high enqueue rate)
+    - Symptoms:
+        - Enqueue rate >> processing rate; queues fill quickly
+        - Often correlates with a campaign, bug, or external integration spike
+    - Actions:
+        - Throttle or pause producers:
+            - `API`: return `429`/`503` for ingestion endpoints or flip ingestion flag.
+            - Backfill: pause backfill jobs in `backfill_jobs` or `queue_jobs`.
+        - Inform product/support of degraded ingestion.
+
+3) Slow processing (heavy jobs or DB slowness)
+    - Symptoms:
+        - Consumers healthy but processing time per job increased
+        - DB slow queries, `IO` saturation
+    - Actions:
+        - Temporarily scale consumers to increase parallelism only if DB can handle it.
+        - Reduce per-worker concurrency and batch sizes to lower DB pressure.
+        - Investigate and fix slow queries / missing indexes (`pg_stat_statements`).
+        - If index builds are running, they may slow queries—avoid additional load.
+
+4) Retry storms / exponential retries
+    - Symptoms:
+        - Jobs repeatedly requeued and failing, causing backlog growth
+    - Actions:
+        - Temporarily disable automatic retries or reduce `maxAttempts` for new jobs.
+        - Move failing jobs to a dead-letter queue (`DLQ`) for manual inspection rather than retrying.
+        - Inspect `etl_errors` for root failure and fix cause.
+
+5) Poison messages (single job repeatedly failing and blocking throughput)
+    - Symptoms:
+        - One job failing repeatedly despite retries, may consume worker cycles
+    - Actions:
+        - Identify offending job id(s) and move them to `DLQ`:
+            - For `BullMQ`: move job to failed manually or use `job.moveToFailed()` with reason.
+            - For DB queue: update `queue_jobs` status to 'failed' or 'quarantined' and record details.
+        - Quarantine payloads for developers to debug.
+
+6) DB connection exhaustion or locks
+    - Symptoms:
+        - Consumers complain about DB connection failures or deadlocks
+        - High `pg_stat_activity` or blocked queries
+    - Actions:
+        - Follow [DB_CONNECTION_EXHAUST.md](./DB_CONNECTION_EXHAUST.md): reduce concurrency, pause consumers, scale
+          pooler/DB
+        - Avoid scaling consumers up until DB healthy.
+
+---
+
+## Immediate mitigations (ordered, safe)
+
+- Pause new producers (ingestion `API`/backfill enqueueing).
+- Pause non-critical queues (`backfill_range`) and keep only critical job types running.
+- Move failing/poison jobs to `DLQ` to stop thrashing.
+- Scale consumers up only after verifying DB capacity; otherwise scale down to a safe level.
+- Apply temporary rate limiting at `API` gateway for clients causing storm.
+
+---
+
+## How to move jobs to a Dead-Letter Queue (DLQ)
+
+Goal: prevent repeated retries from blocking throughput and preserve payload for later debugging.
+
+A) DB-backed `queue_jobs`
+
+- Mark as failed/quarantined with reason and distinct queue type:
+  ```sql
+  UPDATE queue_jobs
+  SET status = 'failed', last_error = jsonb_build_object('reason','quarantined for manual review','original_status',status), updated_at = now()
+  WHERE id = '<job_id>';
+  -- optionally insert into queue_jobs_dlq table with full payload and metadata
+  INSERT INTO queue_jobs_dlq (id, original_job_id, type, payload, reason, created_at)
+  VALUES (gen_random_uuid(), '<job_id>', '<type>', '<payload>'::jsonb, 'quarantined due to fail', now());
+  ```
+
+B) `BullMQ` (`Redis`)
+
+- Use `Bull UI` (`Bull Board`) to find and move a job to failed or use worker `API`:
+    - `await job.moveToFailed(new Error('quarantined'), true);`
+- Alternatively, remove job and persist payload externally for later replay.
+
+---
+
+## Safe scaling guidance
+
+- Don’t exceed DB connection budgets when scaling consumers. Example formula:
+    - Max DB connections = DB capacity (from provider) - headroom (e.g. `20`)
+    - Per-consumer pool size = configured `PG_POOL_MAX`
+    - Max concurrent c`onsumers = floor( (Max DB connections - headroom) / PG_POOL_MAX )`
+- Scale in small steps and observe:
+    - Step 1: increase replicas by `1–2`, wait `1–2 minutes`, measure queue drain and DB metrics
+    - Step 2: increase further if healthy
+- If scaling causes DB pressure, revert and instead reduce per-worker concurrency and use more workers with smaller
+  pools.
+
+---
+
+## Reprocessing strategy (safe recovery)
+
+1. Stabilize system (mitigations applied).
+2. Move failing/poison jobs to `DLQ` and fix root cause.
+3. Create a controlled reprocessing plan:
+    - Re-enqueue `DLQ` jobs to a dedicated "reprocess" queue with limited concurrency.
+    - Use canary reprocessing: `10–50` jobs first, validate, then increase.
+    - For backfills, use small batch sizes and capacity-aware windowing (see [BACKFILL.md](./BACKFILL.md)).
+
+---
+
+## Long-term remediation & prevention
+
+- Enforce producer rate-limits at `API` level to avoid storms.
+- Use short-lived concurrency-controlled worker pools and small per-worker DB pools.
+- Add `DLQ` automation and observability (ageing `DLQ` alerts).
+- Improve job idempotency and ensure payloads are safe to retry.
+- Add circuit breakers:
+    - If failure rate for a job type exceeds threshold, automatically pause enqueues for that type and notify owners.
+- Create admin tools:
+    - Web UI to inspect queues, move jobs to `DLQ`, replay jobs safely.
+- Improve metrics and alerts:
+    - Alert on job age > threshold, failed job ratio, and `DLQ` growth.
+- Harden `ETL` processing to fail fast on unrecoverable errors and move payload to `DLQ`.
+
+---
+
+## Monitoring queries & alerts to add
+
+- Queue depth per type:
+    - `starforge_queue_jobs_waiting_total{type!=""}` → alert if `> X` for `5m`.
+- Oldest job age:
+    - Gauge or query to alert if oldest pending job > threshold (e.g., `30m` for real-time, `24h` for backfills).
+- `DLQ` growth:
+    - `starforge_queue_jobs_dlq_total` → alert if growing unexpectedly.
+- Failed rate:
+    - `rate(starforge_etl_snapshots_failed_total[5m]) / rate(starforge_etl_snapshots_processed_total[5m]) > 0.01` →
+      page.
+
+---
+
+## Post-incident: RCA & follow-up actions
+
+- Produce postmortem with:
+    - Timeline, root cause, mitigation actions, reprocessing plan, and lessons learned.
+- Track remediation tasks:
+    - Rate-limiting, DLQ improvements, consumer autoscaling and DB capacity planning, job schema improvements.
+- Run a game-day exercise simulating queue storms and recovery.
+- Update runbooks (this file) with any newly discovered steps or commands.
+
+---
+
+## Playbook — condensed operational checklist
+
+- [ ] Acknowledge and open incident channel.
+- [ ] Snapshot queue state and collect metrics.
+- [ ] Identify cause (consumer, producer, DB, retries, poison).
+- [ ] Pause producers / backfills if needed.
+- [ ] Move offending jobs to `DLQ` / quarantine payloads.
+- [ ] Scale consumers safely if DB allows; otherwise tune worker concurrency.
+- [ ] Reprocess `DLQ` in a controlled canary pattern.
+- [ ] Monitor for stabilization and verify functional health.
+- [ ] Postmortem and implement long-term fixes.
+
+---
+
+## Useful snippets & examples
+
+```SQL
+-- Pause enqueues at API level (example flag)
+UPDATE feature_flags
+SET data = jsonb_set(data, '{ingest_paused}', 'true'::jsonb)
+WHERE name = 'ingest_control';
+
+-- Move job to DLQ (DB example)
+INSERT INTO queue_jobs_dlq (id, original_job_id, type, payload, reason, created_at)
+SELECT gen_random_uuid(), id, type, payload, 'quarantined', now()
+FROM queue_jobs
+WHERE id = '<job_id>';
+UPDATE queue_jobs
+SET status     = 'failed',
+    updated_at = now()
+WHERE id = '<job_id>';
+
+-- Inspect oldest pending snapshot jobs
+SELECT id, payload ->>'snapshot_id' AS snapshot_id, run_after, created_at
+FROM queue_jobs
+WHERE type = 'process_snapshot' AND status = 'pending'
+ORDER BY run_after ASC LIMIT 50;
+```
+
+---
+
+## Escalation & contacts
+
+- On-call `SRE`: #starforge-ops (pager)
+- `ETL`/`Worker` owner(s): owner listed in job payload or PR
+
+References
+----------
+
+- [docs/ETL_AND_WORKER.md](../ETL_AND_WORKER.md) — worker design, claim semantics and upsert patterns
+- [docs/BACKFILL.md](./BACKFILL.md) — controlled backfill strategy
+- [docs/OBSERVABILITY.md](../OBSERVABILITY.md) — metrics and alert definitions
+- [docs/OP_RUNBOOKS/DB_CONNECTION_EXHAUST.md](./DB_CONNECTION_EXHAUST.md) — DB connection exhaustion runbook
+- [docs/OP_RUNBOOKS/ETL_FAILURE_SPIKE.md](./ETL_FAILURE_SPIKE.md) — `ETL` failure triage
+
+---
diff --git a/docs/OP_RUNBOOKS/SECRET_COMPROMISE.md b/docs/OP_RUNBOOKS/SECRET_COMPROMISE.md
new file mode 100644
index 0000000..3c76a86
--- /dev/null
+++ b/docs/OP_RUNBOOKS/SECRET_COMPROMISE.md
@@ -0,0 +1,251 @@
+# Runbook — Secret Compromise / Credential Leak
+
+---
+
+## Purpose
+
+Immediate, actionable runbook for handling suspected or confirmed compromise of secrets, credentials, `API` keys,
+tokens, or other sensitive material used by StarForge. This document provides containment, rotation, verification,
+forensic and communication steps to recover safely and minimise impact.
+
+---
+
+## Audience
+
+- `SRE` / `Platform engineers`
+- `Security engineers`
+- `Backend engineers` owning affected services
+- `Incident commander` and `on-call responders`
+
+---
+
+## Scope
+
+- Any secret leakage (private key, service account, DB credentials, `API` key, `GitHub secret`, `OAuth client secret`).
+- Both accidental exposures (committed to repo, pasted to public chat) and confirmed malicious exfiltration.
+- Covers cloud provider credentials, database credentials, message broker/Redis, Sentry/observability tokens, and
+  third-party API keys.
+
+---
+
+## Principles
+
+- Assume compromise until proven otherwise.
+- Rotate/revoke compromised secrets immediately — do not wait for full investigation.
+- Preserve evidence for forensic analysis (capture logs, snapshots) before wide destructive actions.
+- Minimise blast radius: isolate affected systems and temporarily stop services that rely on compromised secrets if
+  needed.
+- Communicate proactively with stakeholders (product/support/legal) and follow escalation paths.
+
+---
+
+## Immediate actions (first 5–15 minutes)
+
+1. Acknowledge and create an incident channel (e.g. `#incident-secret-<ts>`).
+2. Identify the secret(s) suspected or confirmed compromised and scope of exposure (repository, pastebin, Slack, error
+   logs).
+3. Containment (fast, non-destructive):
+    - Remove public exposure (delete gist, remove message, take down page) — preserve a copy for `forensics`.
+    - If secret exists in a repository commit, do NOT rebase force-push to hide it; follow Git leak process (see
+      `Forensics`).
+4. Rotate/revoke the secret immediately:
+    - Issue a revocation in the provider console (`AWS keys`, `Google service account keys`, `Sentry token`,
+      `Stripe API key`, etc.).
+    - Replace with a new secret stored securely (`Secrets Manager` / `Vault` / `GitHub Actions secrets`) and deploy
+      minimal changes that use it.
+5. Pause services if necessary:
+    - If the compromised secret grants write access (DB, `S3`, payment provider), consider pausing ingestion/jobs or
+      scaling down workers temporarily to prevent abuse.
+6. Record actions: who performed rotations, timestamps, new secret IDs, and any provider request IDs.
+
+---
+
+## Containment & rotation checklist
+
+- [ ] Identify affected secret name(s), type, and usage.
+- [ ] Revoke compromised secret/token immediately in provider's console or UI.
+- [ ] Generate new credential(s) and store them in the secrets store (`HashiCorp Vault` / `AWS Secrets Manager` /
+  `GitHub Actions Secrets`).
+- [ ] Update dependent services/configs to use the new secrets (deploy a targeted update).
+- [ ] Rotate all secrets with the same scope or standing permissions (e.g., if one `AWS` access key leaked, rotate all
+  access/secret keys for that `IAM` user).
+- [ ] Revoke any idle or older keys belonging to same account that could be abused.
+- [ ] If credentials allow creation of further keys (service account), rotate service account keys and audit created
+  keys.
+
+---
+
+## Provider-specific rotation notes (examples)
+
+- `AWS Access Key`
+    - Console: `IAM` → Users → Security credentials → Create access key → update clients → deactivate old key → delete
+      old key.
+    - If root keys leaked: rotate root key immediately and prefer to delete root keys — create an admin `IAM` user and
+      use it.
+
+- `Google Service Account key`
+    - Console: IAM & Admin → Service Accounts → Keys → Delete compromised key → create new key → update secrets.
+
+- `GitHub Personal Access Token` (`PAT`) or `App Key`
+    - Revoke token in GitHub settings immediately and check audit logs. Replace tokens used by `CI` and dev machines.
+
+- Database credentials
+    - Create new DB user credential with same privileges or rotate password for existing user.
+    - Update application secrets and restart app workers sequentially to apply new credentials.
+
+- `Supabase` / managed `Postgres`
+    - Rotate DB password via provider UI and update `DATABASE_URL` in secrets store; ensure apps reboot with new
+      credentials.
+
+- `Redis` / Message broker secrets
+    - Rotate `AUTH tokens` or `ACL` users; for `Redis Auth`, set a new password and update clients.
+
+- Third-party `API` Keys (`Stripe`, `Sendgrid`, `Sentry`, etc.)
+    - Revoke `API` key in provider dashboard and create a new one; update application secrets.
+
+---
+
+## Forensics & evidence preservation
+
+Do these before destructive changes when feasible (but do not delay critical rotations):
+
+1. Preserve evidence
+    - Save copies of exposed secret artifacts (file, gist URL, chat message) to secure forensic storage (`S3` with
+      limited access).
+    - Timestamp and record who saved it.
+
+2. Gather logs
+    - Collect access logs for services using the compromised secret: Cloud provider console logs (`CloudTrail`), DB
+      audit logs, application logs, and network logs.
+    - Query for suspicious activity from the time of exposure to present (e.g., unusual `API` calls, large downloads,
+      creation of new keys).
+
+3. Snapshot infrastructure state
+    - Take snapshots where possible (DB snapshot, disk snapshot) if there is potential data destruction risk.
+
+4. Git exposure specific steps
+    - If secret was committed: do NOT force-push to hide it. Use the GitHub secret scanning and follow the repo-specific
+      remediation. GitHub offers secret scanning and token revocation suggestions.
+    - Record the commit `SHA`, file path, and revisions for the forensic record.
+
+---
+
+## Investigation & scope analysis (next 1–4 hours)
+
+- Query `CloudTrail` (`AWS`) / `Audit Logs` (`GCP`) for actions by the principal tied to the leaked credentials.
+- Identify resources accessed: list buckets, DB connections, `API` calls, or tokens issued.
+- Determine timeframe of abuse and enumerate data impacted (read, write, delete).
+- If financial providers involved (`Stripe`, `PayPal`), check for fraudulent charges and escalate to Legal/Finance.
+
+---
+
+## Service recovery & redeployment
+
+1. Deploy updates using the new secrets:
+    - Use canary deployment: update a single pod/instance and verify behavior.
+    - Monitor logs and metrics closely for abnormal activity (errors, spikes).
+
+2. Re-enable paused services gradually:
+    - Bring back workers incrementally and monitor for suspicious behaviour.
+
+3. Verify access controls:
+    - Ensure rotated credentials do not inadvertently grant broader permissions than intended.
+    - Apply the last privilege to new secrets.
+
+---
+
+## Communication & escalation
+
+- Internal: Post initial incident note in incident channel with summary:
+    - What leaked, when discovered, initial containment actions, owners, and next steps.
+- Stakeholders: Notify Product, Support, Legal, Finance (if payments), and Privacy/Compliance if user data may be
+  impacted.
+- External disclosure: Coordinate with Legal/Compliance for any external notifications if required by law or provider
+  contracts.
+- Craft short public message for status page or customer support if impact can affect users (avoid revealing detailed
+  technical info).
+
+---
+
+## Suggested incident status template
+
+### Initial:
+
+```
+Incident: suspected credential compromise detected at <time>.
+Affected secret(s): <type/name>.
+Containment actions: secret revoked/rotated, services paused (if applicable).
+Owner: @<oncall>.
+Next update: in 30 minutes.
+```
+
+### Investigation update:
+
+```
+Investigation: access logs indicate <actions> by principal <id> between <time> - <time>.
+Data impacted: <buckets / DB tables>.
+Remediation: rotated keys, revoked tokens, deployed patched config to services.
+ETA for complete remediation: <time>.
+```
+
+---
+
+## Verification & validation (post-remediation)
+
+- Confirm revoked key no longer works (attempt to use revoked token should fail).
+- Verify new keys are being used by all services and no service is still using the old secret.
+- Scan repository and recent logs to ensure the secret is fully removed from code and CI history.
+- Monitor for re-use attempts or suspicious activity for at least `7–30 days` depending on severity.
+
+---
+
+## Detection hardening & prevention (post-incident)
+
+- Enable and enforce secret scanning in GitHub (and other `SCM`).
+- Enforce branch protection and require secrets to be stored in Secrets Manager, not code.
+- Integrate pre-commit hooks (git-secrets) and `CI` scanning.
+- Rotate long-lived credentials to short-lived tokens where possible (`AWS STS`, `GCP` short-lived keys).
+- Add alerting for unusual resource activity (`CloudTrail` anomalous events, sudden data egress).
+- Centralise secrets in `Vault / Secrets Manager` and restrict access via `IAM` roles.
+- Educate developers on secure handling of secrets and leak reporting process.
+
+---
+
+## Audit & reporting
+
+- Produce incident report with timeline, actions, evidence, scope of data accessed, and remediation.
+- Attach logs, snapshots, commit SHAs, and rotated key IDs.
+- Track follow-up tasks: rotate other related keys, update runbooks, add tests/preflight checks.
+
+---
+
+## Useful commands & snippets
+
+```bash
+-- AWS: list access keys for a user
+aws iam list-access-keys --user-name <username>
+
+-- AWS: deactivate an access key
+aws iam update-access-key --user-name <username> --access-key-id <AKIA...> --status Inactive
+
+-- GCP: delete a service account key
+gcloud iam service-accounts keys delete <key-id> --iam-account=<service-account-email>
+
+-- GitHub: revoke a PAT (UI) or OAuth app (UI); check audit log for token use (org admins).
+
+-- Example: rotate DB password (Postgres)
+psql -c "ALTER USER deployer WITH PASSWORD '<new_password>';"
+# Update secrets store and restart services using the new DATABASE_URL
+```
+
+---
+
+## Related runbooks & docs
+
+- [docs/OP_RUNBOOKS/DB_RESTORE.md](./DB_RESTORE.md) — database restore runbook
+- [docs/OP_RUNBOOKS/DB_CONNECTION_EXHAUST.md](./DB_CONNECTION_EXHAUST.md) — DB connection exhaustion
+- [docs/OP_RUNBOOKS/APPLY_MIGRATIONS.md](./APPLY_MIGRATIONS.md) — migration apply runbook
+- [docs/OBSERVABILITY.md](../OBSERVABILITY.md) — monitoring and alerting guidance
+- [docs/MIGRATIONS.md](../MIGRATIONS.md) — migration practices (backup before migrations)
+
+---
diff --git a/docs/OP_RUNBOOKS/WORKER_OOM.md b/docs/OP_RUNBOOKS/WORKER_OOM.md
new file mode 100644
index 0000000..e532600
--- /dev/null
+++ b/docs/OP_RUNBOOKS/WORKER_OOM.md
@@ -0,0 +1,280 @@
+# Runbook — Worker Out Of Memory
+
+---
+
+## Purpose
+
+This runbook explains how to triage, mitigate and recover from `Out-Of-Memory` (`OOM`) incidents affecting `ETL worker`
+processes or other background workers in the StarForge platform. It covers immediate containment, diagnostic steps,
+short-term mitigations, and long-term fixes to prevent future `OOMs`.
+
+---
+
+## Audience
+
+- `SRE` / `Platform engineers`
+- `Backend engineers` owning `ETL/worker` code
+- `Incident commander` during worker incidents
+
+---
+
+## When to use
+
+- Worker pods are `OOMKilled` or repeatedly restarting with `OOM` errors.
+- Memory usage of workers continuously grows until the process is killed.
+- High memory usage correlates with specific snapshots / jobs or a change in workload (backfills, deploy).
+
+---
+
+## Quick summary (first 5 minutes)
+
+1. Acknowledge the alert and create an incident channel (e.g. `#incident-worker-oom-<ts>`).
+2. Scale down or pause workers to stop repeated `OOM` events and prevent saturating the cluster.
+3. Collect logs and pod diagnostics (`kubectl` logs, describe, events).
+4. Identify whether `OOMs` are caused by specific snapshots (very large JSON), regressions in code, or a recent
+   deployment.
+5. If safe, start a single debug worker (with increased memory and debug logging) to reproduce locally or in staging.
+
+---
+
+## Triage checklist (immediate)
+
+- [ ] Acknowledge alert and open incident channel.
+- [ ] Identify affected pods / hosts:
+    - `Kubernetes`: list pods with restarts and `OOMKilled` status.
+    - Non-`K8s`: check system logs for `OOM` killer messages.
+- [ ] Pause ingestion or reduce incoming load if needed.
+- [ ] Scale down worker replicas to prevent further `OOM` kills.
+- [ ] Capture diagnostics: pod logs, events, memory metrics, and failing job ids.
+
+---
+
+## Kubernetes quick commands
+
+### List `ETL` worker pods and their status:
+
+```bash
+kubectl -n starforge get pods -l app=etl-worker -o wide
+kubectl -n starforge describe pod <pod-name>
+kubectl -n starforge logs <pod-name> --previous   # logs from previous container that OOMKilled
+kubectl -n starforge logs <pod-name> --since=30m
+kubectl -n starforge get events --sort-by='.lastTimestamp'
+```
+
+### Scale down quickly (emergency):
+
+```bash
+kubectl -n starforge scale deployment etl-worker --replicas=0
+# or reduce to a safe number
+kubectl -n starforge scale deployment etl-worker --replicas=1
+```
+
+---
+
+## Diagnose OOM cause
+
+1. Check pod events and container status: `kubectl describe pod ...` — look for `OOMKilled` in last state.
+2. Inspect worker logs for stack traces, "JavaScript heap out of memory" (`Node`), or V8 `OOM` messages.
+3. Correlate with metrics:
+    - Pod memory usage from `Kubernetes` metrics / `prometheus`: `pod_memory_usage_bytes`
+    - Worker restarts: `increase(kube_pod_container_status_restarts_total[5m])`
+    - `ETL` metrics: `starforge_etl_processing_duration_seconds`, `starforge_etl_snapshots_processed_total`,
+      `starforge_etl_snapshots_failed_total`
+4. Identify job causing memory spike:
+    - Look at logs to find `snapshot_id` being processed when crash occurred.
+    - Query DB for recent processing attempts and large snapshots:
+      ```sql
+      SELECT id, size_bytes, created_at FROM hero_snapshots ORDER BY size_bytes DESC LIMIT 50;
+      ```
+5. Check recent deploys or config changes (`NODE_OPTIONS`, batch sizes) that may increase memory use.
+
+---
+
+## Containment & immediate mitigation
+
+Do the least disruptive but effective actions first.
+
+A) Stop the flood
+
+- Pause ingestion (flip feature flag or return `503` on ingestion endpoints).
+- Pause or remove heavy backfill jobs (see [./docs/OP_RUNBOOKS/BACKFILL.md](./BACKFILL.md)).
+
+B) Scale down workers
+
+- Scale replicas to `0`, or to `1` for debugging. This prevents repeated `OOM` kills.
+
+C) Start a single debug worker
+
+- Start one worker with increased memory and debug logging to reproduce the case:
+    - In `Node.js`: `set NODE_OPTIONS="--max-old-space-size=4096"` (`MB`) or spawn a debug container with extended
+      memory limit.
+    - Consider enabling heap dumps on `OOM` (e.g. `--abort_on_uncaught_exception` and `heapdump` module) to capture heap
+      snapshot.
+
+D) Quarantine offending snapshots
+
+- If a very large snapshot triggers `OOM`, quarantine it and process later on a high-memory worker:
+  ```sql
+  UPDATE hero_snapshots
+  SET processing = false,
+      error_count = COALESCE(error_count,0) + 1,
+      last_error = to_jsonb('Quarantined: large snapshot causing OOM'::text)
+  WHERE id = '<snapshot_id>';
+  ```
+- Optionally insert into a `quarantine_snapshots` table or push raw to `S3` for developer analysis.
+
+E) Reduce per-worker load
+
+- Reduce concurrency (worker threads / processes), batch sizes, and DB connection pool sizes for workers:
+    - Example: `set WORKER_CONCURRENCY=1` and `PG_POOL_MAX=2` then restart workers.
+
+---
+
+## Diagnostics & reproduction (developer)
+
+- Reproduce locally with the exact snapshot:
+    1. Export raw JSON: `SELECT raw::text FROM hero_snapshots WHERE id = '<snapshot_id>';`
+    2. Save to file and run the worker locally with debug flags and a heap profiler.
+    3. Use streaming parser (stream-json) to exercise the memory behavior. Confirm if full `JSON` parse triggers the
+       `OOM`.
+
+- Capture heap snapshot:
+    - Use Node's `--inspect` and heapdump library or `node --heap-prof` as appropriate.
+    - Upload heap snapshot to secure storage for analysis with `Chrome DevTools` or heapprof tools.
+
+---
+
+## Short-term code/workflow mitigations
+
+- Enforce streaming parsing for large arrays (troops, pets) instead of `JSON.parse` of the entire payload.
+- Add per-snapshot size guard and skip/quarantine payloads larger than threshold (e.g., `5MB`) with admin alert.
+- Reduce worker batch sizes for upserts (e.g., `100 -> 50`).
+- Add backpressure: workers should back off on memory pressure and yield.
+- Add memory limits at pod/container level with sensible requests/limits and autoscale based on `CPU` not memory alone.
+
+---
+
+## Node-specific tips
+
+- Increase `V8` heap limit temporarily for debugging:
+  ```bash
+  NODE_OPTIONS="--max-old-space-size=4096"
+  ```
+- Use streaming `JSON` parsers (`stream-json`) to avoid building giant in-memory objects.
+- Use `--trace-gc` or `--trace-gc-verbose` for `GC` diagnostics.
+- Consider `--expose-gc` and perform explicit `GC` between large batches (last resort).
+
+---
+
+## Kubernetes resource tuning
+
+- Ensure pod `resources.limits.memory` is set to a reasonable value and request matches expected usage.
+- Enable `QoS` by providing requests and limits.
+- Example deployment snippet:
+  ```yaml
+  resources:
+    requests:
+      memory: "1Gi"
+      cpu: "500m"
+    limits:
+      memory: "2Gi"
+      cpu: "1000m"
+  ```
+- Use `HPA` for `CPU` and consider custom metrics for queue length to scale workers safely.
+
+---
+
+## When to increase memory permanently
+
+- Only after root cause analysis shows that workload legitimately requires more memory (e.g., payloads frequently `>2MB`
+  and cannot be streamed).
+- Prefer code/workflow changes over unbounded memory increases.
+- If increasing memory, schedule for a controlled deployment and update pod limits/requests and cost estimates.
+
+---
+
+## Recovery & safe re-enable
+
+1. Confirm root cause mitigations (streaming parse, reduced batch size, quarantined payloads, patched code).
+2. Bring a small number of workers back online (`1–2`) and observe memory and error rates for `15–30 minutes`.
+3. If stable, scale slowly to normal capacity.
+4. Reintroduce quarantined snapshots using a high-memory worker or after applying a patch; reprocess only after
+   validation.
+
+---
+
+## Post-incident actions (RCA)
+
+- Produce a postmortem with timeline, root cause, detection & mitigation steps, and permanent fixes.
+- Action items typically include:
+    - Change `ETL` to streaming parsing.
+    - Add snapshot size limits and quarantine workflow.
+    - Improve memory testing in `CI`/perf tests (process large synthetic payloads).
+    - Add `GC`/heap monitoring dashboards and alerts (e.g., `pod_memory_usage_bytes`,
+      `kube_pod_container_status_restarts_total`).
+    - Document and enforce worker resource limits and concurrency defaults.
+
+---
+
+## Monitoring & alerts to add/improve
+
+- Alert when pod restarts increase: `increase(kube_pod_container_status_restarts_total[5m]) > 0`
+- Alert when pod `memory > 85%` of limit: `pod_memory_usage_bytes / pod_memory_limit_bytes > 0.85`
+- Track `starforge_etl_snapshots_failed_total{error_type="OOM"}` if you record error types.
+- Instrument worker to emit `memory_usage_bytes` and `heap_used_bytes` if possible.
+
+---
+
+## Helpful SQL & housekeeping snippets
+
+### Find largest snapshots:
+
+```sql
+SELECT id, size_bytes, created_at, source
+FROM hero_snapshots
+ORDER BY size_bytes DESC LIMIT 50;
+```
+
+### Mark snapshot as quarantined (example):
+
+```sql
+INSERT INTO snapshot_quarantine (id, snapshot_id, reason, created_at)
+VALUES (gen_random_uuid(), '<snapshot_id>', 'OOM during processing', now());
+
+UPDATE hero_snapshots
+SET processing  = false,
+    error_count = COALESCE(error_count, 0) + 1,
+    last_error  = to_jsonb('Quarantined due to OOM'::text)
+WHERE id = '<snapshot_id>';
+```
+
+### Re-enqueue a quarantined snapshot to special high-memory worker later:
+
+```sql
+INSERT INTO queue_jobs (id, type, payload, priority, status, created_at, updated_at)
+VALUES (gen_random_uuid(), 'process_snapshot_highmem', jsonb_build_object('snapshot_id', '<snapshot_id>'), 10,
+        'pending', now(), now());
+```
+
+---
+
+## Example logs to gather for incident artifacts
+
+- `kubectl -n starforge logs <pod> --previous`
+- `kubectl -n starforge describe pod <pod>`
+- Node process stderr / stack traces
+- Heap profile files if captured
+- `ps aux` / memory maps on host if not containerized
+
+---
+
+## Appendix: tips for developers to avoid OOMs
+
+- Prefer streaming parsers for large `JSON` fields.
+- Avoid building full in-memory representations of very large arrays.
+- Upsert in small batches (configurable).
+- Use connection pooling conservatively (small per-process pools).
+- Add defensive guards: max snapshot size, per-job memory guardrails, and watchdogs that capture heap dumps.
+- Add unit and integration tests with large synthetic snapshots to exercise memory usage.
+
+---