Skip to content
Merged

Docs #30

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
6ddb524
Add runbook for applying database migrations
CorentynDevPro Dec 3, 2025
32a50b8
Merge pull request #20 from CorentynDevPro/apply-migrations
CorentynDevPro Dec 3, 2025
6ed2ce2
Add backfill runbook for historical snapshots
CorentynDevPro Dec 3, 2025
18c813f
Merge pull request #21 from CorentynDevPro/backfill
CorentynDevPro Dec 3, 2025
a1eaae8
Add Postgres connection exhaustion runbook
CorentynDevPro Dec 3, 2025
cbdd84d
Merge pull request #22 from CorentynDevPro/db-connection-exhaust
CorentynDevPro Dec 3, 2025
e2ee976
Add database restore runbook documentation
CorentynDevPro Dec 4, 2025
e8b39a5
Merge pull request #23 from CorentynDevPro/db-restore
CorentynDevPro Dec 4, 2025
f170723
Add ETL failure spike runbook
CorentynDevPro Dec 4, 2025
ff5da07
Add ETL failure spike incident runbook
CorentynDevPro Dec 4, 2025
18c05b2
Merge pull request #24 from CorentynDevPro/etl-failure-spike
CorentynDevPro Dec 4, 2025
082015c
Add migration rollback runbook
CorentynDevPro Dec 4, 2025
d264db7
Add migration rollback runbook documentation
CorentynDevPro Dec 4, 2025
48917b2
Merge pull request #25 from CorentynDevPro/migration-rollback
CorentynDevPro Dec 4, 2025
bc09c3a
Add secret compromise runbook documentation
CorentynDevPro Dec 4, 2025
b1c41ed
Add secret compromise response runbook
CorentynDevPro Dec 4, 2025
09118a0
Merge pull request #26 from CorentynDevPro/secret-compromise
CorentynDevPro Dec 4, 2025
d1a0e78
Add worker OOM runbook documentation
CorentynDevPro Dec 4, 2025
b652383
Add runbook for handling worker OOM incidents
CorentynDevPro Dec 4, 2025
9f955ea
Merge pull request #27 from CorentynDevPro/worker-oom
CorentynDevPro Dec 4, 2025
1e54c1f
Update WORKER_OOM runbook title formatting
CorentynDevPro Dec 4, 2025
7e62587
Merge pull request #28 from CorentynDevPro/worker-oom
CorentynDevPro Dec 4, 2025
475f314
Add QUEUE_BACKLOG runbook documentation
CorentynDevPro Dec 4, 2025
b58cb84
Remove file extensions from documentation titles
CorentynDevPro Dec 4, 2025
6dbb8cd
Add queue backlog operational runbook
CorentynDevPro Dec 4, 2025
0732f41
Merge pull request #29 from CorentynDevPro/queue-backlog
CorentynDevPro Dec 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/DB_MODEL.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# StarForge — Canonical Database Model (DB_MODEL.md)
# StarForge — Canonical Database Model

> This document is the canonical reference for the database schema used by StarForge. It expands the summary in the PRD
> and contains:
Expand Down
283 changes: 283 additions & 0 deletions docs/OP_RUNBOOKS/APPLY_MIGRATIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
# Runbook — Apply Database Migrations

---

## Purpose

This runbook describes the safe, repeatable procedure to apply schema migrations to production (and staging)
environments for the StarForge project. It consolidates the guidance in [docs/MIGRATIONS.md](../MIGRATIONS.md) and
provides a concise,
actionable checklist, commands, verification queries and troubleshooting steps for operators and engineers.

---

## Audience

- `SRE` / `Ops engineers` executing production changes
- `Backend engineers` owning migration PRs
- `Release approvers` and on-call engineers

---

## When to use

- To apply migration PRs to staging or production.
- To run emergency schema changes after appropriate approvals.
- To validate migrations that were applied by `CI` (post-apply verification).

---

## Principles (short)

- Always be safe: take a backup before applying production migrations.
- Use the protected `CI` workflow (`db-bootstrap`) for production when possible — it enforces approvals and records
artifacts.
- Prefer additive, phased migrations: add → backfill → enforce.
- Monitor system health during and after migration; be prepared to rollback or restore.

---

## Pre-flight checklist (must pass before apply)

1. Backup & snapshot
- Create and record a DB backup/snapshot ID. Verify the backup was successful.
- Document the backup ID in the migration ticket/PR.

2. Approvals
- Confirm the GitHub `db-bootstrap` workflow will run under a protected environment (requires approval).
- Ensure required approvers (`Engineering` + `SRE`) are available during the window.

3. `CI` Preflight & Tests
- Ensure `CI` `migrate:preflight` job passed for the migration PR (applied against ephemeral DB).
- Confirm unit/integration tests and migration tests passed in `CI`.

4. Runbook & Impact
- Confirm the migration PR includes impact estimates (row counts, index build time).
- Confirm any required maintenance window or low-traffic window is scheduled if the change is heavy.

5. Communication
- Notify stakeholders (product, support) of the planned maintenance window and expected `ETA`.
- Post the planned change with contact / pager information.

6. Operational readiness
- Ensure `SRE` on-call is available.
- Confirm ability to pause ingestion and workers (feature flag or scaledown commands).
- Confirm you can run the rollback / restore plan and have the runbook open.

---

## How to apply (recommended: GitHub Actions / db-bootstrap)

Use the repository's protected `db-bootstrap` GitHub Actions workflow. This is the preferred and auditable path.

1. Open the PR with migration files and ensure it includes `MIGRATION: <filename>` in commit message.
2. From the repository Actions tab, locate `db-bootstrap` (or run `workflow_dispatch`).
3. Provide required inputs (if any) and kick off the workflow.
4. Approver(s): Approve the environment prompt to let the workflow run against production.
- The workflow will run preflight checks, run migrations, and surface logs.

> Notes:
> - The workflow requires the secret `DATABASE_URL` in `GitHub environment` and the `apply` confirmation input to
actually apply.
> - The workflow logs and artifacts will be retained in `GitHub Actions` and should be saved for audit.

---

## How to apply (alternative: manual via CLI)

Only use if GitHub Actions is not available. Prefer scripted, idempotent commands.

1. Pause ingestion & workers
- Disable ingestion if possible (`API` feature flag) or scale down workers to 0.
- Example (`Kubernetes`): scale down `etl worker` deployment:
- `kubectl scale deployment etl-worker --replicas=0 -n starforge`
- For non-`K8s`: stop worker processes or toggle feature flag.

2. Ensure you have a recent backup.

3. Run preflight locally (optional but recommended)
- `./scripts/migrate-preflight.sh`
- Validate `pgcrypto` availability, connectivity, and run a smoke `ETL`.

4. Run migrations
- From repo root:
- `pnpm install --frozen-lockfile`
- `pnpm run migrate:up -- --config database/migration-config.js --env production`
- If migrations require `CONCURRENTLY` for indexes, follow documented instructions — they may run outside a
transaction.

5. Collect logs and proceed to verification.

---

## Post-apply verification (smoke & health checks)

Immediately after migrations finish, run the following checks before re-enabling full traffic:

1. Schema migration table
- Verify applied migrations:
- `SELECT * FROM node_pgmigrations_schema_version ORDER BY installed_on DESC;`
- (or check `schema_migrations` / the table created by `node-pg-migrate` in your setup)

2. Basic DB health
- Active connections:
- `SELECT count(*) FROM pg_stat_activity;`
- Long running transactions:
-
`SELECT pid, now() - xact_start AS duration, query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY duration DESC LIMIT 10;`

3. Verify key tables/indexes exist (examples)
- Check table existence:
- `SELECT to_regclass('public.hero_snapshots');`
- Check index existence:
- `SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'hero_snapshots';`

4. Run a smoke `ETL`
- Insert a small test snapshot (or use sample fixture) into `hero_snapshots` and enqueue a processing job.
- Verify worker processes it (if workers are still paused, re-enable a single worker instance temporarily).
- Confirm `user_profile_summary` and `user_troops` upserts succeed.
- Helpful SQL:
-
`SELECT id, processed_at, error_count, last_error FROM hero_snapshots WHERE created_at > now() - interval '10 minutes' ORDER BY created_at DESC LIMIT 20;`

5. Check `etl_errors` and logs
- Recent errors:
- `SELECT * FROM etl_errors ORDER BY created_at DESC LIMIT 50;`
- Check application logs and `Prometheus` metrics (`starforge_etl_snapshots_processed_total`,
`starforge_etl_snapshots_failed_total`).

6. Monitor metrics & dashboards
- Watch `ETL` processing rate and failure rate for at least `30–60 minutes`.
- Verify DB CPU, IO, and connection metrics are within expected ranges.

---

## Rollback & emergency restore (decision flow)

If migration causes critical failures or data corruption, follow this decision flow:

1. If migration has a safe `down` script, and you are confident its execution will restore safe state:
- Run the down migration for the offending change:
- `pnpm run migrate:down -- --count 1 -- --config database/migration-config.js`
- Note: downs may not be safe for destructive changes. Ensure you understand side effects.

2. If `down` is unsafe, perform a DB restore from the backup created before the migration:
- Stop ingestion and workers immediately.
- Restore the DB from the backup snapshot ID recorded earlier (follow provider-specific restore steps).
- Notify stakeholders and follow restore verification steps (same as post-apply verification).
- After recovery, coordinate re-deploy of any fixes and controlled re-apply if required.

3. Communicate clearly:
- Open an incident ticket and notify on-call and stakeholders.
- Document timeline and actions taken in the migration PR or incident system.

---

## Common failure modes & mitigations

- CREATE EXTENSION / pgcrypto permission denied
- Symptom: migration fails on CREATE EXTENSION.
- Mitigation: check provider support. If unsupported, use application-side `UUID` generation or request provider
privileges. Revert/skip extension creation if planned.

- Long-running CONCURRENTLY index builds causing resource exhaustion
- Symptom: elevated IO/CPU, slowed queries.
- Mitigation: monitor index build, spread index creation to off-peak hours, throttle background jobs, or create
indexes on replicas if available.

- Migration wrapped in transaction (CONCURRENTLY disallowed)
- Symptom: failure when attempting CREATE INDEX CONCURRENTLY inside transaction.
- Mitigation: separate that step into a non-transactional migration (use `pgm.sql` outside a transaction) as
documented in [docs/MIGRATIONS.md](../MIGRATIONS.md).

- DB connection exhaustion
- Symptom: connection errors, failed migrations.
- Mitigation: reduce migration concurrency, pause workers, increase DB capacity temporarily, use pgbouncer.

- Partial backfill failures
- Symptom: backfill job errors after schema change.
- Mitigation: investigate error, re-run backfill with smaller batches, record failures in `backfill_jobs` for
resume.

---

## Recommended verification queries (copy/paste)

```SQL
-- Check last applied migrations (adjust to your migration table name)
SELECT *
FROM schema_migrations
ORDER BY installed_on DESC LIMIT 20;

-- Check hero_snapshots presence
SELECT to_regclass('public.hero_snapshots') as hero_snapshots_exists;

-- Check GIN index presence for hero_snapshots.raw
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'hero_snapshots'
AND indexdef ILIKE '%gin%';

-- Show recent etl_errors
SELECT id, snapshot_id, error_type, message, created_at
FROM etl_errors
ORDER BY created_at DESC LIMIT 50;

-- Recent processed snapshots
SELECT id, user_id, processed_at, error_count
FROM hero_snapshots
WHERE processed_at IS NOT NULL
ORDER BY processed_at DESC LIMIT 20;
```

---

## Operational checklist (concise)

- [ ] Backup created and backup ID recorded
- [ ] Approvals obtained
- [ ] `CI` preflight passed
- [ ] Maintenance window & communications sent
- [ ] Workers paused / ingestion throttled
- [ ] Run migrations via `CI` (preferred) or manual `CLI`
- [ ] Run post-apply verification queries
- [ ] Run smoke `ETL` and confirm no errors
- [ ] Monitor metrics for `30–60 minutes`
- [ ] Re-enable workers and normal traffic
- [ ] Record migration outcome and attach logs/artifacts to PR

---

## Audit & artifacts

- Keep GitHub Actions logs and artifacts (workflow run ID) attached to the migration PR.
- Record backup snapshot ID and any run IDs (backfill job IDs) in the PR and change log.
- Save verification query outputs into the PR comments or incident log for traceability.

---

## Contacts & escalation

- Primary `SRE`: (fill with on-call contact or team alias)
- Backend owner(s): from migration PR
- Pager / Slack channel: `#starforge-ops`
- If severe outage: page `SRE` and Engineering leads immediately.

---

## References

- [docs/MIGRATIONS.md](../MIGRATIONS.md) — migration conventions and patterns
- [docs/ETL_AND_WORKER.md](../ETL_AND_WORKER.md) — `ETL` contract and smoke test guidance
- [docs/DB_MODEL.md](../DB_MODEL.md) — canonical schema
- `scripts/migrate-preflight.sh` — preflight helper script
- `GitHub Actions` workflow: `.github/workflows/db-bootstrap.yml`

---

## Notes

- This runbook is intended to be concise and actionable. For complex or high-risk migrations (large table rewrites,
partitioning, destructive changes), prepare a migration plan that includes a rehearsal run in staging, detailed
backfill scripts, and extended monitoring windows.
- Keep the runbook updated with contact details and any provider-specific restore steps.
Loading
Loading