diff --git a/docs/DB_MIGRATIONS.md b/docs/DB_MIGRATIONS.md
deleted file mode 100644
index ef581d3..0000000
--- a/docs/DB_MIGRATIONS.md
+++ /dev/null
@@ -1,262 +0,0 @@
-# Database Migrations
-
-This document explains how to manage database migrations for the StarForge project.
-
-## Overview
-
-StarForge uses [node-pg-migrate](https://salsita.github.io/node-pg-migrate/) for database schema migrations. This provides:
-
-- **Version control** for database schema changes
-- **Up and down migrations** for easy rollback
-- **Idempotent operations** that are safe to run multiple times
-- **Transaction support** for atomic changes
-
-## Key Design Decisions
-
-### pgcrypto vs uuid-ossp
-
-We use the `pgcrypto` extension with `gen_random_uuid()` instead of `uuid-ossp` with `uuid_generate_v4()`. This choice was made for better compatibility with managed PostgreSQL providers like Supabase, which may have restrictions on certain extensions.
-
-### JavaScript Migrations
-
-Migrations are written in JavaScript (not SQL) to take advantage of:
-
-- Built-in up/down migration support
-- Better transaction control
-- Programmatic schema generation
-- Type safety and validation
-
-### Idempotent Seeds
-
-Seed data uses `INSERT ... ON CONFLICT DO NOTHING` to ensure it can be safely run multiple times without duplicating data.
-
-## Local Development
-
-### Prerequisites
-
-- PostgreSQL client tools (`psql`) installed
-- Node.js and pnpm installed
-- `DATABASE_URL` environment variable configured
-
-### Setup
-
-1. **Configure your database connection:**
-
-```bash
-export DATABASE_URL="postgresql://user:password@localhost:5432/starforge"
-export PGSSLMODE=require # Optional, defaults to 'require'
-```
-
-2. **Run the bootstrap script:**
-
-```bash
-./scripts/bootstrap-db.sh
-```
-
-This will:
-
-- Install dependencies
-- Run all pending migrations
-- Apply seed data
-- Verify the setup
-
-### Manual Migration Commands
-
-**Run all pending migrations:**
-
-```bash
-pnpm run migrate:up
-```
-
-**Rollback the last migration:**
-
-```bash
-pnpm run migrate:down
-```
-
-**Create a new migration:**
-
-```bash
-pnpm run migrate create my-migration-name
-```
-
-**Check migration status:**
-
-```bash
-pnpm run migrate status --config database/migration-config.js
-```
-
-## CI/CD and Production
-
-### GitHub Actions Workflow
-
-We have a manual workflow (`db-bootstrap.yml`) for running migrations in production:
-
-1. Go to **Actions** → **Database Bootstrap** in GitHub
-2. Click **Run workflow**
-3. Select the target environment (production/staging)
-4. Click **Run workflow** button
-5. **Approve the deployment** if using the production environment
-
-### Environment Protection
-
-The workflow requires manual approval for production deployments through GitHub Environments. This prevents accidental schema changes.
-
-### Setting Up Environments
-
-1. In your GitHub repository, go to **Settings** → **Environments**
-2. Create a `production` environment
-3. Enable **Required reviewers**
-4. Add team members who should approve production migrations
-
-### Required Secrets
-
-Add these secrets in GitHub repository settings:
-
-- `DATABASE_URL`: PostgreSQL connection string for your database
-
-Example:
-
-```
-postgresql://username:password@host.supabase.co:5432/postgres
-```
-
-## Creating New Migrations
-
-### Best Practices
-
-1. **Keep migrations small and focused** - One logical change per migration
-2. **Always include down migrations** - For easy rollback if needed
-3. **Use transactions** - node-pg-migrate wraps migrations in transactions by default
-4. **Test locally first** - Always test migrations on a local database before production
-5. **Make migrations idempotent** - Use `IF NOT EXISTS` and similar clauses
-
-### Example Migration
-
-```javascript
-// database/migrations/1234567890_add_user_preferences.js
-
-exports.up = (pgm) => {
- pgm.createTable(
- 'user_preferences',
- {
- id: {
- type: 'uuid',
- primaryKey: true,
- default: pgm.func('gen_random_uuid()'),
- },
- user_id: {
- type: 'uuid',
- notNull: true,
- references: 'users(id)',
- onDelete: 'CASCADE',
- },
- theme: {
- type: 'varchar(50)',
- default: "'light'",
- },
- notifications_enabled: {
- type: 'boolean',
- default: true,
- },
- },
- {
- ifNotExists: true,
- },
- );
-
- pgm.createIndex('user_preferences', 'user_id', { ifNotExists: true });
-};
-
-exports.down = (pgm) => {
- pgm.dropTable('user_preferences', { ifExists: true });
-};
-```
-
-## Troubleshooting
-
-### Extension Permissions
-
-If you encounter errors about creating extensions:
-
-1. **Enable the extension manually** in your database:
-
- ```sql
- CREATE EXTENSION IF NOT EXISTS pgcrypto;
- ```
-
-2. **Check extension availability:**
-
- ```sql
- SELECT * FROM pg_available_extensions WHERE name = 'pgcrypto';
- ```
-
-3. **Contact your database administrator** if you don't have permissions
-
-### Connection Issues
-
-If migrations fail to connect:
-
-1. Verify your `DATABASE_URL` is correct
-2. Check that `PGSSLMODE` is set appropriately (`require` for most cloud providers)
-3. Verify your IP is whitelisted if using managed PostgreSQL
-4. Test the connection manually:
- ```bash
- psql "$DATABASE_URL" -c "SELECT 1"
- ```
-
-### Migration Conflicts
-
-If migrations get out of sync:
-
-1. **Check current migration status:**
-
- ```bash
- pnpm run migrate status --config database/migration-config.js
- ```
-
-2. **Inspect the pgmigrations table:**
-
- ```sql
- SELECT * FROM pgmigrations ORDER BY run_on DESC;
- ```
-
-3. **Never modify completed migrations** - Create a new migration to fix issues
-
-## Security Best Practices
-
-### Secrets Management
-
-- **Never commit secrets** to version control
-- Use GitHub Secrets for CI/CD
-- Rotate credentials regularly
-- Use different credentials for each environment
-
-### Access Control
-
-- Limit who can run migrations in production
-- Use GitHub Environment protection rules
-- Enable audit logging for database changes
-- Review migration PRs carefully
-
-### Database Backups
-
-- Always backup before running migrations in production
-- Test restoration procedures regularly
-- Keep backups for regulatory compliance periods
-- Consider point-in-time recovery for critical databases
-
-## Additional Resources
-
-- [node-pg-migrate Documentation](https://salsita.github.io/node-pg-migrate/)
-- [PostgreSQL Documentation](https://www.postgresql.org/docs/)
-- [Supabase Database Documentation](https://supabase.com/docs/guides/database)
-
-## Support
-
-If you encounter issues:
-
-1. Check the troubleshooting section above
-2. Review recent migration files for common patterns
-3. Consult the team's database administrator
-4. Open an issue in the repository with detailed error messages
diff --git a/docs/DB_MODEL.md b/docs/DB_MODEL.md
index e69de29..79c9bc8 100644
--- a/docs/DB_MODEL.md
+++ b/docs/DB_MODEL.md
@@ -0,0 +1,1824 @@
+# StarForge — Canonical Database Model (DB_MODEL.md)
+
+> This document is the canonical reference for the database schema used by StarForge. It expands the summary in the PRD
+> and contains:
+> - high-level overview and design principles,
+> - canonical table definitions (key tables used by the ETL and APIs),
+> - recommended indexes, constraints and partitioning guidance,
+> - upsert patterns / example DDL and SQL,
+> - migration & rollout checklist specific to schema changes,
+> - DBML representation (paste into https://dbdiagram.io to visualize),
+> - sample queries useful for debugging and validation.
+
+---
+
+# Audience
+
+- Backend engineers implementing migrations and `ETL workers`
+- `SRE / DevOps` running bootstrap and production migrations
+- Data engineers and analysts consuming normalized tables
+- `QA` writing `integration/e2e tests`
+
+---
+
+# Status
+
+- Draft (aligns with PRD v0.1 — see [docs/PRD.md](./PRD.md)). Update this file when migrations are committed.
+
+---
+
+# Table of contents
+
+1. [Design goals](#1-design-goals--guiding-principles)
+2. [Naming & conventions](#2-naming--conventions)
+3. [Core tables (DDL-like descriptions)](#3-core-tables-canonical-descriptions--recommended-columns)
+4. [Indexes & query patterns](#4-indexes--query-patterns)
+5. [Partitioning & archival guidance](#5-partitioning-retention--archival-guidance)
+6. [Upsert examples and idempotency patterns](#6-upsert-examples-and-idempotency-patterns)
+7. [Migration strategy & preflight checks](#7-migration-strategy--preflight-checks)
+8. [Data retention & archival model](#8-data-retention--archival-model-summary)
+9. [Validation queries & health checks](#9-validation-queries--health-checks)
+10. [DBML schema (paste into dbdiagram.io)](#10-dbml-schema-paste-into-dbdiagramio)
+11. [Change log / references](#11-change-log--references)
+
+---
+
+## 1. Design goals / guiding principles
+
+- Store `raw upstream snapshots` (JSON) for audit & replay while normalizing commonly queried fields for performance.
+- Keep `ETL` idempotent: repeated processing of the same snapshot must not create duplicates or corrupt state.
+- Make `migrations` safe and incremental: add nullable columns first, backfill asynchronously, then make NOT NULL.
+- Minimize long-running exclusive locks; prefer CONCURRENT index creation where supported.
+- Keep `PII` minimal and provide explicit redaction guidance for any sensitive fields prior to archival or external
+ export.
+
+---
+
+## 2. Naming & conventions
+
+- All timestamps: timestamptz (UTC). Column names: `created_at`, `updated_at`, `processed_at`, `server_time` (where
+ upstream provides epoch).
+- Primary key convention: `id` as UUID where appropriate (generated with `gen_random_uuid()` via `pgcrypto`), or integer
+ for catalog tables.
+- Foreign keys: explicit FK constraints where referential integrity is required. Use `ON DELETE SET NULL` for references
+ that must not cascade deletes unexpectedly.
+- JSON storage: use `jsonb` for columns storing raw snapshots or arbitrary extra fields. Use `jsonb` named `extra`,
+ `extras`, or `raw` for unmapped data.
+- Index names: `idx_
_`; unique constraints prefixed `ux_`.
+
+---
+
+## 3. Core tables (canonical descriptions & recommended columns)
+
+> Note: these are canonical column lists and notes — the actual `SQL DDL` must be created via versioned migrations
+> (`node-pg-migrate`).
+
+### 3.1 users (canonical user mapping)
+
+- _Purpose:_ canonical user row used by application and joins.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `namecode` VARCHAR(64) UNIQUE NULLABLE
+ - `discord_user_id` VARCHAR(64) NULLABLE
+ - `username` VARCHAR(255) NULLABLE
+ - `email` VARCHAR(255) NULLABLE
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - UNIQUE(`namecode`)
+ - INDEX on `discord_user_id` for quick lookup
+ - Consider partial indexes for active users (e.g. WHERE `updated_at > now()` - interval '90 days')
+
+- _Notes:_
+ - Keep `PII` minimal; consider moving sensitive `PII` (email) to a protected table with stricter access controls.
+ - Use `gen_random_uuid()` (`pgcrypto`) where provider allows; otherwise generate `UUIDs` app-side.
+
+---
+
+### 3.2 hero_snapshots (raw ingestion store)
+
+- _Purpose:_ persist raw `get_hero_profile` payloads for audit, dedupe and replay. Single source-of-truth raw payload
+ retention.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `user_id` UUID REFERENCES users(id) ON DELETE SET NULL
+ - `namecode` VARCHAR(64) NULLABLE
+ - `source` VARCHAR(64) NOT NULL -- e.g. "fetch_by_namecode", "login", "cli_upload"
+ - `raw` JSONB NOT NULL
+ - `size_bytes` INTEGER NOT NULL
+ - `content_hash` VARCHAR(128) NOT NULL -- hex SHA256
+ - `server_time` BIGINT NULLABLE -- if upstream provides server timestamp (epoch millis)
+ - `processing` BOOLEAN DEFAULT FALSE
+ - `processing_started_at` TIMESTAMPTZ NULLABLE
+ - `processed_at` TIMESTAMPTZ NULLABLE
+ - `error_count` INTEGER DEFAULT 0
+ - `last_error` JSONB NULLABLE
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - `GIN`: CREATE INDEX `idx_hero_snapshots_raw_gin` ON `hero_snapshots` USING GIN (raw `jsonb_path_ops`);
+ - `B-tree`: CREATE INDEX `idx_hero_snapshots_user_created_at` ON `hero_snapshots` (`user_id`, `created_at` DESC);
+ - Optional: UNIQUE(`content_hash`, source) (or partial unique for dedupe window) — if implemented, prefer
+ application-level dedupe to allow duplicates outside window.
+ - Expression index: CREATE INDEX `idx_hero_snapshots_namecode_expr` ON `hero_snapshots` ((raw ->> 'NameCode'));
+
+- _Notes:_
+ - Compute `content_hash` := encode(digest(raw::text, 'sha256'), 'hex') in DB or app on insert; store `size_bytes` :=
+ `octet_length`(raw::text).
+ - Keep raw `JSONB` for replay; plan retention/archival (see section retention).
+ - Use small inserts with Idempotency-Key header in ingestion `API` to avoid duplicate processing.
+
+---
+
+### 3.3 user_troops (normalized inventory)
+
+- _Purpose:_ normalized per-user troop counts for fast queries and analytics.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `user_id` UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE
+ - `troop_id` INTEGER NOT NULL -- FK to `troop_catalog` when available
+ - `amount` INTEGER DEFAULT 0 NOT NULL
+ - `level` INTEGER DEFAULT 1
+ - `rarity` INTEGER DEFAULT 0
+ - `fusion_cards` INTEGER DEFAULT 0
+ - `traits_owned` INTEGER DEFAULT 0
+ - `extra` JSONB DEFAULT '{}'::jsonb
+ - `last_seen` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - UNIQUE(`user_id`, `troop_id`)
+ - INDEX `idx_user_troops_troop` ON `user_troops` (`troop_id`)
+ - INDEX `idx_user_troops_user` ON `user_troops` (`user_id`)
+
+- _Notes:_
+ - Use ON CONFLICT (`user_id`, `troop_id`) DO UPDATE ... in upserts to be idempotent.
+ - Keep extra `JSONB` for unmapped fields so that `ETL` remains tolerant to upstream changes.
+
+---
+
+### 3.4 user_pets
+
+- _Purpose:_ normalized per-user pet inventory.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `user_id` UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE
+ - `pet_id` INT NOT NULL
+ - `amount` INT DEFAULT 0 NOT NULL
+ - `level` INT DEFAULT 1
+ - `xp` BIGINT DEFAULT 0
+ - `orb_fusion_cards` INT DEFAULT 0
+ - `orbs_used` JSONB DEFAULT '[]'::jsonb
+ - `ascension_level` INT DEFAULT 0
+ - `extra` JSONB DEFAULT '{}'::jsonb
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - UNIQUE(`user_id`, `pet_id`)
+ - INDEX `idx_user_pets_pet` ON `user_pets` (`pet_id`)
+ - INDEX `idx_user_pets_user` ON `user_pets` (`user_id`)
+
+- _Notes:_
+ - Keep `orbs_used` as `JSONB` array of event objects if upstream provides that detail.
+ - Use atomic per-user pet upserts to avoid partial state.
+
+---
+
+### 3.5 user_artifacts
+
+- _Purpose:_ per-user artifacts (progress & levels).
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `user_id` UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE
+ - `artifact_id` INT NOT NULL
+ - `level` INT DEFAULT 0
+ - `xp` BIGINT DEFAULT 0
+ - `extra` JSONB DEFAULT '{}'::jsonb
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - UNIQUE(`user_id`, `artifact_id`)
+ - INDEX `idx_user_artifacts_artifact` ON `user_artifacts` (`artifact_id`)
+
+- _Notes:_
+ - Upsert with ON CONFLICT(`user_id`, `artifact_id`) DO UPDATE to be idempotent.
+
+---
+
+### 3.6 user_teams
+
+- _Purpose:_ store user-created teams (equipped teams / saved team configurations).
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `user_id` UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE
+ - `name` VARCHAR(255) NULLABLE
+ - `banner` INT NULLABLE
+ - `team_level` INT DEFAULT 0
+ - `class` VARCHAR(128) NULLABLE
+ - `troops` JSONB DEFAULT '[]'::jsonb -- ordered list of troop ids or objects
+ - `override_data` JSONB DEFAULT '{}'::jsonb
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - INDEX `idx_user_teams_user` ON `user_teams` (`user_id`)
+ - If teams are referenced frequently by namecode, consider index on (`user_id`, `name`) unique per user.
+
+- _Notes:_
+ - Consider a separate table for Team Saves / public sharing (see `team_saves`).
+ - For fast game lookups, keep troop ids as integer arrays if simpler (INT[]), but `JSONB` allows flexible metadata.
+
+---
+
+### 3.7 guilds
+
+- _Purpose:_ guild metadata and per-guild settings/feature flags.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `discord_guild_id` VARCHAR(64) UNIQUE NULLABLE
+ - `name` VARCHAR(255) NOT NULL
+ - `description` TEXT NULLABLE
+ - `settings` JSONB DEFAULT '{}'::jsonb
+ - `feature_flags` JSONB DEFAULT '{}'::jsonb
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - UNIQUE(`discord_guild_id`) -- if provided
+ - INDEX `idx_guilds_name` ON `guilds` (`name`)
+
+- _Notes:_
+ - `Feature_flags` is optional per-guild override; global flags live in `feature_flags` table.
+ - Keep settings small and validated at application level.
+
+---
+
+### 3.8 guild_members
+
+- _Purpose:_ mapping between guilds and users.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `guild_id` UUID NOT NULL REFERENCES guilds(id) ON DELETE CASCADE
+ - `user_id` UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE
+ - `discord_user_id` VARCHAR(64) NULLABLE
+ - `joined_at` TIMESTAMPTZ NULLABLE
+ - `is_owner` BOOLEAN DEFAULT false
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - UNIQUE(`guild_id`, `user_id`)
+ - INDEX `idx_guild_members_guild` ON `guild_members` (`guild_id`)
+
+- _Notes:_
+ - Keep membership events in `audit_logs` if join/leave is important historically.
+
+---
+
+### 3.9 feature_flags
+
+- _Purpose:_ global feature toggles.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `name` VARCHAR(128) UNIQUE NOT NULL
+ - `enabled` BOOLEAN DEFAULT false
+ - `rollout_percentage` INT DEFAULT 0
+ - `data` JSONB DEFAULT '{}'::jsonb -- additional configuration
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - UNIQUE(`name`)
+ - INDEX `idx_feature_flags_enabled` ON `feature_flags` (enabled)
+
+- _Notes:_
+ - Evaluate caching flags in Redis with short `TTL` for performance.
+ - Use deterministic hashing to apply `rollout_percentage` by user id or namecode.
+
+---
+
+### 3.10 user_profile_summary (denormalized read table)
+
+- _Purpose:_ fast-read, denormalized summary consumed by bot/UI (one row per user).
+
+- _Columns:_
+ - `user_id` UUID PRIMARY KEY REFERENCES users(id)
+ - `namecode` VARCHAR(64)
+ - `username` VARCHAR(255)
+ - `level` INTEGER
+ - `top_troops` JSONB DEFAULT '[]'::jsonb -- e.g. [ {troop_id, amount, level} ]
+ - `equipped_pet` JSONB NULLABLE
+ - `pvp_tier` INTEGER NULLABLE
+ - `guild_id` UUID NULLABLE
+ - `last_seen` TIMESTAMPTZ NULLABLE
+ - `cached_at` TIMESTAMPTZ DEFAULT now() -- when summary was generated
+ - `extra` JSONB DEFAULT '{}'::jsonb
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - PRIMARY KEY(`user_id`)
+ - INDEX `idx_profile_summary_namecode` ON `user_profile_summary` (`namecode`)
+ - Consider partial index for active players: WHERE `cached_at > now() - interval '7 days'`
+
+- _Notes:_
+ - `ETL` writes/updates this table; reads should be served from it to meet latency `SLAs`.
+ - Keep `extra` for unmapped fields so reprocessing can add new summary fields without losing data.
+
+---
+
+### 3.11 etl_errors
+
+- _Purpose:_ record per-snapshot or per-entity `ETL` errors for triage.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `snapshot_id` UUID REFERENCES hero_snapshots(id) ON DELETE SET NULL
+ - `error_type` VARCHAR(128) NOT NULL -- e.g. PARSE_ERROR, FK_VIOLATION
+ - `message` TEXT NOT NULL -- short sanitized message
+ - `details` JSONB NULLABLE -- structured diagnostic (truncated snippets, stack)
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - INDEX `idx_etl_errors_snapshot_id_created_at` ON `etl_errors` (`snapshot_id`, `created_at` DESC)
+
+- _Notes:_
+ - Do not store full raw snapshot in details; reference `snapshot_id` for investigation.
+ - Create alerts if error rates exceed threshold.
+
+---
+
+### 3.12 catalogs (troop_catalog, pet_catalog, artifact_catalog)
+
+- _Purpose:_ canonical static metadata seeded from official data sources or community-maintained sources.
+
+- _Example: `troop_catalog` columns:_
+ - `id` INT PRIMARY KEY
+ - `kid` VARCHAR(64) NULLABLE
+ - `name` VARCHAR(255) NOT NULL
+ - `description` TEXT NULLABLE
+ - `rarity` INT
+ - `meta` JSONB DEFAULT '{}'::jsonb -- game-specific attributes (costs, traits)
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - UNIQUE on external identifiers (`kid`)
+ - INDEX on `name` for search
+
+- _Notes:_
+ - Seed catalogs via idempotent seed scripts; `ETL` should reference catalog ids when available.
+ - If `ETL` finds unmapped ids, create placeholder rows with `meta -> { "placeholder": true }` to avoid FK failures.
+
+---
+
+### 3.13 queue_jobs (job persistence / fallback)
+
+- _Purpose:_ persist background jobs (optional DB-backed queue) as a durable fallback or for job inspection.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `type` VARCHAR(128) NOT NULL -- 'process_snapshot', 'archive_snapshot', ...
+ - `payload` JSONB NOT NULL -- minimal payload (snapshot_id, options)
+ - `priority` INT DEFAULT 0
+ - `attempts` INT DEFAULT 0
+ - `max_attempts` INT DEFAULT 5
+ - `status` VARCHAR(32) DEFAULT 'pending' -- pending|running|failed|done
+ - `run_after` TIMESTAMPTZ DEFAULT now()
+ - `last_error` TEXT NULLABLE
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - INDEX `idx_queue_jobs_status_run_after` ON `queue_jobs` (`status`, `run_after`)
+ - Use a small table for visibility; production queue should remain `Redis/BullMQ` for throughput.
+
+- _Notes:_
+ - Use as a fallback for job durability and administrative requeueing.
+
+---
+
+### 3.14 audit_logs
+
+- _Purpose:_ capture important admin/user actions for audit and compliance.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `action` VARCHAR(255) NOT NULL -- e.g. 'migration_run', 'snapshot_reprocessed'
+ - `user_id` UUID NULLABLE
+ - `guild_id` UUID NULLABLE
+ - `metadata` JSONB DEFAULT '{}'::jsonb -- action details, sanitized
+ - `ip_address` VARCHAR(45) NULLABLE
+ - `user_agent` VARCHAR(512) NULLABLE
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+
+- _Indexes:_
+ - INDEX `idx_audit_logs_action_created_at` ON `audit_logs` (`action`, `created_at` DESC)
+
+- _Notes:_
+ - Do not log secrets or raw snapshots here. Keep metadata minimal and traceable.
+
+---
+
+### 3.15 raw_profiles (alternate simplified snapshot store)
+
+- _Purpose:_ an optional simplified table for ingesting raw profiles before enriched processing (lighter than
+ `hero_snapshots`)
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `hero_external_id` VARCHAR(255) NULLABLE -- upstream player id if available
+ - `source` VARCHAR(64) NOT NULL
+ - `captured_at` TIMESTAMPTZ DEFAULT now()
+ - `raw` JSONB NOT NULL
+
+- _Indexes:_
+ - INDEX `idx_raw_profiles_hero_external_id` ON `raw_profiles` (`hero_external_id`)
+
+- _Notes:_
+ - This table is optional if `hero_snapshots` is sufficient. Use if you want a short ingest path decoupled from
+ `ETL`.
+
+---
+
+### 3.16 heroes (canonical hero mapping — expanded)
+
+- _Purpose:_ normalized representation of a player/hero aggregated across snapshots.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `external_id` VARCHAR(255) UNIQUE NULLABLE -- stable upstream id (if any)
+ - `name` VARCHAR(255) NULLABLE
+ - `namecode` VARCHAR(64) UNIQUE NULLABLE
+ - `name_lower` VARCHAR(255) NULLABLE -- for case-insensitive search
+ - `username` VARCHAR(255) NULLABLE
+ - `level` INT DEFAULT 0
+ - `race` INT NULLABLE
+ - `gender` INT NULLABLE
+ - `class` VARCHAR(128) NULLABLE
+ - `portrait_id` INT NULLABLE
+ - `title_id` INT NULLABLE
+ - `flair_id` INT NULLABLE
+ - `honor_rank` INT NULLABLE
+ - `equipped_weapon_id` INT NULLABLE
+ - `equipped_pet_id` INT NULLABLE
+ - `guild_id` UUID NULLABLE REFERENCES guilds(id)
+ - `guild_external_id` VARCHAR(64) NULLABLE
+ - `guild_name` VARCHAR(255) NULLABLE
+ - `guild_rank` INT NULLABLE
+ - `server_time` TIMESTAMPTZ NULLABLE
+ - `last_login` TIMESTAMPTZ NULLABLE
+ - `last_played` TIMESTAMPTZ NULLABLE
+ - `summary` JSONB NULLABLE -- snapshot-derived denormalized blob for debugging
+ - `extras` JSONB DEFAULT '{}'::jsonb
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - UNIQUE(`external_id`) if upstream provides stable id
+ - INDEX `idx_heroes_namecode` ON `heroes` (`namecode`)
+ - INDEX `idx_heroes_guild_id` ON `heroes` (`guild_id`)
+
+- _Notes:_
+ - `heroes` can be the canonical mapping that `ETL` updates; `user_profile_summary` is the read-optimized table
+ consumed by UI/bot.
+ - Keep `summary` for debugging / ad-hoc fallbacks.
+
+---
+
+### 3.17 hero_runes
+
+- _Purpose:_ store runes or rune-sets for a hero.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE
+ - `runes` JSONB NOT NULL
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Indexes:_
+ - INDEX `idx_hero_runes_hero` ON `hero_runes` (`hero_id`)
+
+- _Notes:_
+ - Could be multiple rows per hero if the upstream uses multiple rune slots/versioning.
+
+---
+
+### 3.18 hero_troops
+
+- _Purpose:_ store normalized troop inventory per hero (detailed).
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE
+ - `troop_id` INT NOT NULL
+ - `amount` INT DEFAULT 0
+ - `level` INT DEFAULT 1
+ - `current_rarity` INT DEFAULT 0
+ - `fusion_cards` INT DEFAULT 0
+ - `orb_fusion_cards` INT DEFAULT 0
+ - `traits_owned` INT DEFAULT 0
+ - `invasions` INT DEFAULT 0
+ - `shiny_level_progress` INT DEFAULT 0
+ - `orbs_used` JSONB DEFAULT '[]'::jsonb
+ - `extra` JSONB DEFAULT '{}'::jsonb
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - UNIQUE(`hero_id`, `troop_id`)
+ - INDEX `idx_hero_troops_troop` ON `hero_troops` (`troop_id`)
+ - INDEX `idx_hero_troops_hero` ON `hero_troops` (`hero_id`)
+
+- _Notes:_
+ - Use batch upserts with ON CONFLICT to update amounts/levels atomically.
+
+---
+
+### 3.19 hero_pets
+
+- _Purpose:_ store normalized pets per hero.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE
+ - `pet_id` INT NOT NULL
+ - `amount` INT DEFAULT 0
+ - `level` INT DEFAULT 1
+ - `xp` BIGINT DEFAULT 0
+ - `orb_fusion_cards` INT DEFAULT 0
+ - `orbs_used` JSONB DEFAULT '[]'::jsonb
+ - `ascension_level` INT DEFAULT 0
+ - `extra` JSONB DEFAULT '{}'::jsonb
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - UNIQUE(`hero_id`, `pet_id`)
+ - INDEX `idx_hero_pets_pet` ON `hero_pets` (`pet_id`)
+
+---
+
+### 3.20 hero_artifacts
+
+- _Purpose:_ normalized artifacts per hero.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE
+ - `artifact_id` INT NOT NULL
+ - `xp` BIGINT DEFAULT 0
+ - `level` INT DEFAULT 0
+ - `extra` JSONB DEFAULT '{}'::jsonb
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - UNIQUE(`hero_id`, `artifact_id`)
+
+---
+
+### 3.21 teams
+
+- _Purpose:_ normalized teams (active configurations) for heroes.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE
+ - `name` VARCHAR(255) NULLABLE
+ - `banner` INT NULLABLE
+ - `team_level` INT DEFAULT 0
+ - `class` VARCHAR(128) NULLABLE
+ - `override_data` JSONB DEFAULT '{}'::jsonb
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Indexes:_
+ - INDEX `idx_teams_hero` ON `teams` (`hero_id`)
+
+---
+
+### 3.22 team_troops
+
+- _Purpose:_ mapping table for troops inside a team (ordered positions).
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `team_id` UUID NOT NULL REFERENCES teams(id) ON DELETE CASCADE
+ - `position` INT NOT NULL -- position index in team
+ - `troop_id` INT NOT NULL
+
+- _Indexes:_
+ - INDEX `idx_team_troops_team` ON `team_troops` (`team_id`, `position`)
+
+- _Notes:_
+ - Keep position contiguous and deterministic so presentation layers can render quickly.
+
+---
+
+### 3.23 team_saves
+
+- _Purpose:_ user-shared/persisted team configurations (public or private).
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE
+ - `name` VARCHAR(255)
+ - `description` TEXT
+ - `data` JSONB NOT NULL -- full team data for replay
+ - `is_public` BOOLEAN DEFAULT false
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Indexes:_
+ - INDEX `idx_team_saves_hero` ON `team_saves` (`hero_id`)
+ - INDEX `idx_team_saves_public` ON `team_saves` (`is_public`)
+
+---
+
+### 3.24 team_comments
+
+- _Purpose:_ comments on saved teams (community interactions).
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `team_save_id` UUID NOT NULL REFERENCES team_saves(id) ON DELETE CASCADE
+ - `author_user_id` UUID NOT NULL REFERENCES app_users(id) -- note: app_users used for bot operators / forum users
+ - `guild_id` UUID NULLABLE REFERENCES guilds(id)
+ - `comment` TEXT NOT NULL
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+
+- _Indexes:_
+ - INDEX `idx_team_comments_team_save` ON `team_comments` (`team_save_id`)
+
+---
+
+### 3.25 hero_kingdom_progress
+
+- _Purpose:_ track hero progress in per-kingdom systems.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE
+ - `kingdom_id` INT NOT NULL
+ - `status` INT DEFAULT 0
+ - `income` INT DEFAULT 0
+ - `challenge_tier` INT DEFAULT 0
+ - `invasions` INT DEFAULT 0
+ - `power_rank` INT DEFAULT 0
+ - `tasks` JSONB DEFAULT '[]'::jsonb
+ - `explore` JSONB DEFAULT '{}'::jsonb
+ - `trials_team` JSONB DEFAULT '{}'::jsonb
+ - `extra` JSONB DEFAULT '{}'::jsonb
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - UNIQUE(`hero_id`, `kingdom_id`)
+ - INDEX `idx_hero_kingdom_progress_hero` ON `hero_kingdom_progress` (`hero_id`)
+
+---
+
+### 3.26 hero_pvp_regions
+
+- _Purpose:_ store per-hero PvP region stats and most-used teams.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE
+ - `region_id` INT NOT NULL
+ - `team` JSONB DEFAULT '{}'::jsonb
+ - `stats` JSONB DEFAULT '{}'::jsonb
+ - `most_used_troop` JSONB DEFAULT '{}'::jsonb
+ - `extras` JSONB DEFAULT '{}'::jsonb
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints & Indexes:_
+ - UNIQUE(`hero_id`, `region_id`)
+ - INDEX `idx_hero_pvp_regions_hero` ON `hero_pvp_regions` (`hero_id`)
+
+---
+
+### 3.27 hero_pvp_stats
+
+- _Purpose:_ aggregate PvP metrics per hero.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE UNIQUE
+ - `invades_won` INT DEFAULT 0
+ - `invades_lost` INT DEFAULT 0
+ - `defends_won` INT DEFAULT 0
+ - `defends_lost` INT DEFAULT 0
+ - `most_invaded_kingdom` JSONB NULLABLE
+ - `most_used_troop` JSONB NULLABLE
+ - `raw` JSONB NULLABLE -- keep raw payload if needed for debug
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Notes:_
+ - This table can be updated incrementally as `ETL` processes snapshots.
+
+---
+
+### 3.28 hero_progress_weapons
+
+- _Purpose:_ weapon progression state per hero.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `hero_id` UUID NOT NULL REFERENCES heroes(id)
+ - `weapon_data` JSONB NOT NULL
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Indexes:_
+ - INDEX `idx_hero_progress_weapons_hero` ON `hero_progress_weapons` (`hero_id`)
+
+---
+
+### 3.29 hero_class_data
+
+- _Purpose:_ per-hero class-related data (per-class specialization or progress).
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `hero_id` UUID NOT NULL REFERENCES heroes(id)
+ - `class_name` VARCHAR(128) NOT NULL
+ - `data` JSONB NOT NULL
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Constraints:_
+ - UNIQUE(`hero_id`, `class_name`)
+
+---
+
+### 3.30 hero_meta_json
+
+- _Purpose:_ key-value JSON storage for misc hero metadata.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `hero_id` UUID NOT NULL REFERENCES heroes(id)
+ - `key` VARCHAR(255) NOT NULL
+ - `value` JSONB
+ - `created_at` TIMESTAMPTZ DEFAULT now()
+ - `updated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Indexes:_
+ - INDEX `idx_hero_meta_hero_key` ON `hero_meta_json` (`hero_id`, `key`)
+
+---
+
+### 3.31 sheets_sync_logs
+
+- _Purpose:_ logs for synchronizing data to external sheets (optional feature).
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `guild_id` UUID NOT NULL REFERENCES guilds(id)
+ - `sheet_id` VARCHAR(255)
+ - `range` VARCHAR(128)
+ - `rows_sent` INT DEFAULT 0
+ - `status` VARCHAR(64) -- success|failed|pending
+ - `error` JSONB NULLABLE
+ - `started_at` TIMESTAMPTZ NULLABLE
+ - `finished_at` TIMESTAMPTZ NULLABLE
+
+- _Indexes:_
+ - INDEX `idx_sheets_sync_logs_guild` ON `sheets_sync_logs` (`guild_id`)
+
+---
+
+### 3.32 cache_invalidation
+
+- _Purpose:_ single-row records used to coordinate cache invalidation across services.
+
+- _Columns:_
+ - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid()
+ - `key` VARCHAR(255) NOT NULL -- cache key / namespace
+ - `invalidated_at` TIMESTAMPTZ DEFAULT now()
+
+- _Notes:_
+ - Simple table for cross-process invalidation if pub/sub not available.
+
+---
+
+### 3.33 troops_master_light (materialized view)
+
+- _Purpose:_ lightweight read-only snapshot of troops catalogs for analytics and quick lookups (materialized view).
+
+- _Columns (materialized):_
+ - `id` INT PRIMARY KEY
+ - `kid` VARCHAR
+ - `name` VARCHAR
+ - `kingdom_id` INT
+ - `kingdom_name` VARCHAR
+ - `rarity` VARCHAR
+ - `rarity_id` INT
+ - `max_attack` INT
+ - `max_armor` INT
+ - `max_life` INT
+ - `max_magic` INT
+ - `spell_id` INT
+ - `shiny_spell_id` INT
+ - `colors` INT
+ - `arcanes` VARCHAR
+ - `image_url` VARCHAR
+ - `type` VARCHAR
+ - `troop_role1` VARCHAR
+ - `description` TEXT
+
+- _Notes:_
+ - Refresh schedule depends on how often catalog changes; use CONCURRENTLY where supported.
+
+---
+
+## Foreign key / relationship summary (excerpt)
+
+- `role_permissions.role_id` -> `roles.id`
+- `role_permissions.permission_id` -> `permissions.id`
+- `guild_members.guild_id` -> `guilds.id`
+- `guild_members.user_id` -> `app_users.id`
+- `guild_roles.guild_id` -> `guilds.id`
+- `guild_roles.role_id` -> `roles.id`
+- `guild_member_roles.guild_member_id` -> `guild_members.id`
+- `guild_member_roles.guild_role_id` -> `guild_roles.id`
+- `guild_feature_flags.guild_id` -> `guilds.id`
+- `guild_feature_flags.feature_flag_id` -> `feature_flags.id`
+- `troops.kingdom_id` -> `kingdoms.id`
+- `classes.kingdom_id` -> `kingdoms.id`
+- `troops.spell_id` -> `spells.id`
+- `game_events.troop_id` -> `troops.id`
+- `raw_profiles.hero_external_id` -> `heroes.external_id`
+- `heroes.guild_id` -> `guilds.id`
+- `hero_* tables` (*.hero_id) -> `heroes.id`
+- `teams.hero_id` -> `heroes.id`
+- `team_troops.team_id` -> `teams.id`
+- `team_comments.team_save_id` -> `team_saves.id`
+- `team_comments.author_user_id` -> `app_users.id`
+- `sheets_sync_logs.guild_id` -> `guilds.id`
+
+(See DBML snippet for the full list and diagram rendering.)
+
+---
+
+## Examples DDL snippets (pattern / best-practice)
+
+- `hero_snapshots` insert with hash (`Postgres` example using `pgcrypto`):
+
+```sql
+INSERT INTO hero_snapshots (id, user_id, namecode, source, raw, size_bytes, content_hash, created_at)
+VALUES (gen_random_uuid(),
+ NULL,
+ ($1):: varchar,
+ ($2):: varchar,
+ $3::jsonb,
+ octet_length($3::text),
+ encode(digest($3::text, 'sha256'), 'hex'),
+ now());
+```
+
+- Atomic claim to process snapshot:
+
+```sql
+UPDATE hero_snapshots
+SET processing = true,
+ processing_started_at = now()
+WHERE id = $1
+ AND (processing = false OR processing IS NULL) RETURNING id;
+```
+
+- `GIN` index creation (CONCURRENTLY for no lock where supported):
+
+```sql
+CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_hero_snapshots_raw_gin
+ ON hero_snapshots USING GIN (raw jsonb_path_ops);
+```
+
+- `user_troops` upsert pattern:
+
+```sql
+INSERT INTO user_troops (user_id, troop_id, amount, level, rarity, fusion_cards, traits_owned, extra, updated_at)
+VALUES ($1, $2, $3, $4, $5, $6, $7, $8::jsonb, now()) ON CONFLICT (user_id, troop_id) DO
+UPDATE
+ SET amount = EXCLUDED.amount,
+ level = EXCLUDED.level,
+ rarity = EXCLUDED.rarity,
+ fusion_cards = EXCLUDED.fusion_cards,
+ traits_owned = EXCLUDED.traits_owned,
+ extra = COALESCE (user_troops.extra, '{}'::jsonb) || EXCLUDED.extra,
+ updated_at = now();
+```
+
+---
+
+## Operational recommendations & next steps
+
+1. **Migrations**
+ - Implement these schemas through small, tested `node-pg-migrate` migrations.
+ - Separate extension creation into its own `migration (CREATE EXTENSION IF NOT EXISTS pgcrypto`) with clear fallback
+ instructions.
+
+2. **Indexes**
+ - Create `GIN` indexes for `JSONB` fields used for ad-hoc queries, but measure write overhead.
+ - Use CONCURRENTLY for large indexes in production to avoid long locks.
+
+3. **Partitioning & retention**
+ - Partition `hero_snapshots` by month (range) when dataset grows, to speed deletion/archive jobs.
+ - Implement archival job: compress `raw JSON` -> upload to `S3` -> insert into `snapshot_archives` -> delete
+ partition rows (or mark archived).
+
+4. **Seeds & catalogs**
+ - Provide idempotent seeds for `troop_catalog` / `pet_catalog` / `spells` and keep them under database/seeds/.
+ - `ETL` should create placeholders for missing catalog items and mark them for analyst review.
+
+5. **Backfill & ETL**
+ - Use the same `ETL` code for new snapshots and backfills to guarantee mapping parity.
+ - Record backfill job progress in `queue_jobs` or a `backfill_audit` table for resume capability.
+
+6. **Tests & validation**
+ - Add integration tests that run migrations, insert sample snapshots (docs/examples), run `ETL` and verify
+ normalized tables and `user_profile_summary`.
+
+---
+
+## 4. Indexes & query patterns
+
+### 4.1 Recommended indexes (examples)
+
+- **hero_snapshots**:
+ - **GIN**: `CREATE INDEX idx_hero_snapshots_raw_gin ON hero_snapshots USING GIN (raw jsonb_path_ops);`
+ - **B-tree**: `CREATE INDEX idx_hero_snapshots_user_created_at ON hero_snapshots (user_id, created_at DESC);`
+ - **Optional uniqueness**:
+ `CREATE UNIQUE INDEX ux_hero_snapshots_contenthash_source ON hero_snapshots (content_hash, source)` (use with
+ caution; dedupe logic may be more flexible in app)
+- **users**:
+ - `CREATE UNIQUE INDEX ux_users_namecode ON users (namecode);`
+ - `CREATE INDEX idx_users_discord_user_id ON users (discord_user_id);`
+- **user_troops**:
+ - `CREATE UNIQUE INDEX ux_user_troops_user_troop ON user_troops (user_id, troop_id);`
+ - `CREATE INDEX idx_user_troops_troop ON user_troops (troop_id);`
+- **user_profile_summary**:
+ - `CREATE INDEX idx_profile_summary_namecode ON user_profile_summary (namecode);`
+- **etl_errors**:
+ - `CREATE INDEX idx_etl_errors_snapshot_id_created_at ON etl_errors (snapshot_id, created_at DESC);`
+
+### 4.2 Query patterns (examples)
+
+- Latest processed snapshot for user:
+
+```sql
+SELECT *
+FROM hero_snapshots
+WHERE user_id = $1
+ AND processed_at IS NOT NULL
+ORDER BY created_at DESC LIMIT 1;
+```
+
+- Profile summary by namecode:
+
+```sql
+SELECT *
+FROM user_profile_summary
+WHERE namecode = $1;
+```
+
+- Who owns troop 6024:
+
+```sql
+SELECT u.id, u.namecode, ut.amount, ut.level
+FROM user_troops ut
+ JOIN users u ON ut.user_id = u.id
+WHERE ut.troop_id = 6024
+ AND ut.amount > 0
+ORDER BY ut.amount DESC LIMIT 100;
+```
+
+---
+
+## 5. Partitioning, retention & archival guidance
+
+### 5.1 Partition hero_snapshots when volume grows
+
+- _Strategy:_ time-based partitions (monthly) on created_at:
+ - Create parent table `hero_snapshots` and child partitions `hero_snapshots_2025_11` etc.
+- _Benefits:_ faster archival/deletion, reduced vacuum overhead, improved index size on recent partitions.
+
+### 5.2 Retention policy & archival
+
+- Default retention example: keep 90 days in DB, archive older snapshots to `S3` (compressed) with audit metadata.
+- Steps for archival:
+ 1. SELECT snapshots older than retention threshold and not flagged permanent.
+ 2. Compress raw `JSON` (gzip/zstd), upload to `S3` with server-side encryption.
+ 3. Insert archival metadata (`s3_path`, `checksum`, `archived_at`) into `snapshot_archives` table and then DELETE or
+ mark as archived.
+- Keep at least N recent snapshots per user (configurable) if required.
+
+---
+
+## 6. Upsert examples and idempotency patterns
+
+### 6.1 user_troops upsert (example)
+
+- Use single multi-row upsert for batch inserts/updates:
+
+```sql
+-- example uses tmp table or values list
+INSERT INTO user_troops (user_id, troop_id, amount, level, rarity, fusion_cards, traits_owned, extra, updated_at)
+VALUES ($1, $2, $3, $4, $5, $6, $7, $8, now()), ...
+ ON CONFLICT (user_id, troop_id) DO
+UPDATE
+ SET amount = EXCLUDED.amount,
+ level = EXCLUDED.level,
+ rarity = EXCLUDED.rarity,
+ fusion_cards = EXCLUDED.fusion_cards,
+ traits_owned = EXCLUDED.traits_owned,
+ extra = COALESCE (user_troops.extra, '{}'::jsonb) || EXCLUDED.extra,
+ updated_at = now();
+```
+
+- Keep transactions per entity group (users, troops, pets) to avoid holding long transactions.
+
+### 6.2 users upsert example
+
+```sql
+INSERT INTO users (id, namecode, username, discord_user_id, email, created_at, updated_at)
+VALUES (gen_random_uuid(), $namecode, $username, $discord_id, $email, now(), now()) ON CONFLICT (namecode) DO
+UPDATE
+ SET username = EXCLUDED.username,
+ discord_user_id = COALESCE (users.discord_user_id, EXCLUDED.discord_user_id),
+ email = COALESCE (users.email, EXCLUDED.email),
+ updated_at = now();
+```
+
+- If generating `UUIDs` client-side, ensure deterministic mapping or idempotency key in insertion logic.
+
+### 6.3 Mark snapshot processed
+
+- After successful processing:
+
+```sql
+UPDATE hero_snapshots
+SET processed_at = now(),
+ processing = false,
+ last_error = NULL
+WHERE id = $snapshot_id;
+```
+
+- Use atomic claim to set `processing=true`:
+
+```sql
+UPDATE hero_snapshots
+SET processing = true,
+ processing_started_at = now()
+WHERE id = $snapshot_id
+ AND (processing = false OR processing IS NULL) RETURNING id;
+```
+
+---
+
+## 7. Migration strategy & preflight checks
+
+### 7.1 Tooling
+
+- Use `node-pg-migrate` (`JS`) for migrations; store migrations under `database/migrations/` and use timestamped
+ filenames.
+- Keep migrations small and reversible where possible.
+
+### 7.2 Preflight checks (CI job)
+
+- Verify required extensions & permissions (e.g., `CREATE EXTENSION IF NOT EXISTS pgcrypto;`).
+- Run migrations in a disposable ephemeral DB (testcontainers) and ensure `up` completes.
+- Run a smoke `ETL`: insert a sample snapshot, run the worker code path, assert summary created.
+
+### 7.3 Safe migration pattern
+
+- 3-step for destructive change:
+ 1. Add nullable column / new table / index (no locks if possible: `CREATE INDEX CONCURRENTLY`).
+ 2. Backfill and verify via a background job.
+ 3. Make column NOT NULL and remove old column in a later migration.
+
+### 7.4 Production bootstrap
+
+- Production migrations and bootstrap should run via a manual `GitHub Action` (`db-bootstrap.yml`) protected by GitHub
+ Environments and approvers.
+- Always take a DB snapshot/backup before applying production migrations.
+
+### 7.5 Extension fallback
+
+- If `pgcrypto` cannot be enabled by provider, fallback to generating `UUIDs` in application code or use
+ `uuid_generate_v4()` only if available.
+
+---
+
+## 8. Data retention & archival model (summary)
+
+- Default retention: 90 days for `hero_snapshots` in DB (configurable).
+- Keep last N snapshots per user (e.g., last 30) in addition to time window.
+- Archive older snapshots to `S3` (gzip/zstd) with checksum and audit metadata in `snapshot_archives` table.
+- Provide admin endpoints and runbooks for reprocessing archived snapshots (download from `S3`, reinsert, reprocess).
+
+---
+
+## 9. Validation queries & health checks
+
+- Basic health:
+
+```sql
+SELECT 1;
+```
+
+- Sanity check after migrations:
+
+```sql
+SELECT count(*)
+FROM information_schema.tables
+WHERE table_schema = 'public'
+ AND table_name IN ('users', 'hero_snapshots', 'user_troops');
+```
+
+- Sample `ETL` verification (for a given namecode):
+
+```sql
+-- latest snapshot exists and processed
+SELECT hs.id, hs.processed_at, ups.user_id, ups.namecode
+FROM hero_snapshots hs
+ LEFT JOIN users ups ON hs.user_id = ups.id
+WHERE hs.namecode = $1
+ORDER BY hs.created_at DESC LIMIT 1;
+```
+
+- Idempotency test:
+ - Insert snapshot A, process, record counts.
+ - Re-insert same snapshot (or reprocess) and assert no duplicate rows and counts unchanged.
+
+---
+
+## 10. DBML schema (paste into dbdiagram.io)
+
+The `DBML` representation below is provided for visualization. Paste into https://dbdiagram.io to generate diagrams. The
+code here is the cleaned `DBML` derived from the PRD; it mirrors the canonical tables and relationships described above.
+
+```sql
+// DBML representation of the StarForge schema (cleaned).
+// Paste this into https://dbdiagram.io to visualize the schema.
+// jsonb columns are represented as "json" for visualization purposes.
+
+Table app_users {
+ id uuid [pk]
+ email varchar [unique]
+ username varchar
+ display_name varchar
+ avatar_url varchar
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table roles {
+ id uuid [pk]
+ name varchar [unique, not null]
+ description text
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table permissions {
+ id uuid [pk]
+ code varchar [unique, not null]
+ description text
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table role_permissions {
+ role_id uuid
+ permission_id uuid
+ // composite PK exists in SQL
+}
+
+Table guilds {
+ id uuid [pk]
+ external_id varchar [unique]
+ name varchar
+ description text
+ settings json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table guild_members {
+ id uuid [pk]
+ guild_id uuid
+ user_id uuid
+ game_player_id varchar
+ joined_at timestamp
+ is_owner boolean
+ created_at timestamp
+ updated_at timestamp
+ // unique (guild_id, user_id)
+}
+
+Table guild_roles {
+ id uuid [pk]
+ guild_id uuid
+ role_id uuid
+ name varchar
+ permission_overrides json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table guild_member_roles {
+ id uuid [pk]
+ guild_member_id uuid
+ guild_role_id uuid
+ created_at timestamp
+}
+
+Table audit_logs {
+ id uuid [pk]
+ action varchar [not null]
+ user_id uuid
+ guild_id uuid
+ metadata json
+ ip_address varchar
+ user_agent varchar
+ created_at timestamp
+}
+
+Table feature_flags {
+ id uuid [pk]
+ name varchar [unique, not null]
+ enabled boolean
+ rollout_percentage int
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table guild_feature_flags {
+ id uuid [pk]
+ guild_id uuid
+ feature_flag_id uuid
+ enabled boolean
+}
+
+Table troops {
+ id int [pk]
+ kid varchar
+ name varchar
+ description text
+ colors int
+ arcanes varchar
+ image_url varchar
+ kingdom_id int
+ kingdom_name varchar
+ max_attack int
+ max_armor int
+ max_life int
+ max_magic int
+ rarity varchar
+ rarity_id int
+ spell_id int
+ shiny_spell_id int
+ spell_name varchar
+ shiny_spell_name varchar
+ spell_cost int
+ release_date bigint
+ switch_date bigint
+ troop_role1 varchar
+ type varchar
+ type_code1 varchar
+ type_code2 varchar
+ traits json
+ extras json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table kingdoms {
+ id int [pk]
+ kid varchar
+ name varchar
+ map_index varchar
+ byline varchar
+ description text
+ banner_name varchar
+ banner_image_url varchar
+ bg_image_url varchar
+ banner_mana varchar
+ banner_mana_bits int
+ release_date bigint
+ switch_date bigint
+ tribute_glory int
+ tribute_gold int
+ tribute_souls int
+ explore_traitstone_id int
+ explore_traitstone_colors int
+ explore_traitstone_color_names varchar
+ level_mana_color int
+ level_stat varchar
+ quest text
+ map_position varchar
+ type varchar
+ troops json
+ bonuses json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table classes {
+ id int [pk]
+ class_code varchar
+ name varchar
+ kingdom_id int
+ kingdom_name varchar
+ image_url varchar
+ page_url varchar
+ rarity varchar
+ max_armor int
+ max_attack int
+ max_life int
+ max_magic int
+ spell_id int
+ weapon_id int
+ weapon_name varchar
+ talent_codes json
+ talent_list json
+ traits json
+ extras json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table talents {
+ code varchar [pk]
+ name varchar
+ talent1 varchar
+ talent1_desc text
+ talent2 varchar
+ talent2_desc text
+ talent3 varchar
+ talent3_desc text
+ talent4 varchar
+ talent4_desc text
+ talent5 varchar
+ talent5_desc text
+ talent6 varchar
+ talent6_desc text
+ talent7 varchar
+ talent7_desc text
+ extras json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table medals {
+ id int [pk]
+ name varchar
+ description text
+ data varchar
+ effect varchar
+ level int
+ rarity varchar
+ is_event_medal boolean
+ image_url varchar
+ evolves_into int
+ group_id int
+ extras json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table pets {
+ id int [pk]
+ name varchar
+ kingdom_id int
+ kingdom_name varchar
+ mana_color varchar
+ mana_color_num int
+ image_url varchar
+ effect varchar
+ effect_data varchar
+ effect_title varchar
+ event varchar
+ release_date bigint
+ switch_date bigint
+ extras json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table spells {
+ id int [pk]
+ name varchar
+ description text
+ cost int
+ image_url varchar
+ extras json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table aspects {
+ code varchar [pk]
+ name varchar
+ description text
+ image_url varchar
+ extras json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table traitstones {
+ id int [pk]
+ name varchar
+ colors int
+ image_url varchar
+ extras json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table weapons {
+ id int [pk]
+ name varchar
+ image_url varchar
+ kingdom_id int
+ kingdom_name varchar
+ spell_id int
+ spell_name varchar
+ spell_cost int
+ affixes varchar
+ colors int
+ mastery_requirement int
+ rarity varchar
+ rarity_id int
+ obtain_by varchar
+ weapon_role1 varchar
+ weapon_upgrade varchar
+ release_date bigint
+ switch_date bigint
+ extras json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table game_events {
+ id int [pk]
+ event_id int
+ troop_id int
+ kingdom_id int
+ start_date bigint
+ end_date bigint
+ metadata json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table raw_profiles {
+ id uuid [pk]
+ hero_external_id varchar
+ source varchar
+ captured_at timestamp
+ raw json [not null]
+}
+
+Table heroes {
+ id uuid [pk]
+ external_id varchar [unique]
+ name varchar
+ namecode varchar
+ name_lower varchar
+ username varchar
+ level int
+ level_new int
+ race int
+ race_alt int
+ gender int
+ class varchar
+ portrait_id int
+ title_id int
+ flair_id int
+ honor_rank int
+ equipped_weapon_id int
+ equipped_pet_id int
+ guild_id uuid
+ guild_external_id varchar
+ guild_name varchar
+ guild_rank int
+ server_time timestamp
+ last_login timestamp
+ last_played timestamp
+ created_at timestamp
+ updated_at timestamp
+ summary json
+ extras json
+}
+
+Table hero_runes {
+ id uuid [pk]
+ hero_id uuid
+ runes json [not null]
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table hero_troops {
+ id uuid [pk]
+ hero_id uuid
+ troop_id int
+ amount int
+ level int
+ current_rarity int
+ fusion_cards int
+ orb_fusion_cards int
+ traits_owned int
+ invasions int
+ shiny_level_progress int
+ orbs_used json
+ extra json
+ created_at timestamp
+ updated_at timestamp
+ // unique (hero_id, troop_id)
+}
+
+Table hero_pets {
+ id uuid [pk]
+ hero_id uuid
+ pet_id int
+ amount int
+ level int
+ xp bigint
+ orb_fusion_cards int
+ orbs_used json
+ ascension_level int
+ extra json
+ created_at timestamp
+ updated_at timestamp
+ // unique (hero_id, pet_id)
+}
+
+Table hero_artifacts {
+ id uuid [pk]
+ hero_id uuid
+ artifact_id int
+ xp bigint
+ level int
+ extra json
+ created_at timestamp
+ updated_at timestamp
+ // unique (hero_id, artifact_id)
+}
+
+Table teams {
+ id uuid [pk]
+ hero_id uuid
+ name varchar
+ banner int
+ team_level int
+ class varchar
+ override_data json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table team_troops {
+ id uuid [pk]
+ team_id uuid
+ position int
+ troop_id int
+}
+
+Table team_saves {
+ id uuid [pk]
+ hero_id uuid
+ name varchar
+ description text
+ data json [not null]
+ is_public boolean
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table team_comments {
+ id uuid [pk]
+ team_save_id uuid
+ author_user_id uuid
+ guild_id uuid
+ comment text
+ created_at timestamp
+}
+
+Table hero_kingdom_progress {
+ id uuid [pk]
+ hero_id uuid
+ kingdom_id int
+ status int
+ income int
+ challenge_tier int
+ invasions int
+ power_rank int
+ tasks json
+ explore json
+ trials_team json
+ extra json
+ created_at timestamp
+ updated_at timestamp
+ // unique (hero_id, kingdom_id)
+}
+
+Table hero_pvp_regions {
+ id uuid [pk]
+ hero_id uuid
+ region_id int
+ team json
+ stats json
+ most_used_troop json
+ extras json
+ created_at timestamp
+ updated_at timestamp
+ // unique (hero_id, region_id)
+}
+
+Table hero_pvp_stats {
+ id uuid [pk]
+ hero_id uuid
+ invades_won int
+ invades_lost int
+ defends_won int
+ defends_lost int
+ most_invaded_kingdom json
+ most_used_troop json
+ raw json
+ created_at timestamp
+ updated_at timestamp
+ // unique (hero_id)
+}
+
+Table hero_progress_weapons {
+ id uuid [pk]
+ hero_id uuid
+ weapon_data json [not null]
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table hero_class_data {
+ id uuid [pk]
+ hero_id uuid
+ class_name varchar
+ data json [not null]
+ created_at timestamp
+ updated_at timestamp
+ // unique (hero_id, class_name)
+}
+
+Table hero_meta_json {
+ id uuid [pk]
+ hero_id uuid
+ key varchar
+ value json
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table sheets_sync_logs {
+ id uuid [pk]
+ guild_id uuid
+ sheet_id varchar
+ range varchar
+ rows_sent int
+ status varchar
+ error json
+ started_at timestamp
+ finished_at timestamp
+}
+
+Table queue_jobs {
+ id uuid [pk]
+ type varchar
+ payload json
+ priority int
+ attempts int
+ max_attempts int
+ status varchar
+ run_after timestamp
+ last_error text
+ created_at timestamp
+ updated_at timestamp
+}
+
+Table cache_invalidation {
+ id uuid [pk]
+ key varchar
+ invalidated_at timestamp
+}
+
+Table troops_master_light {
+ id int [pk]
+ kid varchar
+ name varchar
+ kingdom_id int
+ kingdom_name varchar
+ rarity varchar
+ rarity_id int
+ max_attack int
+ max_armor int
+ max_life int
+ max_magic int
+ spell_id int
+ shiny_spell_id int
+ colors int
+ arcanes varchar
+ image_url varchar
+ type varchar
+ troop_role1 varchar
+ description text
+}
+
+// =========================
+// Relationships (Refs)
+— deduplicated
+// =========================
+
+Ref: role_permissions.role_id > roles.id
+Ref: role_permissions.permission_id > permissions.id
+
+Ref: guild_members.guild_id > guilds.id
+Ref: guild_members.user_id > app_users.id
+
+Ref: guild_roles.guild_id > guilds.id
+Ref: guild_roles.role_id > roles.id
+
+Ref: guild_member_roles.guild_member_id > guild_members.id
+Ref: guild_member_roles.guild_role_id > guild_roles.id
+
+Ref: guild_feature_flags.guild_id > guilds.id
+Ref: guild_feature_flags.feature_flag_id > feature_flags.id
+
+Ref: troops.kingdom_id > kingdoms.id
+Ref: classes.kingdom_id > kingdoms.id
+Ref: weapons.kingdom_id > kingdoms.id
+
+Ref: troops.spell_id > spells.id
+Ref: troops.shiny_spell_id > spells.id
+
+Ref: game_events.troop_id > troops.id
+Ref: game_events.kingdom_id > kingdoms.id
+
+Ref: raw_profiles.hero_external_id > heroes.external_id
+
+Ref: heroes.guild_id > guilds.id
+
+Ref: hero_runes.hero_id > heroes.id
+Ref: hero_troops.hero_id > heroes.id
+Ref: hero_pets.hero_id > heroes.id
+Ref: hero_artifacts.hero_id > heroes.id
+
+Ref: teams.hero_id > heroes.id
+Ref: team_troops.team_id > teams.id
+
+Ref: team_saves.hero_id > heroes.id
+Ref: team_comments.team_save_id > team_saves.id
+Ref: team_comments.author_user_id > app_users.id
+Ref: team_comments.guild_id > guilds.id
+
+Ref: hero_kingdom_progress.hero_id > heroes.id
+Ref: hero_pvp_regions.hero_id > heroes.id
+Ref: hero_pvp_stats.hero_id > heroes.id
+Ref: hero_progress_weapons.hero_id > heroes.id
+Ref: hero_class_data.hero_id > heroes.id
+Ref: hero_meta_json.hero_id > heroes.id
+
+Ref: sheets_sync_logs.guild_id > guilds.id
+```
+
+---
+
+## 11. Change log & references
+
+- Change log for the DB model should reference migration IDs and PR numbers. Create `docs/CHANGELOG.md` entries when
+ migrations are merged.
+- See also:
+ - PRD: docs/PRD.md (product-level requirements and mappings)
+ - ETL & Worker design: [docs/ETL_AND_WORKER.md](./ETL_AND_WORKER.md)
+ - Migration conventions: [docs/MIGRATIONS.md](./MIGRATIONS.md)
+ - Observability: [docs/OBSERVABILITY.md](./OBSERVABILITY.md)
+
+---
diff --git a/docs/ETL_AND_WORKER.md b/docs/ETL_AND_WORKER.md
new file mode 100644
index 0000000..ebca99c
--- /dev/null
+++ b/docs/ETL_AND_WORKER.md
@@ -0,0 +1,821 @@
+# ETL & Worker — StarForge
+
+> Version: 0.1
+> Date: 2025-12-02
+
+---
+
+## Table of contents
+
+1. [Executive summary](#1-executive-summary)
+2. [Scope & audience](#2-scope--audience)
+3. [High-level architecture](#3-high-level-architecture)
+4. [Input contract (hero_snapshots row & job payload)](#4-input-contract-hero_snapshots-row--job-payload)
+5. [Queue design & job payload schema](#5-queue-design--job-payload-schema)
+6. [Worker claim semantics & concurrency model](#6-worker-claim-semantics--concurrency-model)
+7. [Processing pipeline — end-to-end](#7-processing-pipeline--end-to-end)
+8. [Upsert patterns & SQL examples](#8-upsert-patterns--sql-examples)
+9. [Error handling, retry policy & poison messages](#9-error-handling-retry-policy--poison-messages)
+10. [Monitoring & metrics (what to emit)](#10-monitoring--metrics-what-to-emit)
+11. [Logging, tracing & structured events](#11-logging-tracing--structured-events)
+12. [Backfill, bulk processing & rate-control](#12-backfill-bulk-processing--rate-control)
+13. [Admin APIs & operational endpoints](#13-admin-apis--operational-endpoints)
+14. [Testing strategy (unit / integration / e2e / perf)](#14-testing-strategy-unit--integration--e2e--perf)
+15. [Security & privacy controls](#15-security--privacy-controls)
+16. [Common incidents & runbook actions](#16-common-incidents--runbook-actions)
+17. [Configuration / example env](#17-configuration--example-env)
+18. [Appendix](#18-appendix)
+ - Example job payloads
+ - Example TypeScript worker skeleton (BullMQ + pg)
+ - Helpful SQL queries
+ - References
+
+---
+
+## 1. Executive summary
+
+This document defines the `ETL` (Extract-Transform-Load) worker design for StarForge. It specifies:
+
+- the contract between ingestion (`hero_snapshots`) and worker jobs,
+- queue and retry semantics,
+- idempotent upsert patterns to normalize `raw player JSON` into relational tables,
+- monitoring and logging conventions,
+- operational runbooks for common incidents,
+- testing & migration considerations.
+
+> Goal: provide a deterministic, observable, and safe pipeline that can process large (2–3MB) `JSON snapshots`, scale
+> horizontally, and be re-run (idempotent) for repair/backfill.
+
+---
+
+## 2. Scope & audience
+
+**Primary audience:**
+
+- Backend engineers implementing worker & queue code
+- DevOps/SRE running workers and scaling infrastructure
+- QA engineers writing integration & performance tests
+- Data engineers validating normalized outputs
+
+**Scope:**
+
+- Processing snapshots stored in `hero_snapshots` (JSONB).
+- Upserting normalized tables (`users`, `user_troops`, `user_pets`, `user_artifacts`, `user_teams`, `guilds`,
+ `user_profile_summary`, `etl_errors`).
+- Supporting reprocessing, retries, and safe backfills.
+- Instrumentation requirements for observability.
+
+**Out of scope (for this doc):**
+
+- UI or bot behavior (only how they observe `ETL` status).
+- Detailed catalog seeding (see [docs/DB_MODEL.md](./DB_MODEL.md)).
+- `Non-ETL` background jobs (e.g., exports) beyond brief notes.
+
+---
+
+## 3. High-level architecture
+
+### Textual diagram (components):
+
+- Client (CLI / Bot / API) -> API Service (ingest endpoint)
+ - writes `hero_snapshots` (raw JSONB)
+ - enqueues job to `Redis/BullMQ` with `snapshot_id`
+ -> Redis / Queue
+ -> `ETL Worker Pool` (n instances)
+ - claim snapshot (atomic DB update)
+ - stream parse `JSONB`
+ - per-entity upserts to `Postgres`
+ - mark snapshot processed / write `etl_errors`
+ - emit metrics to `Prometheus`, errors to `Sentry`
+ -> `Postgres` (normalized tables + `hero_snapshots`)
+ -> `S3` (archive older snapshots) [optional]
+
+### Key constraints & goals:
+
+- _Idempotency:_ reprocessing same snapshot must not duplicate rows.
+- _Memory-safety:_ do not load full large JSON into heap; stream arrays.
+- _Observability:_ job metrics, `ETL` duration, error categories.
+- _Safety:_ per-entity transactions to avoid monolithic long transactions that block DB.
+
+---
+
+## 4. Input contract (hero_snapshots row & job payload)
+
+### 4.1 hero_snapshots canonical row (DB contract)
+
+- _Table:_ `hero_snapshots`
+- _Important columns (read-only contract for workers):_
+ - `id` UUID (snapshot id)
+ - `user_id` UUID (optional)
+ - `namecode` VARCHAR(64) (optional)
+ - `source` VARCHAR (fetch_by_namecode | login | cli_upload | ui_fetch)
+ - `raw` JSONB (full get_hero_profile payload)
+ - `size_bytes` INTEGER
+ - `content_hash` VARCHAR (SHA256 hex)
+ - `processing` BOOLEAN
+ - `processing_started_at` TIMESTAMPTZ
+ - `processed_at` TIMESTAMPTZ
+ - `error_count` INTEGER
+ - `last_error` JSONB
+ - `created_at` TIMESTAMPTZ
+
+Workers MUST treat `raw` as authoritative payload for that job.
+
+### 4.2 Job payload (queue message)
+
+- Minimal job payload should be small and stable:
+
+```json
+{
+ "snapshot_id": "uuid",
+ "correlation_id": "uuid",
+ // request trace id, optional
+ "enqueue_ts": "2025-12-02T12:00:00Z"
+}
+```
+
+- Avoid embedding the full JSON in the job; workers read `raw` from DB by `snapshot_id`.
+- Job `TTL`: worker should drop jobs older than configurable threshold (e.g., 30d) unless reprocessing older snapshots
+ is explicitly allowed.
+
+### 4.3 Input validation rules
+
+- Worker must validate:
+ - `raw` is JSON and contains expected top-level keys (e.g., `NameCode`, `ProfileData`) before attempting mapping.
+ - `size_bytes` matches `octet_length(raw::text)` (optional sanity check).
+- If `raw` is malformed JSON: record `ETL` error, increment `error_count` and stop; do not set `processed_at`.
+
+---
+
+## 5. Queue design & job payload schema
+
+### 5.1 Choice: Redis + BullMQ (recommended)
+
+- Advantages:
+ - job retries
+ - delayed jobs
+ - job priorities
+ - concurrency control
+ - job inspection
+
+- Use a short job payload referencing `snapshot_id` and correlation metadata only.
+
+### 5.2 Queue naming & priorities
+
+- queue: `etl:snapshots` (default)
+- priority levels:
+ - **high** (interactive, e.g., single-player fetch): `priority = 10`
+ - **normal** (background ingestion): `priority = 0`
+ - **low** (bulk backfill): `priority = -10`
+- Use separate queues for reprocess/backfill jobs (e.g., `etl:backfill`) to throttle separately.
+
+### 5.3 Retry & backoff strategy (BullMQ config)
+
+- Attempts: `maxAttempts = configurable` (default 5)
+- Backoff: exponential with `jitter`: `backoff = Math.min(2^attempt * baseMs, maxMs)` with jitter
+ - `baseMs = 1000ms, maxMs = 5 minutes`
+- On transient DB errors: let queue retry.
+- On permanent errors (parse error, invalid shape): mark as failed and do not retry automatically (see section 9).
+
+### 5.4 Poison message handling
+
+- If a job hits maxAttempts: move to "failed" and record `ETL` error row with `permanent: true`. Admins can re-enqueue
+ after manual fix.
+
+### 5.5 Job visibility & admin tools
+
+- Persist job metadata to `queue_jobs` table optionally for auditing (`job_id`, `snapshot_id`, `type`, `status`).
+- Build admin UI to list recent jobs, failures, and reprocess actions.
+
+---
+
+## 6. Worker claim semantics & concurrency model
+
+### 6.1 Atomic claim (DB-first)
+
+- Use DB to claim a snapshot to avoid race conditions across worker instances:
+
+```sql
+WITH claimed AS (
+UPDATE hero_snapshots
+SET processing = true,
+ processing_started_at = now()
+WHERE id = $1
+ AND (processing = false OR processing IS NULL) RETURNING id
+)
+SELECT id
+FROM claimed;
+```
+
+- If no rows returned: someone else claimed it, skip job.
+
+### 6.2 Reasoning
+
+- Use DB atomic update so job coordination remains consistent even if the queue loses a lease.
+- Avoid advisory locks unless necessary; if you need cross-DB locking for multi-step workflows, use
+ `pg_advisory_lock(key)` carefully and always release locks.
+
+### 6.3 Worker concurrency & DB connection budgeting
+
+- Configure worker concurrency so total DB connections (`workers * per-worker connections + app connections) < DB max`
+ connections minus headroom for `pgbouncer`.
+- Suggested defaults:
+ - per-worker concurrency: 4 (tune by instance size)
+ - pg pool max per worker: 2–4
+- Use pooled DB client (pg.Pool) and reuse connections.
+
+### 6.4 Claim TTL & stuck job detection
+
+- If a worker crashes after claiming, a snapshot may remain with `processing=true`. Use a watchdog that detects
+ snapshots
+ with `processing=true` and `processing_started_at` older than `claim_ttl` (e.g., 30 minutes) and:
+ - either mark `processing=false` and increment `error_count` OR
+ - reassign via admin requeue after human review.
+
+---
+
+## 7. Processing pipeline — end-to-end
+
+This section describes the sequential steps the worker executes for each snapshot. Each step has failure handling notes.
+
+### 7.1 High-level steps
+
+1. Dequeue job (`snapshot_id`).
+2. Attempt atomic DB claim (see 6.1).
+3. Read `hero_snapshots.raw JSONB`.
+4. Validate schema & compute a lightweight fingerprint if needed.
+5. Parse using streaming parser for large arrays (troops, pets, artifacts).
+6. Per-entity transformations & upserts in recommended order:
+ - `users` / `heroes`
+ - `guilds` & `guild_members`
+ - `user_troops` (batch upserts)
+ - `user_pets` (batch upserts)
+ - `user_artifacts` (batch upserts)
+ - `user_teams` & `team_troops`
+ - `user_profile_summary` (calculate denormalized summary)
+7. Mark snapshot processed (`processed_at=now()`, `processing=false`) and write metrics.
+8. If errors occurred:
+ - Log error, write `etl_errors` row(s), update `hero_snapshots.last_error` and `error_count`.
+ - Respect retry semantics (for transient errors allow queue retry).
+9. Emit telemetry (events, metrics).
+
+### 7.2 Streaming & memory-safe parsing
+
+- Use `stream-json` or equivalent to iterate over large arrays in `raw` without converting to a full `JS object`.
+- Strategy:
+ - If `raw` is already in DB as `JSONB`, get it as string via `raw::text` and stream-parse.
+ - Alternatively, parse top-level shallow object into memory (small) and stream large fields (troops arrays).
+- When mapping arrays to upserts, batch them (e.g., 100–500 troops per batch) to avoid many small DB roundtrips.
+
+### 7.3 Per-entity transaction boundaries
+
+- Use small per-entity transactions instead of a single one across all entities.
+- Example:
+ - Begin transaction for user upsert -> commit.
+ - Begin transaction for batch of `user_troops` (multi-row upsert) -> commit.
+- Rationale: limits lock time; allows partial progress and easier reprocessing of failed entities.
+
+### 7.4 Ordering rationale
+
+- Upsert users first so other entities can reference `user_id`.
+- Upsert `guilds`/`guild_members` before guild-specific lookups.
+- Upsert `troops`/`pets`/`artifacts` in batches; summary is derived last.
+
+### 7.5 Summary generation
+
+- Build `user_profile_summary` with most-used fields needed by API (top troops, equipped pet, pvp tier, guild).
+- Upsert summary with `ON CONFLICT (user_id) DO UPDATE SET ...` and update `cached_at`.
+
+---
+
+## 8. Upsert patterns & SQL examples
+
+### 8.1 Safe multi-row upsert (user_troops)
+
+- Use multi-row insert + ON CONFLICT DO UPDATE for batches:
+
+```sql
+INSERT INTO user_troops (user_id, troop_id, amount, level, rarity, fusion_cards, traits_owned, extra, last_seen,
+ updated_at)
+VALUES ($1, $2, $3, $4, $5, $6, $7, $8::jsonb, now(), now()), ...
+ ON CONFLICT (user_id, troop_id) DO
+UPDATE
+ SET amount = EXCLUDED.amount,
+ level = EXCLUDED.level,
+ rarity = EXCLUDED.rarity,
+ fusion_cards = EXCLUDED.fusion_cards,
+ traits_owned = EXCLUDED.traits_owned,
+ extra = COALESCE (user_troops.extra, '{}'::jsonb) || EXCLUDED.extra,
+ last_seen = EXCLUDED.last_seen,
+ updated_at = now();
+```
+
+### 8.2 Upsert users (idempotent)
+
+```sql
+INSERT INTO users (id, namecode, username, discord_user_id, email, created_at, updated_at)
+VALUES (COALESCE($id, gen_random_uuid()), $namecode, $username, $discord_user_id, $email, now(),
+ now()) ON CONFLICT (namecode) DO
+UPDATE
+ SET username = COALESCE (EXCLUDED.username, users.username),
+ discord_user_id = COALESCE (EXCLUDED.discord_user_id, users.discord_user_id),
+ email = CASE WHEN users.email IS NULL THEN EXCLUDED.email ELSE users.email
+END
+,
+ updated_at = now()
+RETURNING id;
+```
+
+- Prefer returning `id` to link `user_id` where needed.
+
+### 8.3 Upsert summary
+
+```sql
+INSERT INTO user_profile_summary (user_id, namecode, username, level, top_troops, equipped_pet, pvp_tier, guild_id,
+ last_seen, cached_at, extra)
+VALUES ($user_id, $namecode, $username, $level, $top_troops::jsonb, $equipped_pet::jsonb, $pvp_tier, $guild_id,
+ $last_seen, now(), $extra::jsonb) ON CONFLICT (user_id) DO
+UPDATE
+ SET namecode = EXCLUDED.namecode,
+ username = EXCLUDED.username,
+ level = EXCLUDED.level,
+ top_troops = EXCLUDED.top_troops,
+ equipped_pet = EXCLUDED.equipped_pet,
+ pvp_tier = EXCLUDED.pvp_tier,
+ guild_id = EXCLUDED.guild_id,
+ last_seen = EXCLUDED.last_seen,
+ cached_at = now(),
+ extra = user_profile_summary.extra || EXCLUDED.extra;
+```
+
+### 8.4 Use partial updates for `extra` JSONB fields
+
+- Merge semantics: `existing.extra || new.extra` appends/replaces keys; use `jsonb_set` when needing deep replace.
+
+---
+
+## 9. Error handling, retry policy & poison messages
+
+### 9.1 Error classification
+
+- Transient (retryable):
+ - DB connection issues, deadlocks, temporary external API failures, network timeouts.
+ - Action: allow automatic retry (queue backoff).
+- Permanent (non-retryable):
+ - Malformed JSON, missing required fields, FK constraint due to missing catalog and policy is to fail.
+ - Action: write an `etl_errors` row with `permanent: true`, set `hero_snapshots.processing=false`,
+ `processed_at=NULL` (or optionally `processed_at` to indicate attempted), increment `error_count` and alert
+ maintainers.
+
+### 9.2 Recording errors
+
+- Always record structured errors in `etl_errors`:
+ - `snapshot_id`, `error_type`, `message` (sanitized), `details` (JSON with limited size), `created_at`.
+- Update `hero_snapshots`: increment `error_count`, set `last_error` to summary (short), and (optionally) add
+ `error_history` append.
+
+### 9.3 Retry policy
+
+- Default max attempts: 5 (configurable).
+- Backoff: exponential with jitter (1s, 2s, 4s, 8s… capped at 5 min).
+- For bulk backfills, allow higher `maxAttempts` but with lower concurrency.
+
+### 9.4 Poison queue handling
+
+- After `maxAttempts` reached:
+ - Mark job failed in queue
+ - Create a monitoring alert and create an admin work item (via issue/alerting channel)
+ - Provide reprocess endpoint for manual re-enqueue after fix
+
+### 9.5 Partial failures & compensating actions
+
+- If one entity fails (e.g., `user_troops` batch fails due to FK), worker should:
+ - Rollback the entity-level transaction, write an `etl_errors` row specifying which entity failed and why.
+ - Continue processing other entities (if safe and configured).
+ - Consider marking snapshot with `partial_failure = true` in `hero_snapshots` or `etl_errors` with entity details.
+
+---
+
+## 10. Monitoring & metrics (what to emit)
+
+### 10.1 Metric naming conventions
+
+- Prefix: `starforge_etl_` for worker metrics.
+- Labels: `worker_instance`, `job_type`, `env`, `snapshot_source`, `error_type` where applicable.
+
+### 10.2 Required metrics (Prometheus)
+
+- Counters:
+ - `starforge_etl_snapshots_received_total{source="", job_type=""}`
+ - `starforge_etl_snapshots_processed_total{result="success|failed|partial"}`
+ - `starforge_etl_snapshots_failed_total{error_type=""}`
+- Gauges:
+ - `starforge_etl_processing_jobs_in_progress`
+ - `starforge_etl_queue_depth{queue=""}` (prefer scraped from `BullMQ` or `Redis` exporter)
+- Histograms / Summaries:
+ - `starforge_etl_processing_duration_seconds` (observe by `job_type`)
+ - `starforge_etl_batch_upsert_size` (distribution of batch sizes)
+- DB related:
+ - `starforge_etl_db_tx_duration_seconds`
+ - `starforge_etl_db_connection_count` (from app pool)
+
+### 10.3 Alerting rules (suggested)
+
+- **P0:** `ETL` failure spike:
+ - Condition: `rate(starforge_etl_snapshots_failed_total[5m]) / rate(starforge_etl_snapshots_processed_total[5m]) >
+ 0.01 (i.e., >1%)` -> page on-call
+- **P0:** Queue depth high:
+ - `queue_depth > threshold for 10m` -> page
+- **P1:** Avg processing time regression:
+ - `p95 processing duration > 2x baseline for 10m` -> notify
+- **P1:** `Worker OOM or restarts > N` -> notify
+
+### 10.4 Events & traces
+
+- Emit `snapshot_processed` event with `snapshot_id`, `user_id`, counts of upserts and duration.
+- Trace flow: API ingest -> enqueue -> worker -> DB upserts. Propagate `X-Request-Id` or `trace_id`.
+
+---
+
+## 11. Logging, tracing & structured events
+
+### 11.1 Logging guidelines (structured JSON)
+
+- Always log JSON objects with keys:
+ - `timestamp`, `service`: "etl-worker", `env`, `level`, `message`, `snapshot_id`, `job_id`, `user_id` (if
+ available), `correlation_id`, `trace_id`, `module`, `duration_ms`, `details` (sanitized).
+- Examples:
+
+```json
+{
+ "timestamp": "2025-12-02T12:00:00Z",
+ "service": "etl-worker",
+ "env": "production",
+ "level": "info",
+ "message": "claimed snapshot",
+ "snapshot_id": "uuid",
+ "job_id": "bull-job-uuid",
+ "worker_instance": "worker-1"
+}
+```
+
+### 11.2 Sensitive data redaction
+
+- NEVER log raw tokens, passwords or full player credentials.
+- Mask / redact fields matching patterns (token, password, session) prior to logging.
+- For debugging, store pointers (`snapshot_id`, `s3_path`) to full raw payload rather than including it in logs.
+
+### 11.3 Tracing
+
+- Use OpenTelemetry (recommended) to propagate trace ids across API -> queue -> worker -> DB.
+- Create spans for:
+ - `claim_snapshot` (DB update)
+ - `parse_snapshot` (streaming)
+ - `upsert_` (e.g., `upsert_user_troops`)
+ - `commit_snapshot` (final update)
+
+### 11.4 Error telemetry
+
+- Send errors to Sentry with minimal context: `snapshot_id`, `job_id`, `short_message`, `error_type`, `sanitized_stack`.
+- Avoid including raw snapshot or `PII` in error events.
+
+---
+
+## 12. Backfill, bulk processing & rate-control
+
+### 12.1 Backfill principles
+
+- Use the same worker code as real-time processing to avoid mapping differences.
+- Run backfills on an isolated worker pool with controlled concurrency and DB throttle.
+- Recommended initial concurrency: small (e.g., few workers x 2 concurrency) and increase gradually while monitoring DB
+ metrics.
+
+### 12.2 Batch sizing & parallelism
+
+- For large arrays (troops), process in batches of 100–500 rows (tunable).
+- Use upsert multi-row inserts to reduce round-trips.
+
+### 12.3 Rate limiting and safe windows
+
+- Backfills should honor a DB load budget:
+ - Max write `IOPS` per second or max DB CPU utilization.
+- Implement a backfill controller that:
+ - queries DB metrics
+ - adjusts worker concurrency or sleeps between batches accordingly
+
+### 12.4 Resume & checkpointing
+
+- Record per-snapshot backfill progress in a `backfill_jobs` or `queue_jobs` table with:
+ - `job_id`, `start_ts`, `last_processed_snapshot_id`, `processed_count`, `error_count`
+- On interruption, resume from last recorded snapshot id or job cursor.
+
+---
+
+## 13. Admin APIs & operational endpoints
+
+### 13.1 Reprocess snapshot
+
+- Endpoint: `POST /api/v1/admin/snapshots/:id/reprocess`
+- Requirements: `admin auth` & `RBAC`
+- Behavior:
+ - Validate snapshot exists and not currently processing.
+ - Reset `hero_snapshots.error_count = 0; last_error = NULL; processing = false`
+ - Enqueue job (high priority if needed)
+ - Return `202` Accepted with `job_id` and `ETA` estimate.
+- Audit: create an `audit_logs` entry with requester id & reason.
+
+### 13.2 Snapshot status
+
+- Endpoint: `GET /api/v1/admin/snapshots/:id/status`
+- Returns: processing `flag`, `processed_at`, `error_count`, `last_error`
+
+### 13.3 Health & metrics endpoints
+
+- `/health` (readiness & liveness)
+- `/metrics` (Prometheus) protected by IP or scrape token in production
+
+### 13.4 Job management
+
+- Admin UI to list failed jobs and re-enqueue or inspect `etl_errors`.
+
+---
+
+## 14. Testing strategy (unit / integration / e2e / perf)
+
+### 14.1 Unit tests
+
+- Functions: parse helpers, mapping logic (e.g., mapTroopToRow), hash calculation, idempotency helpers.
+- Mock DB & queue; keep fast tests.
+
+### 14.2 Integration tests
+
+- Use testcontainers or ephemeral `Postgres` + `Redis`.
+- Tests:
+ - migration up -> insert example snapshot -> run worker -> assert normalized tables
+ - idempotency: process same snapshot twice -> assert no duplicate rows
+ - malformed JSON snapshot -> worker writes `etl_errors` and does not crash
+
+### 14.3 End-to-end tests
+
+- Staging-like environment with real services.
+- Scenario: API ingest -> verify snapshot inserted -> wait for worker processing -> query `user_profile_summary` via
+ API -> assert fields.
+
+### 14.4 Performance tests
+
+- Tools: `k6`, `Artillery`
+- Scenarios:
+ - Burst ingestion of N snapshots per minute
+ - Worker processing of large snapshots (2–3MB) with constrained memory to validate streaming parser
+- Goals:
+ - measure p95 processing times, memory usage, DB connection utilization
+ - tune batch size & concurrency
+
+### 14.5 Test fixtures
+
+- Keep synthetic example payloads in `docs/examples/`:
+ - small, medium, large (`~2–3MB`), malformed sample
+- Use these fixtures in `CI` integration tests.
+
+### 14.6 CI gating
+
+- PRs must pass unit + integration tests.
+- Migration PRs must run a migration preflight job (apply migrations to ephemeral DB and run a smoke `ETL`).
+
+---
+
+## 15. Security & privacy controls
+
+### 15.1 Data minimization
+
+- Do not persist user passwords. If upstream includes tokens, `ETL` must redact them before storing in normalized tables
+ or logs unless storing is absolutely required and approved.
+
+### 15.2 Logging policy
+
+- Redact known sensitive keys in snapshots and logs (e.g., token, password, session).
+- Implement automatic scrubbing function that walks `JSON` and masks keys matching configured patterns.
+
+### 15.3 Access control
+
+- Admin endpoints require `RBAC` and auth tokens with least privilege.
+- Raw snapshot access is restricted to operators and SREs — prefer return of pointers (`snapshot_id`) rather than full
+ raw JSON.
+
+### 15.4 Secrets & rotation
+
+- Use secrets manager or `GitHub Secrets` for connections (`DATABASE_URL`, `REDIS_URL`).
+- Rotate worker credentials periodically and automate via provider `OIDC` where possible.
+
+### 15.5 Audit
+
+- Write `audit_logs` for admin actions (reprocess, manual job enqueue, retention runs).
+
+---
+
+## 16. Common incidents & runbook actions
+
+### 16.1 Worker OOM (out of memory)
+
+- Symptoms: worker process restarted, `OOM` logs, repeated restarts.
+- Immediate actions:
+ 1. Scale down worker concurrency or stop worker pool.
+ 2. Check last logs for snapshot size that caused `OOM` (`snapshot_id`).
+ 3. If large snapshot: move snapshot to quarantine (mark `hero_snapshots.processing=false`, add `etl_errors`),
+ re-enqueue under dedicated high-memory worker or reprocess after parser fixes.
+ 4. Increase instance memory temporarily and redeploy.
+- Preventive:
+ - Use streaming parser and batch sizes, set per-worker memory limits and monitor.
+
+### 16.2 DB connection exhaustion
+
+- Symptoms: DB refuses connections, new connections fail.
+- Immediate actions:
+ 1. Pause / scale down worker concurrency.
+ 2. Check pg pool stats (`pg_stat_activity`) and `pgbouncer` (if used).
+ 3. Reduce per-worker pool size and restart workers.
+- Preventive:
+ - Use `pgbouncer`, cap total workers, keep per-worker pool small.
+
+### 16.3 Queue storm / sudden backlog
+
+- Symptoms: queue depth spikes, processing lag increases.
+- Immediate actions:
+ 1. Evaluate if sudden ingestion spike expected (event).
+ 2. Scale worker pool carefully (respect DB connections).
+ 3. If risk of overload, throttle ingestion at `API` (return `202` with `ETA`).
+- Preventive:
+ - Rate limit ingestion clients and use backpressure strategy.
+
+### 16.4 Frequent ETL parse errors (new upstream schema)
+
+- Symptoms: many snapshots failing with parse errors after upstream change.
+- Immediate actions:
+ 1. Pause auto-retries.
+ 2. Sample failing raw payloads and store examples for developer analysis.
+ 3. Create a temporary `ETL` mapping patch that ignores unknown fields and records extras.
+ 4. Plan `migration/ETL` update and backfill as necessary.
+
+### 16.5 Retention/archival job failed (S3 unavailable)
+
+- Immediate actions:
+ 1. Retry with exponential backoff.
+ 2. If persistent, pause archival deletion and alert `SRE`.
+ 3. Ensure DB not purging snapshots until archive confirmed.
+
+---
+
+## 17. Configuration / example env
+
+Example env vars (worker)
+
+```
+DATABASE_URL=postgres://user:pass@host:5432/starforge?sslmode=require
+REDIS_URL=redis://:password@redis-host:6379
+NODE_ENV=production
+WORKER_CONCURRENCY=4
+PG_POOL_MAX=4
+ETL_BATCH_SIZE=250
+ETL_MAX_RETRIES=5
+ETL_BACKOFF_BASE_MS=1000
+ETL_BACKOFF_MAX_MS=300000
+CLAIM_TTL_MINUTES=30
+METRICS_PORT=9091
+SENTRY_DSN=...
+```
+
+---
+
+## 18. Appendix
+
+### 18.1 Example job payloads
+
+- Single snapshot job (minimal):
+
+```json
+{
+ "snapshot_id": "a2f1c1b2-...-e4f9",
+ "correlation_id": "req-1234",
+ "enqueue_ts": "2025-12-02T12:00:00Z"
+}
+```
+
+- Backfill job:
+
+```json
+{
+ "job_type": "backfill_range",
+ "from_created_at": "2025-01-01T00:00:00Z",
+ "to_created_at": "2025-06-01T00:00:00Z",
+ "batch_size": 500,
+ "owner": "data-team"
+}
+```
+
+### 18.2 Example TypeScript worker skeleton (`BullMQ` + `pg` + `stream-json`)
+
+> NOTE: this is a skeleton to illustrate patterns, not production-ready code. Add proper error handling, metrics, config
+> management before use.
+
+```ts
+// sketch-worker.ts
+import { Queue, Worker, Job } from 'bullmq';
+import { Pool } from 'pg';
+import { parser } from 'stream-json';
+import { streamArray } from 'stream-json/streamers/StreamArray';
+import { Readable } from 'stream';
+import Pino from 'pino';
+
+const logger = Pino();
+
+const redisConnection = { host: process.env.REDIS_HOST, port: Number( process.env.REDIS_PORT ) };
+const queue = new Queue( 'etl:snapshots', { connection: redisConnection } );
+
+const pgPool = new Pool( { connectionString: process.env.DATABASE_URL, max: Number( process.env.PG_POOL_MAX || 4 ) } );
+
+const worker = new Worker( 'etl:snapshots', async ( job: Job ) => {
+ const { snapshot_id } = job.data;
+ const client = await pgPool.connect();
+ try {
+ // 1. atomic claim
+ const res = await client.query(
+ `UPDATE hero_snapshots SET processing = true, processing_started_at = now()
+ WHERE id = $1 AND (processing = false OR processing IS NULL)
+ RETURNING id, raw::text as raw_text`, [ snapshot_id ] );
+ if ( res.rowCount === 0 ) {
+ logger.info( { snapshot_id }, 'snapshot already claimed, skipping' );
+ return;
+ }
+ const rawText = res.rows[0].raw_text;
+ // 2. streaming parse example: if troops is large array
+ const jsonStream = Readable.from( [ rawText ] ).pipe( parser() );
+ // You would then stream into proper handlers; for brevity we show full parse fallback:
+ const payload = JSON.parse( rawText ); // fallback only when safe
+ // 3. process entities (use helper functions to batch upserts)
+ await processUserAndUpserts( client, payload, snapshot_id );
+ // 4. mark processed
+ await client.query(
+ `UPDATE hero_snapshots SET processed_at = now(), processing = false, last_error = NULL WHERE id = $1`,
+ [ snapshot_id ] );
+ logger.info( { snapshot_id }, 'processed snapshot' );
+ } catch ( err ) {
+ logger.error( { err, snapshot_id }, 'processing failed' );
+ // write etl_errors and update hero_snapshots
+ await client.query(
+ `INSERT INTO etl_errors (id, snapshot_id, error_type, message, details, created_at)
+ VALUES (gen_random_uuid(), $1, $2, $3, $4::jsonb, now())`,
+ [ snapshot_id, 'PROCESSING_ERROR', String( ( err && err.message ) || 'unknown' ), JSON.stringify( { stack: err && err.stack } ) ] );
+ await client.query( `UPDATE hero_snapshots SET error_count = COALESCE(error_count,0)+1, last_error = to_jsonb($2::text) WHERE id = $1`,
+ [ snapshot_id, `Processing error: ${ err && err.message }` ] );
+ throw err; // let BullMQ retry according to its policy
+ } finally {
+ client.release();
+ }
+}, { connection: redisConnection, concurrency: Number( process.env.WORKER_CONCURRENCY || 4 ) } );
+
+async function processUserAndUpserts( client: any, payload: any, snapshot_id: string ) {
+ // Implement mapping and batch upserts, following SQL patterns in this doc.
+}
+```
+
+### 18.3 Helpful SQL queries
+
+- Find latest processed snapshot for a user:
+
+```sql
+SELECT *
+FROM hero_snapshots
+WHERE user_id = $1
+ AND processed_at IS NOT NULL
+ORDER BY created_at DESC LIMIT 1;
+```
+
+- List failed snapshots (recent):
+
+```sql
+SELECT id, created_at, error_count, last_error
+FROM hero_snapshots
+WHERE error_count > 0
+ORDER BY created_at DESC LIMIT 100;
+```
+
+- Count queue depth (`BullMQ` stores in `Redis`, but if DB-backed queue):
+
+```sql
+SELECT count(*)
+FROM queue_jobs
+WHERE status = 'pending';
+```
+
+---
+
+References
+
+- [docs/DB_MODEL.md](./DB_MODEL.md) (canonical schema)
+- [docs/MIGRATIONS.md](./MIGRATIONS.md) (migration conventions)
+- [docs/OBSERVABILITY.md](./OBSERVABILITY.md) (metrics & alerts)
+- BullMQ docs: https://docs.bullmq.io/
+- stream-json: https://github.com/uhop/stream-json
+- OpenTelemetry for Node: https://opentelemetry.io/
+
+---
diff --git a/docs/MIGRATIONS.md b/docs/MIGRATIONS.md
new file mode 100644
index 0000000..e561b1c
--- /dev/null
+++ b/docs/MIGRATIONS.md
@@ -0,0 +1,549 @@
+# MIGRATIONS.md — Database Migrations Guide
+
+> Version: 0.1
+> Date: 2025-12-03
+
+---
+
+## Purpose
+
+- Provide a complete, practical guide for creating, testing, reviewing and applying database schema migrations for
+ StarForge.
+- Aligns with [docs/DB_MODEL.md](./DB_MODEL.md) (canonical schema) and [docs/ETL_AND_WORKER.md](./ETL_AND_WORKER.md) (
+ `ETL` contracts).
+- Target audience: `Backend engineers`, `SRE/DevOps`, `reviewers` and `release approvers`.
+
+---
+
+## Scope
+
+- JavaScript migrations using `node-pg-migrate` (recommended).
+- Preflight checks, CI integration, local developer workflows, production bootstrap workflow, rollback guidance,
+ indexes, extensions, and operational runbooks.
+- Includes sample migration templates and scripts you can adapt.
+
+---
+
+## Table of contents
+
+1. [Principles & Goals](#1-principles--goals)
+2. [Tooling & repo layout](#2-tooling--repo-layout)
+3. [Naming & file conventions](#3-naming--file-conventions)
+4. [Migration anatomy (up / down / idempotence)](#4-migration-anatomy-up--down--idempotence)
+5. [Safe migration patterns (additive → backfill → enforce)](#5-safe-migration-patterns)
+6. [Extensions & provider considerations (pgcrypto etc.)](#6-extensions--provider-considerations)
+7. [Index creation & long-running operations (CONCURRENTLY)](#7-index-creation--long-running-operations)
+8. [Partitioning strategy & archival-aware changes](#8-partitioning-strategy--archival-aware-changes)
+9. [Preflight checks & CI jobs](#9-preflight-checks--ci-jobs)
+10. [Local developer workflow (create / run / revert)](#10-local-developer-workflow)
+11. [Production bootstrap workflow (manual, protected)](#11-production-bootstrap-workflow-manual-protected)
+12. [Rollback & emergency restore strategy](#12-rollback--emergency-restore-strategy)
+13. [Migration PR checklist & review criteria](#13-migration-pr-checklist--review-criteria)
+14. [Seeding & catalog management](#14-seeding--catalog-management)
+15. [Testing migrations (unit/integration/e2e)](#15-testing-migrations-unit--integration--e2e)
+16. [Backfill & data-migration jobs coordination](#16-backfill--data-migration-jobs-coordination)
+17. [Auditing & migration traceability](#17-auditing--migration-traceability)
+18. [Examples: node-pg-migrate templates & preflight script](#18-examples-node-pg-migrate-templates--preflight-script)
+19. [Runbooks & operational notes](#19-runbooks--operational-notes)
+20. [References](#20-references)
+
+---
+
+## 1. Principles & Goals
+
+- Safety first: migrations must not cause unexpected downtime or data loss.
+- Small, focused changes: prefer many small migrations over large monolithic ones.
+- Reversible where practical: provide `down` when safe; if not reversible, document strong rollback plan.
+- Observable: migrations must produce logs and artifacts to audit what happened and when.
+- Testable: migrations run in CI against ephemeral DBs; run preflight on staging before production.
+- Manual approval for production: production schema changes are applied through a protected, manual workflow.
+
+---
+
+## 2. Tooling & repo layout
+
+### Recommended tooling
+
+- `node-pg-migrate` — JS migrations that integrate well with `Node` toolchain.
+- `pg` (`node-postgres`) for direct DB access in scripts / preflight.
+- `psql` for quick manual checks.
+- testcontainers (or `docker-compose`) for integration tests in CI.
+
+### Suggested repo layout
+
+```
+database/
+ migrations/ # node-pg-migrate migration files
+ seeds/ # idempotent seed files for catalogs
+scripts/
+ migrate-preflight.sh
+ bootstrap-db.sh
+docs/
+ MIGRATIONS.md
+ DB_MODEL.md
+ ETL_AND_WORKER.md
+```
+
+### GitHub Actions
+
+- CI job for migration preflight and tests.
+- Manual `db-bootstrap.yml` workflow for production migrations (requires environment approval).
+
+---
+
+## 3. Naming & file conventions
+
+### Filename pattern (required)
+
+```
+YYYYMMDDHHMMSS_description.[js|sql]
+```
+
+- Use `UTC` timestamp prefix for ordering.
+- Example: `20251203T153000_add_hero_snapshots_table.js` (or numeric timestamp 20251203153000).
+
+### Migration metadata
+
+- Each migration should include header comments describing:
+ - Purpose
+ - Expected impact (approx row counts, index size)
+ - Estimated index build time if large
+
+### Commit messages
+
+- Link migration file to PR and include `MIGRATION: ` in PR and commit message.
+
+---
+
+## 4. Migration anatomy (up / down / idempotence)
+
+Use `node-pg-migrate` export format with `exports.up = (pgm) => { ... }` and `exports.down = (pgm) => { ... }`.
+
+### Guidelines
+
+- Up should apply the change; down should revert when safe.
+- Avoid irreversible actions in a single migration. If irreversible, document backup id and rollback plan in migration
+ header.
+- Migrations should be idempotent where possible and robust to partial re-runs in test environments.
+- Keep migration files small (`~1–200 lines`) and focused.
+
+### Example migration (skeleton)
+
+```js
+/* 20251203_create_hero_snapshots.js
+ Purpose: create hero_snapshots table for raw ingestion
+ Estimated rows: small initially
+ */
+exports.shorthands = undefined;
+
+exports.up = ( pgm ) => {
+ pgm.createTable( 'hero_snapshots', {
+ id: { type: 'uuid', primaryKey: true, default: pgm.func( 'gen_random_uuid()' ) },
+ user_id: { type: 'uuid', references: 'users(id)', onDelete: 'SET NULL' },
+ source: { type: 'varchar(64)', notNull: true },
+ raw: { type: 'jsonb', notNull: true },
+ size_bytes: { type: 'integer', notNull: true },
+ content_hash: { type: 'varchar(128)', notNull: true },
+ processing: { type: 'boolean', default: false },
+ processed_at: { type: 'timestamptz' },
+ created_at: { type: 'timestamptz', default: pgm.func( 'now()' ) }
+ } );
+ pgm.createIndex( 'hero_snapshots', 'raw', { using: 'gin', method: 'jsonb_path_ops' } );
+};
+
+exports.down = ( pgm ) => {
+ pgm.dropIndex( 'hero_snapshots', 'raw' );
+ pgm.dropTable( 'hero_snapshots' );
+};
+```
+
+---
+
+## 5. Safe migration patterns
+
+### Three-step safe approach for potentially disruptive changes (recommended)
+
+1. Additive step: Add new nullable column / indexes / tables. No data movement, no exclusive locks.
+ - Example: `ALTER TABLE user_profile_summary ADD COLUMN summary_v2 JSONB NULL;`
+2. Backfill & verification: Deploy code that writes both old and new columns, backfill data asynchronously via `ETL`
+ /backfill jobs, validate parity in staging.
+3. Enforce & cleanup: Make column NOT NULL and drop old column in a later migration after backfill verification.
+
+### Index changes
+
+- Use `CREATE INDEX CONCURRENTLY` for large indexes to avoid table locks. When using `node-pg-migrate`, run raw SQL for
+ CONCURRENTLY:
+
+```js
+pgm.sql( 'CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_hero_snapshots_raw_gin ON hero_snapshots USING GIN (raw jsonb_path_ops)' );
+```
+
+- Note: `CONCURRENTLY` cannot run inside a transaction — adjust migration runner accordingly (`node-pg-migrate` supports
+ it with `pgm.sql` but you must not rely on transactional behavior).
+
+### Avoid long exclusive operations
+
+- Avoid `ALTER TABLE ... TYPE` or `ADD COLUMN ... DEFAULT ` that rewrite the table. Prefer add nullable column,
+ backfill, then set default + NOT NULL.
+
+---
+
+## 6. Extensions & provider considerations
+
+### Preferred extensions
+
+- `pgcrypto` — preferred for `gen_random_uuid()`.
+- `uuid-ossp` — fallback for some providers.
+
+### CREATE EXTENSION strategy
+
+- Put extension creation into its own migration at the top:
+ - `20251203_create_pgcrypto_extension.js`
+- Migration content:
+
+```js
+exports.up = ( pgm ) => {
+ pgm.sql( "CREATE EXTENSION IF NOT EXISTS pgcrypto;" );
+};
+exports.down = ( pgm ) => {
+ // do not drop extension on down in production-safe path; leave no-op or document
+};
+```
+
+### Provider limitations
+
+- Managed providers (`Supabase`, `RDS`) may refuse `CREATE EXTENSION`. Preflight must check extension availability;
+ fallback:
+ - Generate `UUIDs` application-side or use `gen_random_uuid()` only if available.
+- Document required provider privileges in migration header and in [docs/MIGRATIONS.md](./MIGRATIONS.md).
+
+---
+
+## 7. Index creation & long-running operations
+
+- Use `CREATE INDEX CONCURRENTLY` for large tables.
+- For `node-pg-migrate`, run concurrent index creation via `pgm.sql(...)` outside a transaction. Note: some runners wrap
+ migrations in transactions by default — use `pgm.sql` and configure properly.
+- For Postgres large index estimate:
+ - Estimate rows: `SELECT reltuples::bigint AS approx_rows FROM pg_class WHERE relname='hero_snapshots';`
+ - Estimate size: `SELECT pg_relation_size('hero_snapshots')` or use `pgstattuple` if available.
+
+### Index creation checklist
+
+- Schedule during off-peak window.
+- Kick off `CREATE INDEX CONCURRENTLY`, monitor progress.
+- If using CONCURRENTLY, be aware it still consumes resources (`IO/CPU`).
+
+---
+
+## 8. Partitioning strategy & archival-aware changes
+
+### When to partition
+
+- High-volume append-only tables (`hero_snapshots`) over millions of rows.
+- Partition by time (monthly) or by hash(`user_id`) if cardinality increases.
+
+### Partition migration pattern
+
+1. Create parent table with `PARTITION BY RANGE (created_at)` or hash.
+2. Create partitions (monthly).
+3. Add trigger or default partitioning strategy for new inserts.
+4. Backfill older data into partitions in batches.
+5. Adjust queries and indexes accordingly.
+
+### Archival considerations
+
+- If archival: create `snapshot_archives` table or mark partitions as archived before deletion.
+- Migration to add `archived` metadata should be additive and reversible.
+
+---
+
+## 9. Preflight checks & CI jobs
+
+**Purpose:** detect permission gaps, estimate costs and validate that migrations apply cleanly.
+
+### Preflight script responsibilities (`scripts/migrate-preflight.sh`)
+
+- Verify connectivity (`psql` connection).
+- Check extension permissions:
+ - `SELECT installed_version FROM pg_available_extensions WHERE name='pgcrypto';`
+- Apply migrations to ephemeral test DB
+ - Run `node-pg-migrate` against ephemeral DB and verify `schema_migrations` updated.
+- Run smoke `ETL`: insert a small sample snapshot and run worker logic or `ETL` smoke script.
+- Validate indexing and expected tables existence.
+
+### CI integration
+
+- Add CI job that:
+ - Installs deps (`pnpm install --frozen-lockfile`)
+ - Runs lint, unit tests
+ - Runs `migrate:preflight` (applies migrations to a fresh ephemeral DB) and a smoke test
+- Block merge if preflight fails.
+
+### Example preflight checks (shell snippet)
+
+```bash
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Validate DB connectivity
+psql "$DATABASE_URL" -c "SELECT 1"
+
+# Check pgcrypto availability and privileges
+psql "$DATABASE_URL" -t -c "SELECT extname, extversion FROM pg_extension WHERE extname = 'pgcrypto';" | grep pgcrypto || echo "pgcrypto not installed"
+
+# Run migrations against ephemeral DB (test)
+DATABASE_URL="$TEST_DATABASE_URL" pnpm run migrate:up
+# Run smoke queries
+psql "$TEST_DATABASE_URL" -c "SELECT count(*) FROM hero_snapshots;"
+```
+
+---
+
+## 10. Local developer workflow
+
+### Create new migration
+
+- Use standard filename pattern and put file under `database/migrations/`.
+- Populate `exports.up` and `exports.down`.
+
+### Run migrations locally
+
+- Install deps: `pnpm install`
+- Configure `.env.local` or `.env` with `DATABASE_URL` (local Postgres).
+- Run:
+
+```bash
+pnpm run migrate:up -- --config database/migration-config.js
+```
+
+(or use your npm scripts wrapper)
+
+### Rollback last migration locally
+
+- `pnpm run migrate:down -- --count 1` (use carefully)
+
+### Testing locally
+
+- Run `./scripts/bootstrap-db.sh` to run migrations and seeds in a fresh local DB (script should be idempotent)
+- Run `ETL` smoke: ingest sample file (docs/examples) and start worker to process snapshots.
+
+---
+
+## 11. Production bootstrap workflow (manual, protected)
+
+### Principles
+
+- Migrations for production MUST be applied via a manual, protected `CI` workflow (`db-bootstrap.yml`) requiring
+ `GitHub Environment` approval.
+- Always take a DB snapshot/backup *immediately before* running production migrations.
+
+### Example GitHub Actions workflow (outline)
+
+```yaml
+name: db-bootstrap
+on:
+ workflow_dispatch:
+jobs:
+ bootstrap:
+ runs-on: ubuntu-latest
+ environment: production # requires approval
+ steps:
+ - uses: actions/checkout@v3
+ - name: Setup Node
+ uses: actions/setup-node@v3
+ - name: Install deps
+ run: pnpm install --frozen-lockfile
+ - name: Preflight check
+ run: ./scripts/migrate-preflight.sh
+ env:
+ DATABASE_URL: ${{ secrets.STAGING_DATABASE_URL }}
+ - name: Apply migrations (PROD)
+ if: ${{ github.event.inputs.apply == 'true' }}
+ run: pnpm run migrate:up -- --config database/migration-config.js --env production
+ env:
+ DATABASE_URL: ${{ secrets.DATABASE_URL }}
+```
+
+### Pre-bootstrap checklist (must be done before approval)
+
+- Recent backup id recorded and verified (snapshot successful and tested if possible).
+- Migration preflight on staging passed and smoke `ETL` validated.
+- `SRE` on-call available and aware of the maintenance window.
+- PR includes migration file(s), changelog and rollback plan.
+- Expected index creation & time estimates documented in PR.
+
+### Post-bootstrap verification
+
+- Run sanity queries to confirm expected tables & indexes exist.
+- Run small sample ingest + end-to-end test.
+- Monitor metrics & alerts for `30–120 minutes`.
+
+---
+
+## 12. Rollback & emergency restore strategy
+
+### When to rollback
+
+- Severe data corruption introduced by a migration.
+- Production outage caused by migration (locks, incompatibility).
+- If rollback migration is not feasible, prefer restore-from-backup.
+
+### Rollback options
+
+- Reversible migrations: run `pnpm run migrate:down` for the specific migration if down exists and verified.
+- Restore DB from backup:
+ 1. Stop ingestion & workers to avoid further writes.
+ 2. Restore database from snapshot taken before migration.
+ 3. Re-apply safe migrations or fix code to align with restored schema.
+ 4. Re-run necessary `ETL`/backfill to bring normalized tables up-to-date.
+- Always communicate with stakeholders and log actions in `audit_logs`.
+
+### Emergency steps (short)
+
+1. Pause workers and `API` writes (set read-only mode or flip feature flag).
+2. Assess whether `down` migration is safe; if yes run down; if not, prepare restore.
+3. If restoring: follow provider procedures for `PITR` or snapshot restore.
+4. After restore, run smoke tests, then re-enable ingestion carefully.
+
+---
+
+## 13. Migration PR checklist & review criteria
+
+### Every migration PR must include:
+
+- Migration file(s) in `database/migrations/` with timestamped filenames.
+- Description of change, rationale and expected impact (row counts, index costs).
+- Preflight results (`CI` artifacts) from staging run.
+- Backup id that will be used prior to production run (documented).
+- Runbook / rollback steps documented or link to runbook.
+- Tests: integration tests that exercise the migration (`CI` passing).
+- Approvers: at least two reviewers, one must be `SRE/DevOps` for production-impacting migrations.
+
+### Reviewers should verify:
+
+- Migration is small and focused.
+- No destructive changes without phased approach.
+- Indexes using CONCURRENTLY are documented and scheduled.
+- Required extensions are present or fallback documented.
+- A clear rollback plan exists.
+
+---
+
+## 14. Seeding & catalog management
+
+### Seeding
+
+- Keep idempotent seed scripts under `database/seeds/`.
+- Seeds should use `INSERT ... ON CONFLICT DO NOTHING` or `ON CONFLICT DO UPDATE` as appropriate.
+
+### Catalog synchronization
+
+- Catalogs (`troop_catalog`, `pet_catalog`, `spells`) should be seeded and versioned.
+- When catalog changes require schema changes, coordinate migration and catalog updates in same PR where possible, with
+ backfill plan.
+
+### Backfill seeds
+
+- If migration adds new columns requiring backfill, include backfill script as a separate job (not in migration up) and
+ track in `backfill_jobs` table.
+
+---
+
+## 15. Testing migrations (unit / integration / e2e)
+
+- Unit: test migration helper functions if any.
+- Integration: apply migrations to ephemeral DB in `CI`; run smoke queries and run `ETL` against sample payloads.
+- E2E: in staging, apply migrations, run backfills and full `ETL` to validate end-to-end system behavior.
+- Include migration tests in PR gating.
+
+---
+
+## 16. Backfill & data-migration jobs coordination
+
+- Backfill tasks are often long-running and should be performed by controlled background jobs, not by migration code
+ itself.
+- Backfill job steps:
+ 1. Create new column/table (migration)
+ 2. Launch backfill job (worker/backfill service) that populates new column in batches.
+ 3. Validate backfill results with sampling queries.
+ 4. Release final enforcement migration that sets NOT NULL or drops old columns.
+- Track progress in `backfill_jobs` table with checkpoints for resume.
+
+---
+
+## 17. Auditing & migration traceability
+
+- Record migration runs with:
+ - migration id, timestamp, runner (`CI`/user), environment, backup id, logs.
+- Store artifacts from production run (output logs) in `CI` artifacts storage for audit.
+- Keep `schema_migrations` table consistent and never manually alter it except in documented emergency procedures.
+
+---
+
+## 18. Examples: node-pg-migrate templates & preflight script
+
+### 18.1 Migration JS template (`node-pg-migrate`)
+
+```js
+// database/migrations/20251203T153000_add_example_table.js
+exports.up = ( pgm ) => {
+ pgm.createTable( 'example_table', {
+ id: { type: 'uuid', primaryKey: true, default: pgm.func( 'gen_random_uuid()' ) },
+ name: { type: 'varchar(255)', notNull: true },
+ data: { type: 'jsonb', default: pgm.func( "('{}'::jsonb)" ) },
+ created_at: { type: 'timestamptz', default: pgm.func( 'now()' ) }
+ } );
+ pgm.createIndex( 'example_table', 'name' );
+};
+
+exports.down = ( pgm ) => {
+ pgm.dropIndex( 'example_table', 'name' );
+ pgm.dropTable( 'example_table' );
+};
+```
+
+### 18.2 Preflight script (`scripts/migrate-preflight.sh`)
+
+```bash
+#!/usr/bin/env bash
+set -euo pipefail
+: "${DATABASE_URL:?Need DATABASE_URL}"
+echo "Testing DB connectivity..."
+psql "$DATABASE_URL" -c "SELECT now();"
+
+echo "Checking pgcrypto extension..."
+psql "$DATABASE_URL" -t -c "SELECT extname FROM pg_extension WHERE extname = 'pgcrypto';" | grep -q pgcrypto && echo "pgcrypto present" || echo "pgcrypto missing (provider may not support extension)"
+
+echo "Applying migrations to test database..."
+pnpm run migrate:up -- --config database/migration-config.js
+echo "Running smoke ETL test..."
+# optionally run a smoke script that inserts sample snapshot and triggers local worker
+```
+
+---
+
+## 19. Runbooks & operational notes
+
+- Link runbooks in `docs/OP_RUNBOOKS/`:
+ - [APPLY_MIGRATIONS.md](./OP_RUNBOOKS/APPLY_MIGRATIONS.md) — exact steps and command lines for running the
+ `GitHub Action` and verifying post-run checks.
+ - [MIGRATION_ROLLBACK.md](./OP_RUNBOOKS/MIGRATION_ROLLBACK.md) — step-by-step rollback and restore runbook.
+ - [BACKFILL.md](./OP_RUNBOOKS/BACKFILL.md) — schedule, throttle, and operator steps to run a backfill safely.
+
+### Operational tip
+
+- For critical migrations requiring owner approval, tag PR with `migration-impact: high` and schedule a 2nd approver.
+
+---
+
+## 20. References
+
+- [docs/DB_MODEL.md](./DB_MODEL.md) (canonical schema)
+- [docs/ETL_AND_WORKER.md](./ETL_AND_WORKER.md) (ETL contract)
+- `node-pg-migrate`: https://github.com/salsify/node-pg-migrate
+- `PostgreSQL` docs: https://www.postgresql.org/docs/
+- `Prometheus`: https://prometheus.io/docs/
+- `OpenTelemetry`: https://opentelemetry.io/
+
+---
diff --git a/docs/OBSERVABILITY.md b/docs/OBSERVABILITY.md
new file mode 100644
index 0000000..0a8497d
--- /dev/null
+++ b/docs/OBSERVABILITY.md
@@ -0,0 +1,712 @@
+# Observability — StarForge
+
+> Version: 0.1
+> Date: 2025-12-03
+
+---
+
+This document specifies the observability strategy, metrics, logging, tracing, dashboards and alerting for StarForge.
+It is intended for engineers and `SREs` implementing and operating the `ETL/Worker`, `API`, `Bot` and `admin services`
+described in the [docs/PRD.md](./PRD.md), [docs/ETL_AND_WORKER.md](./ETL_AND_WORKER.md)
+and [docs/DB_MODEL.md](./DB_MODEL.md).
+
+## Goals
+
+- Provide actionable, low-noise monitoring for production and staging.
+- Enable fast incident detection (`ETL` failure spikes, queue backlog, DB exhaustion).
+- Provide structured logs and traces to speed root-cause analysis.
+- Ensure observability respects privacy (`PII` redaction) and cost constraints.
+
+---
+
+## Audience
+
+- `Backend engineers` (instrumentation)
+- `SRE / DevOps` (dashboards, alerting, retention)
+- `QA / Data engineers` (validate metrics, dashboards)
+- `On-call responders` (runbooks & alerts)
+
+---
+
+## Contents
+
+1. [Observability architecture overview](#1-observability-architecture-overview)
+2. [Metrics (what to collect, names, labels)](#2-metrics--names-types-labels-and-meaning)
+3. [Prometheus configuration & scraping guidance](#3-prometheus-configuration--scraping-guidance)
+4. [Grafana dashboards (suggested panels & queries)](#4-grafana-dashboards-templates--panels)
+5. [Alerting (rules, thresholds, playbooks)](#5-alerting--rules-thresholds--playbooks)
+6. [Logging (structure, fields, samples, retention)](#6-logging--structure-fields-scrubbing--retention)
+7. [Tracing (OpenTelemetry, spans, sampling, propagation)](#7-tracing--opentelemetry-spans-propagation-and-sampling)
+8. [Sentry / Error tracking (configuration & scrubbing)](#8-sentry--error-tracking)
+9. [Correlation & metadata (request_id, trace_id, snapshot_id)](#9-correlation--metadata--request_id-trace_id-snapshot_id)
+10. [Security & PII handling in telemetry](#10-security--pii-handling-in-telemetry)
+11. [Retention & storage policies](#11-retention--storage-policies)
+12. [Testing observability in CI](#12-testing-observability-in-ci)
+13. [Operational runbooks & playbook snippets](#13-operational-runbooks--playbooks-summary--links)
+14. [Implementation checklist / example snippets](#14-implementation-checklist--example-snippets)
+15. [References & next steps](#15-references--next-steps)
+
+---
+
+## 1. Observability architecture overview
+
+### Components
+
+- Instrumented services: `api-service`, `etl-worker`, `discord-bot`, `admin-ui`
+- Metrics: `Prometheus` (scrape or push via exporters)
+- Dashboards: `Grafana` (dashboards per service & cross-service)
+- Traces: `OpenTelemetry` collector -> tracing backend (`Tempo`/`Jaeger`/`Tempo+Loki`)
+- Error tracking: `Sentry` (exceptions, structured events) with scrubbing
+- Logs: Structured JSON logs shipped to central log store (`Loki`/`ELK`/`Datadog`/`Logflare`)
+- Alerts: `Prometheus Alertmanager` -> pager (`PagerDuty`) and `Slack` channels
+- Long-term storage: Archive raw metrics/aggregates to cost-effective storage if required
+
+---
+
+### Design notes
+
+- Favor pull-based metrics (`Prometheus` scrape) for services with stable endpoints; use pushgateway for ephemeral jobs
+ when necessary.
+- Ensure all instrumented services expose `/metrics` and `/health`.
+- Propagate correlation ids (`X-Request-Id`) and trace context across requests and jobs to link logs, metrics and
+ traces.
+
+---
+
+## 2. Metrics — names, types, labels, and meaning
+
+### Naming conventions
+
+- Prefix: `starforge_` for all app metrics.
+- Suffix conventions:
+ - `_total` for counters
+ - `_seconds` / `_duration_seconds` for durations
+ - `_gauge` for gauges
+- Use `snake_case`.
+- Avoid cardinality explosion in labels (limit label values and cardinality).
+
+---
+
+### Label guidance
+
+- Common labels: `env`, `service`, `instance`, `job_type`, `snapshot_source`, `queue_name`, `error_type`
+- Use `user_id` or `namecode` sparingly (avoid high-cardinality); prefer hashed or bucketed labels for analytics.
+
+---
+
+### Core metric categories and recommended metrics
+
+#### A. ETL / Worker metrics (prefix: `starforge_etl_`)
+
+- Counters
+ - `starforge_etl_snapshots_received_total{env,service,job_type, snapshot_source}` — incremented when `API` enqueues
+ a snapshot job.
+ - `starforge_etl_snapshots_processed_total{env,service,result}` — result ∈ {success,failed,partial}.
+ - `starforge_etl_snapshots_failed_total{env,service,error_type}` — categorize errors (`PARSE_ERROR`, `DB_ERROR`,
+ `FK_VIOLATION`, `OOM`).
+ - `starforge_etl_entity_upserts_total{env,service,entity}` — counts per entity (`users`, `user_troops`,
+ `user_pets`, ...).
+- Histograms / Summaries
+ - `starforge_etl_processing_duration_seconds_bucket{le,env,service}` — processing latency histogram per snapshot.
+ - `starforge_etl_upsert_batch_size` — batch sizes distribution used for multi-row upserts.
+- Gauges
+ - `starforge_etl_jobs_in_progress{env,service}` — currently processing jobs.
+ - `starforge_etl_last_processed_timestamp{env,service}` — unix timestamp of last processed snapshot.
+
+#### B. API metrics (prefix: `starforge_api_`)
+
+- Counters
+ - `starforge_api_requests_total{env,service,endpoint,method,status}`
+- Histograms
+ - `starforge_api_request_duration_seconds{env,service,endpoint}`
+- Gauges
+ - `starforge_api_inflight_requests{env,service}`
+
+#### C. Queue & Redis metrics
+
+- If using `BullMQ` / `Redis`:
+ - `starforge_queue_jobs_waiting_total{env,queue_name}` — count of waiting jobs (pullable from `Redis`)
+ - `starforge_queue_jobs_active_total{env,queue_name}`
+ - `starforge_queue_jobs_failed_total{env,queue_name}`
+
+#### D. Database metrics (collect from exporter)
+
+- `pg_up` / `pg_connections` / `pg_active_queries` / `pg_longest_tx_seconds` / `pg_lock_count` (from `pg_exporter`)
+- `starforge_db_tx_duration_seconds` (application-level)
+
+#### E. Infrastructure metrics
+
+- CPU, memory, disk, network per instance (`node exporter`)
+- Pod-level metrics if `Kubernetes` (`kube-state-metrics`)
+
+#### F. Business metrics (higher-level)
+
+- `starforge_players_active_30d` — unique active players in 30 days (calculated/ingested)
+- `starforge_bot_commands_total{command}`
+
+### Suggested metric dimensions (labels)
+
+- `env` (production/staging/local)
+- `service` (api, etl-worker, bot)
+- `instance` (host or pod name)
+- `job_type` (process_snapshot, backfill_range)
+- `snapshot_source` (fetch_by_namecode, login, cli_upload)
+- `entity` (user_troops, user_pets, ...)
+
+### Metric examples (Prometheus exposition style)
+
+```text
+# HELP starforge_etl_snapshots_processed_total Number of snapshots processed
+# TYPE starforge_etl_snapshots_processed_total counter
+starforge_etl_snapshots_processed_total{env="production",service="etl-worker",result="success"} 12345
+```
+
+---
+
+## 3. Prometheus configuration & scraping guidance
+
+### Scrape targets
+
+- Each service exposes `/metrics` (`Prometheus` format) on an ephemeral port (e.g., 9091).
+- For `Kubernetes`: use `ServiceMonitor` (`Prometheus` Operator) per service.
+
+---
+
+### Example `scrape_config` (prometheus.yml)
+
+```yaml
+scrape_configs:
+ - job_name: 'starforge-api'
+ metrics_path: /metrics
+ static_configs:
+ - targets: [ 'api-1.example.com:9091','api-2.example.com:9091' ]
+ relabel_configs:
+ - source_labels: [ '__address__' ]
+ target_label: 'instance'
+
+ - job_name: 'starforge-etl'
+ metrics_path: /metrics
+ kubernetes_sd_configs:
+ - role: pod
+ relabel_configs:
+ - source_labels: [ __meta_kubernetes_pod_label_app ]
+ regex: 'etl-worker'
+ action: keep
+```
+
+---
+
+### Scrape intervals
+
+- Default: `scrape_interval: 15s` for application metrics.
+- For high-volume metrics or high-cardinality series, consider `30s` to reduce load.
+
+---
+
+### Relabeling & security
+
+- Add `relabel_configs` to attach `env` and `service` labels from pod metadata.
+- Secure `Prometheus` scrapes with `mTLS` or allowlist internal networks.
+
+---
+
+### Pushgateway
+
+- Use only for short-lived ephemeral jobs if necessary; prefer direct scraping and persistent worker processes.
+
+---
+
+## 4. Grafana dashboards (templates & panels)
+
+### Build dashboards per audience:
+
+- `ETL` / `Worker` dashboard (`SRE`, `backend`)
+- `API` dashboard (latency, errors)
+- DB dashboard (connections, slow queries)
+- Bot dashboard (commands, success rate)
+- Overall system health dashboard (top-level)
+
+---
+
+### Recommended ETL dashboard panels
+
+- Snapshot throughput: `rate(starforge_etl_snapshots_processed_total[1m])` (timeseries)
+- Processing latency:`histogram_quantile(0.95, sum(rate(starforge_etl_processing_duration_seconds_bucket[5m])) by (le))`
+- Failure rate:
+ `sum(rate(starforge_etl_snapshots_failed_total[5m])) / sum(rate(starforge_etl_snapshots_processed_total[5m]))`
+- Queue depth: `starforge_queue_jobs_waiting_total`
+- Jobs in progress: `starforge_etl_jobs_in_progress`
+- Top error types pie: `sum by (error_type) (rate(starforge_etl_snapshots_failed_total[5m]))`
+- DB connections and longest transactions (from `pg_exporter`)
+- Worker restarts / `OOMs` count (kubernetes pod restarts)
+
+---
+
+### Sample `Grafana` panel queries (`PromQL`)
+
+- `p95` processing time:
+
+```promql
+histogram_quantile(0.95, sum(rate(starforge_etl_processing_duration_seconds_bucket[5m])) by (le))
+```
+
+- `ETL` failure rate (5m):
+
+```promql
+(sum(rate(starforge_etl_snapshots_failed_total[5m])) by (service)) /
+(sum(rate(starforge_etl_snapshots_processed_total[5m])) by (service))
+```
+
+- Queue depth per queue:
+
+```promql
+starforge_queue_jobs_waiting_total{env="production"}
+```
+
+### Dashboard tips
+
+- Add annotations for deployments & migrations to correlate spikes.
+- Show per-instance panels for troubleshooting, plus cluster-level aggregates.
+- Provide drilldown links to logs (`Loki`/`ELK`) and traces (`Tempo`/`Jaeger`) with `snapshot_id` or `trace_id`.
+
+---
+
+## 5. Alerting — rules, thresholds & playbooks
+
+### Design principles
+
+- Keep alerts actionable and low-noise.
+- Avoid alerting on single transient blips—use multi-window conditions or require sustained violations.
+- Each alert should include: description, severity, likely causes, quick actions, runbook link.
+
+### Suggested alert severities mapping
+
+- `P0` (page): system outage / critical data-loss / DB down / `ETL` failure rate severe
+- `P1` (notify): service degradation or resource saturation
+- `P2` (ticket): non-urgent but actionable issues
+
+### Alert examples (`Prometheus Alertmanager` rules format snippets)
+
+#### A. `ETL` failure rate high (`P0`)
+
+```yaml
+- alert: StarforgeETLFailureRateHigh
+ expr: |
+ (
+ sum(rate(starforge_etl_snapshots_failed_total[5m]))
+ /
+ sum(rate(starforge_etl_snapshots_processed_total[5m]))
+ ) > 0.01
+ for: 5m
+ labels:
+ severity: page
+ annotations:
+ summary: "ETL failure rate > 1% (5m)"
+ description: "ETL failure rate is above 1% for the last 5 minutes. Check etl_errors table and recent logs. Runbook: docs/OP_RUNBOOKS/INCIDENT_RESPONSE.md"
+```
+
+#### B. Queue depth high (`P0`)
+
+```yaml
+- alert: StarforgeQueueDepthHigh
+ expr: starforge_queue_jobs_waiting_total > 500
+ for: 10m
+ labels:
+ severity: page
+ annotations:
+ summary: "ETL queue backlog > 500"
+ description: "ETL queue backlog is high; scale workers or investigate job failure storms. Check Redis and worker heartbeats."
+```
+
+#### C. DB connection exhaustion (`P0`)
+
+```yaml
+- alert: PostgresConnectionsHigh
+ expr: pg_stat_activity_count > (pg_settings_max_connections * 0.9)
+ for: 2m
+ labels:
+ severity: page
+ annotations:
+ summary: "DB connections > 90%"
+ description: "Database connections exceed 90% of max. Reduce worker concurrency or scale DB. See runbook DB_RESTORE.md / SCALING_UP.md."
+```
+
+#### D. Processing latency regression (`P1`)
+
+```yaml
+- alert: ETLProcessingLatencyHigh
+ expr: histogram_quantile(0.95, sum(rate(starforge_etl_processing_duration_seconds_bucket[5m])) by (le)) > 30
+ for: 10m
+ labels:
+ severity: paged (or notify)
+ annotations:
+ summary: "ETL p95 processing time > 30s"
+```
+
+#### E. Worker restarts (`P1`)
+
+```yaml
+- alert: WorkerPodRestartsHigh
+ expr: increase(kube_pod_container_status_restarts_total{job="kube-state-metrics",pod=~"etl-worker.*"}[10m]) > 3
+ for: 0m
+ labels:
+ severity: page
+ annotations:
+ summary: "ETL worker restarting frequently"
+```
+
+### Alert content: always include
+
+- How to check: queries to run, which dashboards to view.
+- Quick actions: pause backfill, scale workers, check DB connections, examine `etl_errors` for sample snapshot ids.
+- Contact / escalations: on-call rotation or team.
+
+### Playbook snippets (on alert)
+
+- `ETL` failure rate high:
+ 1. Check `Grafana ETL` dashboard and `etl_errors`.
+ 2. Identify top error types and `snapshot_ids`.
+ 3. If parse error is widespread, pause ingestion, create task for update, and escalate to engineering lead.
+ 4. If DB errors, reduce worker concurrency immediately.
+
+---
+
+## 6. Logging — structure, fields, scrubbing & retention
+
+### 6.1 Logging format & mandatory fields
+
+- Use structured JSON logs. Each log MUST include:
+ - `timestamp` (ISO 8601)
+ - `service` (api | etl-worker | bot | admin-ui)
+ - `env` (production | staging)
+ - `level` (DEBUG | INFO | WARN | ERROR)
+ - `message` (short human readable)
+ - `request_id` or `job_id`
+ - `trace_id` (optional, for trace linking)
+ - `snapshot_id` (when relevant)
+ - `worker_instance` or `pod`
+ - `module` (component)
+ - `details` (structured object for contextual fields)
+
+#### Example log (info)
+
+```json
+{
+ "timestamp": "2025-12-03T10:12:34Z",
+ "service": "etl-worker",
+ "env": "production",
+ "level": "info",
+ "message": "Snapshot processed",
+ "snapshot_id": "a2f1c1b2-...-e4f9",
+ "job_id": "bull-uuid-1234",
+ "worker_instance": "worker-3",
+ "duration_ms": 5432,
+ "entities_upserted": {
+ "users": 1,
+ "user_troops": 120
+ },
+ "trace_id": "..."
+}
+```
+
+#### Example log (error)
+
+```json
+{
+ "timestamp": "2025-12-03T10:13:01Z",
+ "service": "etl-worker",
+ "env": "production",
+ "level": "error",
+ "message": "Failed to upsert user_troops batch",
+ "snapshot_id": "...",
+ "job_id": "...",
+ "error_type": "DB_ERROR",
+ "error_message": "duplicate key value violates unique constraint \"ux_user_troops_user_troop\"",
+ "details": {
+ "batch_size": 200
+ }
+}
+```
+
+### 6.2 Redaction & `PII` scrubbing
+
+- Implement a scrubbing pipeline *before* logs leave the app:
+ - Mask values for keys: `password`, `token`, `session`, `auth`, `api_key`, `access_token`.
+ - Replace with `"[REDACTED]"` or hashed placeholder.
+- For raw snapshots, log only `snapshot_id` and `s3_path` (if archived), not the raw JSON.
+- Sentry events must be scrubbed similarly.
+
+### 6.3 Log levels & sampling
+
+- Use DEBUG in staging; in production use INFO for success events and WARN/ERROR for issues.
+- Sample high-volume DEBUG logs (e.g., per-troop operations) or rate-limit them server-side (`1% sample`).
+- For error logs related to a unique snapshot, keep full detailed logs for initial investigation, and then sample/purge
+ as needed.
+
+### 6.4 Log shipping & retention
+
+- Ship logs to centralized log store with indexable fields: `service`, `env`, `snapshot_id`, `job_id`, `trace_id`.
+- Retention policy (recommended baseline):
+ - Hot (searchable) logs: 30 days
+ - Warm (aggregated) logs: 90 days
+ - Cold archive (compressed): 365+ days (for compliance needs)
+- For `PII`-containing logs (rare), keep only as long as legally required and ensure encryption at rest.
+
+---
+
+## 7. Tracing — OpenTelemetry, spans, propagation, and sampling
+
+### 7.1 Goals
+
+- Provide end-to-end traces across `API` ingest -> queue -> worker -> DB upserts.
+- Correlate traces with logs and metrics for fast `RCA`.
+
+### 7.2 Instrumentation strategy
+
+- Use `OpenTelemetry SDK` for `Node.js` (`api-service`, `etl-worker`, `bot`).
+- Instrument:
+ - HTTP servers/clients (incoming `API` requests and outgoing requests)
+ - Queue interactions (enqueue / dequeue / job processing) — use semantic conventions to create spans `queue.enqueue`
+ and `queue.process`
+ - DB calls (`pg`) — create spans for queries/batches
+ - Heavy processing spans (`parse_snapshot`, `upsert_user_troops_batch`)
+
+### 7.3 Trace context propagation
+
+- Propagate `traceparent` headers for HTTP requests.
+- For queue jobs include `trace_id` and `parent_span` in job metadata to continue trace in worker.
+- Use `X-Request-Id` as fallback correlation id.
+
+### 7.4 Spans & recommended hierarchy (example)
+
+- `http.server` (API: `POST /internal/snapshots`)
+ - `queue.enqueue` (enqueue job)
+ - job stored in `Redis` (span)
+- `queue.worker.process` (worker dequeues)
+ - `claim_snapshot` (DB update)
+ - `parse_snapshot` (stream parsing)
+ - `map_entity:users`
+ - `db.upsert.users`
+ - `map_entity:user_troops` (batched)
+ - `db.upsert.user_troops.batch`
+ - `commit_snapshot` (mark processed)
+
+### 7.5 Sampling & retention
+
+- Sample rate: initial default `10%` for production traces (adjust based on cost and utility).
+- Always sample traces for errors (Sentry integration) or for admin-triggered runs (e.g., reprocess).
+- Retention:
+ - Full traces: 7 days
+ - Sampled traces: 30 days
+
+### 7.6 Collector & backend
+
+- Use an `OpenTelemetry Collector` to receive spans and export to backend (`Tempo`/`Jaeger`/`Lightstep`).
+- Collector responsibilities: batching, sampling, drop rules, and forwarding to `APM`.
+
+---
+
+## 8. Sentry / Error tracking
+
+### 8.1 What to capture
+
+- Unhandled exceptions, errors that result in failed jobs, parsing exceptions, DB errors flagged by worker
+- Attach minimal context: `snapshot_id`, `job_id`, `worker_instance`, `env`, `user_id` (if non-`PII`), `trace_id`
+
+### 8.2 Scrubbing & PII protection
+
+- Configure `Sentry` before-send hooks to remove or mask `PII` from event payloads.
+- Do not send full raw `raw` snapshot content to `Sentry` — instead include `snapshot_id` and `s3_path`.
+
+### 8.3 Rate limiting & sampling
+
+- Configure `Sentry` sampling to avoid being overwhelmed by noise (sample non-critical errors).
+- For critical jobs or newly introduced code, temporarily increase sampling to get better coverage.
+
+### 8.4 Alerts & integration
+
+- Use `Sentry` alert rules to create issues in ticketing systems for high-severity errors.
+- Integrate `Sentry` with Slack for immediate visibility on new critical errors.
+
+---
+
+## 9. Correlation & metadata — request_id, trace_id, snapshot_id
+
+### 9.1 Correlation ids
+
+- `X-Request-Id`: set by client or `API` gateway; if absent, generate `UUID` at `API` ingress and return to client in
+ response.
+- Include `request_id` in logs and metrics where possible.
+
+### 9.2 Trace ids
+
+- Standard `OpenTelemetry` `trace_id` propagated through HTTP and queue.
+- Ensure worker reads `trace_id` from job payload and continues trace.
+
+### 9.3 Snapshot id
+
+- Include `snapshot_id` label in relevant metrics, logs and traces to pivot troubleshooting to the raw payload.
+- Avoid using `snapshot_id` as a high-cardinality `Prometheus` label in cardinal series; use it in logs and as trace
+ attribute.
+
+---
+
+## 10. Security & PII handling in telemetry
+
+### 10.1 Policy highlights
+
+- Telemetry must not leak secrets or raw credentials.
+- Keep list of sensitive keys (`password`, `token`, `access_token`, `session`, `auth`) and scrub them in telemetry
+ pipelines.
+- Use hashed identifiers if you must include user identifiers in metrics (e.g., hash(namecode) -> low-card label
+ buckets).
+
+### 10.2 Scrubbing implementation points
+
+- Application-level: sanitize before logging/attaching to traces/`Sentry`.
+- Collector-level: run additional redaction rules in `OpenTelemetry Collector`.
+- Logging pipeline-level: final redaction step in log shipper (`Loki`/`Logstash`).
+
+### 10.3 DSR & telemetry
+
+- When a `Data Subject Request` (`DSR`) requires deletion of personal data:
+ - Search logs and provide redaction or proof of removal as per policy (logs may be immutable; keep a record of
+ attempted purge and apply legal process).
+ - For archived traces / logs in cold storage, follow legal guidance.
+
+---
+
+## 11. Retention & storage policies
+
+### Recommended baseline (configurable by org policy)
+
+- Metrics (`Prometheus`):
+ - Raw samples: 30 days
+ - Aggregated rollups: 365 days
+- Traces:
+ - Full fidelity: 7 days
+ - Sampled: 30 days
+- Logs:
+ - Hot searchable: 30 days
+ - Warm/archived compressed: 365 days
+- `Sentry` / errors: 90 days (or provider plan)
+
+### Cost controls
+
+- Use metric relabeling to drop or reduce cardinality of high-cardinality labels in `Prometheus`.
+- Use trace sampling and error-only retention to control trace storage.
+- Use log lifecycle rules to move older logs to cheaper storage tiers.
+
+---
+
+## 12. Testing observability in CI
+
+### 12.1 Unit & integration asserts
+
+- Unit tests should assert that metrics are emitted for key code paths (use fake `Prometheus` registry).
+- Integration tests should:
+ - Spin up services, call endpoints that produce metrics, and query `/metrics` to assert presence.
+ - Insert a sample snapshot and assert worker emits `starforge_etl_snapshots_processed_total`.
+
+### 12.2 Performance and load tests
+
+- Simulate realistic load in staging and validate dashboards (`ETL` throughput, `p95` processing, DB connections).
+- Validate alert thresholds in a controlled manner (e.g., create synthetic failures to trigger alerts in staging).
+
+### 12.3 Health checks
+
+- Include `curl /health` checks in `CI` smoke tests for service liveness and DB connectivity.
+
+---
+
+## 13. Operational runbooks & playbooks (summary + links)
+
+### Each alert should point to a runbook with step-by-step instructions. Key runbooks:
+
+- `ETL` failure spike — [docs/OP_RUNBOOKS/ETL_FAILURE_SPIKE.md](./OP_RUNBOOKS/ETL_FAILURE_SPIKE.md)
+- Queue backlog — [docs/OP_RUNBOOKS/QUEUE_BACKLOG.md](./OP_RUNBOOKS/QUEUE_BACKLOG.md)
+- DB connection exhaustion — [docs/OP_RUNBOOKS/DB_CONNECTION_EXHAUST.md](./OP_RUNBOOKS/DB_CONNECTION_EXHAUST.md)
+- `Worker OOM` / Crash — [docs/OP_RUNBOOKS/WORKER_OOM.md](./OP_RUNBOOKS/WORKER_OOM.md)
+- Migration rollback — [docs/OP_RUNBOOKS/MIGRATION_ROLLBACK.md](./OP_RUNBOOKS/MIGRATION_ROLLBACK.md)
+- Secrets compromise — [docs/OP_RUNBOOKS/SECRET_COMPROMISE.md](./OP_RUNBOOKS/SECRET_COMPROMISE.md)
+
+### Include runbook essentials in each alert:
+
+- Quick triage steps
+- Targeted queries and dashboards
+- Suggested mitigation (scale down workers, pause backfills, flip feature flag)
+- Escalation path and contacts
+
+### Example quick-playbook for ETL failure spike (short)
+
+1. Check Grafana ETL dashboard and `etl_errors` table for top `error_type` (PARSE vs DB).
+2. If DB errors: scale down worker concurrency and check DB metrics (connections, locks).
+3. If parse errors: sample failing `snapshot_ids` (select last 10 from `etl_errors`) and save to `docs/examples` for dev
+ debugging.
+4. Pause ingestion if necessary and notify product & engineering.
+5. Re-enable ingestion and monitor.
+
+---
+
+## 14. Implementation checklist / example snippets
+
+### Checklist for instrumenting a service
+
+- [ ] Expose `/metrics` in `Prometheus` format (use `prom-client` / `prom-client-exporter`)
+- [ ] Expose `/health` and readiness probes
+- [ ] Instrument core paths: `API` ingest, queue enqueue, job start/end
+- [ ] Emit metrics with appropriate labels (`env`, `service`, `job_type`)
+- [ ] Add structured logging with required fields (`request_id`, `trace_id`)
+- [ ] Integrate `OpenTelemetry` tracing and propagate context into job payload
+- [ ] Sanitize and scrub `PII` before sending logs or events to `Sentry`
+- [ ] Add unit tests asserting metrics emitted
+- [ ] Add end-to-end tests validating /metrics counters after processing a sample snapshot
+
+### Prometheus client (Node) snippet example
+
+```js
+const client = require( 'prom-client' );
+const Registry = client.Registry;
+const register = new Registry();
+
+const snapshotsProcessed = new client.Counter( {
+ name: 'starforge_etl_snapshots_processed_total',
+ help: 'Snapshots processed',
+ labelNames: [ 'env', 'service', 'result' ],
+} );
+register.registerMetric( snapshotsProcessed );
+
+// expose metrics
+app.get( '/metrics', async ( req, res ) => {
+ res.set( 'Content-Type', register.contentType );
+ res.end( await register.metrics() );
+} );
+```
+
+### OpenTelemetry (Node) basic initialization (example)
+
+```js
+const { NodeTracerProvider } = require( '@opentelemetry/sdk-trace-node' );
+const { registerInstrumentations } = require( '@opentelemetry/instrumentation' );
+const { JaegerExporter } = require( '@opentelemetry/exporter-jaeger' );
+const provider = new NodeTracerProvider();
+const exporter = new JaegerExporter( { serviceName: 'starforge-etl' } );
+provider.addSpanProcessor( new SimpleSpanProcessor( exporter ) );
+provider.register();
+// instrument HTTP, pg, etc.
+registerInstrumentations( { tracerProvider: provider, instrumentations: [ ... ] } );
+```
+
+---
+
+## 15. References & next steps
+
+- [docs/ETL_AND_WORKER.md](./ETL_AND_WORKER.md) — `ETL` worker contract (instrumentation targets referenced throughout
+ this doc)
+- [docs/DB_MODEL.md](./DB_MODEL.md) — canonical schema (for DB-local metrics)
+- [docs/MIGRATIONS.md](./MIGRATIONS.md) — migration conventions (alert to add annotations for migrations)
+- Prometheus docs: https://prometheus.io/docs/
+- OpenTelemetry: https://opentelemetry.io/
+- Grafana best practices: https://grafana.com/docs/grafana/latest/
+
+---