diff --git a/docs/DB_MIGRATIONS.md b/docs/DB_MIGRATIONS.md deleted file mode 100644 index ef581d3..0000000 --- a/docs/DB_MIGRATIONS.md +++ /dev/null @@ -1,262 +0,0 @@ -# Database Migrations - -This document explains how to manage database migrations for the StarForge project. - -## Overview - -StarForge uses [node-pg-migrate](https://salsita.github.io/node-pg-migrate/) for database schema migrations. This provides: - -- **Version control** for database schema changes -- **Up and down migrations** for easy rollback -- **Idempotent operations** that are safe to run multiple times -- **Transaction support** for atomic changes - -## Key Design Decisions - -### pgcrypto vs uuid-ossp - -We use the `pgcrypto` extension with `gen_random_uuid()` instead of `uuid-ossp` with `uuid_generate_v4()`. This choice was made for better compatibility with managed PostgreSQL providers like Supabase, which may have restrictions on certain extensions. - -### JavaScript Migrations - -Migrations are written in JavaScript (not SQL) to take advantage of: - -- Built-in up/down migration support -- Better transaction control -- Programmatic schema generation -- Type safety and validation - -### Idempotent Seeds - -Seed data uses `INSERT ... ON CONFLICT DO NOTHING` to ensure it can be safely run multiple times without duplicating data. - -## Local Development - -### Prerequisites - -- PostgreSQL client tools (`psql`) installed -- Node.js and pnpm installed -- `DATABASE_URL` environment variable configured - -### Setup - -1. **Configure your database connection:** - -```bash -export DATABASE_URL="postgresql://user:password@localhost:5432/starforge" -export PGSSLMODE=require # Optional, defaults to 'require' -``` - -2. **Run the bootstrap script:** - -```bash -./scripts/bootstrap-db.sh -``` - -This will: - -- Install dependencies -- Run all pending migrations -- Apply seed data -- Verify the setup - -### Manual Migration Commands - -**Run all pending migrations:** - -```bash -pnpm run migrate:up -``` - -**Rollback the last migration:** - -```bash -pnpm run migrate:down -``` - -**Create a new migration:** - -```bash -pnpm run migrate create my-migration-name -``` - -**Check migration status:** - -```bash -pnpm run migrate status --config database/migration-config.js -``` - -## CI/CD and Production - -### GitHub Actions Workflow - -We have a manual workflow (`db-bootstrap.yml`) for running migrations in production: - -1. Go to **Actions** → **Database Bootstrap** in GitHub -2. Click **Run workflow** -3. Select the target environment (production/staging) -4. Click **Run workflow** button -5. **Approve the deployment** if using the production environment - -### Environment Protection - -The workflow requires manual approval for production deployments through GitHub Environments. This prevents accidental schema changes. - -### Setting Up Environments - -1. In your GitHub repository, go to **Settings** → **Environments** -2. Create a `production` environment -3. Enable **Required reviewers** -4. Add team members who should approve production migrations - -### Required Secrets - -Add these secrets in GitHub repository settings: - -- `DATABASE_URL`: PostgreSQL connection string for your database - -Example: - -``` -postgresql://username:password@host.supabase.co:5432/postgres -``` - -## Creating New Migrations - -### Best Practices - -1. **Keep migrations small and focused** - One logical change per migration -2. **Always include down migrations** - For easy rollback if needed -3. **Use transactions** - node-pg-migrate wraps migrations in transactions by default -4. **Test locally first** - Always test migrations on a local database before production -5. **Make migrations idempotent** - Use `IF NOT EXISTS` and similar clauses - -### Example Migration - -```javascript -// database/migrations/1234567890_add_user_preferences.js - -exports.up = (pgm) => { - pgm.createTable( - 'user_preferences', - { - id: { - type: 'uuid', - primaryKey: true, - default: pgm.func('gen_random_uuid()'), - }, - user_id: { - type: 'uuid', - notNull: true, - references: 'users(id)', - onDelete: 'CASCADE', - }, - theme: { - type: 'varchar(50)', - default: "'light'", - }, - notifications_enabled: { - type: 'boolean', - default: true, - }, - }, - { - ifNotExists: true, - }, - ); - - pgm.createIndex('user_preferences', 'user_id', { ifNotExists: true }); -}; - -exports.down = (pgm) => { - pgm.dropTable('user_preferences', { ifExists: true }); -}; -``` - -## Troubleshooting - -### Extension Permissions - -If you encounter errors about creating extensions: - -1. **Enable the extension manually** in your database: - - ```sql - CREATE EXTENSION IF NOT EXISTS pgcrypto; - ``` - -2. **Check extension availability:** - - ```sql - SELECT * FROM pg_available_extensions WHERE name = 'pgcrypto'; - ``` - -3. **Contact your database administrator** if you don't have permissions - -### Connection Issues - -If migrations fail to connect: - -1. Verify your `DATABASE_URL` is correct -2. Check that `PGSSLMODE` is set appropriately (`require` for most cloud providers) -3. Verify your IP is whitelisted if using managed PostgreSQL -4. Test the connection manually: - ```bash - psql "$DATABASE_URL" -c "SELECT 1" - ``` - -### Migration Conflicts - -If migrations get out of sync: - -1. **Check current migration status:** - - ```bash - pnpm run migrate status --config database/migration-config.js - ``` - -2. **Inspect the pgmigrations table:** - - ```sql - SELECT * FROM pgmigrations ORDER BY run_on DESC; - ``` - -3. **Never modify completed migrations** - Create a new migration to fix issues - -## Security Best Practices - -### Secrets Management - -- **Never commit secrets** to version control -- Use GitHub Secrets for CI/CD -- Rotate credentials regularly -- Use different credentials for each environment - -### Access Control - -- Limit who can run migrations in production -- Use GitHub Environment protection rules -- Enable audit logging for database changes -- Review migration PRs carefully - -### Database Backups - -- Always backup before running migrations in production -- Test restoration procedures regularly -- Keep backups for regulatory compliance periods -- Consider point-in-time recovery for critical databases - -## Additional Resources - -- [node-pg-migrate Documentation](https://salsita.github.io/node-pg-migrate/) -- [PostgreSQL Documentation](https://www.postgresql.org/docs/) -- [Supabase Database Documentation](https://supabase.com/docs/guides/database) - -## Support - -If you encounter issues: - -1. Check the troubleshooting section above -2. Review recent migration files for common patterns -3. Consult the team's database administrator -4. Open an issue in the repository with detailed error messages diff --git a/docs/DB_MODEL.md b/docs/DB_MODEL.md index e69de29..79c9bc8 100644 --- a/docs/DB_MODEL.md +++ b/docs/DB_MODEL.md @@ -0,0 +1,1824 @@ +# StarForge — Canonical Database Model (DB_MODEL.md) + +> This document is the canonical reference for the database schema used by StarForge. It expands the summary in the PRD +> and contains: +> - high-level overview and design principles, +> - canonical table definitions (key tables used by the ETL and APIs), +> - recommended indexes, constraints and partitioning guidance, +> - upsert patterns / example DDL and SQL, +> - migration & rollout checklist specific to schema changes, +> - DBML representation (paste into https://dbdiagram.io to visualize), +> - sample queries useful for debugging and validation. + +--- + +# Audience + +- Backend engineers implementing migrations and `ETL workers` +- `SRE / DevOps` running bootstrap and production migrations +- Data engineers and analysts consuming normalized tables +- `QA` writing `integration/e2e tests` + +--- + +# Status + +- Draft (aligns with PRD v0.1 — see [docs/PRD.md](./PRD.md)). Update this file when migrations are committed. + +--- + +# Table of contents + +1. [Design goals](#1-design-goals--guiding-principles) +2. [Naming & conventions](#2-naming--conventions) +3. [Core tables (DDL-like descriptions)](#3-core-tables-canonical-descriptions--recommended-columns) +4. [Indexes & query patterns](#4-indexes--query-patterns) +5. [Partitioning & archival guidance](#5-partitioning-retention--archival-guidance) +6. [Upsert examples and idempotency patterns](#6-upsert-examples-and-idempotency-patterns) +7. [Migration strategy & preflight checks](#7-migration-strategy--preflight-checks) +8. [Data retention & archival model](#8-data-retention--archival-model-summary) +9. [Validation queries & health checks](#9-validation-queries--health-checks) +10. [DBML schema (paste into dbdiagram.io)](#10-dbml-schema-paste-into-dbdiagramio) +11. [Change log / references](#11-change-log--references) + +--- + +## 1. Design goals / guiding principles + +- Store `raw upstream snapshots` (JSON) for audit & replay while normalizing commonly queried fields for performance. +- Keep `ETL` idempotent: repeated processing of the same snapshot must not create duplicates or corrupt state. +- Make `migrations` safe and incremental: add nullable columns first, backfill asynchronously, then make NOT NULL. +- Minimize long-running exclusive locks; prefer CONCURRENT index creation where supported. +- Keep `PII` minimal and provide explicit redaction guidance for any sensitive fields prior to archival or external + export. + +--- + +## 2. Naming & conventions + +- All timestamps: timestamptz (UTC). Column names: `created_at`, `updated_at`, `processed_at`, `server_time` (where + upstream provides epoch). +- Primary key convention: `id` as UUID where appropriate (generated with `gen_random_uuid()` via `pgcrypto`), or integer + for catalog tables. +- Foreign keys: explicit FK constraints where referential integrity is required. Use `ON DELETE SET NULL` for references + that must not cascade deletes unexpectedly. +- JSON storage: use `jsonb` for columns storing raw snapshots or arbitrary extra fields. Use `jsonb` named `extra`, + `extras`, or `raw` for unmapped data. +- Index names: `idx__`; unique constraints prefixed `ux_`. + +--- + +## 3. Core tables (canonical descriptions & recommended columns) + +> Note: these are canonical column lists and notes — the actual `SQL DDL` must be created via versioned migrations +> (`node-pg-migrate`). + +### 3.1 users (canonical user mapping) + +- _Purpose:_ canonical user row used by application and joins. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `namecode` VARCHAR(64) UNIQUE NULLABLE + - `discord_user_id` VARCHAR(64) NULLABLE + - `username` VARCHAR(255) NULLABLE + - `email` VARCHAR(255) NULLABLE + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - UNIQUE(`namecode`) + - INDEX on `discord_user_id` for quick lookup + - Consider partial indexes for active users (e.g. WHERE `updated_at > now()` - interval '90 days') + +- _Notes:_ + - Keep `PII` minimal; consider moving sensitive `PII` (email) to a protected table with stricter access controls. + - Use `gen_random_uuid()` (`pgcrypto`) where provider allows; otherwise generate `UUIDs` app-side. + +--- + +### 3.2 hero_snapshots (raw ingestion store) + +- _Purpose:_ persist raw `get_hero_profile` payloads for audit, dedupe and replay. Single source-of-truth raw payload + retention. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `user_id` UUID REFERENCES users(id) ON DELETE SET NULL + - `namecode` VARCHAR(64) NULLABLE + - `source` VARCHAR(64) NOT NULL -- e.g. "fetch_by_namecode", "login", "cli_upload" + - `raw` JSONB NOT NULL + - `size_bytes` INTEGER NOT NULL + - `content_hash` VARCHAR(128) NOT NULL -- hex SHA256 + - `server_time` BIGINT NULLABLE -- if upstream provides server timestamp (epoch millis) + - `processing` BOOLEAN DEFAULT FALSE + - `processing_started_at` TIMESTAMPTZ NULLABLE + - `processed_at` TIMESTAMPTZ NULLABLE + - `error_count` INTEGER DEFAULT 0 + - `last_error` JSONB NULLABLE + - `created_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - `GIN`: CREATE INDEX `idx_hero_snapshots_raw_gin` ON `hero_snapshots` USING GIN (raw `jsonb_path_ops`); + - `B-tree`: CREATE INDEX `idx_hero_snapshots_user_created_at` ON `hero_snapshots` (`user_id`, `created_at` DESC); + - Optional: UNIQUE(`content_hash`, source) (or partial unique for dedupe window) — if implemented, prefer + application-level dedupe to allow duplicates outside window. + - Expression index: CREATE INDEX `idx_hero_snapshots_namecode_expr` ON `hero_snapshots` ((raw ->> 'NameCode')); + +- _Notes:_ + - Compute `content_hash` := encode(digest(raw::text, 'sha256'), 'hex') in DB or app on insert; store `size_bytes` := + `octet_length`(raw::text). + - Keep raw `JSONB` for replay; plan retention/archival (see section retention). + - Use small inserts with Idempotency-Key header in ingestion `API` to avoid duplicate processing. + +--- + +### 3.3 user_troops (normalized inventory) + +- _Purpose:_ normalized per-user troop counts for fast queries and analytics. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `user_id` UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE + - `troop_id` INTEGER NOT NULL -- FK to `troop_catalog` when available + - `amount` INTEGER DEFAULT 0 NOT NULL + - `level` INTEGER DEFAULT 1 + - `rarity` INTEGER DEFAULT 0 + - `fusion_cards` INTEGER DEFAULT 0 + - `traits_owned` INTEGER DEFAULT 0 + - `extra` JSONB DEFAULT '{}'::jsonb + - `last_seen` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - UNIQUE(`user_id`, `troop_id`) + - INDEX `idx_user_troops_troop` ON `user_troops` (`troop_id`) + - INDEX `idx_user_troops_user` ON `user_troops` (`user_id`) + +- _Notes:_ + - Use ON CONFLICT (`user_id`, `troop_id`) DO UPDATE ... in upserts to be idempotent. + - Keep extra `JSONB` for unmapped fields so that `ETL` remains tolerant to upstream changes. + +--- + +### 3.4 user_pets + +- _Purpose:_ normalized per-user pet inventory. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `user_id` UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE + - `pet_id` INT NOT NULL + - `amount` INT DEFAULT 0 NOT NULL + - `level` INT DEFAULT 1 + - `xp` BIGINT DEFAULT 0 + - `orb_fusion_cards` INT DEFAULT 0 + - `orbs_used` JSONB DEFAULT '[]'::jsonb + - `ascension_level` INT DEFAULT 0 + - `extra` JSONB DEFAULT '{}'::jsonb + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - UNIQUE(`user_id`, `pet_id`) + - INDEX `idx_user_pets_pet` ON `user_pets` (`pet_id`) + - INDEX `idx_user_pets_user` ON `user_pets` (`user_id`) + +- _Notes:_ + - Keep `orbs_used` as `JSONB` array of event objects if upstream provides that detail. + - Use atomic per-user pet upserts to avoid partial state. + +--- + +### 3.5 user_artifacts + +- _Purpose:_ per-user artifacts (progress & levels). + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `user_id` UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE + - `artifact_id` INT NOT NULL + - `level` INT DEFAULT 0 + - `xp` BIGINT DEFAULT 0 + - `extra` JSONB DEFAULT '{}'::jsonb + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - UNIQUE(`user_id`, `artifact_id`) + - INDEX `idx_user_artifacts_artifact` ON `user_artifacts` (`artifact_id`) + +- _Notes:_ + - Upsert with ON CONFLICT(`user_id`, `artifact_id`) DO UPDATE to be idempotent. + +--- + +### 3.6 user_teams + +- _Purpose:_ store user-created teams (equipped teams / saved team configurations). + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `user_id` UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE + - `name` VARCHAR(255) NULLABLE + - `banner` INT NULLABLE + - `team_level` INT DEFAULT 0 + - `class` VARCHAR(128) NULLABLE + - `troops` JSONB DEFAULT '[]'::jsonb -- ordered list of troop ids or objects + - `override_data` JSONB DEFAULT '{}'::jsonb + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - INDEX `idx_user_teams_user` ON `user_teams` (`user_id`) + - If teams are referenced frequently by namecode, consider index on (`user_id`, `name`) unique per user. + +- _Notes:_ + - Consider a separate table for Team Saves / public sharing (see `team_saves`). + - For fast game lookups, keep troop ids as integer arrays if simpler (INT[]), but `JSONB` allows flexible metadata. + +--- + +### 3.7 guilds + +- _Purpose:_ guild metadata and per-guild settings/feature flags. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `discord_guild_id` VARCHAR(64) UNIQUE NULLABLE + - `name` VARCHAR(255) NOT NULL + - `description` TEXT NULLABLE + - `settings` JSONB DEFAULT '{}'::jsonb + - `feature_flags` JSONB DEFAULT '{}'::jsonb + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - UNIQUE(`discord_guild_id`) -- if provided + - INDEX `idx_guilds_name` ON `guilds` (`name`) + +- _Notes:_ + - `Feature_flags` is optional per-guild override; global flags live in `feature_flags` table. + - Keep settings small and validated at application level. + +--- + +### 3.8 guild_members + +- _Purpose:_ mapping between guilds and users. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `guild_id` UUID NOT NULL REFERENCES guilds(id) ON DELETE CASCADE + - `user_id` UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE + - `discord_user_id` VARCHAR(64) NULLABLE + - `joined_at` TIMESTAMPTZ NULLABLE + - `is_owner` BOOLEAN DEFAULT false + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - UNIQUE(`guild_id`, `user_id`) + - INDEX `idx_guild_members_guild` ON `guild_members` (`guild_id`) + +- _Notes:_ + - Keep membership events in `audit_logs` if join/leave is important historically. + +--- + +### 3.9 feature_flags + +- _Purpose:_ global feature toggles. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `name` VARCHAR(128) UNIQUE NOT NULL + - `enabled` BOOLEAN DEFAULT false + - `rollout_percentage` INT DEFAULT 0 + - `data` JSONB DEFAULT '{}'::jsonb -- additional configuration + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - UNIQUE(`name`) + - INDEX `idx_feature_flags_enabled` ON `feature_flags` (enabled) + +- _Notes:_ + - Evaluate caching flags in Redis with short `TTL` for performance. + - Use deterministic hashing to apply `rollout_percentage` by user id or namecode. + +--- + +### 3.10 user_profile_summary (denormalized read table) + +- _Purpose:_ fast-read, denormalized summary consumed by bot/UI (one row per user). + +- _Columns:_ + - `user_id` UUID PRIMARY KEY REFERENCES users(id) + - `namecode` VARCHAR(64) + - `username` VARCHAR(255) + - `level` INTEGER + - `top_troops` JSONB DEFAULT '[]'::jsonb -- e.g. [ {troop_id, amount, level} ] + - `equipped_pet` JSONB NULLABLE + - `pvp_tier` INTEGER NULLABLE + - `guild_id` UUID NULLABLE + - `last_seen` TIMESTAMPTZ NULLABLE + - `cached_at` TIMESTAMPTZ DEFAULT now() -- when summary was generated + - `extra` JSONB DEFAULT '{}'::jsonb + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - PRIMARY KEY(`user_id`) + - INDEX `idx_profile_summary_namecode` ON `user_profile_summary` (`namecode`) + - Consider partial index for active players: WHERE `cached_at > now() - interval '7 days'` + +- _Notes:_ + - `ETL` writes/updates this table; reads should be served from it to meet latency `SLAs`. + - Keep `extra` for unmapped fields so reprocessing can add new summary fields without losing data. + +--- + +### 3.11 etl_errors + +- _Purpose:_ record per-snapshot or per-entity `ETL` errors for triage. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `snapshot_id` UUID REFERENCES hero_snapshots(id) ON DELETE SET NULL + - `error_type` VARCHAR(128) NOT NULL -- e.g. PARSE_ERROR, FK_VIOLATION + - `message` TEXT NOT NULL -- short sanitized message + - `details` JSONB NULLABLE -- structured diagnostic (truncated snippets, stack) + - `created_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - INDEX `idx_etl_errors_snapshot_id_created_at` ON `etl_errors` (`snapshot_id`, `created_at` DESC) + +- _Notes:_ + - Do not store full raw snapshot in details; reference `snapshot_id` for investigation. + - Create alerts if error rates exceed threshold. + +--- + +### 3.12 catalogs (troop_catalog, pet_catalog, artifact_catalog) + +- _Purpose:_ canonical static metadata seeded from official data sources or community-maintained sources. + +- _Example: `troop_catalog` columns:_ + - `id` INT PRIMARY KEY + - `kid` VARCHAR(64) NULLABLE + - `name` VARCHAR(255) NOT NULL + - `description` TEXT NULLABLE + - `rarity` INT + - `meta` JSONB DEFAULT '{}'::jsonb -- game-specific attributes (costs, traits) + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - UNIQUE on external identifiers (`kid`) + - INDEX on `name` for search + +- _Notes:_ + - Seed catalogs via idempotent seed scripts; `ETL` should reference catalog ids when available. + - If `ETL` finds unmapped ids, create placeholder rows with `meta -> { "placeholder": true }` to avoid FK failures. + +--- + +### 3.13 queue_jobs (job persistence / fallback) + +- _Purpose:_ persist background jobs (optional DB-backed queue) as a durable fallback or for job inspection. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `type` VARCHAR(128) NOT NULL -- 'process_snapshot', 'archive_snapshot', ... + - `payload` JSONB NOT NULL -- minimal payload (snapshot_id, options) + - `priority` INT DEFAULT 0 + - `attempts` INT DEFAULT 0 + - `max_attempts` INT DEFAULT 5 + - `status` VARCHAR(32) DEFAULT 'pending' -- pending|running|failed|done + - `run_after` TIMESTAMPTZ DEFAULT now() + - `last_error` TEXT NULLABLE + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - INDEX `idx_queue_jobs_status_run_after` ON `queue_jobs` (`status`, `run_after`) + - Use a small table for visibility; production queue should remain `Redis/BullMQ` for throughput. + +- _Notes:_ + - Use as a fallback for job durability and administrative requeueing. + +--- + +### 3.14 audit_logs + +- _Purpose:_ capture important admin/user actions for audit and compliance. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `action` VARCHAR(255) NOT NULL -- e.g. 'migration_run', 'snapshot_reprocessed' + - `user_id` UUID NULLABLE + - `guild_id` UUID NULLABLE + - `metadata` JSONB DEFAULT '{}'::jsonb -- action details, sanitized + - `ip_address` VARCHAR(45) NULLABLE + - `user_agent` VARCHAR(512) NULLABLE + - `created_at` TIMESTAMPTZ DEFAULT now() + +- _Indexes:_ + - INDEX `idx_audit_logs_action_created_at` ON `audit_logs` (`action`, `created_at` DESC) + +- _Notes:_ + - Do not log secrets or raw snapshots here. Keep metadata minimal and traceable. + +--- + +### 3.15 raw_profiles (alternate simplified snapshot store) + +- _Purpose:_ an optional simplified table for ingesting raw profiles before enriched processing (lighter than + `hero_snapshots`) + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `hero_external_id` VARCHAR(255) NULLABLE -- upstream player id if available + - `source` VARCHAR(64) NOT NULL + - `captured_at` TIMESTAMPTZ DEFAULT now() + - `raw` JSONB NOT NULL + +- _Indexes:_ + - INDEX `idx_raw_profiles_hero_external_id` ON `raw_profiles` (`hero_external_id`) + +- _Notes:_ + - This table is optional if `hero_snapshots` is sufficient. Use if you want a short ingest path decoupled from + `ETL`. + +--- + +### 3.16 heroes (canonical hero mapping — expanded) + +- _Purpose:_ normalized representation of a player/hero aggregated across snapshots. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `external_id` VARCHAR(255) UNIQUE NULLABLE -- stable upstream id (if any) + - `name` VARCHAR(255) NULLABLE + - `namecode` VARCHAR(64) UNIQUE NULLABLE + - `name_lower` VARCHAR(255) NULLABLE -- for case-insensitive search + - `username` VARCHAR(255) NULLABLE + - `level` INT DEFAULT 0 + - `race` INT NULLABLE + - `gender` INT NULLABLE + - `class` VARCHAR(128) NULLABLE + - `portrait_id` INT NULLABLE + - `title_id` INT NULLABLE + - `flair_id` INT NULLABLE + - `honor_rank` INT NULLABLE + - `equipped_weapon_id` INT NULLABLE + - `equipped_pet_id` INT NULLABLE + - `guild_id` UUID NULLABLE REFERENCES guilds(id) + - `guild_external_id` VARCHAR(64) NULLABLE + - `guild_name` VARCHAR(255) NULLABLE + - `guild_rank` INT NULLABLE + - `server_time` TIMESTAMPTZ NULLABLE + - `last_login` TIMESTAMPTZ NULLABLE + - `last_played` TIMESTAMPTZ NULLABLE + - `summary` JSONB NULLABLE -- snapshot-derived denormalized blob for debugging + - `extras` JSONB DEFAULT '{}'::jsonb + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - UNIQUE(`external_id`) if upstream provides stable id + - INDEX `idx_heroes_namecode` ON `heroes` (`namecode`) + - INDEX `idx_heroes_guild_id` ON `heroes` (`guild_id`) + +- _Notes:_ + - `heroes` can be the canonical mapping that `ETL` updates; `user_profile_summary` is the read-optimized table + consumed by UI/bot. + - Keep `summary` for debugging / ad-hoc fallbacks. + +--- + +### 3.17 hero_runes + +- _Purpose:_ store runes or rune-sets for a hero. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE + - `runes` JSONB NOT NULL + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Indexes:_ + - INDEX `idx_hero_runes_hero` ON `hero_runes` (`hero_id`) + +- _Notes:_ + - Could be multiple rows per hero if the upstream uses multiple rune slots/versioning. + +--- + +### 3.18 hero_troops + +- _Purpose:_ store normalized troop inventory per hero (detailed). + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE + - `troop_id` INT NOT NULL + - `amount` INT DEFAULT 0 + - `level` INT DEFAULT 1 + - `current_rarity` INT DEFAULT 0 + - `fusion_cards` INT DEFAULT 0 + - `orb_fusion_cards` INT DEFAULT 0 + - `traits_owned` INT DEFAULT 0 + - `invasions` INT DEFAULT 0 + - `shiny_level_progress` INT DEFAULT 0 + - `orbs_used` JSONB DEFAULT '[]'::jsonb + - `extra` JSONB DEFAULT '{}'::jsonb + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - UNIQUE(`hero_id`, `troop_id`) + - INDEX `idx_hero_troops_troop` ON `hero_troops` (`troop_id`) + - INDEX `idx_hero_troops_hero` ON `hero_troops` (`hero_id`) + +- _Notes:_ + - Use batch upserts with ON CONFLICT to update amounts/levels atomically. + +--- + +### 3.19 hero_pets + +- _Purpose:_ store normalized pets per hero. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE + - `pet_id` INT NOT NULL + - `amount` INT DEFAULT 0 + - `level` INT DEFAULT 1 + - `xp` BIGINT DEFAULT 0 + - `orb_fusion_cards` INT DEFAULT 0 + - `orbs_used` JSONB DEFAULT '[]'::jsonb + - `ascension_level` INT DEFAULT 0 + - `extra` JSONB DEFAULT '{}'::jsonb + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - UNIQUE(`hero_id`, `pet_id`) + - INDEX `idx_hero_pets_pet` ON `hero_pets` (`pet_id`) + +--- + +### 3.20 hero_artifacts + +- _Purpose:_ normalized artifacts per hero. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE + - `artifact_id` INT NOT NULL + - `xp` BIGINT DEFAULT 0 + - `level` INT DEFAULT 0 + - `extra` JSONB DEFAULT '{}'::jsonb + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - UNIQUE(`hero_id`, `artifact_id`) + +--- + +### 3.21 teams + +- _Purpose:_ normalized teams (active configurations) for heroes. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE + - `name` VARCHAR(255) NULLABLE + - `banner` INT NULLABLE + - `team_level` INT DEFAULT 0 + - `class` VARCHAR(128) NULLABLE + - `override_data` JSONB DEFAULT '{}'::jsonb + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Indexes:_ + - INDEX `idx_teams_hero` ON `teams` (`hero_id`) + +--- + +### 3.22 team_troops + +- _Purpose:_ mapping table for troops inside a team (ordered positions). + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `team_id` UUID NOT NULL REFERENCES teams(id) ON DELETE CASCADE + - `position` INT NOT NULL -- position index in team + - `troop_id` INT NOT NULL + +- _Indexes:_ + - INDEX `idx_team_troops_team` ON `team_troops` (`team_id`, `position`) + +- _Notes:_ + - Keep position contiguous and deterministic so presentation layers can render quickly. + +--- + +### 3.23 team_saves + +- _Purpose:_ user-shared/persisted team configurations (public or private). + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE + - `name` VARCHAR(255) + - `description` TEXT + - `data` JSONB NOT NULL -- full team data for replay + - `is_public` BOOLEAN DEFAULT false + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Indexes:_ + - INDEX `idx_team_saves_hero` ON `team_saves` (`hero_id`) + - INDEX `idx_team_saves_public` ON `team_saves` (`is_public`) + +--- + +### 3.24 team_comments + +- _Purpose:_ comments on saved teams (community interactions). + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `team_save_id` UUID NOT NULL REFERENCES team_saves(id) ON DELETE CASCADE + - `author_user_id` UUID NOT NULL REFERENCES app_users(id) -- note: app_users used for bot operators / forum users + - `guild_id` UUID NULLABLE REFERENCES guilds(id) + - `comment` TEXT NOT NULL + - `created_at` TIMESTAMPTZ DEFAULT now() + +- _Indexes:_ + - INDEX `idx_team_comments_team_save` ON `team_comments` (`team_save_id`) + +--- + +### 3.25 hero_kingdom_progress + +- _Purpose:_ track hero progress in per-kingdom systems. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE + - `kingdom_id` INT NOT NULL + - `status` INT DEFAULT 0 + - `income` INT DEFAULT 0 + - `challenge_tier` INT DEFAULT 0 + - `invasions` INT DEFAULT 0 + - `power_rank` INT DEFAULT 0 + - `tasks` JSONB DEFAULT '[]'::jsonb + - `explore` JSONB DEFAULT '{}'::jsonb + - `trials_team` JSONB DEFAULT '{}'::jsonb + - `extra` JSONB DEFAULT '{}'::jsonb + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - UNIQUE(`hero_id`, `kingdom_id`) + - INDEX `idx_hero_kingdom_progress_hero` ON `hero_kingdom_progress` (`hero_id`) + +--- + +### 3.26 hero_pvp_regions + +- _Purpose:_ store per-hero PvP region stats and most-used teams. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE + - `region_id` INT NOT NULL + - `team` JSONB DEFAULT '{}'::jsonb + - `stats` JSONB DEFAULT '{}'::jsonb + - `most_used_troop` JSONB DEFAULT '{}'::jsonb + - `extras` JSONB DEFAULT '{}'::jsonb + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints & Indexes:_ + - UNIQUE(`hero_id`, `region_id`) + - INDEX `idx_hero_pvp_regions_hero` ON `hero_pvp_regions` (`hero_id`) + +--- + +### 3.27 hero_pvp_stats + +- _Purpose:_ aggregate PvP metrics per hero. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `hero_id` UUID NOT NULL REFERENCES heroes(id) ON DELETE CASCADE UNIQUE + - `invades_won` INT DEFAULT 0 + - `invades_lost` INT DEFAULT 0 + - `defends_won` INT DEFAULT 0 + - `defends_lost` INT DEFAULT 0 + - `most_invaded_kingdom` JSONB NULLABLE + - `most_used_troop` JSONB NULLABLE + - `raw` JSONB NULLABLE -- keep raw payload if needed for debug + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Notes:_ + - This table can be updated incrementally as `ETL` processes snapshots. + +--- + +### 3.28 hero_progress_weapons + +- _Purpose:_ weapon progression state per hero. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `hero_id` UUID NOT NULL REFERENCES heroes(id) + - `weapon_data` JSONB NOT NULL + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Indexes:_ + - INDEX `idx_hero_progress_weapons_hero` ON `hero_progress_weapons` (`hero_id`) + +--- + +### 3.29 hero_class_data + +- _Purpose:_ per-hero class-related data (per-class specialization or progress). + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `hero_id` UUID NOT NULL REFERENCES heroes(id) + - `class_name` VARCHAR(128) NOT NULL + - `data` JSONB NOT NULL + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Constraints:_ + - UNIQUE(`hero_id`, `class_name`) + +--- + +### 3.30 hero_meta_json + +- _Purpose:_ key-value JSON storage for misc hero metadata. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `hero_id` UUID NOT NULL REFERENCES heroes(id) + - `key` VARCHAR(255) NOT NULL + - `value` JSONB + - `created_at` TIMESTAMPTZ DEFAULT now() + - `updated_at` TIMESTAMPTZ DEFAULT now() + +- _Indexes:_ + - INDEX `idx_hero_meta_hero_key` ON `hero_meta_json` (`hero_id`, `key`) + +--- + +### 3.31 sheets_sync_logs + +- _Purpose:_ logs for synchronizing data to external sheets (optional feature). + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `guild_id` UUID NOT NULL REFERENCES guilds(id) + - `sheet_id` VARCHAR(255) + - `range` VARCHAR(128) + - `rows_sent` INT DEFAULT 0 + - `status` VARCHAR(64) -- success|failed|pending + - `error` JSONB NULLABLE + - `started_at` TIMESTAMPTZ NULLABLE + - `finished_at` TIMESTAMPTZ NULLABLE + +- _Indexes:_ + - INDEX `idx_sheets_sync_logs_guild` ON `sheets_sync_logs` (`guild_id`) + +--- + +### 3.32 cache_invalidation + +- _Purpose:_ single-row records used to coordinate cache invalidation across services. + +- _Columns:_ + - `id` UUID PRIMARY KEY DEFAULT gen_random_uuid() + - `key` VARCHAR(255) NOT NULL -- cache key / namespace + - `invalidated_at` TIMESTAMPTZ DEFAULT now() + +- _Notes:_ + - Simple table for cross-process invalidation if pub/sub not available. + +--- + +### 3.33 troops_master_light (materialized view) + +- _Purpose:_ lightweight read-only snapshot of troops catalogs for analytics and quick lookups (materialized view). + +- _Columns (materialized):_ + - `id` INT PRIMARY KEY + - `kid` VARCHAR + - `name` VARCHAR + - `kingdom_id` INT + - `kingdom_name` VARCHAR + - `rarity` VARCHAR + - `rarity_id` INT + - `max_attack` INT + - `max_armor` INT + - `max_life` INT + - `max_magic` INT + - `spell_id` INT + - `shiny_spell_id` INT + - `colors` INT + - `arcanes` VARCHAR + - `image_url` VARCHAR + - `type` VARCHAR + - `troop_role1` VARCHAR + - `description` TEXT + +- _Notes:_ + - Refresh schedule depends on how often catalog changes; use CONCURRENTLY where supported. + +--- + +## Foreign key / relationship summary (excerpt) + +- `role_permissions.role_id` -> `roles.id` +- `role_permissions.permission_id` -> `permissions.id` +- `guild_members.guild_id` -> `guilds.id` +- `guild_members.user_id` -> `app_users.id` +- `guild_roles.guild_id` -> `guilds.id` +- `guild_roles.role_id` -> `roles.id` +- `guild_member_roles.guild_member_id` -> `guild_members.id` +- `guild_member_roles.guild_role_id` -> `guild_roles.id` +- `guild_feature_flags.guild_id` -> `guilds.id` +- `guild_feature_flags.feature_flag_id` -> `feature_flags.id` +- `troops.kingdom_id` -> `kingdoms.id` +- `classes.kingdom_id` -> `kingdoms.id` +- `troops.spell_id` -> `spells.id` +- `game_events.troop_id` -> `troops.id` +- `raw_profiles.hero_external_id` -> `heroes.external_id` +- `heroes.guild_id` -> `guilds.id` +- `hero_* tables` (*.hero_id) -> `heroes.id` +- `teams.hero_id` -> `heroes.id` +- `team_troops.team_id` -> `teams.id` +- `team_comments.team_save_id` -> `team_saves.id` +- `team_comments.author_user_id` -> `app_users.id` +- `sheets_sync_logs.guild_id` -> `guilds.id` + +(See DBML snippet for the full list and diagram rendering.) + +--- + +## Examples DDL snippets (pattern / best-practice) + +- `hero_snapshots` insert with hash (`Postgres` example using `pgcrypto`): + +```sql +INSERT INTO hero_snapshots (id, user_id, namecode, source, raw, size_bytes, content_hash, created_at) +VALUES (gen_random_uuid(), + NULL, + ($1):: varchar, + ($2):: varchar, + $3::jsonb, + octet_length($3::text), + encode(digest($3::text, 'sha256'), 'hex'), + now()); +``` + +- Atomic claim to process snapshot: + +```sql +UPDATE hero_snapshots +SET processing = true, + processing_started_at = now() +WHERE id = $1 + AND (processing = false OR processing IS NULL) RETURNING id; +``` + +- `GIN` index creation (CONCURRENTLY for no lock where supported): + +```sql +CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_hero_snapshots_raw_gin + ON hero_snapshots USING GIN (raw jsonb_path_ops); +``` + +- `user_troops` upsert pattern: + +```sql +INSERT INTO user_troops (user_id, troop_id, amount, level, rarity, fusion_cards, traits_owned, extra, updated_at) +VALUES ($1, $2, $3, $4, $5, $6, $7, $8::jsonb, now()) ON CONFLICT (user_id, troop_id) DO +UPDATE + SET amount = EXCLUDED.amount, + level = EXCLUDED.level, + rarity = EXCLUDED.rarity, + fusion_cards = EXCLUDED.fusion_cards, + traits_owned = EXCLUDED.traits_owned, + extra = COALESCE (user_troops.extra, '{}'::jsonb) || EXCLUDED.extra, + updated_at = now(); +``` + +--- + +## Operational recommendations & next steps + +1. **Migrations** + - Implement these schemas through small, tested `node-pg-migrate` migrations. + - Separate extension creation into its own `migration (CREATE EXTENSION IF NOT EXISTS pgcrypto`) with clear fallback + instructions. + +2. **Indexes** + - Create `GIN` indexes for `JSONB` fields used for ad-hoc queries, but measure write overhead. + - Use CONCURRENTLY for large indexes in production to avoid long locks. + +3. **Partitioning & retention** + - Partition `hero_snapshots` by month (range) when dataset grows, to speed deletion/archive jobs. + - Implement archival job: compress `raw JSON` -> upload to `S3` -> insert into `snapshot_archives` -> delete + partition rows (or mark archived). + +4. **Seeds & catalogs** + - Provide idempotent seeds for `troop_catalog` / `pet_catalog` / `spells` and keep them under database/seeds/. + - `ETL` should create placeholders for missing catalog items and mark them for analyst review. + +5. **Backfill & ETL** + - Use the same `ETL` code for new snapshots and backfills to guarantee mapping parity. + - Record backfill job progress in `queue_jobs` or a `backfill_audit` table for resume capability. + +6. **Tests & validation** + - Add integration tests that run migrations, insert sample snapshots (docs/examples), run `ETL` and verify + normalized tables and `user_profile_summary`. + +--- + +## 4. Indexes & query patterns + +### 4.1 Recommended indexes (examples) + +- **hero_snapshots**: + - **GIN**: `CREATE INDEX idx_hero_snapshots_raw_gin ON hero_snapshots USING GIN (raw jsonb_path_ops);` + - **B-tree**: `CREATE INDEX idx_hero_snapshots_user_created_at ON hero_snapshots (user_id, created_at DESC);` + - **Optional uniqueness**: + `CREATE UNIQUE INDEX ux_hero_snapshots_contenthash_source ON hero_snapshots (content_hash, source)` (use with + caution; dedupe logic may be more flexible in app) +- **users**: + - `CREATE UNIQUE INDEX ux_users_namecode ON users (namecode);` + - `CREATE INDEX idx_users_discord_user_id ON users (discord_user_id);` +- **user_troops**: + - `CREATE UNIQUE INDEX ux_user_troops_user_troop ON user_troops (user_id, troop_id);` + - `CREATE INDEX idx_user_troops_troop ON user_troops (troop_id);` +- **user_profile_summary**: + - `CREATE INDEX idx_profile_summary_namecode ON user_profile_summary (namecode);` +- **etl_errors**: + - `CREATE INDEX idx_etl_errors_snapshot_id_created_at ON etl_errors (snapshot_id, created_at DESC);` + +### 4.2 Query patterns (examples) + +- Latest processed snapshot for user: + +```sql +SELECT * +FROM hero_snapshots +WHERE user_id = $1 + AND processed_at IS NOT NULL +ORDER BY created_at DESC LIMIT 1; +``` + +- Profile summary by namecode: + +```sql +SELECT * +FROM user_profile_summary +WHERE namecode = $1; +``` + +- Who owns troop 6024: + +```sql +SELECT u.id, u.namecode, ut.amount, ut.level +FROM user_troops ut + JOIN users u ON ut.user_id = u.id +WHERE ut.troop_id = 6024 + AND ut.amount > 0 +ORDER BY ut.amount DESC LIMIT 100; +``` + +--- + +## 5. Partitioning, retention & archival guidance + +### 5.1 Partition hero_snapshots when volume grows + +- _Strategy:_ time-based partitions (monthly) on created_at: + - Create parent table `hero_snapshots` and child partitions `hero_snapshots_2025_11` etc. +- _Benefits:_ faster archival/deletion, reduced vacuum overhead, improved index size on recent partitions. + +### 5.2 Retention policy & archival + +- Default retention example: keep 90 days in DB, archive older snapshots to `S3` (compressed) with audit metadata. +- Steps for archival: + 1. SELECT snapshots older than retention threshold and not flagged permanent. + 2. Compress raw `JSON` (gzip/zstd), upload to `S3` with server-side encryption. + 3. Insert archival metadata (`s3_path`, `checksum`, `archived_at`) into `snapshot_archives` table and then DELETE or + mark as archived. +- Keep at least N recent snapshots per user (configurable) if required. + +--- + +## 6. Upsert examples and idempotency patterns + +### 6.1 user_troops upsert (example) + +- Use single multi-row upsert for batch inserts/updates: + +```sql +-- example uses tmp table or values list +INSERT INTO user_troops (user_id, troop_id, amount, level, rarity, fusion_cards, traits_owned, extra, updated_at) +VALUES ($1, $2, $3, $4, $5, $6, $7, $8, now()), ... + ON CONFLICT (user_id, troop_id) DO +UPDATE + SET amount = EXCLUDED.amount, + level = EXCLUDED.level, + rarity = EXCLUDED.rarity, + fusion_cards = EXCLUDED.fusion_cards, + traits_owned = EXCLUDED.traits_owned, + extra = COALESCE (user_troops.extra, '{}'::jsonb) || EXCLUDED.extra, + updated_at = now(); +``` + +- Keep transactions per entity group (users, troops, pets) to avoid holding long transactions. + +### 6.2 users upsert example + +```sql +INSERT INTO users (id, namecode, username, discord_user_id, email, created_at, updated_at) +VALUES (gen_random_uuid(), $namecode, $username, $discord_id, $email, now(), now()) ON CONFLICT (namecode) DO +UPDATE + SET username = EXCLUDED.username, + discord_user_id = COALESCE (users.discord_user_id, EXCLUDED.discord_user_id), + email = COALESCE (users.email, EXCLUDED.email), + updated_at = now(); +``` + +- If generating `UUIDs` client-side, ensure deterministic mapping or idempotency key in insertion logic. + +### 6.3 Mark snapshot processed + +- After successful processing: + +```sql +UPDATE hero_snapshots +SET processed_at = now(), + processing = false, + last_error = NULL +WHERE id = $snapshot_id; +``` + +- Use atomic claim to set `processing=true`: + +```sql +UPDATE hero_snapshots +SET processing = true, + processing_started_at = now() +WHERE id = $snapshot_id + AND (processing = false OR processing IS NULL) RETURNING id; +``` + +--- + +## 7. Migration strategy & preflight checks + +### 7.1 Tooling + +- Use `node-pg-migrate` (`JS`) for migrations; store migrations under `database/migrations/` and use timestamped + filenames. +- Keep migrations small and reversible where possible. + +### 7.2 Preflight checks (CI job) + +- Verify required extensions & permissions (e.g., `CREATE EXTENSION IF NOT EXISTS pgcrypto;`). +- Run migrations in a disposable ephemeral DB (testcontainers) and ensure `up` completes. +- Run a smoke `ETL`: insert a sample snapshot, run the worker code path, assert summary created. + +### 7.3 Safe migration pattern + +- 3-step for destructive change: + 1. Add nullable column / new table / index (no locks if possible: `CREATE INDEX CONCURRENTLY`). + 2. Backfill and verify via a background job. + 3. Make column NOT NULL and remove old column in a later migration. + +### 7.4 Production bootstrap + +- Production migrations and bootstrap should run via a manual `GitHub Action` (`db-bootstrap.yml`) protected by GitHub + Environments and approvers. +- Always take a DB snapshot/backup before applying production migrations. + +### 7.5 Extension fallback + +- If `pgcrypto` cannot be enabled by provider, fallback to generating `UUIDs` in application code or use + `uuid_generate_v4()` only if available. + +--- + +## 8. Data retention & archival model (summary) + +- Default retention: 90 days for `hero_snapshots` in DB (configurable). +- Keep last N snapshots per user (e.g., last 30) in addition to time window. +- Archive older snapshots to `S3` (gzip/zstd) with checksum and audit metadata in `snapshot_archives` table. +- Provide admin endpoints and runbooks for reprocessing archived snapshots (download from `S3`, reinsert, reprocess). + +--- + +## 9. Validation queries & health checks + +- Basic health: + +```sql +SELECT 1; +``` + +- Sanity check after migrations: + +```sql +SELECT count(*) +FROM information_schema.tables +WHERE table_schema = 'public' + AND table_name IN ('users', 'hero_snapshots', 'user_troops'); +``` + +- Sample `ETL` verification (for a given namecode): + +```sql +-- latest snapshot exists and processed +SELECT hs.id, hs.processed_at, ups.user_id, ups.namecode +FROM hero_snapshots hs + LEFT JOIN users ups ON hs.user_id = ups.id +WHERE hs.namecode = $1 +ORDER BY hs.created_at DESC LIMIT 1; +``` + +- Idempotency test: + - Insert snapshot A, process, record counts. + - Re-insert same snapshot (or reprocess) and assert no duplicate rows and counts unchanged. + +--- + +## 10. DBML schema (paste into dbdiagram.io) + +The `DBML` representation below is provided for visualization. Paste into https://dbdiagram.io to generate diagrams. The +code here is the cleaned `DBML` derived from the PRD; it mirrors the canonical tables and relationships described above. + +```sql +// DBML representation of the StarForge schema (cleaned). +// Paste this into https://dbdiagram.io to visualize the schema. +// jsonb columns are represented as "json" for visualization purposes. + +Table app_users { + id uuid [pk] + email varchar [unique] + username varchar + display_name varchar + avatar_url varchar + created_at timestamp + updated_at timestamp +} + +Table roles { + id uuid [pk] + name varchar [unique, not null] + description text + created_at timestamp + updated_at timestamp +} + +Table permissions { + id uuid [pk] + code varchar [unique, not null] + description text + created_at timestamp + updated_at timestamp +} + +Table role_permissions { + role_id uuid + permission_id uuid + // composite PK exists in SQL +} + +Table guilds { + id uuid [pk] + external_id varchar [unique] + name varchar + description text + settings json + created_at timestamp + updated_at timestamp +} + +Table guild_members { + id uuid [pk] + guild_id uuid + user_id uuid + game_player_id varchar + joined_at timestamp + is_owner boolean + created_at timestamp + updated_at timestamp + // unique (guild_id, user_id) +} + +Table guild_roles { + id uuid [pk] + guild_id uuid + role_id uuid + name varchar + permission_overrides json + created_at timestamp + updated_at timestamp +} + +Table guild_member_roles { + id uuid [pk] + guild_member_id uuid + guild_role_id uuid + created_at timestamp +} + +Table audit_logs { + id uuid [pk] + action varchar [not null] + user_id uuid + guild_id uuid + metadata json + ip_address varchar + user_agent varchar + created_at timestamp +} + +Table feature_flags { + id uuid [pk] + name varchar [unique, not null] + enabled boolean + rollout_percentage int + created_at timestamp + updated_at timestamp +} + +Table guild_feature_flags { + id uuid [pk] + guild_id uuid + feature_flag_id uuid + enabled boolean +} + +Table troops { + id int [pk] + kid varchar + name varchar + description text + colors int + arcanes varchar + image_url varchar + kingdom_id int + kingdom_name varchar + max_attack int + max_armor int + max_life int + max_magic int + rarity varchar + rarity_id int + spell_id int + shiny_spell_id int + spell_name varchar + shiny_spell_name varchar + spell_cost int + release_date bigint + switch_date bigint + troop_role1 varchar + type varchar + type_code1 varchar + type_code2 varchar + traits json + extras json + created_at timestamp + updated_at timestamp +} + +Table kingdoms { + id int [pk] + kid varchar + name varchar + map_index varchar + byline varchar + description text + banner_name varchar + banner_image_url varchar + bg_image_url varchar + banner_mana varchar + banner_mana_bits int + release_date bigint + switch_date bigint + tribute_glory int + tribute_gold int + tribute_souls int + explore_traitstone_id int + explore_traitstone_colors int + explore_traitstone_color_names varchar + level_mana_color int + level_stat varchar + quest text + map_position varchar + type varchar + troops json + bonuses json + created_at timestamp + updated_at timestamp +} + +Table classes { + id int [pk] + class_code varchar + name varchar + kingdom_id int + kingdom_name varchar + image_url varchar + page_url varchar + rarity varchar + max_armor int + max_attack int + max_life int + max_magic int + spell_id int + weapon_id int + weapon_name varchar + talent_codes json + talent_list json + traits json + extras json + created_at timestamp + updated_at timestamp +} + +Table talents { + code varchar [pk] + name varchar + talent1 varchar + talent1_desc text + talent2 varchar + talent2_desc text + talent3 varchar + talent3_desc text + talent4 varchar + talent4_desc text + talent5 varchar + talent5_desc text + talent6 varchar + talent6_desc text + talent7 varchar + talent7_desc text + extras json + created_at timestamp + updated_at timestamp +} + +Table medals { + id int [pk] + name varchar + description text + data varchar + effect varchar + level int + rarity varchar + is_event_medal boolean + image_url varchar + evolves_into int + group_id int + extras json + created_at timestamp + updated_at timestamp +} + +Table pets { + id int [pk] + name varchar + kingdom_id int + kingdom_name varchar + mana_color varchar + mana_color_num int + image_url varchar + effect varchar + effect_data varchar + effect_title varchar + event varchar + release_date bigint + switch_date bigint + extras json + created_at timestamp + updated_at timestamp +} + +Table spells { + id int [pk] + name varchar + description text + cost int + image_url varchar + extras json + created_at timestamp + updated_at timestamp +} + +Table aspects { + code varchar [pk] + name varchar + description text + image_url varchar + extras json + created_at timestamp + updated_at timestamp +} + +Table traitstones { + id int [pk] + name varchar + colors int + image_url varchar + extras json + created_at timestamp + updated_at timestamp +} + +Table weapons { + id int [pk] + name varchar + image_url varchar + kingdom_id int + kingdom_name varchar + spell_id int + spell_name varchar + spell_cost int + affixes varchar + colors int + mastery_requirement int + rarity varchar + rarity_id int + obtain_by varchar + weapon_role1 varchar + weapon_upgrade varchar + release_date bigint + switch_date bigint + extras json + created_at timestamp + updated_at timestamp +} + +Table game_events { + id int [pk] + event_id int + troop_id int + kingdom_id int + start_date bigint + end_date bigint + metadata json + created_at timestamp + updated_at timestamp +} + +Table raw_profiles { + id uuid [pk] + hero_external_id varchar + source varchar + captured_at timestamp + raw json [not null] +} + +Table heroes { + id uuid [pk] + external_id varchar [unique] + name varchar + namecode varchar + name_lower varchar + username varchar + level int + level_new int + race int + race_alt int + gender int + class varchar + portrait_id int + title_id int + flair_id int + honor_rank int + equipped_weapon_id int + equipped_pet_id int + guild_id uuid + guild_external_id varchar + guild_name varchar + guild_rank int + server_time timestamp + last_login timestamp + last_played timestamp + created_at timestamp + updated_at timestamp + summary json + extras json +} + +Table hero_runes { + id uuid [pk] + hero_id uuid + runes json [not null] + created_at timestamp + updated_at timestamp +} + +Table hero_troops { + id uuid [pk] + hero_id uuid + troop_id int + amount int + level int + current_rarity int + fusion_cards int + orb_fusion_cards int + traits_owned int + invasions int + shiny_level_progress int + orbs_used json + extra json + created_at timestamp + updated_at timestamp + // unique (hero_id, troop_id) +} + +Table hero_pets { + id uuid [pk] + hero_id uuid + pet_id int + amount int + level int + xp bigint + orb_fusion_cards int + orbs_used json + ascension_level int + extra json + created_at timestamp + updated_at timestamp + // unique (hero_id, pet_id) +} + +Table hero_artifacts { + id uuid [pk] + hero_id uuid + artifact_id int + xp bigint + level int + extra json + created_at timestamp + updated_at timestamp + // unique (hero_id, artifact_id) +} + +Table teams { + id uuid [pk] + hero_id uuid + name varchar + banner int + team_level int + class varchar + override_data json + created_at timestamp + updated_at timestamp +} + +Table team_troops { + id uuid [pk] + team_id uuid + position int + troop_id int +} + +Table team_saves { + id uuid [pk] + hero_id uuid + name varchar + description text + data json [not null] + is_public boolean + created_at timestamp + updated_at timestamp +} + +Table team_comments { + id uuid [pk] + team_save_id uuid + author_user_id uuid + guild_id uuid + comment text + created_at timestamp +} + +Table hero_kingdom_progress { + id uuid [pk] + hero_id uuid + kingdom_id int + status int + income int + challenge_tier int + invasions int + power_rank int + tasks json + explore json + trials_team json + extra json + created_at timestamp + updated_at timestamp + // unique (hero_id, kingdom_id) +} + +Table hero_pvp_regions { + id uuid [pk] + hero_id uuid + region_id int + team json + stats json + most_used_troop json + extras json + created_at timestamp + updated_at timestamp + // unique (hero_id, region_id) +} + +Table hero_pvp_stats { + id uuid [pk] + hero_id uuid + invades_won int + invades_lost int + defends_won int + defends_lost int + most_invaded_kingdom json + most_used_troop json + raw json + created_at timestamp + updated_at timestamp + // unique (hero_id) +} + +Table hero_progress_weapons { + id uuid [pk] + hero_id uuid + weapon_data json [not null] + created_at timestamp + updated_at timestamp +} + +Table hero_class_data { + id uuid [pk] + hero_id uuid + class_name varchar + data json [not null] + created_at timestamp + updated_at timestamp + // unique (hero_id, class_name) +} + +Table hero_meta_json { + id uuid [pk] + hero_id uuid + key varchar + value json + created_at timestamp + updated_at timestamp +} + +Table sheets_sync_logs { + id uuid [pk] + guild_id uuid + sheet_id varchar + range varchar + rows_sent int + status varchar + error json + started_at timestamp + finished_at timestamp +} + +Table queue_jobs { + id uuid [pk] + type varchar + payload json + priority int + attempts int + max_attempts int + status varchar + run_after timestamp + last_error text + created_at timestamp + updated_at timestamp +} + +Table cache_invalidation { + id uuid [pk] + key varchar + invalidated_at timestamp +} + +Table troops_master_light { + id int [pk] + kid varchar + name varchar + kingdom_id int + kingdom_name varchar + rarity varchar + rarity_id int + max_attack int + max_armor int + max_life int + max_magic int + spell_id int + shiny_spell_id int + colors int + arcanes varchar + image_url varchar + type varchar + troop_role1 varchar + description text +} + +// ========================= +// Relationships (Refs) +— deduplicated +// ========================= + +Ref: role_permissions.role_id > roles.id +Ref: role_permissions.permission_id > permissions.id + +Ref: guild_members.guild_id > guilds.id +Ref: guild_members.user_id > app_users.id + +Ref: guild_roles.guild_id > guilds.id +Ref: guild_roles.role_id > roles.id + +Ref: guild_member_roles.guild_member_id > guild_members.id +Ref: guild_member_roles.guild_role_id > guild_roles.id + +Ref: guild_feature_flags.guild_id > guilds.id +Ref: guild_feature_flags.feature_flag_id > feature_flags.id + +Ref: troops.kingdom_id > kingdoms.id +Ref: classes.kingdom_id > kingdoms.id +Ref: weapons.kingdom_id > kingdoms.id + +Ref: troops.spell_id > spells.id +Ref: troops.shiny_spell_id > spells.id + +Ref: game_events.troop_id > troops.id +Ref: game_events.kingdom_id > kingdoms.id + +Ref: raw_profiles.hero_external_id > heroes.external_id + +Ref: heroes.guild_id > guilds.id + +Ref: hero_runes.hero_id > heroes.id +Ref: hero_troops.hero_id > heroes.id +Ref: hero_pets.hero_id > heroes.id +Ref: hero_artifacts.hero_id > heroes.id + +Ref: teams.hero_id > heroes.id +Ref: team_troops.team_id > teams.id + +Ref: team_saves.hero_id > heroes.id +Ref: team_comments.team_save_id > team_saves.id +Ref: team_comments.author_user_id > app_users.id +Ref: team_comments.guild_id > guilds.id + +Ref: hero_kingdom_progress.hero_id > heroes.id +Ref: hero_pvp_regions.hero_id > heroes.id +Ref: hero_pvp_stats.hero_id > heroes.id +Ref: hero_progress_weapons.hero_id > heroes.id +Ref: hero_class_data.hero_id > heroes.id +Ref: hero_meta_json.hero_id > heroes.id + +Ref: sheets_sync_logs.guild_id > guilds.id +``` + +--- + +## 11. Change log & references + +- Change log for the DB model should reference migration IDs and PR numbers. Create `docs/CHANGELOG.md` entries when + migrations are merged. +- See also: + - PRD: docs/PRD.md (product-level requirements and mappings) + - ETL & Worker design: [docs/ETL_AND_WORKER.md](./ETL_AND_WORKER.md) + - Migration conventions: [docs/MIGRATIONS.md](./MIGRATIONS.md) + - Observability: [docs/OBSERVABILITY.md](./OBSERVABILITY.md) + +--- diff --git a/docs/ETL_AND_WORKER.md b/docs/ETL_AND_WORKER.md new file mode 100644 index 0000000..ebca99c --- /dev/null +++ b/docs/ETL_AND_WORKER.md @@ -0,0 +1,821 @@ +# ETL & Worker — StarForge + +> Version: 0.1 +> Date: 2025-12-02 + +--- + +## Table of contents + +1. [Executive summary](#1-executive-summary) +2. [Scope & audience](#2-scope--audience) +3. [High-level architecture](#3-high-level-architecture) +4. [Input contract (hero_snapshots row & job payload)](#4-input-contract-hero_snapshots-row--job-payload) +5. [Queue design & job payload schema](#5-queue-design--job-payload-schema) +6. [Worker claim semantics & concurrency model](#6-worker-claim-semantics--concurrency-model) +7. [Processing pipeline — end-to-end](#7-processing-pipeline--end-to-end) +8. [Upsert patterns & SQL examples](#8-upsert-patterns--sql-examples) +9. [Error handling, retry policy & poison messages](#9-error-handling-retry-policy--poison-messages) +10. [Monitoring & metrics (what to emit)](#10-monitoring--metrics-what-to-emit) +11. [Logging, tracing & structured events](#11-logging-tracing--structured-events) +12. [Backfill, bulk processing & rate-control](#12-backfill-bulk-processing--rate-control) +13. [Admin APIs & operational endpoints](#13-admin-apis--operational-endpoints) +14. [Testing strategy (unit / integration / e2e / perf)](#14-testing-strategy-unit--integration--e2e--perf) +15. [Security & privacy controls](#15-security--privacy-controls) +16. [Common incidents & runbook actions](#16-common-incidents--runbook-actions) +17. [Configuration / example env](#17-configuration--example-env) +18. [Appendix](#18-appendix) + - Example job payloads + - Example TypeScript worker skeleton (BullMQ + pg) + - Helpful SQL queries + - References + +--- + +## 1. Executive summary + +This document defines the `ETL` (Extract-Transform-Load) worker design for StarForge. It specifies: + +- the contract between ingestion (`hero_snapshots`) and worker jobs, +- queue and retry semantics, +- idempotent upsert patterns to normalize `raw player JSON` into relational tables, +- monitoring and logging conventions, +- operational runbooks for common incidents, +- testing & migration considerations. + +> Goal: provide a deterministic, observable, and safe pipeline that can process large (2–3MB) `JSON snapshots`, scale +> horizontally, and be re-run (idempotent) for repair/backfill. + +--- + +## 2. Scope & audience + +**Primary audience:** + +- Backend engineers implementing worker & queue code +- DevOps/SRE running workers and scaling infrastructure +- QA engineers writing integration & performance tests +- Data engineers validating normalized outputs + +**Scope:** + +- Processing snapshots stored in `hero_snapshots` (JSONB). +- Upserting normalized tables (`users`, `user_troops`, `user_pets`, `user_artifacts`, `user_teams`, `guilds`, + `user_profile_summary`, `etl_errors`). +- Supporting reprocessing, retries, and safe backfills. +- Instrumentation requirements for observability. + +**Out of scope (for this doc):** + +- UI or bot behavior (only how they observe `ETL` status). +- Detailed catalog seeding (see [docs/DB_MODEL.md](./DB_MODEL.md)). +- `Non-ETL` background jobs (e.g., exports) beyond brief notes. + +--- + +## 3. High-level architecture + +### Textual diagram (components): + +- Client (CLI / Bot / API) -> API Service (ingest endpoint) + - writes `hero_snapshots` (raw JSONB) + - enqueues job to `Redis/BullMQ` with `snapshot_id` + -> Redis / Queue + -> `ETL Worker Pool` (n instances) + - claim snapshot (atomic DB update) + - stream parse `JSONB` + - per-entity upserts to `Postgres` + - mark snapshot processed / write `etl_errors` + - emit metrics to `Prometheus`, errors to `Sentry` + -> `Postgres` (normalized tables + `hero_snapshots`) + -> `S3` (archive older snapshots) [optional] + +### Key constraints & goals: + +- _Idempotency:_ reprocessing same snapshot must not duplicate rows. +- _Memory-safety:_ do not load full large JSON into heap; stream arrays. +- _Observability:_ job metrics, `ETL` duration, error categories. +- _Safety:_ per-entity transactions to avoid monolithic long transactions that block DB. + +--- + +## 4. Input contract (hero_snapshots row & job payload) + +### 4.1 hero_snapshots canonical row (DB contract) + +- _Table:_ `hero_snapshots` +- _Important columns (read-only contract for workers):_ + - `id` UUID (snapshot id) + - `user_id` UUID (optional) + - `namecode` VARCHAR(64) (optional) + - `source` VARCHAR (fetch_by_namecode | login | cli_upload | ui_fetch) + - `raw` JSONB (full get_hero_profile payload) + - `size_bytes` INTEGER + - `content_hash` VARCHAR (SHA256 hex) + - `processing` BOOLEAN + - `processing_started_at` TIMESTAMPTZ + - `processed_at` TIMESTAMPTZ + - `error_count` INTEGER + - `last_error` JSONB + - `created_at` TIMESTAMPTZ + +Workers MUST treat `raw` as authoritative payload for that job. + +### 4.2 Job payload (queue message) + +- Minimal job payload should be small and stable: + +```json +{ + "snapshot_id": "uuid", + "correlation_id": "uuid", + // request trace id, optional + "enqueue_ts": "2025-12-02T12:00:00Z" +} +``` + +- Avoid embedding the full JSON in the job; workers read `raw` from DB by `snapshot_id`. +- Job `TTL`: worker should drop jobs older than configurable threshold (e.g., 30d) unless reprocessing older snapshots + is explicitly allowed. + +### 4.3 Input validation rules + +- Worker must validate: + - `raw` is JSON and contains expected top-level keys (e.g., `NameCode`, `ProfileData`) before attempting mapping. + - `size_bytes` matches `octet_length(raw::text)` (optional sanity check). +- If `raw` is malformed JSON: record `ETL` error, increment `error_count` and stop; do not set `processed_at`. + +--- + +## 5. Queue design & job payload schema + +### 5.1 Choice: Redis + BullMQ (recommended) + +- Advantages: + - job retries + - delayed jobs + - job priorities + - concurrency control + - job inspection + +- Use a short job payload referencing `snapshot_id` and correlation metadata only. + +### 5.2 Queue naming & priorities + +- queue: `etl:snapshots` (default) +- priority levels: + - **high** (interactive, e.g., single-player fetch): `priority = 10` + - **normal** (background ingestion): `priority = 0` + - **low** (bulk backfill): `priority = -10` +- Use separate queues for reprocess/backfill jobs (e.g., `etl:backfill`) to throttle separately. + +### 5.3 Retry & backoff strategy (BullMQ config) + +- Attempts: `maxAttempts = configurable` (default 5) +- Backoff: exponential with `jitter`: `backoff = Math.min(2^attempt * baseMs, maxMs)` with jitter + - `baseMs = 1000ms, maxMs = 5 minutes` +- On transient DB errors: let queue retry. +- On permanent errors (parse error, invalid shape): mark as failed and do not retry automatically (see section 9). + +### 5.4 Poison message handling + +- If a job hits maxAttempts: move to "failed" and record `ETL` error row with `permanent: true`. Admins can re-enqueue + after manual fix. + +### 5.5 Job visibility & admin tools + +- Persist job metadata to `queue_jobs` table optionally for auditing (`job_id`, `snapshot_id`, `type`, `status`). +- Build admin UI to list recent jobs, failures, and reprocess actions. + +--- + +## 6. Worker claim semantics & concurrency model + +### 6.1 Atomic claim (DB-first) + +- Use DB to claim a snapshot to avoid race conditions across worker instances: + +```sql +WITH claimed AS ( +UPDATE hero_snapshots +SET processing = true, + processing_started_at = now() +WHERE id = $1 + AND (processing = false OR processing IS NULL) RETURNING id +) +SELECT id +FROM claimed; +``` + +- If no rows returned: someone else claimed it, skip job. + +### 6.2 Reasoning + +- Use DB atomic update so job coordination remains consistent even if the queue loses a lease. +- Avoid advisory locks unless necessary; if you need cross-DB locking for multi-step workflows, use + `pg_advisory_lock(key)` carefully and always release locks. + +### 6.3 Worker concurrency & DB connection budgeting + +- Configure worker concurrency so total DB connections (`workers * per-worker connections + app connections) < DB max` + connections minus headroom for `pgbouncer`. +- Suggested defaults: + - per-worker concurrency: 4 (tune by instance size) + - pg pool max per worker: 2–4 +- Use pooled DB client (pg.Pool) and reuse connections. + +### 6.4 Claim TTL & stuck job detection + +- If a worker crashes after claiming, a snapshot may remain with `processing=true`. Use a watchdog that detects + snapshots + with `processing=true` and `processing_started_at` older than `claim_ttl` (e.g., 30 minutes) and: + - either mark `processing=false` and increment `error_count` OR + - reassign via admin requeue after human review. + +--- + +## 7. Processing pipeline — end-to-end + +This section describes the sequential steps the worker executes for each snapshot. Each step has failure handling notes. + +### 7.1 High-level steps + +1. Dequeue job (`snapshot_id`). +2. Attempt atomic DB claim (see 6.1). +3. Read `hero_snapshots.raw JSONB`. +4. Validate schema & compute a lightweight fingerprint if needed. +5. Parse using streaming parser for large arrays (troops, pets, artifacts). +6. Per-entity transformations & upserts in recommended order: + - `users` / `heroes` + - `guilds` & `guild_members` + - `user_troops` (batch upserts) + - `user_pets` (batch upserts) + - `user_artifacts` (batch upserts) + - `user_teams` & `team_troops` + - `user_profile_summary` (calculate denormalized summary) +7. Mark snapshot processed (`processed_at=now()`, `processing=false`) and write metrics. +8. If errors occurred: + - Log error, write `etl_errors` row(s), update `hero_snapshots.last_error` and `error_count`. + - Respect retry semantics (for transient errors allow queue retry). +9. Emit telemetry (events, metrics). + +### 7.2 Streaming & memory-safe parsing + +- Use `stream-json` or equivalent to iterate over large arrays in `raw` without converting to a full `JS object`. +- Strategy: + - If `raw` is already in DB as `JSONB`, get it as string via `raw::text` and stream-parse. + - Alternatively, parse top-level shallow object into memory (small) and stream large fields (troops arrays). +- When mapping arrays to upserts, batch them (e.g., 100–500 troops per batch) to avoid many small DB roundtrips. + +### 7.3 Per-entity transaction boundaries + +- Use small per-entity transactions instead of a single one across all entities. +- Example: + - Begin transaction for user upsert -> commit. + - Begin transaction for batch of `user_troops` (multi-row upsert) -> commit. +- Rationale: limits lock time; allows partial progress and easier reprocessing of failed entities. + +### 7.4 Ordering rationale + +- Upsert users first so other entities can reference `user_id`. +- Upsert `guilds`/`guild_members` before guild-specific lookups. +- Upsert `troops`/`pets`/`artifacts` in batches; summary is derived last. + +### 7.5 Summary generation + +- Build `user_profile_summary` with most-used fields needed by API (top troops, equipped pet, pvp tier, guild). +- Upsert summary with `ON CONFLICT (user_id) DO UPDATE SET ...` and update `cached_at`. + +--- + +## 8. Upsert patterns & SQL examples + +### 8.1 Safe multi-row upsert (user_troops) + +- Use multi-row insert + ON CONFLICT DO UPDATE for batches: + +```sql +INSERT INTO user_troops (user_id, troop_id, amount, level, rarity, fusion_cards, traits_owned, extra, last_seen, + updated_at) +VALUES ($1, $2, $3, $4, $5, $6, $7, $8::jsonb, now(), now()), ... + ON CONFLICT (user_id, troop_id) DO +UPDATE + SET amount = EXCLUDED.amount, + level = EXCLUDED.level, + rarity = EXCLUDED.rarity, + fusion_cards = EXCLUDED.fusion_cards, + traits_owned = EXCLUDED.traits_owned, + extra = COALESCE (user_troops.extra, '{}'::jsonb) || EXCLUDED.extra, + last_seen = EXCLUDED.last_seen, + updated_at = now(); +``` + +### 8.2 Upsert users (idempotent) + +```sql +INSERT INTO users (id, namecode, username, discord_user_id, email, created_at, updated_at) +VALUES (COALESCE($id, gen_random_uuid()), $namecode, $username, $discord_user_id, $email, now(), + now()) ON CONFLICT (namecode) DO +UPDATE + SET username = COALESCE (EXCLUDED.username, users.username), + discord_user_id = COALESCE (EXCLUDED.discord_user_id, users.discord_user_id), + email = CASE WHEN users.email IS NULL THEN EXCLUDED.email ELSE users.email +END +, + updated_at = now() +RETURNING id; +``` + +- Prefer returning `id` to link `user_id` where needed. + +### 8.3 Upsert summary + +```sql +INSERT INTO user_profile_summary (user_id, namecode, username, level, top_troops, equipped_pet, pvp_tier, guild_id, + last_seen, cached_at, extra) +VALUES ($user_id, $namecode, $username, $level, $top_troops::jsonb, $equipped_pet::jsonb, $pvp_tier, $guild_id, + $last_seen, now(), $extra::jsonb) ON CONFLICT (user_id) DO +UPDATE + SET namecode = EXCLUDED.namecode, + username = EXCLUDED.username, + level = EXCLUDED.level, + top_troops = EXCLUDED.top_troops, + equipped_pet = EXCLUDED.equipped_pet, + pvp_tier = EXCLUDED.pvp_tier, + guild_id = EXCLUDED.guild_id, + last_seen = EXCLUDED.last_seen, + cached_at = now(), + extra = user_profile_summary.extra || EXCLUDED.extra; +``` + +### 8.4 Use partial updates for `extra` JSONB fields + +- Merge semantics: `existing.extra || new.extra` appends/replaces keys; use `jsonb_set` when needing deep replace. + +--- + +## 9. Error handling, retry policy & poison messages + +### 9.1 Error classification + +- Transient (retryable): + - DB connection issues, deadlocks, temporary external API failures, network timeouts. + - Action: allow automatic retry (queue backoff). +- Permanent (non-retryable): + - Malformed JSON, missing required fields, FK constraint due to missing catalog and policy is to fail. + - Action: write an `etl_errors` row with `permanent: true`, set `hero_snapshots.processing=false`, + `processed_at=NULL` (or optionally `processed_at` to indicate attempted), increment `error_count` and alert + maintainers. + +### 9.2 Recording errors + +- Always record structured errors in `etl_errors`: + - `snapshot_id`, `error_type`, `message` (sanitized), `details` (JSON with limited size), `created_at`. +- Update `hero_snapshots`: increment `error_count`, set `last_error` to summary (short), and (optionally) add + `error_history` append. + +### 9.3 Retry policy + +- Default max attempts: 5 (configurable). +- Backoff: exponential with jitter (1s, 2s, 4s, 8s… capped at 5 min). +- For bulk backfills, allow higher `maxAttempts` but with lower concurrency. + +### 9.4 Poison queue handling + +- After `maxAttempts` reached: + - Mark job failed in queue + - Create a monitoring alert and create an admin work item (via issue/alerting channel) + - Provide reprocess endpoint for manual re-enqueue after fix + +### 9.5 Partial failures & compensating actions + +- If one entity fails (e.g., `user_troops` batch fails due to FK), worker should: + - Rollback the entity-level transaction, write an `etl_errors` row specifying which entity failed and why. + - Continue processing other entities (if safe and configured). + - Consider marking snapshot with `partial_failure = true` in `hero_snapshots` or `etl_errors` with entity details. + +--- + +## 10. Monitoring & metrics (what to emit) + +### 10.1 Metric naming conventions + +- Prefix: `starforge_etl_` for worker metrics. +- Labels: `worker_instance`, `job_type`, `env`, `snapshot_source`, `error_type` where applicable. + +### 10.2 Required metrics (Prometheus) + +- Counters: + - `starforge_etl_snapshots_received_total{source="", job_type=""}` + - `starforge_etl_snapshots_processed_total{result="success|failed|partial"}` + - `starforge_etl_snapshots_failed_total{error_type=""}` +- Gauges: + - `starforge_etl_processing_jobs_in_progress` + - `starforge_etl_queue_depth{queue=""}` (prefer scraped from `BullMQ` or `Redis` exporter) +- Histograms / Summaries: + - `starforge_etl_processing_duration_seconds` (observe by `job_type`) + - `starforge_etl_batch_upsert_size` (distribution of batch sizes) +- DB related: + - `starforge_etl_db_tx_duration_seconds` + - `starforge_etl_db_connection_count` (from app pool) + +### 10.3 Alerting rules (suggested) + +- **P0:** `ETL` failure spike: + - Condition: `rate(starforge_etl_snapshots_failed_total[5m]) / rate(starforge_etl_snapshots_processed_total[5m]) > + 0.01 (i.e., >1%)` -> page on-call +- **P0:** Queue depth high: + - `queue_depth > threshold for 10m` -> page +- **P1:** Avg processing time regression: + - `p95 processing duration > 2x baseline for 10m` -> notify +- **P1:** `Worker OOM or restarts > N` -> notify + +### 10.4 Events & traces + +- Emit `snapshot_processed` event with `snapshot_id`, `user_id`, counts of upserts and duration. +- Trace flow: API ingest -> enqueue -> worker -> DB upserts. Propagate `X-Request-Id` or `trace_id`. + +--- + +## 11. Logging, tracing & structured events + +### 11.1 Logging guidelines (structured JSON) + +- Always log JSON objects with keys: + - `timestamp`, `service`: "etl-worker", `env`, `level`, `message`, `snapshot_id`, `job_id`, `user_id` (if + available), `correlation_id`, `trace_id`, `module`, `duration_ms`, `details` (sanitized). +- Examples: + +```json +{ + "timestamp": "2025-12-02T12:00:00Z", + "service": "etl-worker", + "env": "production", + "level": "info", + "message": "claimed snapshot", + "snapshot_id": "uuid", + "job_id": "bull-job-uuid", + "worker_instance": "worker-1" +} +``` + +### 11.2 Sensitive data redaction + +- NEVER log raw tokens, passwords or full player credentials. +- Mask / redact fields matching patterns (token, password, session) prior to logging. +- For debugging, store pointers (`snapshot_id`, `s3_path`) to full raw payload rather than including it in logs. + +### 11.3 Tracing + +- Use OpenTelemetry (recommended) to propagate trace ids across API -> queue -> worker -> DB. +- Create spans for: + - `claim_snapshot` (DB update) + - `parse_snapshot` (streaming) + - `upsert_` (e.g., `upsert_user_troops`) + - `commit_snapshot` (final update) + +### 11.4 Error telemetry + +- Send errors to Sentry with minimal context: `snapshot_id`, `job_id`, `short_message`, `error_type`, `sanitized_stack`. +- Avoid including raw snapshot or `PII` in error events. + +--- + +## 12. Backfill, bulk processing & rate-control + +### 12.1 Backfill principles + +- Use the same worker code as real-time processing to avoid mapping differences. +- Run backfills on an isolated worker pool with controlled concurrency and DB throttle. +- Recommended initial concurrency: small (e.g., few workers x 2 concurrency) and increase gradually while monitoring DB + metrics. + +### 12.2 Batch sizing & parallelism + +- For large arrays (troops), process in batches of 100–500 rows (tunable). +- Use upsert multi-row inserts to reduce round-trips. + +### 12.3 Rate limiting and safe windows + +- Backfills should honor a DB load budget: + - Max write `IOPS` per second or max DB CPU utilization. +- Implement a backfill controller that: + - queries DB metrics + - adjusts worker concurrency or sleeps between batches accordingly + +### 12.4 Resume & checkpointing + +- Record per-snapshot backfill progress in a `backfill_jobs` or `queue_jobs` table with: + - `job_id`, `start_ts`, `last_processed_snapshot_id`, `processed_count`, `error_count` +- On interruption, resume from last recorded snapshot id or job cursor. + +--- + +## 13. Admin APIs & operational endpoints + +### 13.1 Reprocess snapshot + +- Endpoint: `POST /api/v1/admin/snapshots/:id/reprocess` +- Requirements: `admin auth` & `RBAC` +- Behavior: + - Validate snapshot exists and not currently processing. + - Reset `hero_snapshots.error_count = 0; last_error = NULL; processing = false` + - Enqueue job (high priority if needed) + - Return `202` Accepted with `job_id` and `ETA` estimate. +- Audit: create an `audit_logs` entry with requester id & reason. + +### 13.2 Snapshot status + +- Endpoint: `GET /api/v1/admin/snapshots/:id/status` +- Returns: processing `flag`, `processed_at`, `error_count`, `last_error` + +### 13.3 Health & metrics endpoints + +- `/health` (readiness & liveness) +- `/metrics` (Prometheus) protected by IP or scrape token in production + +### 13.4 Job management + +- Admin UI to list failed jobs and re-enqueue or inspect `etl_errors`. + +--- + +## 14. Testing strategy (unit / integration / e2e / perf) + +### 14.1 Unit tests + +- Functions: parse helpers, mapping logic (e.g., mapTroopToRow), hash calculation, idempotency helpers. +- Mock DB & queue; keep fast tests. + +### 14.2 Integration tests + +- Use testcontainers or ephemeral `Postgres` + `Redis`. +- Tests: + - migration up -> insert example snapshot -> run worker -> assert normalized tables + - idempotency: process same snapshot twice -> assert no duplicate rows + - malformed JSON snapshot -> worker writes `etl_errors` and does not crash + +### 14.3 End-to-end tests + +- Staging-like environment with real services. +- Scenario: API ingest -> verify snapshot inserted -> wait for worker processing -> query `user_profile_summary` via + API -> assert fields. + +### 14.4 Performance tests + +- Tools: `k6`, `Artillery` +- Scenarios: + - Burst ingestion of N snapshots per minute + - Worker processing of large snapshots (2–3MB) with constrained memory to validate streaming parser +- Goals: + - measure p95 processing times, memory usage, DB connection utilization + - tune batch size & concurrency + +### 14.5 Test fixtures + +- Keep synthetic example payloads in `docs/examples/`: + - small, medium, large (`~2–3MB`), malformed sample +- Use these fixtures in `CI` integration tests. + +### 14.6 CI gating + +- PRs must pass unit + integration tests. +- Migration PRs must run a migration preflight job (apply migrations to ephemeral DB and run a smoke `ETL`). + +--- + +## 15. Security & privacy controls + +### 15.1 Data minimization + +- Do not persist user passwords. If upstream includes tokens, `ETL` must redact them before storing in normalized tables + or logs unless storing is absolutely required and approved. + +### 15.2 Logging policy + +- Redact known sensitive keys in snapshots and logs (e.g., token, password, session). +- Implement automatic scrubbing function that walks `JSON` and masks keys matching configured patterns. + +### 15.3 Access control + +- Admin endpoints require `RBAC` and auth tokens with least privilege. +- Raw snapshot access is restricted to operators and SREs — prefer return of pointers (`snapshot_id`) rather than full + raw JSON. + +### 15.4 Secrets & rotation + +- Use secrets manager or `GitHub Secrets` for connections (`DATABASE_URL`, `REDIS_URL`). +- Rotate worker credentials periodically and automate via provider `OIDC` where possible. + +### 15.5 Audit + +- Write `audit_logs` for admin actions (reprocess, manual job enqueue, retention runs). + +--- + +## 16. Common incidents & runbook actions + +### 16.1 Worker OOM (out of memory) + +- Symptoms: worker process restarted, `OOM` logs, repeated restarts. +- Immediate actions: + 1. Scale down worker concurrency or stop worker pool. + 2. Check last logs for snapshot size that caused `OOM` (`snapshot_id`). + 3. If large snapshot: move snapshot to quarantine (mark `hero_snapshots.processing=false`, add `etl_errors`), + re-enqueue under dedicated high-memory worker or reprocess after parser fixes. + 4. Increase instance memory temporarily and redeploy. +- Preventive: + - Use streaming parser and batch sizes, set per-worker memory limits and monitor. + +### 16.2 DB connection exhaustion + +- Symptoms: DB refuses connections, new connections fail. +- Immediate actions: + 1. Pause / scale down worker concurrency. + 2. Check pg pool stats (`pg_stat_activity`) and `pgbouncer` (if used). + 3. Reduce per-worker pool size and restart workers. +- Preventive: + - Use `pgbouncer`, cap total workers, keep per-worker pool small. + +### 16.3 Queue storm / sudden backlog + +- Symptoms: queue depth spikes, processing lag increases. +- Immediate actions: + 1. Evaluate if sudden ingestion spike expected (event). + 2. Scale worker pool carefully (respect DB connections). + 3. If risk of overload, throttle ingestion at `API` (return `202` with `ETA`). +- Preventive: + - Rate limit ingestion clients and use backpressure strategy. + +### 16.4 Frequent ETL parse errors (new upstream schema) + +- Symptoms: many snapshots failing with parse errors after upstream change. +- Immediate actions: + 1. Pause auto-retries. + 2. Sample failing raw payloads and store examples for developer analysis. + 3. Create a temporary `ETL` mapping patch that ignores unknown fields and records extras. + 4. Plan `migration/ETL` update and backfill as necessary. + +### 16.5 Retention/archival job failed (S3 unavailable) + +- Immediate actions: + 1. Retry with exponential backoff. + 2. If persistent, pause archival deletion and alert `SRE`. + 3. Ensure DB not purging snapshots until archive confirmed. + +--- + +## 17. Configuration / example env + +Example env vars (worker) + +``` +DATABASE_URL=postgres://user:pass@host:5432/starforge?sslmode=require +REDIS_URL=redis://:password@redis-host:6379 +NODE_ENV=production +WORKER_CONCURRENCY=4 +PG_POOL_MAX=4 +ETL_BATCH_SIZE=250 +ETL_MAX_RETRIES=5 +ETL_BACKOFF_BASE_MS=1000 +ETL_BACKOFF_MAX_MS=300000 +CLAIM_TTL_MINUTES=30 +METRICS_PORT=9091 +SENTRY_DSN=... +``` + +--- + +## 18. Appendix + +### 18.1 Example job payloads + +- Single snapshot job (minimal): + +```json +{ + "snapshot_id": "a2f1c1b2-...-e4f9", + "correlation_id": "req-1234", + "enqueue_ts": "2025-12-02T12:00:00Z" +} +``` + +- Backfill job: + +```json +{ + "job_type": "backfill_range", + "from_created_at": "2025-01-01T00:00:00Z", + "to_created_at": "2025-06-01T00:00:00Z", + "batch_size": 500, + "owner": "data-team" +} +``` + +### 18.2 Example TypeScript worker skeleton (`BullMQ` + `pg` + `stream-json`) + +> NOTE: this is a skeleton to illustrate patterns, not production-ready code. Add proper error handling, metrics, config +> management before use. + +```ts +// sketch-worker.ts +import { Queue, Worker, Job } from 'bullmq'; +import { Pool } from 'pg'; +import { parser } from 'stream-json'; +import { streamArray } from 'stream-json/streamers/StreamArray'; +import { Readable } from 'stream'; +import Pino from 'pino'; + +const logger = Pino(); + +const redisConnection = { host: process.env.REDIS_HOST, port: Number( process.env.REDIS_PORT ) }; +const queue = new Queue( 'etl:snapshots', { connection: redisConnection } ); + +const pgPool = new Pool( { connectionString: process.env.DATABASE_URL, max: Number( process.env.PG_POOL_MAX || 4 ) } ); + +const worker = new Worker( 'etl:snapshots', async ( job: Job ) => { + const { snapshot_id } = job.data; + const client = await pgPool.connect(); + try { + // 1. atomic claim + const res = await client.query( + `UPDATE hero_snapshots SET processing = true, processing_started_at = now() + WHERE id = $1 AND (processing = false OR processing IS NULL) + RETURNING id, raw::text as raw_text`, [ snapshot_id ] ); + if ( res.rowCount === 0 ) { + logger.info( { snapshot_id }, 'snapshot already claimed, skipping' ); + return; + } + const rawText = res.rows[0].raw_text; + // 2. streaming parse example: if troops is large array + const jsonStream = Readable.from( [ rawText ] ).pipe( parser() ); + // You would then stream into proper handlers; for brevity we show full parse fallback: + const payload = JSON.parse( rawText ); // fallback only when safe + // 3. process entities (use helper functions to batch upserts) + await processUserAndUpserts( client, payload, snapshot_id ); + // 4. mark processed + await client.query( + `UPDATE hero_snapshots SET processed_at = now(), processing = false, last_error = NULL WHERE id = $1`, + [ snapshot_id ] ); + logger.info( { snapshot_id }, 'processed snapshot' ); + } catch ( err ) { + logger.error( { err, snapshot_id }, 'processing failed' ); + // write etl_errors and update hero_snapshots + await client.query( + `INSERT INTO etl_errors (id, snapshot_id, error_type, message, details, created_at) + VALUES (gen_random_uuid(), $1, $2, $3, $4::jsonb, now())`, + [ snapshot_id, 'PROCESSING_ERROR', String( ( err && err.message ) || 'unknown' ), JSON.stringify( { stack: err && err.stack } ) ] ); + await client.query( `UPDATE hero_snapshots SET error_count = COALESCE(error_count,0)+1, last_error = to_jsonb($2::text) WHERE id = $1`, + [ snapshot_id, `Processing error: ${ err && err.message }` ] ); + throw err; // let BullMQ retry according to its policy + } finally { + client.release(); + } +}, { connection: redisConnection, concurrency: Number( process.env.WORKER_CONCURRENCY || 4 ) } ); + +async function processUserAndUpserts( client: any, payload: any, snapshot_id: string ) { + // Implement mapping and batch upserts, following SQL patterns in this doc. +} +``` + +### 18.3 Helpful SQL queries + +- Find latest processed snapshot for a user: + +```sql +SELECT * +FROM hero_snapshots +WHERE user_id = $1 + AND processed_at IS NOT NULL +ORDER BY created_at DESC LIMIT 1; +``` + +- List failed snapshots (recent): + +```sql +SELECT id, created_at, error_count, last_error +FROM hero_snapshots +WHERE error_count > 0 +ORDER BY created_at DESC LIMIT 100; +``` + +- Count queue depth (`BullMQ` stores in `Redis`, but if DB-backed queue): + +```sql +SELECT count(*) +FROM queue_jobs +WHERE status = 'pending'; +``` + +--- + +References + +- [docs/DB_MODEL.md](./DB_MODEL.md) (canonical schema) +- [docs/MIGRATIONS.md](./MIGRATIONS.md) (migration conventions) +- [docs/OBSERVABILITY.md](./OBSERVABILITY.md) (metrics & alerts) +- BullMQ docs: https://docs.bullmq.io/ +- stream-json: https://github.com/uhop/stream-json +- OpenTelemetry for Node: https://opentelemetry.io/ + +--- diff --git a/docs/MIGRATIONS.md b/docs/MIGRATIONS.md new file mode 100644 index 0000000..e561b1c --- /dev/null +++ b/docs/MIGRATIONS.md @@ -0,0 +1,549 @@ +# MIGRATIONS.md — Database Migrations Guide + +> Version: 0.1 +> Date: 2025-12-03 + +--- + +## Purpose + +- Provide a complete, practical guide for creating, testing, reviewing and applying database schema migrations for + StarForge. +- Aligns with [docs/DB_MODEL.md](./DB_MODEL.md) (canonical schema) and [docs/ETL_AND_WORKER.md](./ETL_AND_WORKER.md) ( + `ETL` contracts). +- Target audience: `Backend engineers`, `SRE/DevOps`, `reviewers` and `release approvers`. + +--- + +## Scope + +- JavaScript migrations using `node-pg-migrate` (recommended). +- Preflight checks, CI integration, local developer workflows, production bootstrap workflow, rollback guidance, + indexes, extensions, and operational runbooks. +- Includes sample migration templates and scripts you can adapt. + +--- + +## Table of contents + +1. [Principles & Goals](#1-principles--goals) +2. [Tooling & repo layout](#2-tooling--repo-layout) +3. [Naming & file conventions](#3-naming--file-conventions) +4. [Migration anatomy (up / down / idempotence)](#4-migration-anatomy-up--down--idempotence) +5. [Safe migration patterns (additive → backfill → enforce)](#5-safe-migration-patterns) +6. [Extensions & provider considerations (pgcrypto etc.)](#6-extensions--provider-considerations) +7. [Index creation & long-running operations (CONCURRENTLY)](#7-index-creation--long-running-operations) +8. [Partitioning strategy & archival-aware changes](#8-partitioning-strategy--archival-aware-changes) +9. [Preflight checks & CI jobs](#9-preflight-checks--ci-jobs) +10. [Local developer workflow (create / run / revert)](#10-local-developer-workflow) +11. [Production bootstrap workflow (manual, protected)](#11-production-bootstrap-workflow-manual-protected) +12. [Rollback & emergency restore strategy](#12-rollback--emergency-restore-strategy) +13. [Migration PR checklist & review criteria](#13-migration-pr-checklist--review-criteria) +14. [Seeding & catalog management](#14-seeding--catalog-management) +15. [Testing migrations (unit/integration/e2e)](#15-testing-migrations-unit--integration--e2e) +16. [Backfill & data-migration jobs coordination](#16-backfill--data-migration-jobs-coordination) +17. [Auditing & migration traceability](#17-auditing--migration-traceability) +18. [Examples: node-pg-migrate templates & preflight script](#18-examples-node-pg-migrate-templates--preflight-script) +19. [Runbooks & operational notes](#19-runbooks--operational-notes) +20. [References](#20-references) + +--- + +## 1. Principles & Goals + +- Safety first: migrations must not cause unexpected downtime or data loss. +- Small, focused changes: prefer many small migrations over large monolithic ones. +- Reversible where practical: provide `down` when safe; if not reversible, document strong rollback plan. +- Observable: migrations must produce logs and artifacts to audit what happened and when. +- Testable: migrations run in CI against ephemeral DBs; run preflight on staging before production. +- Manual approval for production: production schema changes are applied through a protected, manual workflow. + +--- + +## 2. Tooling & repo layout + +### Recommended tooling + +- `node-pg-migrate` — JS migrations that integrate well with `Node` toolchain. +- `pg` (`node-postgres`) for direct DB access in scripts / preflight. +- `psql` for quick manual checks. +- testcontainers (or `docker-compose`) for integration tests in CI. + +### Suggested repo layout + +``` +database/ + migrations/ # node-pg-migrate migration files + seeds/ # idempotent seed files for catalogs +scripts/ + migrate-preflight.sh + bootstrap-db.sh +docs/ + MIGRATIONS.md + DB_MODEL.md + ETL_AND_WORKER.md +``` + +### GitHub Actions + +- CI job for migration preflight and tests. +- Manual `db-bootstrap.yml` workflow for production migrations (requires environment approval). + +--- + +## 3. Naming & file conventions + +### Filename pattern (required) + +``` +YYYYMMDDHHMMSS_description.[js|sql] +``` + +- Use `UTC` timestamp prefix for ordering. +- Example: `20251203T153000_add_hero_snapshots_table.js` (or numeric timestamp 20251203153000). + +### Migration metadata + +- Each migration should include header comments describing: + - Purpose + - Expected impact (approx row counts, index size) + - Estimated index build time if large + +### Commit messages + +- Link migration file to PR and include `MIGRATION: ` in PR and commit message. + +--- + +## 4. Migration anatomy (up / down / idempotence) + +Use `node-pg-migrate` export format with `exports.up = (pgm) => { ... }` and `exports.down = (pgm) => { ... }`. + +### Guidelines + +- Up should apply the change; down should revert when safe. +- Avoid irreversible actions in a single migration. If irreversible, document backup id and rollback plan in migration + header. +- Migrations should be idempotent where possible and robust to partial re-runs in test environments. +- Keep migration files small (`~1–200 lines`) and focused. + +### Example migration (skeleton) + +```js +/* 20251203_create_hero_snapshots.js + Purpose: create hero_snapshots table for raw ingestion + Estimated rows: small initially + */ +exports.shorthands = undefined; + +exports.up = ( pgm ) => { + pgm.createTable( 'hero_snapshots', { + id: { type: 'uuid', primaryKey: true, default: pgm.func( 'gen_random_uuid()' ) }, + user_id: { type: 'uuid', references: 'users(id)', onDelete: 'SET NULL' }, + source: { type: 'varchar(64)', notNull: true }, + raw: { type: 'jsonb', notNull: true }, + size_bytes: { type: 'integer', notNull: true }, + content_hash: { type: 'varchar(128)', notNull: true }, + processing: { type: 'boolean', default: false }, + processed_at: { type: 'timestamptz' }, + created_at: { type: 'timestamptz', default: pgm.func( 'now()' ) } + } ); + pgm.createIndex( 'hero_snapshots', 'raw', { using: 'gin', method: 'jsonb_path_ops' } ); +}; + +exports.down = ( pgm ) => { + pgm.dropIndex( 'hero_snapshots', 'raw' ); + pgm.dropTable( 'hero_snapshots' ); +}; +``` + +--- + +## 5. Safe migration patterns + +### Three-step safe approach for potentially disruptive changes (recommended) + +1. Additive step: Add new nullable column / indexes / tables. No data movement, no exclusive locks. + - Example: `ALTER TABLE user_profile_summary ADD COLUMN summary_v2 JSONB NULL;` +2. Backfill & verification: Deploy code that writes both old and new columns, backfill data asynchronously via `ETL` + /backfill jobs, validate parity in staging. +3. Enforce & cleanup: Make column NOT NULL and drop old column in a later migration after backfill verification. + +### Index changes + +- Use `CREATE INDEX CONCURRENTLY` for large indexes to avoid table locks. When using `node-pg-migrate`, run raw SQL for + CONCURRENTLY: + +```js +pgm.sql( 'CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_hero_snapshots_raw_gin ON hero_snapshots USING GIN (raw jsonb_path_ops)' ); +``` + +- Note: `CONCURRENTLY` cannot run inside a transaction — adjust migration runner accordingly (`node-pg-migrate` supports + it with `pgm.sql` but you must not rely on transactional behavior). + +### Avoid long exclusive operations + +- Avoid `ALTER TABLE ... TYPE` or `ADD COLUMN ... DEFAULT ` that rewrite the table. Prefer add nullable column, + backfill, then set default + NOT NULL. + +--- + +## 6. Extensions & provider considerations + +### Preferred extensions + +- `pgcrypto` — preferred for `gen_random_uuid()`. +- `uuid-ossp` — fallback for some providers. + +### CREATE EXTENSION strategy + +- Put extension creation into its own migration at the top: + - `20251203_create_pgcrypto_extension.js` +- Migration content: + +```js +exports.up = ( pgm ) => { + pgm.sql( "CREATE EXTENSION IF NOT EXISTS pgcrypto;" ); +}; +exports.down = ( pgm ) => { + // do not drop extension on down in production-safe path; leave no-op or document +}; +``` + +### Provider limitations + +- Managed providers (`Supabase`, `RDS`) may refuse `CREATE EXTENSION`. Preflight must check extension availability; + fallback: + - Generate `UUIDs` application-side or use `gen_random_uuid()` only if available. +- Document required provider privileges in migration header and in [docs/MIGRATIONS.md](./MIGRATIONS.md). + +--- + +## 7. Index creation & long-running operations + +- Use `CREATE INDEX CONCURRENTLY` for large tables. +- For `node-pg-migrate`, run concurrent index creation via `pgm.sql(...)` outside a transaction. Note: some runners wrap + migrations in transactions by default — use `pgm.sql` and configure properly. +- For Postgres large index estimate: + - Estimate rows: `SELECT reltuples::bigint AS approx_rows FROM pg_class WHERE relname='hero_snapshots';` + - Estimate size: `SELECT pg_relation_size('hero_snapshots')` or use `pgstattuple` if available. + +### Index creation checklist + +- Schedule during off-peak window. +- Kick off `CREATE INDEX CONCURRENTLY`, monitor progress. +- If using CONCURRENTLY, be aware it still consumes resources (`IO/CPU`). + +--- + +## 8. Partitioning strategy & archival-aware changes + +### When to partition + +- High-volume append-only tables (`hero_snapshots`) over millions of rows. +- Partition by time (monthly) or by hash(`user_id`) if cardinality increases. + +### Partition migration pattern + +1. Create parent table with `PARTITION BY RANGE (created_at)` or hash. +2. Create partitions (monthly). +3. Add trigger or default partitioning strategy for new inserts. +4. Backfill older data into partitions in batches. +5. Adjust queries and indexes accordingly. + +### Archival considerations + +- If archival: create `snapshot_archives` table or mark partitions as archived before deletion. +- Migration to add `archived` metadata should be additive and reversible. + +--- + +## 9. Preflight checks & CI jobs + +**Purpose:** detect permission gaps, estimate costs and validate that migrations apply cleanly. + +### Preflight script responsibilities (`scripts/migrate-preflight.sh`) + +- Verify connectivity (`psql` connection). +- Check extension permissions: + - `SELECT installed_version FROM pg_available_extensions WHERE name='pgcrypto';` +- Apply migrations to ephemeral test DB + - Run `node-pg-migrate` against ephemeral DB and verify `schema_migrations` updated. +- Run smoke `ETL`: insert a small sample snapshot and run worker logic or `ETL` smoke script. +- Validate indexing and expected tables existence. + +### CI integration + +- Add CI job that: + - Installs deps (`pnpm install --frozen-lockfile`) + - Runs lint, unit tests + - Runs `migrate:preflight` (applies migrations to a fresh ephemeral DB) and a smoke test +- Block merge if preflight fails. + +### Example preflight checks (shell snippet) + +```bash +#!/usr/bin/env bash +set -euo pipefail + +# Validate DB connectivity +psql "$DATABASE_URL" -c "SELECT 1" + +# Check pgcrypto availability and privileges +psql "$DATABASE_URL" -t -c "SELECT extname, extversion FROM pg_extension WHERE extname = 'pgcrypto';" | grep pgcrypto || echo "pgcrypto not installed" + +# Run migrations against ephemeral DB (test) +DATABASE_URL="$TEST_DATABASE_URL" pnpm run migrate:up +# Run smoke queries +psql "$TEST_DATABASE_URL" -c "SELECT count(*) FROM hero_snapshots;" +``` + +--- + +## 10. Local developer workflow + +### Create new migration + +- Use standard filename pattern and put file under `database/migrations/`. +- Populate `exports.up` and `exports.down`. + +### Run migrations locally + +- Install deps: `pnpm install` +- Configure `.env.local` or `.env` with `DATABASE_URL` (local Postgres). +- Run: + +```bash +pnpm run migrate:up -- --config database/migration-config.js +``` + +(or use your npm scripts wrapper) + +### Rollback last migration locally + +- `pnpm run migrate:down -- --count 1` (use carefully) + +### Testing locally + +- Run `./scripts/bootstrap-db.sh` to run migrations and seeds in a fresh local DB (script should be idempotent) +- Run `ETL` smoke: ingest sample file (docs/examples) and start worker to process snapshots. + +--- + +## 11. Production bootstrap workflow (manual, protected) + +### Principles + +- Migrations for production MUST be applied via a manual, protected `CI` workflow (`db-bootstrap.yml`) requiring + `GitHub Environment` approval. +- Always take a DB snapshot/backup *immediately before* running production migrations. + +### Example GitHub Actions workflow (outline) + +```yaml +name: db-bootstrap +on: + workflow_dispatch: +jobs: + bootstrap: + runs-on: ubuntu-latest + environment: production # requires approval + steps: + - uses: actions/checkout@v3 + - name: Setup Node + uses: actions/setup-node@v3 + - name: Install deps + run: pnpm install --frozen-lockfile + - name: Preflight check + run: ./scripts/migrate-preflight.sh + env: + DATABASE_URL: ${{ secrets.STAGING_DATABASE_URL }} + - name: Apply migrations (PROD) + if: ${{ github.event.inputs.apply == 'true' }} + run: pnpm run migrate:up -- --config database/migration-config.js --env production + env: + DATABASE_URL: ${{ secrets.DATABASE_URL }} +``` + +### Pre-bootstrap checklist (must be done before approval) + +- Recent backup id recorded and verified (snapshot successful and tested if possible). +- Migration preflight on staging passed and smoke `ETL` validated. +- `SRE` on-call available and aware of the maintenance window. +- PR includes migration file(s), changelog and rollback plan. +- Expected index creation & time estimates documented in PR. + +### Post-bootstrap verification + +- Run sanity queries to confirm expected tables & indexes exist. +- Run small sample ingest + end-to-end test. +- Monitor metrics & alerts for `30–120 minutes`. + +--- + +## 12. Rollback & emergency restore strategy + +### When to rollback + +- Severe data corruption introduced by a migration. +- Production outage caused by migration (locks, incompatibility). +- If rollback migration is not feasible, prefer restore-from-backup. + +### Rollback options + +- Reversible migrations: run `pnpm run migrate:down` for the specific migration if down exists and verified. +- Restore DB from backup: + 1. Stop ingestion & workers to avoid further writes. + 2. Restore database from snapshot taken before migration. + 3. Re-apply safe migrations or fix code to align with restored schema. + 4. Re-run necessary `ETL`/backfill to bring normalized tables up-to-date. +- Always communicate with stakeholders and log actions in `audit_logs`. + +### Emergency steps (short) + +1. Pause workers and `API` writes (set read-only mode or flip feature flag). +2. Assess whether `down` migration is safe; if yes run down; if not, prepare restore. +3. If restoring: follow provider procedures for `PITR` or snapshot restore. +4. After restore, run smoke tests, then re-enable ingestion carefully. + +--- + +## 13. Migration PR checklist & review criteria + +### Every migration PR must include: + +- Migration file(s) in `database/migrations/` with timestamped filenames. +- Description of change, rationale and expected impact (row counts, index costs). +- Preflight results (`CI` artifacts) from staging run. +- Backup id that will be used prior to production run (documented). +- Runbook / rollback steps documented or link to runbook. +- Tests: integration tests that exercise the migration (`CI` passing). +- Approvers: at least two reviewers, one must be `SRE/DevOps` for production-impacting migrations. + +### Reviewers should verify: + +- Migration is small and focused. +- No destructive changes without phased approach. +- Indexes using CONCURRENTLY are documented and scheduled. +- Required extensions are present or fallback documented. +- A clear rollback plan exists. + +--- + +## 14. Seeding & catalog management + +### Seeding + +- Keep idempotent seed scripts under `database/seeds/`. +- Seeds should use `INSERT ... ON CONFLICT DO NOTHING` or `ON CONFLICT DO UPDATE` as appropriate. + +### Catalog synchronization + +- Catalogs (`troop_catalog`, `pet_catalog`, `spells`) should be seeded and versioned. +- When catalog changes require schema changes, coordinate migration and catalog updates in same PR where possible, with + backfill plan. + +### Backfill seeds + +- If migration adds new columns requiring backfill, include backfill script as a separate job (not in migration up) and + track in `backfill_jobs` table. + +--- + +## 15. Testing migrations (unit / integration / e2e) + +- Unit: test migration helper functions if any. +- Integration: apply migrations to ephemeral DB in `CI`; run smoke queries and run `ETL` against sample payloads. +- E2E: in staging, apply migrations, run backfills and full `ETL` to validate end-to-end system behavior. +- Include migration tests in PR gating. + +--- + +## 16. Backfill & data-migration jobs coordination + +- Backfill tasks are often long-running and should be performed by controlled background jobs, not by migration code + itself. +- Backfill job steps: + 1. Create new column/table (migration) + 2. Launch backfill job (worker/backfill service) that populates new column in batches. + 3. Validate backfill results with sampling queries. + 4. Release final enforcement migration that sets NOT NULL or drops old columns. +- Track progress in `backfill_jobs` table with checkpoints for resume. + +--- + +## 17. Auditing & migration traceability + +- Record migration runs with: + - migration id, timestamp, runner (`CI`/user), environment, backup id, logs. +- Store artifacts from production run (output logs) in `CI` artifacts storage for audit. +- Keep `schema_migrations` table consistent and never manually alter it except in documented emergency procedures. + +--- + +## 18. Examples: node-pg-migrate templates & preflight script + +### 18.1 Migration JS template (`node-pg-migrate`) + +```js +// database/migrations/20251203T153000_add_example_table.js +exports.up = ( pgm ) => { + pgm.createTable( 'example_table', { + id: { type: 'uuid', primaryKey: true, default: pgm.func( 'gen_random_uuid()' ) }, + name: { type: 'varchar(255)', notNull: true }, + data: { type: 'jsonb', default: pgm.func( "('{}'::jsonb)" ) }, + created_at: { type: 'timestamptz', default: pgm.func( 'now()' ) } + } ); + pgm.createIndex( 'example_table', 'name' ); +}; + +exports.down = ( pgm ) => { + pgm.dropIndex( 'example_table', 'name' ); + pgm.dropTable( 'example_table' ); +}; +``` + +### 18.2 Preflight script (`scripts/migrate-preflight.sh`) + +```bash +#!/usr/bin/env bash +set -euo pipefail +: "${DATABASE_URL:?Need DATABASE_URL}" +echo "Testing DB connectivity..." +psql "$DATABASE_URL" -c "SELECT now();" + +echo "Checking pgcrypto extension..." +psql "$DATABASE_URL" -t -c "SELECT extname FROM pg_extension WHERE extname = 'pgcrypto';" | grep -q pgcrypto && echo "pgcrypto present" || echo "pgcrypto missing (provider may not support extension)" + +echo "Applying migrations to test database..." +pnpm run migrate:up -- --config database/migration-config.js +echo "Running smoke ETL test..." +# optionally run a smoke script that inserts sample snapshot and triggers local worker +``` + +--- + +## 19. Runbooks & operational notes + +- Link runbooks in `docs/OP_RUNBOOKS/`: + - [APPLY_MIGRATIONS.md](./OP_RUNBOOKS/APPLY_MIGRATIONS.md) — exact steps and command lines for running the + `GitHub Action` and verifying post-run checks. + - [MIGRATION_ROLLBACK.md](./OP_RUNBOOKS/MIGRATION_ROLLBACK.md) — step-by-step rollback and restore runbook. + - [BACKFILL.md](./OP_RUNBOOKS/BACKFILL.md) — schedule, throttle, and operator steps to run a backfill safely. + +### Operational tip + +- For critical migrations requiring owner approval, tag PR with `migration-impact: high` and schedule a 2nd approver. + +--- + +## 20. References + +- [docs/DB_MODEL.md](./DB_MODEL.md) (canonical schema) +- [docs/ETL_AND_WORKER.md](./ETL_AND_WORKER.md) (ETL contract) +- `node-pg-migrate`: https://github.com/salsify/node-pg-migrate +- `PostgreSQL` docs: https://www.postgresql.org/docs/ +- `Prometheus`: https://prometheus.io/docs/ +- `OpenTelemetry`: https://opentelemetry.io/ + +--- diff --git a/docs/OBSERVABILITY.md b/docs/OBSERVABILITY.md new file mode 100644 index 0000000..0a8497d --- /dev/null +++ b/docs/OBSERVABILITY.md @@ -0,0 +1,712 @@ +# Observability — StarForge + +> Version: 0.1 +> Date: 2025-12-03 + +--- + +This document specifies the observability strategy, metrics, logging, tracing, dashboards and alerting for StarForge. +It is intended for engineers and `SREs` implementing and operating the `ETL/Worker`, `API`, `Bot` and `admin services` +described in the [docs/PRD.md](./PRD.md), [docs/ETL_AND_WORKER.md](./ETL_AND_WORKER.md) +and [docs/DB_MODEL.md](./DB_MODEL.md). + +## Goals + +- Provide actionable, low-noise monitoring for production and staging. +- Enable fast incident detection (`ETL` failure spikes, queue backlog, DB exhaustion). +- Provide structured logs and traces to speed root-cause analysis. +- Ensure observability respects privacy (`PII` redaction) and cost constraints. + +--- + +## Audience + +- `Backend engineers` (instrumentation) +- `SRE / DevOps` (dashboards, alerting, retention) +- `QA / Data engineers` (validate metrics, dashboards) +- `On-call responders` (runbooks & alerts) + +--- + +## Contents + +1. [Observability architecture overview](#1-observability-architecture-overview) +2. [Metrics (what to collect, names, labels)](#2-metrics--names-types-labels-and-meaning) +3. [Prometheus configuration & scraping guidance](#3-prometheus-configuration--scraping-guidance) +4. [Grafana dashboards (suggested panels & queries)](#4-grafana-dashboards-templates--panels) +5. [Alerting (rules, thresholds, playbooks)](#5-alerting--rules-thresholds--playbooks) +6. [Logging (structure, fields, samples, retention)](#6-logging--structure-fields-scrubbing--retention) +7. [Tracing (OpenTelemetry, spans, sampling, propagation)](#7-tracing--opentelemetry-spans-propagation-and-sampling) +8. [Sentry / Error tracking (configuration & scrubbing)](#8-sentry--error-tracking) +9. [Correlation & metadata (request_id, trace_id, snapshot_id)](#9-correlation--metadata--request_id-trace_id-snapshot_id) +10. [Security & PII handling in telemetry](#10-security--pii-handling-in-telemetry) +11. [Retention & storage policies](#11-retention--storage-policies) +12. [Testing observability in CI](#12-testing-observability-in-ci) +13. [Operational runbooks & playbook snippets](#13-operational-runbooks--playbooks-summary--links) +14. [Implementation checklist / example snippets](#14-implementation-checklist--example-snippets) +15. [References & next steps](#15-references--next-steps) + +--- + +## 1. Observability architecture overview + +### Components + +- Instrumented services: `api-service`, `etl-worker`, `discord-bot`, `admin-ui` +- Metrics: `Prometheus` (scrape or push via exporters) +- Dashboards: `Grafana` (dashboards per service & cross-service) +- Traces: `OpenTelemetry` collector -> tracing backend (`Tempo`/`Jaeger`/`Tempo+Loki`) +- Error tracking: `Sentry` (exceptions, structured events) with scrubbing +- Logs: Structured JSON logs shipped to central log store (`Loki`/`ELK`/`Datadog`/`Logflare`) +- Alerts: `Prometheus Alertmanager` -> pager (`PagerDuty`) and `Slack` channels +- Long-term storage: Archive raw metrics/aggregates to cost-effective storage if required + +--- + +### Design notes + +- Favor pull-based metrics (`Prometheus` scrape) for services with stable endpoints; use pushgateway for ephemeral jobs + when necessary. +- Ensure all instrumented services expose `/metrics` and `/health`. +- Propagate correlation ids (`X-Request-Id`) and trace context across requests and jobs to link logs, metrics and + traces. + +--- + +## 2. Metrics — names, types, labels, and meaning + +### Naming conventions + +- Prefix: `starforge_` for all app metrics. +- Suffix conventions: + - `_total` for counters + - `_seconds` / `_duration_seconds` for durations + - `_gauge` for gauges +- Use `snake_case`. +- Avoid cardinality explosion in labels (limit label values and cardinality). + +--- + +### Label guidance + +- Common labels: `env`, `service`, `instance`, `job_type`, `snapshot_source`, `queue_name`, `error_type` +- Use `user_id` or `namecode` sparingly (avoid high-cardinality); prefer hashed or bucketed labels for analytics. + +--- + +### Core metric categories and recommended metrics + +#### A. ETL / Worker metrics (prefix: `starforge_etl_`) + +- Counters + - `starforge_etl_snapshots_received_total{env,service,job_type, snapshot_source}` — incremented when `API` enqueues + a snapshot job. + - `starforge_etl_snapshots_processed_total{env,service,result}` — result ∈ {success,failed,partial}. + - `starforge_etl_snapshots_failed_total{env,service,error_type}` — categorize errors (`PARSE_ERROR`, `DB_ERROR`, + `FK_VIOLATION`, `OOM`). + - `starforge_etl_entity_upserts_total{env,service,entity}` — counts per entity (`users`, `user_troops`, + `user_pets`, ...). +- Histograms / Summaries + - `starforge_etl_processing_duration_seconds_bucket{le,env,service}` — processing latency histogram per snapshot. + - `starforge_etl_upsert_batch_size` — batch sizes distribution used for multi-row upserts. +- Gauges + - `starforge_etl_jobs_in_progress{env,service}` — currently processing jobs. + - `starforge_etl_last_processed_timestamp{env,service}` — unix timestamp of last processed snapshot. + +#### B. API metrics (prefix: `starforge_api_`) + +- Counters + - `starforge_api_requests_total{env,service,endpoint,method,status}` +- Histograms + - `starforge_api_request_duration_seconds{env,service,endpoint}` +- Gauges + - `starforge_api_inflight_requests{env,service}` + +#### C. Queue & Redis metrics + +- If using `BullMQ` / `Redis`: + - `starforge_queue_jobs_waiting_total{env,queue_name}` — count of waiting jobs (pullable from `Redis`) + - `starforge_queue_jobs_active_total{env,queue_name}` + - `starforge_queue_jobs_failed_total{env,queue_name}` + +#### D. Database metrics (collect from exporter) + +- `pg_up` / `pg_connections` / `pg_active_queries` / `pg_longest_tx_seconds` / `pg_lock_count` (from `pg_exporter`) +- `starforge_db_tx_duration_seconds` (application-level) + +#### E. Infrastructure metrics + +- CPU, memory, disk, network per instance (`node exporter`) +- Pod-level metrics if `Kubernetes` (`kube-state-metrics`) + +#### F. Business metrics (higher-level) + +- `starforge_players_active_30d` — unique active players in 30 days (calculated/ingested) +- `starforge_bot_commands_total{command}` + +### Suggested metric dimensions (labels) + +- `env` (production/staging/local) +- `service` (api, etl-worker, bot) +- `instance` (host or pod name) +- `job_type` (process_snapshot, backfill_range) +- `snapshot_source` (fetch_by_namecode, login, cli_upload) +- `entity` (user_troops, user_pets, ...) + +### Metric examples (Prometheus exposition style) + +```text +# HELP starforge_etl_snapshots_processed_total Number of snapshots processed +# TYPE starforge_etl_snapshots_processed_total counter +starforge_etl_snapshots_processed_total{env="production",service="etl-worker",result="success"} 12345 +``` + +--- + +## 3. Prometheus configuration & scraping guidance + +### Scrape targets + +- Each service exposes `/metrics` (`Prometheus` format) on an ephemeral port (e.g., 9091). +- For `Kubernetes`: use `ServiceMonitor` (`Prometheus` Operator) per service. + +--- + +### Example `scrape_config` (prometheus.yml) + +```yaml +scrape_configs: + - job_name: 'starforge-api' + metrics_path: /metrics + static_configs: + - targets: [ 'api-1.example.com:9091','api-2.example.com:9091' ] + relabel_configs: + - source_labels: [ '__address__' ] + target_label: 'instance' + + - job_name: 'starforge-etl' + metrics_path: /metrics + kubernetes_sd_configs: + - role: pod + relabel_configs: + - source_labels: [ __meta_kubernetes_pod_label_app ] + regex: 'etl-worker' + action: keep +``` + +--- + +### Scrape intervals + +- Default: `scrape_interval: 15s` for application metrics. +- For high-volume metrics or high-cardinality series, consider `30s` to reduce load. + +--- + +### Relabeling & security + +- Add `relabel_configs` to attach `env` and `service` labels from pod metadata. +- Secure `Prometheus` scrapes with `mTLS` or allowlist internal networks. + +--- + +### Pushgateway + +- Use only for short-lived ephemeral jobs if necessary; prefer direct scraping and persistent worker processes. + +--- + +## 4. Grafana dashboards (templates & panels) + +### Build dashboards per audience: + +- `ETL` / `Worker` dashboard (`SRE`, `backend`) +- `API` dashboard (latency, errors) +- DB dashboard (connections, slow queries) +- Bot dashboard (commands, success rate) +- Overall system health dashboard (top-level) + +--- + +### Recommended ETL dashboard panels + +- Snapshot throughput: `rate(starforge_etl_snapshots_processed_total[1m])` (timeseries) +- Processing latency:`histogram_quantile(0.95, sum(rate(starforge_etl_processing_duration_seconds_bucket[5m])) by (le))` +- Failure rate: + `sum(rate(starforge_etl_snapshots_failed_total[5m])) / sum(rate(starforge_etl_snapshots_processed_total[5m]))` +- Queue depth: `starforge_queue_jobs_waiting_total` +- Jobs in progress: `starforge_etl_jobs_in_progress` +- Top error types pie: `sum by (error_type) (rate(starforge_etl_snapshots_failed_total[5m]))` +- DB connections and longest transactions (from `pg_exporter`) +- Worker restarts / `OOMs` count (kubernetes pod restarts) + +--- + +### Sample `Grafana` panel queries (`PromQL`) + +- `p95` processing time: + +```promql +histogram_quantile(0.95, sum(rate(starforge_etl_processing_duration_seconds_bucket[5m])) by (le)) +``` + +- `ETL` failure rate (5m): + +```promql +(sum(rate(starforge_etl_snapshots_failed_total[5m])) by (service)) / +(sum(rate(starforge_etl_snapshots_processed_total[5m])) by (service)) +``` + +- Queue depth per queue: + +```promql +starforge_queue_jobs_waiting_total{env="production"} +``` + +### Dashboard tips + +- Add annotations for deployments & migrations to correlate spikes. +- Show per-instance panels for troubleshooting, plus cluster-level aggregates. +- Provide drilldown links to logs (`Loki`/`ELK`) and traces (`Tempo`/`Jaeger`) with `snapshot_id` or `trace_id`. + +--- + +## 5. Alerting — rules, thresholds & playbooks + +### Design principles + +- Keep alerts actionable and low-noise. +- Avoid alerting on single transient blips—use multi-window conditions or require sustained violations. +- Each alert should include: description, severity, likely causes, quick actions, runbook link. + +### Suggested alert severities mapping + +- `P0` (page): system outage / critical data-loss / DB down / `ETL` failure rate severe +- `P1` (notify): service degradation or resource saturation +- `P2` (ticket): non-urgent but actionable issues + +### Alert examples (`Prometheus Alertmanager` rules format snippets) + +#### A. `ETL` failure rate high (`P0`) + +```yaml +- alert: StarforgeETLFailureRateHigh + expr: | + ( + sum(rate(starforge_etl_snapshots_failed_total[5m])) + / + sum(rate(starforge_etl_snapshots_processed_total[5m])) + ) > 0.01 + for: 5m + labels: + severity: page + annotations: + summary: "ETL failure rate > 1% (5m)" + description: "ETL failure rate is above 1% for the last 5 minutes. Check etl_errors table and recent logs. Runbook: docs/OP_RUNBOOKS/INCIDENT_RESPONSE.md" +``` + +#### B. Queue depth high (`P0`) + +```yaml +- alert: StarforgeQueueDepthHigh + expr: starforge_queue_jobs_waiting_total > 500 + for: 10m + labels: + severity: page + annotations: + summary: "ETL queue backlog > 500" + description: "ETL queue backlog is high; scale workers or investigate job failure storms. Check Redis and worker heartbeats." +``` + +#### C. DB connection exhaustion (`P0`) + +```yaml +- alert: PostgresConnectionsHigh + expr: pg_stat_activity_count > (pg_settings_max_connections * 0.9) + for: 2m + labels: + severity: page + annotations: + summary: "DB connections > 90%" + description: "Database connections exceed 90% of max. Reduce worker concurrency or scale DB. See runbook DB_RESTORE.md / SCALING_UP.md." +``` + +#### D. Processing latency regression (`P1`) + +```yaml +- alert: ETLProcessingLatencyHigh + expr: histogram_quantile(0.95, sum(rate(starforge_etl_processing_duration_seconds_bucket[5m])) by (le)) > 30 + for: 10m + labels: + severity: paged (or notify) + annotations: + summary: "ETL p95 processing time > 30s" +``` + +#### E. Worker restarts (`P1`) + +```yaml +- alert: WorkerPodRestartsHigh + expr: increase(kube_pod_container_status_restarts_total{job="kube-state-metrics",pod=~"etl-worker.*"}[10m]) > 3 + for: 0m + labels: + severity: page + annotations: + summary: "ETL worker restarting frequently" +``` + +### Alert content: always include + +- How to check: queries to run, which dashboards to view. +- Quick actions: pause backfill, scale workers, check DB connections, examine `etl_errors` for sample snapshot ids. +- Contact / escalations: on-call rotation or team. + +### Playbook snippets (on alert) + +- `ETL` failure rate high: + 1. Check `Grafana ETL` dashboard and `etl_errors`. + 2. Identify top error types and `snapshot_ids`. + 3. If parse error is widespread, pause ingestion, create task for update, and escalate to engineering lead. + 4. If DB errors, reduce worker concurrency immediately. + +--- + +## 6. Logging — structure, fields, scrubbing & retention + +### 6.1 Logging format & mandatory fields + +- Use structured JSON logs. Each log MUST include: + - `timestamp` (ISO 8601) + - `service` (api | etl-worker | bot | admin-ui) + - `env` (production | staging) + - `level` (DEBUG | INFO | WARN | ERROR) + - `message` (short human readable) + - `request_id` or `job_id` + - `trace_id` (optional, for trace linking) + - `snapshot_id` (when relevant) + - `worker_instance` or `pod` + - `module` (component) + - `details` (structured object for contextual fields) + +#### Example log (info) + +```json +{ + "timestamp": "2025-12-03T10:12:34Z", + "service": "etl-worker", + "env": "production", + "level": "info", + "message": "Snapshot processed", + "snapshot_id": "a2f1c1b2-...-e4f9", + "job_id": "bull-uuid-1234", + "worker_instance": "worker-3", + "duration_ms": 5432, + "entities_upserted": { + "users": 1, + "user_troops": 120 + }, + "trace_id": "..." +} +``` + +#### Example log (error) + +```json +{ + "timestamp": "2025-12-03T10:13:01Z", + "service": "etl-worker", + "env": "production", + "level": "error", + "message": "Failed to upsert user_troops batch", + "snapshot_id": "...", + "job_id": "...", + "error_type": "DB_ERROR", + "error_message": "duplicate key value violates unique constraint \"ux_user_troops_user_troop\"", + "details": { + "batch_size": 200 + } +} +``` + +### 6.2 Redaction & `PII` scrubbing + +- Implement a scrubbing pipeline *before* logs leave the app: + - Mask values for keys: `password`, `token`, `session`, `auth`, `api_key`, `access_token`. + - Replace with `"[REDACTED]"` or hashed placeholder. +- For raw snapshots, log only `snapshot_id` and `s3_path` (if archived), not the raw JSON. +- Sentry events must be scrubbed similarly. + +### 6.3 Log levels & sampling + +- Use DEBUG in staging; in production use INFO for success events and WARN/ERROR for issues. +- Sample high-volume DEBUG logs (e.g., per-troop operations) or rate-limit them server-side (`1% sample`). +- For error logs related to a unique snapshot, keep full detailed logs for initial investigation, and then sample/purge + as needed. + +### 6.4 Log shipping & retention + +- Ship logs to centralized log store with indexable fields: `service`, `env`, `snapshot_id`, `job_id`, `trace_id`. +- Retention policy (recommended baseline): + - Hot (searchable) logs: 30 days + - Warm (aggregated) logs: 90 days + - Cold archive (compressed): 365+ days (for compliance needs) +- For `PII`-containing logs (rare), keep only as long as legally required and ensure encryption at rest. + +--- + +## 7. Tracing — OpenTelemetry, spans, propagation, and sampling + +### 7.1 Goals + +- Provide end-to-end traces across `API` ingest -> queue -> worker -> DB upserts. +- Correlate traces with logs and metrics for fast `RCA`. + +### 7.2 Instrumentation strategy + +- Use `OpenTelemetry SDK` for `Node.js` (`api-service`, `etl-worker`, `bot`). +- Instrument: + - HTTP servers/clients (incoming `API` requests and outgoing requests) + - Queue interactions (enqueue / dequeue / job processing) — use semantic conventions to create spans `queue.enqueue` + and `queue.process` + - DB calls (`pg`) — create spans for queries/batches + - Heavy processing spans (`parse_snapshot`, `upsert_user_troops_batch`) + +### 7.3 Trace context propagation + +- Propagate `traceparent` headers for HTTP requests. +- For queue jobs include `trace_id` and `parent_span` in job metadata to continue trace in worker. +- Use `X-Request-Id` as fallback correlation id. + +### 7.4 Spans & recommended hierarchy (example) + +- `http.server` (API: `POST /internal/snapshots`) + - `queue.enqueue` (enqueue job) + - job stored in `Redis` (span) +- `queue.worker.process` (worker dequeues) + - `claim_snapshot` (DB update) + - `parse_snapshot` (stream parsing) + - `map_entity:users` + - `db.upsert.users` + - `map_entity:user_troops` (batched) + - `db.upsert.user_troops.batch` + - `commit_snapshot` (mark processed) + +### 7.5 Sampling & retention + +- Sample rate: initial default `10%` for production traces (adjust based on cost and utility). +- Always sample traces for errors (Sentry integration) or for admin-triggered runs (e.g., reprocess). +- Retention: + - Full traces: 7 days + - Sampled traces: 30 days + +### 7.6 Collector & backend + +- Use an `OpenTelemetry Collector` to receive spans and export to backend (`Tempo`/`Jaeger`/`Lightstep`). +- Collector responsibilities: batching, sampling, drop rules, and forwarding to `APM`. + +--- + +## 8. Sentry / Error tracking + +### 8.1 What to capture + +- Unhandled exceptions, errors that result in failed jobs, parsing exceptions, DB errors flagged by worker +- Attach minimal context: `snapshot_id`, `job_id`, `worker_instance`, `env`, `user_id` (if non-`PII`), `trace_id` + +### 8.2 Scrubbing & PII protection + +- Configure `Sentry` before-send hooks to remove or mask `PII` from event payloads. +- Do not send full raw `raw` snapshot content to `Sentry` — instead include `snapshot_id` and `s3_path`. + +### 8.3 Rate limiting & sampling + +- Configure `Sentry` sampling to avoid being overwhelmed by noise (sample non-critical errors). +- For critical jobs or newly introduced code, temporarily increase sampling to get better coverage. + +### 8.4 Alerts & integration + +- Use `Sentry` alert rules to create issues in ticketing systems for high-severity errors. +- Integrate `Sentry` with Slack for immediate visibility on new critical errors. + +--- + +## 9. Correlation & metadata — request_id, trace_id, snapshot_id + +### 9.1 Correlation ids + +- `X-Request-Id`: set by client or `API` gateway; if absent, generate `UUID` at `API` ingress and return to client in + response. +- Include `request_id` in logs and metrics where possible. + +### 9.2 Trace ids + +- Standard `OpenTelemetry` `trace_id` propagated through HTTP and queue. +- Ensure worker reads `trace_id` from job payload and continues trace. + +### 9.3 Snapshot id + +- Include `snapshot_id` label in relevant metrics, logs and traces to pivot troubleshooting to the raw payload. +- Avoid using `snapshot_id` as a high-cardinality `Prometheus` label in cardinal series; use it in logs and as trace + attribute. + +--- + +## 10. Security & PII handling in telemetry + +### 10.1 Policy highlights + +- Telemetry must not leak secrets or raw credentials. +- Keep list of sensitive keys (`password`, `token`, `access_token`, `session`, `auth`) and scrub them in telemetry + pipelines. +- Use hashed identifiers if you must include user identifiers in metrics (e.g., hash(namecode) -> low-card label + buckets). + +### 10.2 Scrubbing implementation points + +- Application-level: sanitize before logging/attaching to traces/`Sentry`. +- Collector-level: run additional redaction rules in `OpenTelemetry Collector`. +- Logging pipeline-level: final redaction step in log shipper (`Loki`/`Logstash`). + +### 10.3 DSR & telemetry + +- When a `Data Subject Request` (`DSR`) requires deletion of personal data: + - Search logs and provide redaction or proof of removal as per policy (logs may be immutable; keep a record of + attempted purge and apply legal process). + - For archived traces / logs in cold storage, follow legal guidance. + +--- + +## 11. Retention & storage policies + +### Recommended baseline (configurable by org policy) + +- Metrics (`Prometheus`): + - Raw samples: 30 days + - Aggregated rollups: 365 days +- Traces: + - Full fidelity: 7 days + - Sampled: 30 days +- Logs: + - Hot searchable: 30 days + - Warm/archived compressed: 365 days +- `Sentry` / errors: 90 days (or provider plan) + +### Cost controls + +- Use metric relabeling to drop or reduce cardinality of high-cardinality labels in `Prometheus`. +- Use trace sampling and error-only retention to control trace storage. +- Use log lifecycle rules to move older logs to cheaper storage tiers. + +--- + +## 12. Testing observability in CI + +### 12.1 Unit & integration asserts + +- Unit tests should assert that metrics are emitted for key code paths (use fake `Prometheus` registry). +- Integration tests should: + - Spin up services, call endpoints that produce metrics, and query `/metrics` to assert presence. + - Insert a sample snapshot and assert worker emits `starforge_etl_snapshots_processed_total`. + +### 12.2 Performance and load tests + +- Simulate realistic load in staging and validate dashboards (`ETL` throughput, `p95` processing, DB connections). +- Validate alert thresholds in a controlled manner (e.g., create synthetic failures to trigger alerts in staging). + +### 12.3 Health checks + +- Include `curl /health` checks in `CI` smoke tests for service liveness and DB connectivity. + +--- + +## 13. Operational runbooks & playbooks (summary + links) + +### Each alert should point to a runbook with step-by-step instructions. Key runbooks: + +- `ETL` failure spike — [docs/OP_RUNBOOKS/ETL_FAILURE_SPIKE.md](./OP_RUNBOOKS/ETL_FAILURE_SPIKE.md) +- Queue backlog — [docs/OP_RUNBOOKS/QUEUE_BACKLOG.md](./OP_RUNBOOKS/QUEUE_BACKLOG.md) +- DB connection exhaustion — [docs/OP_RUNBOOKS/DB_CONNECTION_EXHAUST.md](./OP_RUNBOOKS/DB_CONNECTION_EXHAUST.md) +- `Worker OOM` / Crash — [docs/OP_RUNBOOKS/WORKER_OOM.md](./OP_RUNBOOKS/WORKER_OOM.md) +- Migration rollback — [docs/OP_RUNBOOKS/MIGRATION_ROLLBACK.md](./OP_RUNBOOKS/MIGRATION_ROLLBACK.md) +- Secrets compromise — [docs/OP_RUNBOOKS/SECRET_COMPROMISE.md](./OP_RUNBOOKS/SECRET_COMPROMISE.md) + +### Include runbook essentials in each alert: + +- Quick triage steps +- Targeted queries and dashboards +- Suggested mitigation (scale down workers, pause backfills, flip feature flag) +- Escalation path and contacts + +### Example quick-playbook for ETL failure spike (short) + +1. Check Grafana ETL dashboard and `etl_errors` table for top `error_type` (PARSE vs DB). +2. If DB errors: scale down worker concurrency and check DB metrics (connections, locks). +3. If parse errors: sample failing `snapshot_ids` (select last 10 from `etl_errors`) and save to `docs/examples` for dev + debugging. +4. Pause ingestion if necessary and notify product & engineering. +5. Re-enable ingestion and monitor. + +--- + +## 14. Implementation checklist / example snippets + +### Checklist for instrumenting a service + +- [ ] Expose `/metrics` in `Prometheus` format (use `prom-client` / `prom-client-exporter`) +- [ ] Expose `/health` and readiness probes +- [ ] Instrument core paths: `API` ingest, queue enqueue, job start/end +- [ ] Emit metrics with appropriate labels (`env`, `service`, `job_type`) +- [ ] Add structured logging with required fields (`request_id`, `trace_id`) +- [ ] Integrate `OpenTelemetry` tracing and propagate context into job payload +- [ ] Sanitize and scrub `PII` before sending logs or events to `Sentry` +- [ ] Add unit tests asserting metrics emitted +- [ ] Add end-to-end tests validating /metrics counters after processing a sample snapshot + +### Prometheus client (Node) snippet example + +```js +const client = require( 'prom-client' ); +const Registry = client.Registry; +const register = new Registry(); + +const snapshotsProcessed = new client.Counter( { + name: 'starforge_etl_snapshots_processed_total', + help: 'Snapshots processed', + labelNames: [ 'env', 'service', 'result' ], +} ); +register.registerMetric( snapshotsProcessed ); + +// expose metrics +app.get( '/metrics', async ( req, res ) => { + res.set( 'Content-Type', register.contentType ); + res.end( await register.metrics() ); +} ); +``` + +### OpenTelemetry (Node) basic initialization (example) + +```js +const { NodeTracerProvider } = require( '@opentelemetry/sdk-trace-node' ); +const { registerInstrumentations } = require( '@opentelemetry/instrumentation' ); +const { JaegerExporter } = require( '@opentelemetry/exporter-jaeger' ); +const provider = new NodeTracerProvider(); +const exporter = new JaegerExporter( { serviceName: 'starforge-etl' } ); +provider.addSpanProcessor( new SimpleSpanProcessor( exporter ) ); +provider.register(); +// instrument HTTP, pg, etc. +registerInstrumentations( { tracerProvider: provider, instrumentations: [ ... ] } ); +``` + +--- + +## 15. References & next steps + +- [docs/ETL_AND_WORKER.md](./ETL_AND_WORKER.md) — `ETL` worker contract (instrumentation targets referenced throughout + this doc) +- [docs/DB_MODEL.md](./DB_MODEL.md) — canonical schema (for DB-local metrics) +- [docs/MIGRATIONS.md](./MIGRATIONS.md) — migration conventions (alert to add annotations for migrations) +- Prometheus docs: https://prometheus.io/docs/ +- OpenTelemetry: https://opentelemetry.io/ +- Grafana best practices: https://grafana.com/docs/grafana/latest/ + +---