diff --git a/docs/PRD.md b/docs/PRD.md new file mode 100644 index 0000000..551f5ce --- /dev/null +++ b/docs/PRD.md @@ -0,0 +1,6636 @@ +# Product Requirement Document (PRD) — StarForge + +--- + +## 🔒 Confidentiality Notice + +![SENSITIVE](https://img.shields.io/badge/SENSITIVE-DO%20NOT%20SHARE-red?style=for-the-badge) +> ⚠️ This document may contain sensitive information. Do not share outside the approved group. +> Moreover, this is a living document and will be updated as the project evolves. + +--- + +## 🏷 Versioning & Metadata + +![Version](https://img.shields.io/badge/version-0.1-blue?style=for-the-badge) +![Date](https://img.shields.io/badge/date-2025--11--26-lightgrey?style=for-the-badge) +![Status](https://img.shields.io/badge/status-Draft-yellow?style=for-the-badge) + +### Document Version + +- **Version:** `0.1` +- **Date:** `2025-11-26` +- **Status:** **Draft** + +### Authors & Contacts + +| Role | Person / Contact | +|-------------------|---------------------------------------------------------------------------------------------------------------------------------------| +| Product Owner | ![PO](https://github.com/CorentynDevPro.png?size=20) **Star Tiflette** — GitHub: [@CorentynDevPro](https://github.com/CorentynDevPro) | +| Technical Lead | _TBD_ | +| Engineering Owner | _TBD_ | +| Design Lead | _TBD_ | +| QA Lead | _TBD_ | +| DevOps / SRE | _TBD_ | +| Stakeholders | **Project Maintainers, small test community** | + +> Note: "TBD" stands for "To Be Determined" — role/person not yet assigned. + +--- + +### 📜 Revision History + +
+Click to expand revision history + +| Version | Date | Author | Changes Made | +|---------|-----------:|------------------------------------------------------------------------|--------------------------------------------------------| +| 0.1 | 2025-11-26 | Product Owner ([`@CorentynDevPro`](https://github.com/CorentynDevPro)) | Initial draft: metadata, executive summary, objectives | + +
+ +--- + +## **1. Executive Summary** + +### **1.1 Purpose of this document** + +This PRD describes scope, objectives, functional and non‑functional requirements, and success criteria for StarForge +focusing initially on the player-profile ingestion & database foundation: `migrations`, `safe DB bootstrap`, +`snapshot ingestion`, and `ETL` to normalized tables. + +### **1.2 Background & context** + +- The game backend provides rich `JSON snapshots` (e.g. `get_hero_profile`). When fetched + with the `--login` flow these payloads can reach ~2–3 MB per player. +- The repository already contains source for a Discord bot, backend, CI flows, an historical `schema.sql`, and several + extractor script (`get_hero_profile.sh` / `get_hero_proile.bat` / `get_hero_profile.ps1`). +- We moved sensitive values into **GitHub Actions Secrets** (`DATABASE_URL`, `DISCORD_TOKEN`, etc.) and `sketched CI` + + a `manual DB bootstrap` workflow. +- We need to replace the ad‑hoc schema approach with a robust migration system (`node-pg-migrate`), implement a + safe/manual bootstrap, and design a `data model` + `ETL` to handle large profile `JSON` efficiently. + +### **1.3 Problem statement** + +- Player profile payloads are large and nested; storing them only as raw `JSON` makes querying and analytics slow and + awkward. +- Current DB bootstrap and schema management is manual and risky for production. +- No formal ETL process to reliably transform snapshots into normalized, indexed tables for common queries. +- Need a versioned migration workflow to safely evolve schema. + +### **1.4 Proposed solution (high-level)** + +- Adopt `node-pg-migrate` for versioned JS migrations and create an initial migration that: + - Adds normalized tables (`users`, `hero_snapshots`, `user_troops`, `user_pets`, `user_artifacts`, `user_teams`, + `guilds`, `guild_members`, `feature_flags`, `profile_summary`, etc.) + - Creates indexes and enables `pgcrypto` for `gen_random_uuid()` +- Store raw snapshots in `hero_snapshots` (`JSONB`) for audit and reprocessing. +- Provide an idempotent bootstrap script (`scripts/bootstrap-db.sh`) and a manual GitHub Actions workflow ( + `db-bootstrap.yml`) running in a protected environment requiring approval. +- Implement a background ETL worker (queue + worker) that consumes `hero_snapshots` and upserts normalized tables. + Worker marks snapshots as processed and supports retries. +- Define retention / archive policy for snapshots (e.g. keep N last snapshots per user or 90 days; archive older + snapshots to `S3`). + +### **1.5 Key benefits & value proposition** + +- Fast, indexed queries for common needs (who owns troop X, leaderboard queries, quick profile read). +- Reproducible and safer schema evolution via migrations and manual production approval. +- Ability to replay ETL from stored snapshots when mapping changes or bugs require reprocessing. +- Improved observability and operational safety. + +### **1.6 Decisions already made / constraints** + +- Migration tool: `node-pg-migrate` (JS). +- UUID generation: prefer `pgcrypto` + `gen_random_uuid()` for cloud compatibility (Supabase). +- DB bootstrap workflow: manual (`workflow_dispatch`) and run in a GitHub Environment (e.g., production) to require + approval. +- Snapshots stored as `JSONB` in Postgres (TOAST compression) with a `GIN` index for search. +- Secrets remain in **GitHub Actions Secrets** (`DATABASE_URL`, `PGSSLMODE`, `DISCORD_TOKEN`, `GOOGLE_SA_JSON`, etc.). + +## **2. Objectives & Success Criteria** + +### **2.1 Business objectives** + +- Enable backend and bot features based on player profiles (team recommendations, inventory tracking, analytics) with + acceptable latencies. +- Reduce onboarding time for new devs by documenting and automating DB bootstrap and migrations. +- Prevent accidental production schema changes by requiring manual approval for migrations applied to prod. + +### **2.2 Product objectives (OKRs)** + +> Note: OKRs means Objectives and Key Results. Objectives are high-level goals, and Key Results are measurable outcomes +> that indicate progress toward those goals. + +- `O1`: Deliver the migration pipeline and a manual DB bootstrap workflow within 2 sprints. +- `O2`: Implement ETL worker that normalizes at least 90% of the useful fields from get_hero_profile (troops, pets, + teams, guild info) in the next sprint. +- `O3`: Achieve sub-200ms median response times for core read queries (after normalization and indexing). + +### 2.3 Key performance indicators (KPIs) / success metrics + +- _Migration reproducibility:_ `100%` success rate for manual runs in staging. +- _ETL throughput:_ baseline `X snapshots/hour` (to be measured), with `target >Y` after optimizations. +- _Storage growth:_ `MB/day` for snapshots; alert when rate exceeds threshold. +- _Query latency:_ `p95` for primary queries `<200ms`. +- _ETL error rate:_ `<1%` (with automated retries and alerting). + +### 2.4 Non-goals / Out of scope + +- **Full reimplementation** of the game data catalog (troop stats) — catalogs will be added **incrementally**. +- **Automatic application of migrations** on merge to main (deliberately out of scope to keep production safe). +- **Full BI reporting** and **historical analytics** in initial phase (phase 2). + +--- + +## **3. Users & Personas** + +This section details the primary and secondary personas for the Player Profile & DB Foundation project, their _goals_ +and _pain points_, and the _top-level user journeys_ (main flows and edge/error flows). Use these personas and journeys +to derive user stories, acceptance criteria and implementation priorities. + +--- + +### **3.1 Primary personas** + +**Persona: Alex — End Player (Gamer)** + +- **Role:** Regular player of the game who expects the bot / web UI to show up-to-date profile information (inventory, + teams, PvP stats). +- **Motivations:** + - Quickly view own profile, teams and troop counts inside Discord or web UI. + - Get recommendations based on current troops and items. + - Keep track of progress (Guild contributions, PvP rank). +- **Pain points:** + - Long loading times when the system queries raw JSON or does on-the-fly parsing. + - Inaccurate or stale information if snapshot ingestion is delayed or fails. + - Concern about privacy if login flows are used insecurely. +- **Success for this feature:** + - Profile and quick summary are available in <200ms p95 after ETL has completed (or served from summary cache). + - The player can request a fresh snapshot and get predictable results. + - No loss or leak of credentials when using login-based fetches. + +**Persona: Gwen — Guild Leader / Moderator** + +- **Role:** Community leader using summaries/analytics to manage guild activity and rewards. +- **Motivations:** + - Quickly see guild members’ contribution and top performers. + - Identify members missing required troops or assets. +- **Pain points:** + - Hard-to-run queries that require parsing raw JSON every time. + - Difficulty getting a consistent, searchable view of all guild members. +- **Success for this feature:** + - Ability to query and generate guild-level reports from normalized data (e.g. total troop counts, top + contributors). + - Dashboard or commands return results with low latency and consistent data freshness. + +**Persona: Dev (Backend Engineer) — Jordan** + +- **Role:** Maintains backend services and ETL worker, writes migrations, debugs ingestion problems. +- **Motivations:** + - Clear, versioned migrations and safe bootstrap for local/staging/prod. + - Idempotent ETL tasks and clear logs / retries when things go wrong. + - Fast developer onboarding (scripts, .env.example, sample data). +- **Pain points:** + - Fragile manual schema application (schema.sql) that is hard to evolve. + - Inconsistent or undocumented ETL transformations causing regressions. +- **Success for this feature:** + - Migrations applied reproducibly in staging and safely in production using manual approval. + - Comprehensive tests and example flows (sample JSON -> ETL -> normalized tables). + - Worker is idempotent and can reprocess snapshots safely. + +**Persona: Data Analyst — Morgan** + +- **Role:** Runs analytics, builds reports and dashboards on player behavior and inventory distributions. +- **Motivations:** + - Query normalized tables instead of messy JSON blobs. + - Get fresh data for near-real-time analytics. +- **Pain points:** + - Needing to write complex JSON path queries over JSONB instead of simple SQL aggregations. + - Unclear or inconsistent schema mapping from snapshots to normalized tables. +- **Success for this feature:** + - Clean schema with indexes and documented fields for analytics. + - ETL produces consistent, dated snapshots and retains historic progress for time-series analysis. + +**Persona: Bot Operator / Community Tools Admin — Riley** + +- **Role:** Operates the Discord bot, runs slash command deployments and maintenance. +- **Motivations:** + - Bot commands return profile data quickly and reliably. + - Simple process to deploy updated slash commands and react to data model changes. +- **Pain points:** + - Backend downtime or schema mismatches causing bot failures. + - Lack of a safe workflow to update DB schema used by bot. +- **Success for this feature:** + - Bot continues to function across releases thanks to stable APIs and profile_summary table. + - Admins have runbooks to re-run ETL or roll back schema changes. + +--- + +### 3.2 Secondary personas / stakeholders + +- **Product Owner (PO)** + - _Responsibilities:_ prioritize features, define acceptance criteria, sign off releases. + - _Interest:_ business value, time-to-market, cost. + +- **DevOps / SRE** + - _Responsibilities:_ CI/CD, infrastructure, environment protection, backups. + - _Interest:_ safe migrations, secrets management, monitoring & alerts. + +- **QA Engineer** + - _Responsibilities:_ test plans, test data, acceptance testing for migrations and ETL. + - _Interest:_ reproducible test environments, sample payloads, rollback tests. + +- **Security Officer / Privacy Officer** + - _Responsibilities:_ ensure credentials & PII handled correctly, audits and compliance (GDPR). + - _Interest:_ secrets lifecycle, encryption, data retention policy. + +- **Legal / Compliance** + - _Responsibilities:_ privacy policies, data residency constraints. + - _Interest:_ retention rules and user consent flows. + +- **Integration Providers (Supabase, Discord)** + - _Responsibilities:_ external services support and limits (extensions, rate limits). + - _Interest:_ compatibility, permissioning (e.g. CREATE EXTENSION restrictions). + +- **Community Manager** + - _Responsibilities:_ communicates changes to users, organizes beta testers. + - _Interest:_ rollout plan, user-facing docs. + +--- + +### 3.3 User journeys (top-level) + +Below are the **primary user journeys** grouped by actor. Each journey contains `preconditions`, `main flow`, +`success criteria`, +`typical metrics to monitor`, and common `edge/error flows` & recovery steps. + +**Journey A — Player requests profile snapshot by NameCode (interactive fetch)** + +- _Actor(s):_ Player (Alex), System (API script / backend) +- _Preconditions:_ + - Player has a `NameCode` / `invite code`. + - The fetch script or frontend has access to a stable `API endpoint` (pcmob.parse.gemsofwar.com or internal proxy). + - No secrets required from the player for NameCode fetch. +- _Trigger:_ + - Player issues a request via the `CLI` script, `web UI` or a `bot command` to fetch profile by NameCode. +- _Main flow:_ + 1. Client calls `API function get_hero_profile` with NameCode. + 2. Response (`JSON`) saved to `hero_snapshots` as a new row (`raw JSONB`, source="fetch_by_namecode", size_bytes + recorded). + 3. Push `snapshot id` to `ETL queue`. + 4. `ETL worker` picks up job, parses `JSON`, upserts `users`, `user_troops`, `user_pets`, `user_teams`, `guilds`, + etc. + 5. `ETL` updates or creates `user_profile_summary` with denormalized quick-read fields. + 6. Bot or UI fetches `profile_summary` (or `hero_snapshots.latest` if needed) and returns to player. +- _Success criteria:_ + - Snapshot saved and queued within X seconds of `API response`. + - `ETL worker` processes snapshot and updates summary within `configurable SLA` (e.g., < 30s for interactive flows; + asynchronous acceptable for larger loads). + - Player sees consistent, accurate data in the bot/UI. +- _Metrics:_ + - Snapshot `ingestion time`, `queue latency`, `ETL processing time`, `API latency`, `summary query latency`. +- _Edge / error flows:_ + - `API` returns partial or malformed `JSON` → snapshot saved but `ETL` fails; mark `snapshot.processing=false`, + processed_at=NULL, create error record with `logs`; notify DevOps/Dev. + - Snapshot size large but within expected range → worker uses `stream/parsing memory-safety`; if memory spike → + worker OOM: automatic retry with smaller memory footprint, escalate. + - `Rate-limited` by upstream: schedule retry and notify user about delay. + +**Journey B — Player login flow (--login) that produces larger payloads** + +- _Actor(s):_ Player (Alex), System (login endpoint), Security Officer (policy) +- _Preconditions:_ + - Player provides credentials interactively (local script) — must never be committed in scripts. + - Credentials used only locally or via a `secure ephemeral agent`; we do not store user passwords in our DB. +- _Trigger:_ + - Player runs `get_hero_profile.sh --login` and authenticates with game backend. +- _Main flow:_ + 1. Script posts `login_user` payload; upstream returns a large result `JSON` (2–3MB). + 2. Script stores login response (`login_user_*.json`) and may extract NameCode automatically using `jq`. + 3. If NameCode found, the script runs `get_hero_profile` with the NameCode to obtain final profile. + 4. Save snapshot to `hero_snapshots` and follow `ETL` as in Journey A. +- _Success criteria:_ + - Login and fetch complete without exposing credentials. + - `ETL` processes large payloads without timeouts / resource exhaustion. +- _Special concerns:_ + - Privacy & credentials: never store passwords; if tokens are produced by upstream (session tokens), only store them + if required and encrypted — prefer not to store service tokens in hero_snapshots. + - Big payload handling: ETL must be resource-aware and possibly chunk processing (avoid loading whole payload into + memory when unnecessary). +- _Edge / error flows:_ + - Upstream blocks extension creation during subsequent DB operations (see `migrations`) — worker must log the issue + and the bootstrap must offer fallback. + - If `jq` not present locally, script must inform user how to extract NameCode manually — include instructions in + README. + +**Journey C — Developer local setup & bootstrap DB** + +- _Actor(s):_ Dev (Jordan) +- _Preconditions:_ + - Developer cloned repository, has `Node/PNPM` installed, has a local Postgres instance or connection string. + - .env.example filled with `DATABASE_URL` and `PGSSLMODE` if necessary. +- _Trigger:_ + - Developer runs `./scripts/bootstrap-db.sh` or `pnpm run db:bootstrap` to initialize schema and seeds. +- _Main flow:_ + 1. Script checks env vars and tools (`pnpm`, `psql`). + 2. Runs `pnpm install --frozen-lockfile`. + 3. Runs migrations via `node-pg-migrate up --config database/migration-config.js`. + 4. Runs seed `SQL` files idempotently. + 5. Optionally inserts sample snapshot(s) from `examples/get_hero_profile_*.json` to test `ETL`. + 6. Start worker locally (e.g., `pnpm run worker`) to process snapshots. +- _Success criteria:_ + - Migrations run without errors on local DB. + - Seeds applied idempotently. + - Example snapshot processed, normalized tables populated. +- _Edge / error flows:_ + - Missing dependencies (`pnpm`) → script fails with clear guidance to install. + - CREATE EXTENSION `uuid-ossp` / `pgcrypto` permission denies → script explains fallback and documents manual DBA + steps (ask provider to enable extension or use `gen_random_uuid()`). + - Migration partially applied and fails halfway → migrations are transactional; if not, document manual rollback + steps and ensure tests cover this. + +**Journey D — Admin applies DB bootstrap in production (manual GitHub Action)** + +- _Actor(s):_ DevOps / PO / Authorized Engineer +- _Preconditions:_ + - Repository secrets configured (`secrets.DATABASE_URL`, `secrets.PGSSLMODE`). + - Environment protection set up (GitHub Environments) and access to approve runs. +- _Trigger:_ + - Authorized user triggers GitHub Action workflow (`workflow_dispatch`) to bootstrap DB. +- _Main flow:_ + 1. Workflow runs checkout, setup `node` & `pnpm`, installs deps. + 2. It sets `DATABASE_URL` and `PGSSLMODE` from secrets into environment. + 3. Runs `pnpm run migrate:up` (`node-pg-migrate`) to apply migrations. + 4. Runs seeds and validates by listing tables or running sanity queries. + 5. Logs and artifacts stored; if environment requires approval, job waits until approved. +- _Success criteria:_ + - Migrations applied and verified via post-run sanity checks. + - No destructive operations performed without explicit manual approval. +- _Edge / error flows:_ + - Migration fails due to missing extension permission → abort and log cause; provide remediation steps (request + extension enabling from provider or run alternate migration). + - Secrets are misconfigured → workflow fails early with clear error; do not leak secrets in logs. + - If workflow times out: workflow status = failed, notify stakeholders and provide snapshot of DB state. + +**Journey E — ETL Worker processes snapshots** + +- _Actor(s):_ `ETL` Worker (background service), `Queue system` (Redis/BullMQ) +- _Preconditions:_ + - Snapshot saved in `hero_snapshots` with `processing=false`. + - Queue and worker services are running and have access to `DATABASE_URL`/`PGSSLMODE`. +- _Main flow:_ + 1. On insertion of hero_snapshots row, enqueue snapshot id. + 2. Worker atomically marks `row.processing=true` (optimistic lock) to claim job. + 3. Worker parses raw JSON in stream-safe manner and upserts: + - `users` (namecode, username, summary) + - `user_troops` (upsert per troop_id) + - `user_pets`, `user_artifacts` + - `guilds` & `guild_members` + - `user_profile_summary` (denormalized) + 4. Worker writes audit/log entries for changes and sets `processed_at` timestamp and `processing=false`. + 5. Worker emits telemetry: `processed_count`, `duration`, `errors`. +- _Success criteria:_ + - Worker marks snapshot processed and normalized tables reflect data consistently. + - Worker is idempotent: reprocessing same snapshot does not duplicate or corrupt data. +- _Edge / error flows:_ + - Snapshot `JSON` malformed → worker logs error, writes to error queue and marks snapshot with error metadata ( + `error_message`, `error_count`). Trigger alert if error rate exceeds threshold. + - Partial failure during upsert (e.g., FK violation due to missing catalog row) → worker should roll back the + transaction for that entity and optionally continue with others; record failure for manual review. + - Upstream likely to produce new fields → worker should ignore unknown fields by default and write the raw snapshot + so reprocessing is possible. + +**Journey F — Bot command / UI reads profile summary** + +- _Actor(s):_ Bot Operator (Riley), Player (Alex) +- _Preconditions:_ + - `user_profile_summary` row exists for the user (`ETL` completed). + - `API route` or DB read permission in place for the bot server. +- _Main flow:_ + 1. User triggers bot command /profile or web UI loads profile. + 2. Backend queries `user_profile_summary`; on cache miss, fallback to latest `hero_snapshots` processed row and + render minimal view. + 3. Return formatted data (teams, equipped pet, top troops). +- _Success criteria:_ + - Bot command responds within target latency (`p95 < 200ms`). + - Data is consistent with latest processed snapshot. +- _Edge / error flows:_ + - No `profile_summary` exists → fall back to latest processed `hero_snapshots` or respond with friendly message ( + e.g., "Profile not processed yet; try again in a minute"). + - DB query times out → bot returns an error and logs to monitoring. + +**Journey G — Data analyst / reporting flow** + +- _Actor(s):_ Data Analyst (Morgan) +- _Preconditions:_ + - Normalized data present in tables; historical `user_progress` snapshots exist. + - Access controls and read-only DB users available for analytics queries. +- _Main flow:_ + 1. Analyst runs queries/aggregations on normalized tables (e.g., troop distribution, top players). + 2. Queries use indexes and materialized views if provided. + 3. For heavier `BI` runs, analyst may extract data to a warehouse. +- _Success criteria:_ + - Queries complete in reasonable time (depends on dataset size); heavy analytics offloaded to dedicated worker or + snapshot export. +- _Edge / error flows:_ + - Analyst needs a field not yet normalized → request to dev team to extend `ETL` or use `JSONB` queries as a + stopgap. + - Very large scans → recommend creating materialized views or exporting to data warehouse. + +--- + +## Edge flows and error handling (cross-cutting) + +Below are generalized edge/error conditions that affect multiple journeys, with recommended recovery/mitigation steps. + +1. Malformed or truncated JSON + - Behavior: ETL fails to parse; worker records error and marks snapshot with error metadata. + - Mitigation: + - Keep raw snapshot for debugging. + - Worker writes detailed error logs and a searchable error table. + - Provide a rerun endpoint / worker command to reattempt reprocessing after fixes. + +2. Upstream rate limiting or timeouts + - Behavior: fetch scripts fail intermittently. + - Mitigation: + - Implement exponential backoff and retry policies in the fetch client. + - Expose an ETA to user when request is delayed. + - Respect upstream rate limits, log upstream status. + +3. Large payload memory pressure (2–3MB or more) + - Behavior: Worker OOM or degraded latency. + - Mitigation: + - Stream JSON parsing where possible; avoid loading whole payload as a single in-memory object. + - Break ETL into smaller per-entity transactions; publish partial progress. + - Monitor memory usage and provide autoscaling for worker pool. + +4. CREATE EXTENSION permission denied (uuid-ossp / pgcrypto) + - Behavior: migration fails on extension creation. + - Mitigation: + - Use pgcrypto/gen_random_uuid() by default (less often blocked on Supabase). + - Document fallback steps in bootstrap script and MIGRATIONS.md. + - For providers that forbid extension creation, document required provider-side ops or use alternative ID + generation. + +5. Missing or misconfigured secrets (DATABASE_URL, PGSSLMODE) + - Behavior: bootstrap workflow fails early; logs should not leak secrets. + - Mitigation: + - Validate secrets in workflow pre-check step and fail fast with actionable message. + - Use GitHub-enforced environments and secrets policies; rotate keys periodically. + +6. Concurrent migrations / long-running migration locks + - Behavior: schema upgrades might block workers or queries. + - Mitigation: + - Make migrations idempotent and short; avoid long-running table rewrites where possible. + - Use maintenance windows for high-risk changes and communicate to stakeholders. + - Provide migration rollback plan and DB backups. + +7. Duplicate snapshot submissions + - Behavior: same snapshot inserted multiple times (identical raw content). + - Mitigation: + - Compute and store content hash (e.g., SHA256) on hero_snapshots and use uniqueness constraints to avoid + duplicates; still support duplicates if needed but mark as duplicates. + - ETL idempotency: worker uses snapshot id to ensure a single processed outcome. + +8. Data privacy & credential leakage + - Behavior: login flows may produce tokens or personal info. + - Mitigation: + - Never persist user passwords. If upstream returns session tokens, treat them as secrets and only store if + strictly required and encrypted. + - Mask PII in logs; redact tokens and other sensitive fields. + - Document retention and deletion policy (Data Retention doc). + +9. Partial ETL due to schema mismatch + - Behavior: new fields are added upstream that worker doesn’t understand; upsert can fail due to FK references. + - Mitigation: + - Worker should ignore unknown fields by default and capture them under an `extra` JSONB column. + - Maintain comprehensive tests with sample payloads (cover old and new payload shapes). + - Provide a reprocessing flow after migration to populate new columns. + +--- + +## Acceptance criteria & "done" checklist for journeys + +These criteria should be used by QA/Product to mark features done for the Player Profile & DB Foundation scope: + +- hero_snapshots creation: + - [ ] Raw snapshots inserted reliably for both NameCode fetch and login fetch. + - [ ] size_bytes recorded and a content hash stored. +- ETL worker: + - [ ] Worker processes snapshots and sets processed_at on completion. + - [ ] Upserts are idempotent: reprocessing a snapshot produces no duplicate rows. + - [ ] Worker writes meaningful logs on success/failure and emits metrics. +- Normalized schema: + - [ ] users, user_troops, user_pets, user_artifacts, user_teams, guilds, guild_members, user_profile_summary exist + with indexes. + - [ ] A small sample JSON (provided) can be processed end-to-end in local setup. +- Developer experience: + - [ ] scripts/bootstrap-db.sh runs locally, applies migrations and seeds idempotently. + - [ ] .env.example documents required variables and example values (non-sensitive). +- Operational safety: + - [ ] db-bootstrap GitHub Action is manual, reads secrets only from GitHub Secrets and runs in protected environment + requiring approval. + - [ ] Backups and rollback runbooks exist and tested. +- Security & privacy: + - [ ] No credentials are stored in plaintext; logging redacts tokens/PII. + - [ ] Data retention policy documented. + +--- + +## 4. User Stories & Use Cases + +This section translates the product goals, personas and user journeys into concrete epics, prioritized user stories with +acceptance criteria (Given/When/Then) and detailed use cases. Use this section to populate backlog tickets and to drive +implementation, QA and acceptance. + +--- + +### 4.1 Epics + +Each epic groups related functionality into a deliverable area. Use them as high‑level backlog buckets. + +- EPIC-DB-FOUNDATION + - Goal: Establish a reproducible, versioned database foundation (migrations, bootstrap, seeds) and developer + onboarding. + - Includes: node-pg-migrate integration, bootstrap scripts, CI workflow for manual/protected bootstrap, seed data + and sample payloads. + +- EPIC-SNAPSHOT-INGESTION + - Goal: Reliably store raw player profile snapshots (JSONB) with metadata and deduplication, and provide + retention/archival policy. + - Includes: hero_snapshots table, content hashing, source attribution, size and server_time capture, duplicate + detection. + +- EPIC-ETL-WORKER + - Goal: Background worker that normalizes snapshots into indexed relational tables, is idempotent, resilient and + observable. + - Includes: queue design, claim/processing semantics, per-entity upserts (users, user_troops, user_pets, + user_artifacts, user_teams, guilds), error handling and reprocess API. + +- EPIC-API-BACKEND & BOT + - Goal: Provide low-latency read APIs and bot commands that use denormalized summary tables with graceful fallback + to raw snapshots. + - Includes: /profile/summary endpoint, bot slash command, admin endpoints (reprocess, health). + +- EPIC-ANALYTICS & EXPORTS + - Goal: Enable analysts to query normalized data, create materialized views for heavy aggregations and export data + for BI. + - Includes: materialized views, export jobs, schema documentation. + +- EPIC-DEVEX & DOCS + - Goal: Developer experience and onboarding documentation to run migrations, local bootstrap, worker and test ETL + with sample payloads. + - Includes: docs/DB_MIGRATIONS.md, docs/ETL_AND_WORKER.md, scripts/bootstrap-db.sh, ingest-sample.sh. + +- EPIC-SECURITY & PRIVACY + - Goal: Ensure credentials and PII are never leaked or stored insecurely, document retention and GDPR + considerations, redact secrets in logs. + - Includes: logging rules, retention policy implementation, secrets handling guidelines. + +- EPIC-OBSERVABILITY & OPERATIONS + - Goal: Provide metrics, alerts and runbooks for ETL and DB bootstrap operations, enable safe on-call operations. + - Includes: Prometheus metrics, health endpoints, runbooks/incident procedures. + +--- + +### 4.2 User stories (with acceptance criteria) + +Stories are grouped by epic and prioritized (P0 = must-have, P1 = important, P2 = nice-to-have). Each story includes a +short description and acceptance criteria formatted as Given/When/Then. + +EPIC-DB-FOUNDATION + +- STORY-DB-001 — Add versioned migrations using node-pg-migrate (P0) + - Description: Add node-pg-migrate configuration and initial migration(s) creating normalized schema and + hero_snapshots table, using pgcrypto/gen_random_uuid(). + - Acceptance: + - Given a fresh Postgres instance and a valid DATABASE_URL, when the dev runs `pnpm migrate:up`, then migrations + complete without error and expected tables (users, hero_snapshots, user_troops, user_pets, user_artifacts, + user_teams, guilds, guild_members, user_profile_summary, feature_flags) exist. + +- STORY-DB-002 — Add bootstrap script & protected GitHub Action (P0) + - Description: Provide scripts/bootstrap-db.sh and a manual GitHub Action workflow (workflow_dispatch) that runs + migrations and seeds using repository secrets and environment protection. + - Acceptance: + - Given secrets configured and approver permissions, when a maintainer triggers the workflow, then it completes + successfully and runs a sanity check query (e.g., lists expected tables) and stores logs as artifacts. + +- STORY-DB-003 — Provide idempotent seed and schema validation (P1) + - Description: Ensure seed scripts are idempotent and include schema validation checks to confirm critical indexes + and extensions. + - Acceptance: + - Given seeds run multiple times, when executed again, then database state remains consistent and idempotent + without duplicate rows. + +EPIC-SNAPSHOT-INGESTION + +- STORY-SNAP-001 — Persist raw snapshot with metadata (P0) + - Description: On every fetch/login response, persist the raw JSON into hero_snapshots JSONB along with size_bytes, + content_hash (SHA256), source, server_time (if present). + - Acceptance: + - Given a valid response JSON, when backend inserts snapshot, then hero_snapshots contains a row with raw JSON, + size_bytes > 0, content_hash set and created_at timestamp present. + +- STORY-SNAP-002 — Duplicate detection / short-window dedupe (P1) + - Description: If identical snapshot payload (same content_hash) is submitted within a configurable short window ( + e.g., 60s), mark as duplicate instead of inserting a full second row. + - Acceptance: + - Given identical payloads submitted twice within the dedupe window, when the second insertion occurs, then a + duplicate record link or duplicate_count is recorded and no duplicate raw row is inserted. + +- STORY-SNAP-003 — Snapshot ingest API and CLI integration (P0) + - Description: Provide backend endpoint and client scripts that call it; CLI (get_hero_profile.sh) should save + output and optionally POST to ingestion endpoint or store locally. + - Acceptance: + - Given a successful get_hero_profile result, when CLI runs in non-login mode, then file is saved locally and + optionally posted to ingestion endpoint when configured. + +EPIC-ETL-WORKER + +- STORY-ETL-001 — Background worker: claim/process/update (P0) + - Description: Implement a queue+worker which atomically claims a snapshot (processing flag), parses it, upserts + normalized tables and sets processed_at. Worker must be idempotent. + - Acceptance: + - Given hero_snapshots row with processing=false, when worker processes it, then processed_at is set, + processing=false and normalized tables reflect parsed data; re-running worker on the same snapshot does not + create duplicates. + +- STORY-ETL-002 — Stream-aware parsing for large payloads (P1) + - Description: ETL must handle large payload arrays (troops) without fully loading JSON into memory; use streaming + or chunked processing where applicable. + - Acceptance: + - Given a ~3MB snapshot processed on a low-memory instance, when worker runs, then process completes without + memory OOM and within a reasonable time. + +- STORY-ETL-003 — Preserve unmapped fields to `extra` JSONB (P1) + - Description: Unknown/new fields in upstream payload are stored under `extra` JSONB fields in relevant normalized + rows to allow later reprocessing/analysis. + - Acceptance: + - Given a snapshot contains fields not mapped in the schema, when worker upserts, then those fields are saved + under `extra` on the appropriate entity row and do not cause failures. + +- STORY-ETL-004 — Partial-upsert strategy & compensating actions (P1) + - Description: ETL should perform per-entity transactions so a failure on one entity does not roll back unrelated + entities; capture failed entity errors for manual review. + - Acceptance: + - Given a snapshot where user_troops upsert fails due to unexpected FK, when worker processes snapshot, then + user and other entities are still upserted and an error record is created for the failing entity. + +- STORY-ETL-005 — Reprocess API for admins (P1) + - Description: Admin endpoint to enqueue a snapshot for reprocessing; endpoint requires authentication and logs the + action. + - Acceptance: + - Given snapshot id exists, when an admin posts reprocess request, then snapshot is enqueued and a job id or 202 + response is returned. + +EPIC-API-BACKEND & BOT + +- STORY-API-001 — Fast profile summary endpoint (P0) + - Description: Implement GET /api/profile/summary/:namecode returning denormalized fields from user_profile_summary; + fallback to latest hero_snapshot processed row. + - Acceptance: + - Given a processed profile, when GET /profile/summary/:namecode is called, then the service returns the summary + and p95 response time is <200ms in staging. + +- STORY-API-002 — Bot slash command `/profile ` (P0) + - Description: Bot command which calls the summary API and formats a short embed for Discord (level, top troops, + equipped pet). + - Acceptance: + - Given summary exists, when player executes slash command, then bot replies with formatted embed within the bot + command timeout window. + +- STORY-API-003 — Friendly fallback when profile pending (P1) + - Description: If a summary is not available, API returns 202 with an ETA message or returns best-effort data from + latest processed snapshot and indicates freshness. + - Acceptance: + - Given no summary yet, when API invoked, then response communicates status (202 or best-effort) and includes + next estimated processing ETA. + +EPIC-ANALYTICS & EXPORTS + +- STORY-AN-001 — Materialized view for troop ownership (P2) + - Description: Build a materialized view summarizing troop ownership counts and last_updated times to speed + analytics queries. + - Acceptance: + - Given data in user_troops, when view is refreshed, then queries for top owners return in acceptable query time + and reflect recent data after refresh. + +- STORY-AN-002 — Export job to S3 (P2) + - Description: Implement job to export normalized tables (CSV/Parquet) to S3 for BI ingestion. + - Acceptance: + - Given an export request, when the job runs, then export files appear in S3 and contain expected columns, with + an audit entry. + +EPIC-DEVEX & DOCS + +- STORY-DEV-001 — Provide sample JSONs and ingest script (P0) + - Description: Include representative get_hero_profile JSON samples and a script to insert them into a local DB and + trigger local worker. + - Acceptance: + - Given local DB and worker, when developer runs ingest sample script, then normalized tables populate and + manual verification queries succeed. + +- STORY-DEV-002 — Developer onboarding doc (P0) + - Description: docs/DB_MIGRATIONS.md and docs/ETL_AND_WORKER.md with step-by-step local setup, env vars, and common + troubleshooting. + - Acceptance: + - A new developer following docs can bootstrap local DB and process a sample snapshot without outside help. + +EPIC-SECURITY & PRIVACY + +- STORY-SEC-001 — Redact tokens and never persist passwords (P0) + - Description: Ensure scripts and worker never persist raw credentials; redact tokens in logs and redact PII + according to policy. + - Acceptance: + - Given a snapshot containing tokens/PII, when storing or logging, then passwords are never saved, tokens are + redacted and logs do not contain raw secret strings. + +- STORY-SEC-002 — Implement snapshot retention & archival job (P1) + - Description: Background job to archive or delete snapshots older than retention threshold (e.g., 90 days) and + document policy. + - Acceptance: + - Given snapshots older than retention, when retention job runs, then snapshots are archived to S3 (or deleted) + and audit entries exist. + +EPIC-OBSERVABILITY & OPERATIONS + +- STORY-OPS-001 — ETL metrics & health endpoint (P1) + - Description: Worker exposes metrics (processed_count, failure_count, duration_histogram) and a /health endpoint + for orchestration. + - Acceptance: + - Metrics are scraped by Prometheus and show non-zero values after processing; alerts trigger when + failure_rate > configured threshold. + +- STORY-OPS-002 — Runbook for DB bootstrap and ETL incident (P1) + - Description: Create runbooks with steps for manual intervention: how to re-run migrations safely, how to + re-enqueue snapshots, how to roll back. + - Acceptance: + - On-call can follow runbook to safely re-run ETL or bootstrap with minimal assistance. + +--- + +### 4.3 Detailed use cases + +Use cases describe step-by-step flows, actors, preconditions, main flows, alternative flows and postconditions. These +are more detailed than user stories and map to acceptance tests. + +#### UC-100 — Fetch profile by NameCode + +- ID: UC-100 +- Actors: Player (Alex), Ingestion API (or local CLI that posts), Backend, DB (hero_snapshots), ETL Worker, Bot/UI +- Preconditions: + - Player provides a valid NameCode. + - System has network access to upstream get_hero_profile endpoint or the player runs CLI to fetch and POST the + result. +- Main flow: + 1. Player uses CLI or UI to request profile by NameCode. + 2. System (client) calls upstream get_hero_profile and receives JSON body and response headers. + 3. System computes content_hash (SHA256) and size_bytes. + 4. If a snapshot with same content_hash exists within dedupe window, mark as duplicate and link; else insert into + hero_snapshots with metadata: source="fetch_by_namecode", created_at, server_time. + 5. System enqueues snapshot id into processing queue. + 6. Worker claims snapshot (atomic update processing=true). + 7. Worker parses snapshot: + - Upsert users table (user row keyed by NameCode/Id). + - Upsert user_troops (amount, level, rarity, fusion_cards, traits_owned, extra). + - Upsert user_pets, user_artifacts. + - Upsert user_teams (array of troop ids), guilds and guild_members if present. + - Update user_profile_summary with denormalized fields (level, top 5 troops, equipped pet, PvP tier, guild + name). + 8. Worker sets processed_at and clears processing flag; emits metrics. + 9. Bot/UI queries /profile/summary and returns the summary to the player. +- Alternative flows: + - Upstream failure (network or rate limit): client retries with backoff; no snapshot inserted until successful or + cached error logged. + - Snapshot malformed: worker records error metadata, does not set processed_at; admin alerted if repeated failures. + - Duplicate snapshot detected: system increments duplicate counter and optionally returns existing snapshot id to + client. +- Postconditions: + - hero_snapshots contains an inserted or linked snapshot record. + - Normalized tables reflect the snapshot if ETL succeeded. + - Metrics recorded for ingestion latency and ETL time. + +#### UC-101 — Login-based profile fetch (large payload) + +- ID: UC-101 +- Actors: Player (Alex), CLI script, Upstream login endpoint, Ingestion API/Backend, DB, ETL worker +- Preconditions: + - Player runs get_hero_profile.sh --login locally and supplies credentials in interactive prompt. + - CLI prompts must not persist passwords anywhere on disk or in repo. +- Main flow: + 1. CLI posts login_user payload to upstream and receives a login response containing session info and possibly + NameCode. + 2. CLI saves the login response locally (file) for debugging; it should not POST credentials to ingestion service. + 3. If NameCode present or if CLI requests final profile, CLI triggers get_hero_profile call to retrieve large + profile JSON (~2–3MB). + 4. Follow UC-100 main flow to insert snapshot and process. +- Alternative flows: + - jq missing locally: CLI instructs the user to install jq or to extract NameCode manually and aborts. + - Upstream returns token: CLI redacts token when saving to file and does not persist tokens in ingestion system + unless explicitly authorized and encrypted. +- Postconditions: + - Snapshot stored and queued; no passwords persisted. + - Large snapshot processed by ETL in streaming/chunked manner. + +#### UC-102 — Developer local bootstrap and sample processing + +- ID: UC-102 +- Actors: Developer, Local Postgres, Scripts (bootstrap-db.sh, ingest-sample.sh), ETL worker +- Preconditions: + - Developer has Node, pnpm, local Postgres or remote dev DB, and .env configured with DATABASE_URL. +- Main flow: + 1. Developer runs bootstrap script. + 2. Script validates env vars, installs dependencies, runs migrations and idempotent seeds. + 3. Developer runs ingest-sample.sh which inserts an example JSON into hero_snapshots. + 4. Developer starts local worker; worker processes snapshot and populates normalized tables. + 5. Developer validates by running SQL queries against user_profile_summary and user_troops. +- Alternative flows: + - Missing pg extension permission: script prints instructions and suggests a manual step or alternative UUID + generation. + - Seed fails: script aborts with an error and rollbacks partial seeds. +- Postconditions: + - Local DB schema and seeds applied; sample snapshot processed. + +#### UC-103 — Admin reprocess snapshot + +- ID: UC-103 +- Actors: Admin, Admin API, Queue, Worker +- Preconditions: + - Admin has auth to call admin endpoints. + - Snapshot id exists in hero_snapshots. +- Main flow: + 1. Admin calls POST /admin/snapshots/:id/reprocess. + 2. API validates admin rights and snapshot existence. + 3. API clears snapshot error flags, resets processed_at if needed, and enqueues snapshot id. + 4. Worker picks up job and processes idempotently. + 5. System returns job status and logs. +- Alternative flows: + - Snapshot in-progress: API returns 409 and informs admin. + - Snapshot not found: API returns 404. +- Postconditions: + - Snapshot reprocessed and normalized tables updated. + +#### UC-104 — Retention & archival of old snapshots + +- ID: UC-104 +- Actors: Cron retention job, DB, S3 (archive store) +- Preconditions: + - Retention policy configured (days or keep last N snapshots per user). +- Main flow: + 1. Retention job selects hero_snapshots older than retention threshold and not flagged as permanent. + 2. For each snapshot: compress and upload raw JSON to S3, store archival metadata (s3_path, archived_at) and then + delete or mark archived in DB. + 3. Job logs success and raises alert on failures. +- Alternative flows: + - S3 temporarily unavailable: job retries with exponential backoff and logs failures; if persistent, escalate. +- Postconditions: + - Old snapshots are archived and DB storage reduced, with audit entries retained. + +#### UC-105 — Materialized view refresh & analytics export + +- ID: UC-105 +- Actors: Analyst, Export Job, Materialized Views +- Preconditions: + - Normalized data exists in user_troops and related tables. +- Main flow: + 1. Analyst triggers materialized view refresh or scheduled job refreshes it. + 2. Analyst runs a query against materialized view for aggregated insights (e.g., troop ownership counts). + 3. For large export, analyst requests export job which writes data to S3. +- Alternative flows: + - View refresh collides with heavy DB load: refresh is scheduled during off-peak or uses CONCURRENTLY option where + supported. +- Postconditions: + - Materialized view is up-to-date and exports are available on S3. + +#### UC-106 — Migration permission failure handling + +- ID: UC-106 +- Actors: Maintainer, GitHub Actions, Database provider +- Preconditions: + - Migrations contain extension creation (CREATE EXTENSION IF NOT EXISTS pgcrypto). +- Main flow: + 1. Workflow runs and attempts to create extension. + 2. DB provider denies permission; migration fails. + 3. Workflow captures error, aborts and notifies approver with remediation steps (enable extension or run alternate + migration). +- Alternative flows: + - Maintainer has privilege to enable extension: run remedial step then re-run migration. +- Postconditions: + - Workflow fails with clear remediation instructions and DB left in consistent state. + +--- + +## Mapping to Acceptance Tests & Tickets + +For each user story above create a ticket that contains: + +- Story description and priority. +- Acceptance criteria (Given/When/Then) copied verbatim. +- Test plan (unit tests, integration tests, local end‑to‑end). +- Example payload(s) from examples/ for test fixtures. +- Any migration steps or environment requirements. + +--- + +## Backlog & Iteration recommendations + +1. Sprint 1 (foundation): + - STORY-DB-001, STORY-DB-002, STORY-SNAP-001, STORY-DEV-001, STORY-DEV-002 (migrate, bootstrap, snapshot persist, + sample payloads, docs). + +2. Sprint 2 (ETL core + API): + - STORY-ETL-001, STORY-API-001, STORY-API-002, STORY-ETL-003 (idempotent ETL, summary endpoint, bot command, + preserve unmapped fields). + +3. Sprint 3 (resilience & ops): + - STORY-ETL-002, STORY-SNAP-002, STORY-OPS-001, STORY-SEC-002 (streaming ETL, dedupe, metrics, retention). + +4. Sprint 4 (analytics & polish): + - EPIC-ANALYTICS stories and remaining P2 items. + +--- + +## 5. Functional Requirements + +This section describes the functional scope for the Player Profile & DB Foundation project. It lists features at a high +level, provides detailed functional specifications for each major feature, defines data requirements (entities, +retention and archival rules), outlines integration requirements with external systems, and lists third‑party services +and dependencies (quotas, rate limits and SLAs). + +--- + +### 5.1 Feature list (high level) + +The following feature list groups work into logically cohesive capabilities that will be delivered across sprints. + +- Feature: Database Foundation & Migrations + - Versioned migrations (node-pg-migrate), bootstrap script, protected GitHub Actions workflow for manual production + bootstrap. +- Feature: Snapshot Ingestion Endpoint & CLI Integration + - Persist raw get_hero_profile JSON snapshots with metadata (size, SHA256 hash, source) to hero_snapshots. + - Deduplication within configurable window. +- Feature: Background ETL Worker + - Queue + worker that claims snapshots, parses them, and upserts normalized tables (users, user_troops, user_pets, + user_artifacts, user_teams, guilds, guild_members, user_profile_summary). + - Idempotent processing, per-entity transactions, streaming/chunked parsing for large payloads. +- Feature: Profile Summary API & Bot Commands + - Low‑latency read endpoints using profile_summary; slash command integration for Discord bot. + - Friendly fallbacks when summary is pending. +- Feature: Admin & Operational Endpoints + - Reprocess snapshot API, health and metrics endpoints, retention/archival job control. +- Feature: Analytics & Exports + - Materialized views and export jobs to S3 (CSV/Parquet) for BI pipelines. +- Feature: Security & Compliance Controls + - Redaction of tokens in logs, never persist passwords, snapshot retention policies and GDPR-related deletion flows. +- Feature: Observability & Runbooks + - Metrics (Prometheus), logs (structured), alerts and on-call runbooks for ETL and DB bootstrap operations. +- Feature: Developer Experience + - Sample payloads, local bootstrap scripts, documentation (DB_MIGRATIONS.md, ETL_AND_WORKER.md) and automated tests + for ETL idempotency. + +--- + +### 5.2 Detailed functional specification (per feature) + +Below are detailed specifications for the highest-priority features. Each feature includes overview, inputs/outputs, +UI/UX behavior (when applicable), API contract examples, business rules & validations, and error handling. + +Feature A — Snapshot Ingestion (API + CLI integration) + +- Overview + - Receive and persist raw player profile snapshots returned by get_hero_profile (NameCode fetch or login flow). + Capture metadata to detect duplicates and feed ETL pipeline. +- Inputs / outputs + - Inputs: + - JSON body (raw get_hero_profile payload) + - HTTP headers (optional: upstream server_time) + - Query params / metadata: source (string), client_name (string), content_hash optional + - Outputs: + - DB insert into hero_snapshots: id (UUID), user_id (nullable), namecode (optional), source, raw JSONB, + size_bytes, content_hash (SHA256), server_time, created_at + - HTTP response with snapshot id and status +- UI/UX behavior + - CLI: get_hero_profile.sh saves raw JSON locally and can POST to ingestion endpoint if configured; CLI prints + snapshot id and next steps (e.g., "snapshot queued for processing — check /profile/summary in ~30s"). + - Web/UI: Button or action to “Fetch profile by NameCode” returns immediate acknowledgement (202) and GUID. +- API contract (example) + - Endpoint: POST /api/internal/snapshots + - Method: POST + - Auth: Bearer token (service), or limited API key for CLI; endpoint restricted to internal clients + - Request JSON: + { + "source": "fetch_by_namecode" | "login", + "namecode": "COCORIDER_JQGB", + "payload": { ... full get_hero_profile JSON ... } + } + - Response: + - 201 Created + { + "snapshot_id": "uuid", + "status": "queued", + "created_at": "2025-11-28T12:34:56Z" + } + - 409 Duplicate (optional) + { + "snapshot_id": "existing-uuid", + "status": "duplicate" + } +- Business rules & validations + - Validate that payload is JSON and non-empty. + - Compute SHA256(content) as content_hash. If a snapshot with same content_hash exists within dedupe_window ( + configurable, default 60s), return duplicate response (do not insert duplicate raw row) but record an attempt ( + duplicate_count) or an audit event. + - size_bytes recorded as byte length of payload. + - If namecode present in payload, attempt to map to existing users record (user_id) if a match exists. + - Enqueue snapshot id to ETL queue after successful insert. +- Error handling & messages + - 400 Bad Request if payload missing or invalid JSON. + - 401 Unauthorized if auth fails. + - 413 Payload Too Large if size exceeds configured maximum (reject or return 413 and instruct to use CLI with + chunking). + - 500 Internal Server Error on DB/queue problems; response includes safe error id for support tracing (no sensitive + data). + +Feature B — Background ETL Worker (core normalization) + +- Overview + - Asynchronous worker that consumes snapshot ids, parses raw JSON, and upserts normalized relational records. + Designed to be idempotent, stream-friendly and observable. +- Inputs / outputs + - Inputs: + - snapshot id (UUID) + - hero_snapshots.raw JSONB + - Outputs: + - Upserts to normalized tables (users, user_troops, user_pets, user_artifacts, user_teams, guilds, + guild_members, user_profile_summary, user_progress) + - Snapshot processed metadata: processed_at, processing flag cleared, error metadata if failure + - Metrics emitted (duration_ms, items_processed, failure_count) +- UI/UX behavior + - Not user-facing directly. Admin UI may show snapshot processing status and allow reprocess action. +- Processing steps (core) + 1. Claim snapshot: atomically set processing=true where processing=false to avoid duplicate claims (SQL WHERE + processing=false RETURNING id). + 2. Parse JSON safely, using streaming / chunked processing for large arrays (troops, inventories). + 3. Begin per-entity upsert transaction: + - Upsert users by unique keys (namecode, discord_user_id). Use ON CONFLICT for idempotency. + - Upsert user_troops: for each troop record create or update unique (user_id, troop_id). + - Upsert pets, artifacts, teams similarly. + - Upsert guilds and guild_members if present. + - Create/update user_profile_summary (denormalized for quick reads). + - Save any unmapped fields into `extra` JSONB on each row or into a special unmapped_fields table for audit. + 4. Commit and set processed_at timestamp; if partially failed, capture entity-level errors and write to etl_errors + table. +- API contract (admin) + - Endpoint: POST /api/admin/snapshots/:id/reprocess + - Method: POST + - Auth: Admin-level JWT or API key + - Response: + - 202 Accepted { "job_id": "uuid", "status": "enqueued" } + - 404 Not Found if snapshot id unknown + - 409 Conflict if snapshot currently processing +- Business rules & validations + - Worker must be idempotent: repeated processing of same snapshot id must not create duplicates nor corrupt state. + - For large snapshots, break work into smaller DB transactions per-entity to reduce lock contention; do not hold one + monolithic transaction for whole snapshot. + - Unknown fields must not cause failure; store them under `extra` and emit a telemetry event for later mapping. + - If upsert fails due to referential integrity (missing catalog row), optionally create a placeholder catalog row or + write the failure to etl_errors for manual resolution (configurable behavior). + - Processing attempts limited by retry policy (exponential backoff). After N failed attempts, mark snapshot as + failed and alert. +- Error handling & messages + - If parsing error: set snapshot.processing=false, processed_at=NULL, write detailed error into etl_errors (include + snapshot_id, exception, stack, truncated raw snippet for debugging), and notify via alerting channel. + - If DB deadlock or transient error: retry automatically per policy. + - If permanent error (schema mismatch / unknown severe condition): mark snapshot failed, do not retry, and create a + manual work item for engineers. + +Feature C — Profile Summary API & Bot + +- Overview + - Expose low-latency read APIs and a Discord bot command that returns denormalized profile summaries built by the + ETL. +- Inputs / outputs + - Inputs: + - namecode or user_id param + - Outputs: + - JSON summary with fields: namecode, username, level, top_troops [ {troop_id, amount, level} ], equipped_pet, + pvp_tier, guild {id, name}, last_seen, summary_generated_at +- UI/UX behavior + - Bot: Slash command `/profile ` returns a compact embed: player name, level, top 3 troops, main pet, + guild, last seen. If summary pending, bot replies with friendly ETA. + - Web UI: profile page loads summary quickly and displays a link "View raw snapshot" for advanced users (requires + permission). +- API contract (example) + - Endpoint: GET /api/profile/summary/:namecode + - Method: GET + - Auth: public read or authenticated as needed (rate limited) + - Response: + - 200 OK { "namecode": "...", "level": 52, "top_troops": [...], "equipped_pet": {...}, "guild": {...}, " + last_seen": "...", "cached_at": "..." } + - 202 Accepted { "message": "Profile processing in progress", "estimated_ready_in": "30s" } + - 404 Not Found { "message": "No profile found" } +- Business rules & validations + - If profile_summary exists return it immediately. + - If profile_summary missing but a processed hero_snapshot exists, build an ad-hoc summary from the latest processed + snapshot and return with freshness metadata. + - Apply per-client rate limits to avoid abuse; enforce caching headers (Cache-Control) as appropriate. +- Error handling & messages + - 404 if neither summary nor processed snapshot exists. + - 429 Too Many Requests when client exceeds rate limits. + - 500 for backend issues with an error id for support. + +Feature D — Retention & Archival Job + +- Overview + - Retention job to prune or archive hero_snapshots older than configured retention; supports archiving to + S3/compatible storage with audit trail. +- Inputs / outputs + - Inputs: + - Retention configuration (days to retain or N snapshots per user) + - Outputs: + - Archived files on S3 (optionally compressed), DB archival metadata rows, deleted or marked archived snapshots + from DB. +- Business rules & validations + - Default retention: 90 days (configurable per environment). + - Optionally: keep last N snapshots per user (e.g., keep last 30 per user). + - Archive must include minimal audit metadata: snapshot_id, original_size_bytes, archived_at, s3_path, checksum. + - Retention job should be idempotent and resumable. +- Error handling & messages + - When S3 upload fails: retry with exponential backoff and escalate if exceeding thresholds. + - If unable to archive a snapshot, do not delete DB row; log and create a ticket. + +Feature E — Developer Experience & Migrations + +- Overview + - Tools and docs for dev onboarding, local bootstrap, sample ingestion and migration execution. +- Inputs / outputs + - Inputs: developer environment, .env with DATABASE_URL, sample JSON files + - Outputs: running local DB with schema applied, seeded data and example snapshots processed +- Business rules & validations + - Bootstrap script must be idempotent and provide clear error messages for missing permissions (e.g., CREATE + EXTENSION). + - Migrations must be reversible or documented with rollback steps. +- Error handling & messages + - If migration fails, scripts must print human-friendly remediation (missing extension, permission denied) and not + leak secrets. + +--- + +### 5.3 Data requirements + +This subsection documents the data entities required, retention and archival rules, and points to the canonical data +schema docs. + +- Data entities required (primary) + - users + - id (UUID), namecode, username, discord_user_id, created_at, updated_at + - hero_snapshots + - id (UUID), user_id (nullable), namecode, source, raw JSONB, size_bytes, content_hash (SHA256), server_time, + processing (bool), processed_at, created_at, error metadata + - user_troops + - id (UUID), user_id, troop_id (int), amount, level, rarity, extra JSONB, last_seen + - user_pets + - id (UUID), user_id, pet_id, amount, level, xp, extra JSONB + - user_artifacts + - id (UUID), user_id, artifact_id, level, xp, extra JSONB + - user_teams + - id (UUID), user_id, name, banner, troops (int array), updated_at + - guilds + - id (UUID), discord_guild_id, name, settings JSONB, feature_flags + - guild_members + - id (UUID), guild_id, user_id, discord_user_id, joined_at + - user_profile_summary + - user_id (PK), denormalized fields for fast reads (level, top_troops array, equipped_pet, pvp_tier, last_seen, + cached_at) + - etl_errors / etl_audit + - id, snapshot_id, error_type, message, details, created_at + - catalog tables (optional) + - troop_catalog, pet_catalog, artifact_catalog (static metadata, seeded) +- Retention and archival rules + - Default snapshot retention: 90 days (configurable). + - Optionally keep last N snapshots per user (e.g., 30). Policy expressed as: keep newest N OR keep snapshots younger + than D days, whichever keeps more recent data. + - Archival: snapshots older than retention are compressed and uploaded to S3 (or other object store) with checksum, + and either deleted from DB or marked archived (policy driven). + - Audit: archival actions must write audit rows (archived_by_job, archived_at, s3_path, checksum). + - PII retention: any PII detected must be handled according to DATA_PRIVACY.md — if user requests deletion, both DB + rows and archived files must be purged according to legal process. +- Data schema references + - Canonical DB model and ERD: docs/DB_MODEL.md (link). All normalized tables, indexes and constraints are defined in + DB_MODEL.md and migrations generated from that model. + - Migrations and seed files live under database/migrations/ and database/seeds/. + +--- + +### 5.4 Integration requirements + +List of external systems to integrate with, required interfaces and authentication mechanisms. + +- Discord (Bot integration) + - Purpose: present profiles to players via slash commands; optionally link NameCode to Discord accounts. + - Integration: + - OAuth2 / Bot token stored in GitHub Secrets (DISCORD_TOKEN). + - Use Discord Gateway intents as required (presence if needed). + - Rate limit handling: respect Discord API limits; implement backoff and retries. + - Permissions: + - Bot must request only required scopes and have clear privacy policy for data usage. + +- Supabase / Postgres (Primary DB) + - Purpose: host hero_snapshots and normalized tables. + - Integration: + - Use DATABASE_URL from GitHub Secrets or environment variables. + - Migrations run via node-pg-migrate; use pgcrypto extension (or documented alternative). + - Use separate DB roles for app writes and read-only analytics. + - Constraints: + - CREATE EXTENSION permissions may be restricted on managed providers; bootstrap scripts must handle permission + errors gracefully. + +- Redis / Queue (BullMQ or equivalent) + - Purpose: queue snapshot processing jobs and manage worker orchestration. + - Integration: + - Connection via REDIS_URL (secret). Jobs enqueued upon snapshot insert. + - Worker concurrency configured via environment variable. + - Notes: + - If using hosted Redis (e.g., Upstash), account for connection limits and latency. + +- S3 / Object Storage (AWS S3 / S3-compatible) + - Purpose: archive snapshots and store exports for analytics. + - Integration: + - Use service account credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) stored as secrets. + - Upload archived snapshots to dedicated bucket with lifecycle rules. + - Security: + - Enforce server-side encryption (SSE) and proper IAM policies. + +- Google Cloud (optional) + - Purpose: CI secrets, service accounts for GCP-based resources if used. + - Integration: + - GOOGLE_SA_JSON stored as GitHub secret for actions that need it. + +- GitHub Actions (CI/CD) + - Purpose: run migrations (manual), run tests, build & publish container images to GHCR. + - Integration: + - Use repository secrets for DB connections; protect workflows that run on production via GitHub + Environments/approvals. + +- Monitoring & Logging (Prometheus, Grafana, Sentry or alternatives) + - Purpose: collect metrics and errors, manage alerts. + - Integration: + - Worker exposes /metrics; scrape by Prometheus or push to managed provider. + - Sentry DSN for error tracking; configure sampling to avoid PII leakage. + +- Upstream game API (get_hero_profile endpoints) + - Purpose: source of raw profile snapshots. + - Integration: + - Scripts or proxy clients call this endpoint. Respect upstream rate limits and terms of service. + - Any credentials used by players for login flows must be handled locally (never committed) and not stored by + ingestion service. + - Rate limiting: + - Track upstream rate limits and surface to users when exceeded. + +- CI Container Registry (GHCR) + - Purpose: publish worker images or other containers. + - Integration: + - Use GHCR_TOKEN or GitHub Actions authentication; include retention and cleanup policy for images. + +--- + +### 5.5 Third‑party services & dependencies + +An inventory of external services and dependencies with operational constraints, quotas, and suggested SLA/targets. + +- Postgres (Supabase/managed Postgres) + - Typical SLA: provider dependent (e.g., 99.95%). + - Limits: connection limits, extensions permissions may be restricted. + - Recommendations: use a dedicated DB role for ETL operations; monitor connection count and long-running queries. + +- Redis / Managed queue service (e.g., Upstash, RedisLabs) + - SLA: provider dependent. + - Limits: concurrent connections, max memory, max message throughput. + - Recommendations: size instance for peak ETL throughput, set eviction policies. + +- S3 / Object Storage (AWS S3, DigitalOcean Spaces, MinIO) + - SLA: typically high durability (11 9s), provider dependent for availability. + - Limits: request rates per prefix — follow provider guidelines for parallel uploads. + - Recommendations: configure lifecycle, enable server-side encryption, versioning optional. + +- GitHub Actions & GHCR + - Quotas: actions minutes and storage quotas per plan; GitHub rate limits for API calls. + - Recommendations: Use protected environments for production operations; rotate tokens periodically. + +- Monitoring & Error Tracking (Prometheus/Grafana, Sentry) + - Quotas: retention and event quotas (Sentry); scrape frequency for Prometheus. + - Recommendations: configure alerting thresholds (ETL failure spikes, high queue latency), control sampling to avoid + sending PII. + +- Upstream game API + - Constraints: unknown rate limits; must be treated as a throttled resource. + - Recommendations: implement client-side backoff, expose user-facing messages when upstream limits are hit. + +- Node.js / pnpm / npm ecosystem + - Constraints: dependency vulnerabilities and transitive license issues. + - Recommendations: dependabot or similar for dependency updates; audits in CI. + +- Libraries & DB extensions + - pgcrypto (preferred), jsonb tooling, node-pg-migrate + - Notes: Ensure provider supports chosen extensions or provide fallback code paths. + +Service-level expectations (internal targets) + +- ETL worker availability: target 99.9% in production during business hours. +- Snapshot ingestion latency: < 1s for API ack when snapshot accepted (processing asynchronous). +- ETL processing SLA for interactive flows: configurable default (e.g., 30s), longer allowed for backfills. +- Alerting: trigger when ETL failure rate > 1% over 5 minutes or queue latency > threshold. + +--- + +References + +- Link canonical schema: docs/DB_MODEL.md +- Link ETL design and worker contract: docs/ETL_AND_WORKER.md +- Link migration conventions: docs/MIGRATIONS.md + +## 6. Non-Functional Requirements (NFR) + +This section lists measurable non-functional requirements and operational constraints for the Player Profile & DB +Foundation project. Where concrete numbers are proposed, they are recommendations to start with and should be validated +against real traffic and baseline measurement. + +--- + +### 6.1 Performance + +- Latency targets + - Snapshot ingestion (API ack): p95 < 1s, p99 < 3s (acknowledgement that snapshot was received and queued). + - Profile summary read endpoint (normal path served from user_profile_summary): p95 < 200ms, p99 < 500ms under + typical staging/production read load. + - Profile summary fallback (build ad-hoc from latest processed snapshot): p95 < 500ms, p99 < 1s. + - ETL processing for interactive requests (small/average snapshots): median < 10s, p95 < 30s. Large payloads (login + flow, 2–3MB) may be treated as asynchronous with SLA target p95 < 5 minutes for background processing in initial + release. + - Admin operations (bootstrap/migrations): no strict latency SLA but must complete within workflow timeouts (GitHub + Actions default) and provide progress logs. + +- Throughput / concurrency targets + - Target initial throughput: 1,000 snapshot ingest requests/day (configurable). + - Target ETL capacity: process 100 snapshots/hour with a single worker instance. Design for horizontal scaling to + handle spikes (workers ×N). + - Concurrent read queries: support 500 concurrent summary reads (scale via read replicas / caching). + - These numbers are starting points; measure real traffic and increase capacity targets accordingly. + +- Load profile and expected traffic + - Typical load: bursts when community events occur (fetch scripts executed by many users); anticipate spikes ( + 10–100× baseline) during coordinated runs. + - Peak-case planning: system should be able to scale to handle spike multiplier for short periods (auto-scale worker + pool and API replicas). + - Backfill scenarios: bulk backfills will be scheduled during off-peak windows and run with controlled concurrency + to avoid impacting production reads. + +--- + +### 6.2 Scalability + +- Horizontal / vertical scaling expectations + - Stateless components (API, worker processes) must scale horizontally behind a load balancer or process supervisor. + - Postgres: scale vertically for CPU/memory; scale horizontally for reads with read replicas. Use partitioning and + connection pooling to scale writes and large snapshot storage. + - Redis/Queue: scale vertically to increase throughput; consider sharding if needed. + - Object storage (S3): scale automatically for archival/export operations. + +- Bottleneck considerations + - Postgres connections and long-running transactions are primary write bottlenecks — avoid monolithic transactions + for entire snapshots. + - Network bandwidth when transferring large snapshots or performing S3 uploads. + - Memory consumption during ETL parsing for large payloads: use streaming/chunked parsing. + - Rate limits of upstream API and Discord (throttle & queue at client side). + +- Recommendations + - Use connection pooling (pgbouncer) and limit DB connections per worker. + - Partition hero_snapshots by time (monthly) or by hash of user_id for very large datasets. + - Employ read replicas for heavy analytical queries and for bot read traffic if necessary. + - Implement autoscaling policies for workers based on queue depth and processing latency. + +--- + +### 6.3 Reliability / Availability + +- Target uptime / SLA + - Internal target: 99.9% availability for public read API and worker infrastructure during business hours (SLA can + be refined with stakeholders). + - Snapshot ingests and ETL are best-effort asynchronous services; availability target 99.5% for ingestion API. + +- RTO / RPO objectives + - RTO (Recovery Time Objective): 1 hour for critical failures affecting primary reads; 24 hours for full recovery + after catastrophic failure. + - RPO (Recovery Point Objective): database backups taken at least daily with WAL archiving; target RPO = 1 hour ( + WAL-enabled continuous archiving) for production-critical data where supported by provider. + +- Redundancy strategy + - DB: managed provider with automated backups and optional read-replicas; cross-region replicas if required for + higher availability. + - API & workers: run at least two instances across availability zones; use managed orchestration for automatic + restart. + - Queue: run with managed Redis or HA configuration; ensure persistence where required (or use jobs persisted in DB + as fallback). + - Storage: use durable object store with versioning and lifecycle policies (S3 or S3-compatible). + +- Backup & restore + - Regular automated backups (daily snapshots + continuous WAL where supported). + - Periodic restore drills documented in BACKUP_RESTORE.md; at minimum yearly full restore test and quarterly partial + restore validation. + - Retain backups per policy balancing compliance and cost (e.g., 90 days online, archive longer-term). + +--- + +### 6.4 Security + +- Authentication & authorization model + - Public read endpoints: allow anonymous reads or light authentication depending on product choice; enforce per-IP + and per-token rate limiting. + - Internal ingestion/admin endpoints: require service-to-service authentication (short-lived signed tokens or mTLS) + or bearer tokens stored in GitHub Secrets; admin endpoints require RBAC (role-based access control) and be limited + to designated maintainer accounts. + - Use least privilege: separate credentials for migrations, worker writes, analytics reads. + +- Data encryption (at rest / in transit) + - In transit: TLS 1.2+ for all external and internal communications (API, DB connections with SSL). + - At rest: rely on provider encryption (Postgres managed service encryption, S3 SSE). For highly sensitive fields + consider application-level encryption for specific columns. + - Secrets in GitHub Actions: use GitHub Secrets and protected Environments; avoid printing secrets to logs. + +- Secret management + - Store secrets in GitHub Secrets for CI and in a secrets manager for runtime (e.g., AWS Secrets Manager, GCP Secret + Manager, or provider equivalent). + - Enforce rotation policy (e.g., rotate DB credentials and service tokens every 90 days or on compromise). + - Audit access to secrets and require multi-person approval for high-privilege environment changes. + +- Threat model highlights + - Threats: + - Credential leakage (accidental commit, logs). + - Data exfiltration (malicious actor or misconfigured S3 permissions). + - Injection (SQL injection via poorly-validated fields). + - Supply chain (malicious NPM packages). + - DDoS / abusive traffic (rate-limiting bypass). + - Mitigations: + - Pre-commit scanning and CI checks to prevent secrets in code. + - IAM least privilege and S3 bucket policies; object-level encryption. + - Parameterized queries and ORM / query builder usage; strict validation of incoming JSON. + - Dependency scanning (dependabot), pinned dependencies and reproducible builds. + - WAF or rate limiting, API quotas, and abuse monitoring. + +- OWASP considerations + - Address OWASP Top 10 (A1–A10) as applicable: + - A1 Injection: use parameterized queries; validate inputs. + - A2 Broken Authentication: enforce secure tokens and session handling. + - A3 Sensitive Data Exposure: redact sensitive fields in logs and encrypt at rest. + - A5 Security Misconfiguration: restrict permissions on DB and storage; avoid unnecessary extensions. + - A9 Components with Known Vulnerabilities: dependency scanning and patching. + - Include security tests in CI (SAST/DSA) and periodic dependency audits. + +--- + +### 6.5 Privacy & Compliance + +- PII handling + - Define a clear data classification: what fields in snapshots are PII (emails, real names, device identifiers, + tokens) and handle accordingly. + - Minimize PII storage: only store fields required for functionality; store raw snapshots only when necessary and + redact sensitive fields before archival if required. + - Logs: never log user credentials or raw tokens; redact PII in application logs. + +- GDPR / CCPA / other regulatory constraints + - Implement user data subject request (DSR) workflows: right to access, right to erasure ("right to be forgotten"), + portability. + - Maintain audit trail for deletion and retention actions. + - Ensure Data Processing Agreement (DPA) with cloud providers when handling EU user data. + - Document lawful basis for processing user data in DATA_PRIVACY.md. + +- Data residency requirements + - Support configuration of data residency per environment (e.g., EU-only storage). If required, deploy DB and S3 + buckets in specific regions and configure backups accordingly. + - Ensure cross-region backups/processing comply with legal constraints. + +- Consent & user-facing notices + - If public-facing service, include privacy policy and explicit consent flows for login-based ingestion (explain + what is captured and retained). + - Provide user-facing controls for deleting stored profiles (or requesting archival/deletion) and document expected + SLAs for deletion. + +--- + +### 6.6 Maintainability & Operability + +- Observability requirements (metrics, logs, traces) + - Metrics: + - ETL: processed_count, success_count, failure_count, average_latency, p95_latency, queue_depth. + - API: request rates, error rates, latency percentiles. + - Infrastructure: DB connection usage, replication lag, worker memory/CPU. + - Logging: + - Structured logs (JSON) with standardized fields (timestamp, service, level, job_id/snapshot_id, + correlation_id). + - Redact sensitive fields; log sampling for high-volume flows. + - Tracing: + - Distributed tracing for request → ingestion → worker pipeline (trace IDs propagated in headers). + - Dashboards & alerts: + - Dashboard for ETL health, queue length, failure rate, and ingestion latency. + - Alerts when failure_rate > threshold, queue depth high, or ETL latency SLA breached. + +- Error tracking & alerting expectations + - Use Sentry or equivalent for uncaught exceptions and application errors (with PII redaction). + - Alerts: + - P0: ETL failure rate > 1% sustained for 5 minutes → page on-call. + - P0: Queue depth > threshold (configurable) for > 5 minutes → notify. + - P1: Migration workflow failure in production → notify maintainers. + - On-call rotation and SLAs for acknowledgements should be defined in OP_RUNBOOKS/ONCALL.md. + +- Operational runbooks required + - Runbooks to include: + - How to re-enqueue snapshots and reprocess (admin endpoint and manual DB steps). + - How to run migrations and rollback safely (with pre-check list). + - How to restore DB from backup and perform a sanity check. + - Incident response for ETL storm/failure, and for secrets compromise. + - How to perform retention/archival jobs manually. + +- Testing & CI + - CI must run unit tests, integration tests against a test Postgres instance, linting, and lightweight security + scans. + - Include end-to-end test that exercises ingest → ETL → profile_summary for sample payloads. + +--- + +### 6.7 Accessibility + +- A11y requirements & compliance level (WCAG) + - Public-facing UI (if present) should aim for WCAG 2.1 AA compliance as a target. + - Bot messages: ensure text conveys necessary information without relying on color alone; provide alt text for + images or icons in embeds where relevant. + - Documentation: make sure developer docs are navigable and readable (headings, code blocks, keyboard accessibility + for any web UI). + +--- + +### 6.8 Internationalization / Localization + +- Languages supported + - Initial supported language: English (en). + - Target second language: French (fr) for project maintainers/community (optional for user-facing UI). + - Use i18n frameworks for any user-facing strings; avoid hard-coded text in bot replies and web UI. + +- Formatting & timezones + - Store all timestamps in UTC in the DB (ISO 8601). Convert to local timezone only at presentation layer. + - Number/date formatting follows locale conventions on presentation layer (client or UI). + +- Character encoding + - Use UTF-8 for all stored text and interfaces. + +--- + +### 6.9 Constraints and limitations + +- Platform or infrastructure constraints + - Managed Postgres providers may restrict CREATE EXTENSION or certain superuser operations (documented fallback + required). + - GitHub Actions limitations: runner execution timeouts and secret exposure risk — production bootstrap must be + manual and protected. + - Provider quotas for Redis, S3, GHCR and Actions minutes should be tracked to avoid hitting limits. + +- Regulatory or legal constraints + - If processing EU resident data, comply with GDPR (DPA with provider, data residency if required). + - If storing payment or highly sensitive data, use certified provider services and restricted handling (PCI DSS out + of scope unless payments added). + +- Cost constraints + - Archiving snapshots and running high-frequency ETL can grow storage and compute costs quickly. Use retention + policy, lifecycle rules and careful autoscaling. + - Monitor monthly spend and provide cost alerts. + +- Operational constraints + - Migration changes that require downtime must be communicated and scheduled; prefer online migrations where + possible. + - Avoid large, blocking schema changes in a single migration — prefer migration patterns that add columns with null + defaults, backfill asynchronously, then make columns NOT NULL in a later migration. + +--- + +References and cross-links + +- docs/DB_MODEL.md (data model) +- docs/ETL_AND_WORKER.md (worker design, retries, idempotency) +- docs/DATA_PRIVACY.md (privacy & GDPR procedures) +- docs/OBSERVABILITY.md and docs/OP_RUNBOOKS/* (monitoring and runbooks) + +--- + +## 7. Data Model & Schema (overview) + +This section provides a high-level overview of the canonical data model we will use to support snapshot ingestion, ETL +normalization and fast reads. It contains an embedded ERD-like diagram (textual), detailed key entities and attributes ( +types, constraints and relationships), indexing and common query patterns, and the migration/versioning strategy +reference. + +For the full, canonical schema (DDL, constraints, index definitions and ER diagrams) see: docs/DB_MODEL.md + +--- + +### 7.1 High-level ERD (link or embedded) + +Below is a compact textual ERD showing main tables and relationships. It is intended as an overview; the full ERD image +and complete table definitions live in docs/DB_MODEL.md. + +Users (1) ⟷ (N) HeroSnapshots +Users (1) ⟷ (N) UserTroops +Users (1) ⟷ (N) UserPets +Users (1) ⟷ (N) UserArtifacts +Users (1) ⟷ (N) UserTeams +Guilds (1) ⟷ (N) GuildMembers +Users (1) ⟷ (N) GuildMembers +HeroSnapshots (1) ⟷ (N) ETLErrors / ETLAudit + +Textual relationships: + +- users.id (UUID, PK) + - hero_snapshots.user_id → users.id (nullable) — raw snapshot may be inserted before user mapping exists + - user_troops.user_id → users.id (NOT NULL) + - user_teams.user_id → users.id (NOT NULL) +- guild_members.guild_id → guilds.id +- guild_members.user_id → users.id + +(For a graphical ERD, see docs/DB_MODEL.md which contains an SVG/PNG ERD and table-by-table DDL.) + +--- + +### 7.2 Key entities and attributes + +Below are the primary tables required for the initial product vertical slice, with recommended column names, types, +constraints and brief notes on purpose. + +Note: these are the canonical attributes used by the ETL and APIs. Implementation DDL lives under database/migrations/ +and docs/DB_MODEL.md. + +1) users + +- Purpose: canonical user account mapping (NameCode, Discord id, human name) +- Columns: + - id UUID PRIMARY KEY DEFAULT gen_random_uuid() + - namecode VARCHAR(64) UNIQUE NULLABLE — NameCode / Invite code (ex: COCORIDER_JQGB) + - discord_user_id VARCHAR(64) NULLABLE + - username VARCHAR(255) NULLABLE + - email VARCHAR(255) NULLABLE + - created_at TIMESTAMPTZ DEFAULT now() + - updated_at TIMESTAMPTZ DEFAULT now() +- Notes: + - Keep PII minimal; optionally separate PII into a protected table if compliance requires. + - Indexes: UNIQUE on namecode; index on discord_user_id + +2) hero_snapshots + +- Purpose: store raw JSONB payloads from get_hero_profile / login flow for audit & replay +- Columns: + - id UUID PRIMARY KEY DEFAULT gen_random_uuid() + - user_id UUID REFERENCES users(id) ON DELETE SET NULL + - namecode VARCHAR(64) NULLABLE + - source VARCHAR(64) NOT NULL (e.g., "fetch_by_namecode", "login", "cli_upload") + - raw JSONB NOT NULL + - size_bytes INTEGER NOT NULL + - content_hash VARCHAR(128) NOT NULL -- SHA256 hex + - server_time BIGINT NULLABLE (if provided by upstream) + - processing BOOLEAN DEFAULT FALSE + - processed_at TIMESTAMPTZ NULLABLE + - created_at TIMESTAMPTZ DEFAULT now() + - error_count INTEGER DEFAULT 0 + - last_error JSONB NULLABLE +- Constraints & indexes: + - UNIQUE(content_hash, source) OPTIONAL depending on dedupe policy + - INDEX on (user_id, created_at DESC) + - GIN index on raw using jsonb_path_ops or default jsonb_ops for search: + - CREATE INDEX ON hero_snapshots USING GIN (raw jsonb_path_ops); + - Expression index on ( (raw ->> 'PlayerId') ) if upstream exposes stable top-level ID + +3) user_troops + +- Purpose: normalized inventory of troops per user (fast lookup by troop_id) +- Columns: + - id UUID PRIMARY KEY DEFAULT gen_random_uuid() + - user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE + - troop_id INTEGER NOT NULL -- references troop_catalog.id when available + - amount INTEGER DEFAULT 0 NOT NULL + - level INTEGER DEFAULT 1 + - rarity INTEGER DEFAULT 0 + - fusion_cards INTEGER DEFAULT 0 + - traits_owned INTEGER DEFAULT 0 + - extra JSONB DEFAULT '{}'::jsonb -- store unknown fields + - last_seen TIMESTAMPTZ DEFAULT now() + - updated_at TIMESTAMPTZ DEFAULT now() +- Constraints & indexes: + - UNIQUE(user_id, troop_id) + - INDEX on (troop_id, amount) for analytics + - INDEX on (user_id, troop_id) for fast upsert/deletes + +4) guilds + +- Purpose: guild metadata and feature flags +- Columns: + - id UUID PRIMARY KEY DEFAULT gen_random_uuid() + - discord_guild_id VARCHAR(64) UNIQUE NULLABLE + - name VARCHAR(255) + - settings JSONB DEFAULT '{}'::jsonb + - feature_flags JSONB DEFAULT '{}'::jsonb + - created_at TIMESTAMPTZ DEFAULT now() + - updated_at TIMESTAMPTZ DEFAULT now() + +5) guild_members + +- Purpose: mapping between guilds and users +- Columns: + - id UUID PRIMARY KEY DEFAULT gen_random_uuid() + - guild_id UUID NOT NULL REFERENCES guilds(id) ON DELETE CASCADE + - user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE + - discord_user_id VARCHAR(64) NULLABLE + - joined_at TIMESTAMPTZ NULLABLE +- Constraints & indexes: + - UNIQUE(guild_id, user_id) + - INDEX on guild_id for guild-wide queries + +6) feature_flags + +- Purpose: store product feature toggles and rollout metadata (global flags) +- Columns: + - id UUID PRIMARY KEY DEFAULT gen_random_uuid() + - name VARCHAR(128) UNIQUE NOT NULL + - enabled BOOLEAN DEFAULT false + - rollout_percentage INTEGER DEFAULT 0 + - data JSONB DEFAULT '{}'::jsonb + - created_at TIMESTAMPTZ DEFAULT now() + - updated_at TIMESTAMPTZ DEFAULT now() + +7) user_profile_summary + +- Purpose: denormalized, read-optimized summary for fast profile reads (the primary read for the bot/UI) +- Columns: + - user_id UUID PRIMARY KEY REFERENCES users(id) + - namecode VARCHAR(64) + - username VARCHAR(255) + - level INTEGER + - top_troops JSONB DEFAULT '[]' -- e.g. [ {troop_id, amount, level}, ... ] + - equipped_troop_ids INTEGER[] DEFAULT '{}' + - equipped_pet JSONB NULLABLE + - pvp_tier INTEGER NULLABLE + - guild_id UUID NULLABLE + - last_seen TIMESTAMPTZ NULLABLE + - cached_at TIMESTAMPTZ DEFAULT now() -- when summary was generated + - extra JSONB DEFAULT '{}'::jsonb +- Indexes: + - PRIMARY KEY user_id + - INDEX on namecode for quick lookup (if public reads use namecode) + +8) etl_errors (or etl_audit) + +- Purpose: capture per-snapshot or per-entity errors for debugging and reprocess prioritization +- Columns: + - id UUID PRIMARY KEY DEFAULT gen_random_uuid() + - snapshot_id UUID REFERENCES hero_snapshots(id) + - error_type VARCHAR(128) + - message TEXT + - details JSONB + - created_at TIMESTAMPTZ DEFAULT now() + +9) optional catalogs + +- troop_catalog (id INT PRIMARY KEY, name TEXT, rarity INT, meta JSONB) +- pet_catalog, artifact_catalog + +--- + +### 7.3 Indexing and query patterns + +This section describes recommended indexes and the most common query patterns the system must serve quickly. Index +choices balance write cost of ETL upserts against read performance. + +Recommended indexes (examples) + +- hero_snapshots + - GIN index: CREATE INDEX idx_hero_snapshots_raw_gin ON hero_snapshots USING GIN (raw jsonb_path_ops); + - Recent snapshots per user: CREATE INDEX idx_hero_snapshots_user_created_at ON hero_snapshots (user_id, created_at + DESC); + - Optional uniqueness: CREATE UNIQUE INDEX IF NOT EXISTS ux_hero_snapshots_contenthash_source ON hero_snapshots ( + content_hash, source) WHERE (content_hash IS NOT NULL); + +- users + - CREATE UNIQUE INDEX idx_users_namecode ON users(namecode); + - CREATE INDEX idx_users_discord_user_id ON users(discord_user_id); + +- user_troops + - CREATE UNIQUE INDEX ux_user_troops_user_troop ON user_troops (user_id, troop_id); + - CREATE INDEX idx_user_troops_troop ON user_troops (troop_id); + - CREATE INDEX idx_user_troops_user ON user_troops (user_id); + +- user_profile_summary + - CREATE INDEX idx_profile_summary_namecode ON user_profile_summary (namecode); + +- guild_members + - CREATE INDEX idx_guild_members_guild ON guild_members (guild_id); + +- etl_errors + - INDEX on snapshot_id and created_at for fast debugging. + +Query patterns and example SQL + +- Get latest processed snapshot for a user: + - SELECT * FROM hero_snapshots WHERE user_id = $1 AND processed_at IS NOT NULL ORDER BY created_at DESC LIMIT 1; + +- Get profile summary by namecode: + - SELECT * FROM user_profile_summary WHERE namecode = $1; + +- Who owns troop 6024 (list users with >0 amount): + - SELECT u.id, u.namecode, ut.amount, ut.level FROM user_troops ut JOIN users u ON ut.user_id = u.id WHERE + ut.troop_id = 6024 AND ut.amount > 0 ORDER BY ut.amount DESC LIMIT 100; + +- Top troops for a user: + - SELECT jsonb_array_elements(user_profile_summary.top_troops) FROM user_profile_summary WHERE user_id = $1; + +- Leaderboard by troop total: + - SELECT u.namecode, ut.amount FROM user_troops ut JOIN users u ON ut.user_id = u.id WHERE ut.troop_id = $troop_id + ORDER BY ut.amount DESC LIMIT 100; + +- Search raw snapshot for condition (example using JSON path) + - SELECT id FROM hero_snapshots WHERE raw @? '$.ProfileData.Troops.* ? (@.TroopId == 6024 && @.Amount > 0)'; + +Indexing considerations & patterns + +- GIN indexes on jsonb speed ad-hoc queries but carry write cost. Use them judiciously on hero_snapshots; most + production queries should use normalized tables. +- Expression indexes: create indexes on frequently accessed extracted JSON keys (e.g., ( (raw ->> 'namecode') )) if you + must query raw snapshots frequently. +- Partial indexes: for large tables, use partial indexes for hot subsets (e.g., processed snapshots only). +- Materialized views: create materialized views for heavy aggregations (troop ownership totals) and refresh them on + schedule or via incremental jobs. +- Partitioning: hero_snapshots can be partitioned by created_at (e.g., monthly) or by hash(user_id) if snapshot volume + grows into tens of millions of rows. Partitioning reduces vacuum and index bloat and improves archival deletion + performance. + +ETL upsert patterns (best practices) + +- Use ON CONFLICT (user_id, troop_id) DO UPDATE ... for user_troops upserts. +- Combine small batches into single multi-row upserts where possible to reduce roundtrips. +- Limit transaction scope: commit per entity group (users, troops, pets) to reduce lock hold time. +- Use advisory locks only where required to avoid concurrency issues with parallel worker instances. + +Read optimizations + +- Serve interactive reads from user_profile_summary (denormalized). +- Cache summaries in Redis or an in-process LRU cache if read volume demands it. +- Use read replicas for heavy analytics or dashboard queries to avoid impacting primary. + +--- + +### 7.4 Migration strategy & versioning (link to MIGRATIONS.md) + +Migration tooling and conventions + +- Tool: node-pg-migrate (JavaScript migrations) + - Store migrations under database/migrations/ with timestamped filenames, e.g. + 20251128T1530_create_hero_snapshots.js + - Migrations should be small, reversible where practical, and tested locally and in staging before production. + +Naming & semantics + +- Use a consistent filename pattern: YYYYMMDDHHMMSS_description.[js|sql] +- Each migration should contain: + - up: statements to apply change + - down: statements to revert change when safe +- Avoid destructive operations in one step (e.g., dropping columns) — prefer a 3-step safe migration: + 1. Add new nullable column + 2. Backfill/update data (asynchronously via worker) + 3. Make column NOT NULL and drop old column in a later migration + +Migration best practices + +- Use transactional migrations where possible. For long-running operations that cannot be executed inside a + transaction (e.g., CREATE INDEX CONCURRENTLY), document them in migration and include pre/post sanity checks. +- Keep extension creation separate in its own migration with clear notes and guard behavior (CREATE EXTENSION IF NOT + EXISTS pgcrypto). Document provider permissions fallback in MIGRATIONS.md. +- Test migrations: + - Run migrations up/down in CI on a fresh DB instance (to ensure they apply cleanly). + - Smoke tests: after migrations, run a suite that validates expected tables, indexes and a minimal ETL run against + sample payloads. +- Staging → Production promotion: + - Migrations are applied first in staging; after smoke-test pass, apply to production via the manual GitHub Actions + workflow (db-bootstrap.yml) requiring environment approvers. +- Versioning & release policies: + - Keep schema changes tied to feature branches; migrate with feature gates or feature flags if schema changes are + not backwards compatible. + - Maintain a migration changelog and map migration ids to PRs for traceability. + +Rollback & emergency patches + +- Plan for rollbacks: + - For reversible migrations use the down script. + - For destructive changes, keep backups and a pre-approved rollback plan in MIGRATIONS.md. +- Backups: + - Before applying production migrations, take a DB snapshot or backup and validate that restore is possible. +- Emergency hotfix flow: + - If a migration causes issues, the runbook in docs/OP_RUNBOOKS/MIGRATION_ROLLBACK.md provides the step-by-step + mitigation: stop workers, pause ingest, restore from backup or apply compensating migration, then resume. + +Automation & CI integration + +- CI gates: require successful migration in a CI job against a sandbox DB before merging migration-related PRs. +- Migration pre-checks: add preflight scripts that: + - verify required DB extensions and privileges + - estimate the cost of index creation and warn if creating large indexes + - check for long-running queries and locks + +--- + +References + +- Complete migration conventions and examples: docs/MIGRATIONS.md +- Full DDL, column types, constraints and ERD diagrams: docs/DB_MODEL.md +- ETL idempotency patterns and per-entity upsert examples: docs/ETL_AND_WORKER.md + +--- + +## 8. API Specification (contract) + +This section defines the API surface, design principles, authentication & authorization rules, concrete endpoints ( +request/response schemas and error codes), rate limiting policy and versioning/deprecation rules. The goal is a clear, +predictable, secure and stable contract for clients (CLI, bot, UI, internal services) and operators. + +--- + +### 8.1 API design principles (REST / GraphQL / versioning) + +- Style + - Primary design: RESTful JSON APIs. Keep resource-oriented endpoints and use standard HTTP verbs (GET, POST, + PUT/PATCH, DELETE) and status codes. + - Consider GraphQL in a future phase for complex client-driven queries (analytics / dashboard), but do not introduce + it in v1 to keep the surface simple. +- Content format + - All endpoints produce and consume JSON (application/json). Use UTF-8 encoding and ISO-8601 timestamps (UTC). +- Versioning + - Use versioned path segments: /api/v1/... + - Version in the Accept or custom header is optional but path-based versioning is mandatory for the initial release. +- Idempotency and safe semantics + - POST endpoints that create jobs/resources should support an Idempotency-Key header for safe retries ( + Idempotency-Key: ). + - GET endpoints must be safe and cacheable (when appropriate). Use Cache-Control headers. +- Pagination & filtering + - For list endpoints, use limit/offset or cursor-based pagination depending on expected volume. Default limit=50, + max limit=1000. + - Support filtering by common fields (e.g., namecode, user_id, troop_id) and sorting. +- Error model + - Use a consistent error response structure (see section 8.3). +- Documentation & schema + - Publish OpenAPI 3.0+ specification at /openapi.yaml and human-readable docs at /docs/api. + - Include examples for common flows (snapshot ingest, fetch summary, reprocess). +- Backwards compatibility + - Additive changes only within a major version (v1): adding new optional response fields is allowed; removing or + renaming fields requires a new major version (v2). +- Security first + - Enforce TLS for all traffic, require authentication for internal and admin endpoints, and apply rate limiting & + quotas. +- Observability + - Include correlation IDs (X-Request-Id / X-Correlation-Id) for tracing requests end-to-end (ingest → queue → + worker). + +--- + +### 8.2 Authentication & Authorization + +- Auth types supported + - Service-to-service bearer tokens (signed JWTs) for internal services and GitHub Actions. Validate signature via + JWKS or shared secret. + - API keys (opaque tokens) for CLI clients when needed (ingest uploads). API keys scoped and revocable. + - Admin tokens (OIDC or short-lived JWT) for admin endpoints; RBAC enforced server-side. + - OAuth2 for optional user-based flows (Discord linking) — only where user consent is explicitly required. +- Least privilege & scopes + - Tokens must be scoped. Example scopes: + - ingest:snapshots — allow POST /api/v1/internal/snapshots + - read:profiles — allow GET /api/v1/profile/summary/* + - admin:snapshots — allow reprocess and admin actions + - migrate:apply — allow running migrations (CI workflow only) + - Do not use a single global token with full access in production; prefer environment-scoped tokens and short-lived + tokens for human-triggered workflows. +- Admin RBAC + - Admin endpoints require authentication AND role check. Roles: admin, maintainer, operator, analyst. Only + admin/maintainer roles may trigger production migrations or archive operations. +- Key lifecycle & rotation + - Enforce rotation for long-lived keys (e.g., API keys rotated every 90 days). + - Provide an API to list and revoke API keys for service accounts. +- Credential storage + - Secrets stored in a secrets manager (provider-specific) in runtime; GitHub Actions use repository secrets and + protected environments. +- Auditing + - Log admin actions (who triggered, when, on which snapshot/migration). Audit logs must not contain raw secrets or + full snapshots (store only references/hashes). +- Client guidance + - Require clients to set User-Agent with app name and version (User-Agent: StarForgeCLI/0.1) to aid support and rate + limiting. + +--- + +### 8.3 Endpoints (list) + +All endpoints are under /api/v1. Below are primary endpoints for the initial vertical slice. Each entry shows method, +request/response examples and error cases. + +Common error response format (JSON) + +- HTTP status >= 400: + { + "error": { + "code": "ERROR_CODE", + "message": "Human-readable message", + "details": { /* optional additional data */ }, + "request_id": "uuid" + } + } +- Example error codes: + - INVALID_PAYLOAD, UNAUTHORIZED, FORBIDDEN, NOT_FOUND, CONFLICT, RATE_LIMIT_EXCEEDED, SERVER_ERROR, + PAYLOAD_TOO_LARGE, DUPLICATE_SNAPSHOT + +1) POST /api/v1/internal/snapshots + +- Purpose: Accept a raw get_hero_profile JSON payload and create a hero_snapshots row (internal ingestion). +- Auth: Bearer token with scope ingest:snapshots or valid API key. +- Idempotency: Support Idempotency-Key header (recommended). +- Request (application/json) + { + "source": "fetch_by_namecode" | "login" | "cli_upload", + "namecode": "COCORIDER_JQGB", // optional but recommended + "payload": { /* full get_hero_profile JSON */ } + } + - Headers: + - Authorization: Bearer + - Idempotency-Key: (optional) + - X-Request-Id: (optional) +- Responses: + - 201 Created + { + "snapshot_id": "uuid", + "status": "queued", + "created_at": "2025-11-28T12:34:56Z" + } + - 202 Accepted (when accepted but queued) + { + "snapshot_id": "uuid", + "status": "queued", + "estimated_processing_seconds": 10 + } + - 409 Conflict (duplicate within dedupe window) + { + "snapshot_id": "existing-uuid", + "status": "duplicate", + "message": "Identical snapshot detected within dedupe window" + } + - 400 Bad Request (invalid JSON / missing payload) + - 413 Payload Too Large (size limit exceeded) with guidance + - 401 Unauthorized / 403 Forbidden + - 429 Rate limit exceeded + - 500 Server error (includes request_id) +- Notes: + - Server calculates content_hash (SHA256) and size_bytes. If namecode can be extracted server-side, it attempts to + map to users.id. + +2) GET /api/v1/profile/summary/:namecode + +- Purpose: Return denormalized profile_summary for a given namecode; fallback behavior documented. +- Auth: Public read allowed (no auth) or read:profiles scope if private deployment. +- Request + - Path param: namecode (string) + - Query params (optional): + - source=cache|latest — prefer cached summary or compute ad-hoc from latest processed snapshot +- Success Response (200 OK) + { + "namecode": "COCORIDER_JQGB", + "user_id": "uuid", + "username": "Coco", + "level": 52, + "top_troops": [ + { "troop_id": 6024, "amount": 15, "level": 3 }, + { "troop_id": 6010, "amount": 8, "level": 2 } + ], + "equipped_pet": { "pet_id": 101, "level": 2 }, + "pvp_tier": 4, + "guild": { "id": "uuid", "name": "GuildName" }, + "last_seen": "2025-11-28T12:00:00Z", + "cached_at": "2025-11-28T12:01:00Z" + } +- Alternate responses: + - 202 Accepted + { + "message": "Profile processing in progress", + "estimated_ready_in_seconds": 30 + } + - 404 Not Found { "message": "No profile found" } + - 429 Rate limit exceeded + - 500 Server error + +3) GET /api/v1/profile/raw/:namecode + +- Purpose: Return latest processed hero_snapshot.raw for a namecode (restricted). +- Auth: read:profiles scope or admin access for raw snapshot access. +- Response: + - 200 OK { "snapshot_id": "uuid", "payload": { ... full raw JSON ... }, "created_at": "..." } + - 403 Forbidden if client lacks permission + - 404 Not Found +- Notes: + - Raw snapshot may contain sensitive fields (tokens); only expose to trusted clients and redaction may occur based + on policy. + +4) POST /api/v1/admin/snapshots/:id/reprocess + +- Purpose: Enqueue an existing snapshot for reprocessing by the ETL worker. +- Auth: Admin scope admin:snapshots; requires RBAC. +- Request: no body required. +- Responses: + - 202 Accepted { "job_id": "uuid", "status": "enqueued" } + - 404 Not Found + - 409 Conflict (snapshot currently processing) + - 401/403 Unauthorized / Forbidden +- Audit: + - Log requester ID, snapshot_id, timestamp and reason (optional). + +5) GET /api/v1/admin/snapshots/:id/status + +- Purpose: Return processing status and errors for a given snapshot id. +- Auth: admin:snapshots +- Response: + { + "snapshot_id": "uuid", + "processing": false, + "processed_at": "2025-11-28T12:10:00Z", + "error_count": 0, + "last_error": null + } + +6) GET /api/v1/health + +- Purpose: Readiness/liveness check for orchestration. Minimal response for health probes. +- Auth: none (or token if cluster requires) +- Response: + - 200 OK { "status": "ok", "timestamp": "..." } + - 503 Service Unavailable when dependent systems unhealthy +- Implementation: + - Health check should validate DB connectivity (light query), queue connectivity (PING), and optionally S3 + reachable. + +7) GET /metrics + +- Purpose: Prometheus metrics endpoint for the service and worker. +- Auth: IP-restricted or bearer token (prometheus scrape jobs). +- Response: text/plain; version=0.0.4 with metrics lines. + +8) POST /api/v1/admin/migrations/apply (optional / restricted) + +- Purpose: Trigger a migration run in CI context (rare; normally performed via GitHub Action). Very restricted. +- Auth: migrate:apply and require environment approval / multi-person auth. +- Response: + - 202 Accepted { "job_id": "uuid", "status": "started" } + - 403 Forbidden if not allowed +- Notes: + - Prefer the GitHub Actions workflow for production migrations. This endpoint is optional and must be heavily + guarded. + +9) GET /api/v1/admin/exports?entity=user_troops&from=...&to=... + +- Purpose: Trigger or query export jobs that write CSV/Parquet to S3 for analytics. +- Auth: admin or analyst scope +- Response: job listing with status and s3_path once complete. + +--- + +### 8.4 Rate limiting and throttling policy + +- Objectives + - Protect upstream systems (our API, the DB and third-party APIs), provide fair usage and avoid abuse. +- Policy overview + - Public read endpoints (GET /profile/summary): + - Default per-IP: 60 requests/minute + - Burst: 120 requests in short window allowed, then throttled + - Authenticated clients (with API key) may get higher quotas (e.g., 600 req/min) subject to plan. + - Ingestion endpoints (POST /internal/snapshots): + - Per-api-key limit: 30 requests/min by default for CLI clients (configurable); service clients allowed higher + quotas. + - Enforce per-client concurrency limits to avoid fan-out storms to ETL workers. + - Admin endpoints: + - Strict low-rate limits and additional checks (e.g., require two-person approval for migration triggers). + - Bot endpoints: + - Commands are rate limited per guild/user according to Discord best practices; enforce additional server-side + rate limits to avoid abuse. +- Enforcement & headers + - Use token-based rate limiting where possible (rate limits keyed by API key or bearer token). + - Return standard rate-limit headers: + - X-RateLimit-Limit: + - X-RateLimit-Remaining: + - X-RateLimit-Reset: + - On limit exceeded return 429 Too Many Requests with: + { + "error": { "code": "RATE_LIMIT_EXCEEDED", "message": "Rate limit exceeded", "retry_after": 30 } + } +- Throttling & backpressure + - For queue saturation (queue depth high) return 202 Accepted with "queued" response and estimated ETA rather than + accepting more work that will overload workers. + - Provide graceful degradation: if write path is saturated, serve read-only cached summaries with explanation to + clients. +- Abuse detection + - Monitor for abnormal patterns and apply temporary IP blacklisting, challenge flows or manual review. +- Client recommendations + - Clients should respect Retry-After header and implement exponential backoff with jitter for retries. + - Use pagination to limit result sizes. + +--- + +### 8.5 API versioning & deprecation policy + +- Versioning strategy + - Major version in path: /api/v1/... . When breaking changes are required, introduce /api/v2/ and deprecate /api/v1 + per policy below. + - Semantic versioning for SDKs and API clients; server-side follows path-major-versioning. +- Deprecation policy + - Non-breaking (additive) changes: no deprecation required; clients should handle extra optional fields. + - Breaking changes: + - Announce deprecation at least 90 calendar days prior to removal (for public endpoints). + - Provide migration guide and example mapping, and a compatibility layer where feasible. + - Emit Deprecation headers on responses for deprecated endpoints: + - Deprecation: true + - Sunset: + - Link: + - Examples: + - Response header: Deprecation: true + - Response header: Sunset: Tue, 27 Feb 2025 00:00:00 GMT + - Response header: Link: ; rel="sunset" +- Version negotiation + - Support clients that include Accept header versioning temporarily, but path-based versioning is authoritative. +- Grace period & support + - Maintain backward compatibility shims where reasonable. + - During deprecation window offer a compatibility testing sandbox and provide sample code/libraries to ease + migration. +- Breaking-change approval & communication + - Any breaking change must be approved by Product and Engineering leads. + - Announce via release notes, mailing list, GitHub release and docs. +- Emergency patches + - For security critical changes requiring immediate breaking change, apply emergency channel and communicate impact, + provide a temporary mitigation path. +- OpenAPI & docs per version + - Publish /openapi-v1.yaml, /openapi-v2.yaml when multiple versions exist. Keep docs for older versions available + until sunset. + +--- + +References & artifacts + +- OpenAPI definition: /openapi.yaml (generate from code or maintain manually). +- API docs: /docs/api (user-friendly). +- Admin & migration runbooks: docs/OP_RUNBOOKS/*. +- Client SDKs and examples: /clients (optional — add TypeScript/Node example for snapshot ingestion and profile read). + +## 9. UI / UX + +This section defines design goals, constraints, wireframe links, detailed interaction behavior per screen, accessibility +requirements and responsive behavior for the Profile & DB Foundation features. The intent is to provide clear guidance +to designers and frontend engineers so UX decisions are consistent with operational, security and performance needs. + +--- + +### 9.1 Design goals & constraints + +Design goals + +- Clarity & speed: present the most important player information (summary) quickly and clearly. The primary read path + must be lightweight and highly cacheable. +- Progressive disclosure: show a compact summary by default and allow drilling into details (inventory, teams, raw + snapshot) on demand. +- Predictability: indicate freshness of data and processing status (queued / processing / processed) with consistent + affordances. +- Operational safety: provide clear admin UI states for ETL jobs, reprocess actions and migration triggers; require + confirmations for destructive operations. +- Privacy-aware: clearly label any view that exposes raw snapshots and require elevated permissions to see sensitive + fields. +- Developer-friendly: include sample data, debug mode and links to underlying snapshot id for troubleshooting. + +Design constraints + +- Performance: primary summary must load from a denormalized cache (user_profile_summary) and be served under latency + targets (p95 < 200ms). Avoid heavy client-side parsing of raw JSON. +- Security: raw snapshot view restricted to authenticated and authorized users; UI must not display secrets (tokens) + even for privileged users — redact or mask them. +- Consistency: align visual language with existing branding and the Discord embed style for bot responses. +- Minimal scope: initially provide a compact set of screens (Summary, Raw Snapshot viewer, Admin/ETL dashboard). Expand + only after validating usage patterns. + +Design tokens & components (suggested) + +- Color tokens: primary, secondary, success, warning, danger, neutral; ensure WCAG contrast. +- Typography tokens: scale for headings, body, monospace for raw JSON. +- Components: Card, Badge (status), Table, Collapsible panel, Modal (confirm), SearchBar, Pagination, + AsyncActionButton (shows spinner / progress), CodeBlock (syntax-highlighted JSON), EmptyState, Toasts/Notifications. + +--- + +### 9.2 Wireframes / mockups (links to design files) + +Design files location (placeholders to update with real URLs) + +- Figma (recommended): https://www.figma.com/file/XXXX/StarForge-Designs (replace with actual team Figma URL) +- Repository design folder: /design (store exported PNG/SVG wireframes and final assets) +- Example image files: + - docs/design/wireframes/profile_summary_desktop.png + - docs/design/wireframes/profile_summary_mobile.png + - docs/design/wireframes/admin_etl_dashboard.png + - docs/design/wireframes/raw_snapshot_viewer.png + +What to include in the design repository + +- Low-fidelity wireframes for desktop and mobile for each screen listed below. +- High-fidelity mockups / component-library tokens. +- Interaction prototypes for critical flows: fetch snapshot, ETL job lifecycle, reprocess snapshot, migration workflow + approval. +- Exported assets for Discord embeds (icons, small thumbnails). + +Notes for designers + +- Provide a compact “bot embed” mock that mirrors the Discord message card: title, small icon, level & top troops as + inline fields, CTA link to full profile. +- Annotate mockups with accessibility notes (contrast ratios, keyboard order, ARIA roles). +- Include states: loading, empty, error, stale data (cached_at older than threshold). + +--- + +### 9.3 Interaction details per screen + +This subsection documents screen-by-screen interactions (user inputs, expected responses, validation, and error +handling). Screens prioritized for first release are marked P0. + +Screen: Profile Summary (P0) + +- Purpose: Show denormalized, quick-read information about a player. +- Primary elements: + - Header: Player name, NameCode, level, small avatar / thumbnail + - Status badge: processed / processing / pending + - Key stats: level, PvP tier, guild, last_seen + - Top Troops: list of top 3–5 troops with icons, amount, and small stat badges + - Equipped pet: icon + level + - Actions: Refresh (trigger a new fetch), View Raw Snapshot, Report Issue + - Footer: cached_at timestamp and "last updated" tooltip +- Inputs & interactions: + - Click Refresh: POST to /api/v1/internal/snapshots with source=ui_fetch (requires auth) or trigger client-side + instruction to run CLI; show toast "Snapshot requested" and estimated ETA. + - View Raw Snapshot: opens Raw Snapshot viewer in modal or new page (auth gated). + - If status == processing: show progress spinner and disable actions that would trigger further duplicate fetches; + provide ETA if available. +- Validation & error handling: + - If API returns 202 (pending): show non-blocking banner "Processing, expected in ~30s". + - If profile not found: show call-to-action "Fetch profile by NameCode" with input field and button, validate + namecode format client-side. + - On network/API error: show inline error toast with request id and "Retry" option. + +Screen: Raw Snapshot Viewer (P1) + +- Purpose: Allow privileged users to view the stored JSONB and metadata for debugging and audit. +- Primary elements: + - Metadata header: snapshot_id, created_at, size_bytes, content_hash, processed_at, error_count + - JSON viewer: pretty-printed JSON with collapsible nodes, line numbers, search within JSON + - Controls: Download JSON, Copy link, Redact/Mask sensitive fields toggle (always mask tokens by default), Reprocess + button (admin) + - Audit panel: ETL errors for this snapshot and processing history +- Inputs & interactions: + - Search box filters JSON keys and values, highlights matches. + - Download triggers GET /api/v1/profile/raw/:namecode or snapshot export (auth). + - Reprocess: shows modal confirm (require typing "REPROCESS" or similar), then POST admin reprocess endpoint; show + enqueue toast and job id. +- Validation & security: + - Redact tokens and known sensitive keys automatically on client (and server side). + - Require admin auth for reprocess and raw download. + - Confirm destructive actions and record audit entries. + +Screen: Admin / ETL Dashboard (P0) + +- Purpose: Monitor queue, worker health, recent failures, top error reasons and process snapshots manually. +- Primary elements: + - Summary tiles: queue depth, processing rate (per minute), success rate, failure rate. + - Recent snapshots list: snapshot_id, namecode, size, status, created_at, processed_at, error_count, quick actions ( + view, reprocess) + - Failure trends chart (time series) + - Worker pool status: instance list, memory/CPU, last heartbeat, current job id + - Retention controls: run retention job, configure retention thresholds +- Inputs & interactions: + - Click snapshot row to open Raw Snapshot Viewer. + - Bulk reprocess: select snapshots and enqueue reprocess (confirm modal). + - Pause queue / Pause worker: confirmed action with reason required. + - Export errors: download CSV of latest etl_errors. +- Validation & error handling: + - Action confirmations required for bulk or destructive operations. + - All admin actions generate audit logs visible on the panel. + - Show warn banner if queue depth > threshold or worker heartbeats missing. + +Screen: Migrations & Bootstrap UI (P1) + +- Purpose: Provide a controlled UI for reviewing migration plans and triggering the protected GitHub Action for + production bootstrap (optional — primarily run via Actions). +- Primary elements: + - List of pending migrations with descriptions, author and migration id + - Preflight checks panel that shows required extensions, estimated index size and pre-check results + - Trigger button: "Run Bootstrap (Requires Approval)" — links to GitHub Actions run or triggers action via API (if + implemented) + - Audit trail of past bootstrap runs with logs +- Inputs & interactions: + - Preflight must pass before allowing run; if fails, show remediation steps. + - On trigger, require approver identity (GitHub environment approval or multi-person confirmation). +- Safety: + - UI should not expose DATABASE_URL or secrets. Only provide links to logs and artifacts. + +Screen: Exports & Analytics (P2) + +- Purpose: Allow analysts to request exports of materialized/relevant views to S3. +- Interactions: + - Select entity (user_troops), date-range picker, choose format (CSV/Parquet), submit export job. + - Show job status and S3 link when complete. + - Validate ranges to avoid huge exports; if too large, suggest incremental export. +- Permissions: + - Analyst/admin scope only. + +Screen: Onboarding / Developer Docs (P0) + +- Purpose: Offer one-click links and quick instructions to set up local dev environment. +- Contents: + - Quick start steps, sample payloads, button to insert sample snapshot (local only), link to migration docs. + - Troubleshooting tips for extension permission errors and common failures. + +Cross-cutting interaction patterns + +- Confirmations: destructive or high-impact actions require typed confirmation and display expected consequences. +- Idempotency: UI must set and present Idempotency-Key when making snapshot ingest requests so retries are safe. +- Feedback: every async action returns immediate UI feedback (toast + job id) and updates dashboard when job completes + via websocket or polling. +- Visibility: show timestamps (UTC) and relative times (e.g., "5 minutes ago") with tooltip showing exact ISO timestamp. + +--- + +### 9.4 Accessibility considerations (keyboard nav, screen readers) + +Accessibility goals + +- Follow WCAG 2.1 AA where reasonable for public-facing screens; aim for compliance on documentation and admin consoles. +- Ensure keyboard-only users and screen reader users can perform core tasks (view summary, trigger fetch, view errors). + +Specific requirements & implementation notes + +- Semantic HTML: use proper headings (h1..h6), lists, tables, forms and landmarks (role="main", role="navigation", + role="complementary"). +- ARIA attributes: provide ARIA labels for interactive controls (modals, confirm dialogs, buttons without textual + labels). +- Focus management: + - On modal open: move focus to the first interactive element; on close return focus to invoking control. + - Provide visible focus indicators for all focusable elements. +- Keyboard navigation: + - All controls accessible via Tab / Shift+Tab. + - Provide keyboard shortcuts for frequent admin actions (e.g., reprocess selected snapshots) but make them + discoverable and optional. +- Screen reader support: + - Provide descriptive alt text for images and icons. + - Announce dynamic updates (toasts, job status changes) via aria-live regions. + - For the JSON viewer provide a "Toggle collapsed/expanded" control and text-mode view optimized for screen + readers (collapsible structure with headings). +- Contrast & typography: + - Maintain contrast ratio >= 4.5:1 for body text and 3:1 for large text per WCAG guidance. + - Avoid relying on color alone to communicate status; use icons and text labels. +- Motion & reduced-motion: + - Respect prefers-reduced-motion; provide reduced animations if user prefers. +- Forms & error messages: + - Associate labels with inputs; provide inline error messages with aria-describedby linking to the error text. +- Testing & validation: + - Include automated accessibility checks in CI (axe-core, pa11y). + - Conduct manual testing with a screen reader (NVDA/VoiceOver) on key screens. + +--- + +### 9.5 Responsive behavior (mobile / tablet / desktop) + +Responsive design principles + +- Progressive enhancement: keep the core experience (summary read, refresh request) usable on low-bandwidth / low-CPU + devices. +- Breakpoints (suggested): + - Small (mobile): up to 600px + - Medium (tablet): 600px–1024px + - Large (desktop): 1024px+ +- Layout adjustments + - Profile Summary: + - Mobile: single column card; header with avatar and name, followed by vertical list of stats; action buttons + stacked. + - Tablet: two-column layout — header + key stats left, top troops and actions right. + - Desktop: multi-column with expanded top troops, quick actions, and last_seen + cached_at in header. + - Raw Snapshot Viewer: + - Mobile: show truncated JSON with "Open Full" link to download or open in a separate view; provide search but + limit initial expansion to avoid extremely long DOM. + - Desktop: full collapsible JSON tree with side-by-side metadata panel. + - Admin Dashboard: + - Mobile: present only the most critical tiles (queue depth, recent failures) and allow navigation to full + desktop UI for advanced operations. + - Desktop: full dashboard with charts, tables and controls. +- Touch targets & spacing + - Ensure tap targets are >= 44x44 px for mobile. + - Use adequate spacing to avoid accidental taps. +- Performance considerations for mobile + - Lazy-load heavy components (charts, JSON tree) and prefer server-side rendered summaries for initial paint. + - Use compressed images and optimized icons (SVG). +- Offline & poor connectivity + - Provide graceful messages when offline or when API is unreachable, and allow queuing of non-sensitive actions + locally if relevant (e.g., store a requested snapshot request to retry). +- Discord bot experience + - Discord embeds are single-card content: keep messages short and provide a link to the web UI for details. Design + embed messages to display well on mobile Discord clients. +- Testing + - Validate on a matrix of devices (Android/iOS) and browsers (Chrome, Firefox, Safari) and ensure performance meets + p95 targets. + +--- + +Appendix: UI copy & microcopy guidance + +- Use concise, action-oriented labels: "Fetch profile", "Reprocess snapshot", "Download JSON". +- Status language: + - "Processed" — snapshot fully processed and summary available. + - "Processing" — ETL in progress; provide ETA where possible. + - "Queued" — snapshot enqueued for ETL. + - "Failed" — show a short reason and link to error details. +- Confirm dialogs should explicitly state consequences, e.g., "Reprocessing will re-run ETL for this snapshot and may + overwrite current normalized records. Type REPROCESS to confirm." + +--- + +## 10. Integrations & External Systems + +This section lists each external system we integrate with, the expected contract, security considerations, operational +needs and best practices. Use it as an integration checklist for implementation, CI, runbooks and security review. + +--- + +### 10.1 Discord bot integration (scopes, intents, webhooks) + +Overview + +- Purpose: provide player-facing commands (e.g., /profile), notifications, and optional guild admin features via a + Discord bot. +- Clients: Bot runs as a service (Node.js/TypeScript) connecting to Discord Gateway and calling our backend APIs. + +Bot token & secrets + +- Store the bot token in secrets (GitHub Secret: DISCORD_TOKEN; runtime: secrets manager). Never commit tokens. +- Use short-lived OAuth flows for any user-level consent (if linking accounts), avoid storing user tokens long-term. + +Required Bot Scopes + +- Bot OAuth scopes: + - bot — add the bot to guilds + - applications.commands — register slash commands + - identify (if linking to users) + - email (only if explicitly required and consented) +- Recommended optional scopes: + - guilds.members.read — only if bot needs member discovery and compliant with Discord policies (review privacy + implications) + +Recommended Gateway Intents + +- Privileged intents require enabling in the developer portal and may require justification: + - GUILD_MEMBERS (privileged) — only if you need member join events or mapping users to guilds. Avoid unless strictly + necessary. + - GUILD_PRESENCES (privileged) — usually unnecessary; avoid for privacy and rate reasons. +- Non-privileged: + - GUILDS — required + - GUILD_MESSAGES — if bot reacts to messages + - DIRECT_MESSAGES — only if bot supports DMs + +Slash Commands & Webhook Patterns + +- Use slash commands for profile lookup: /profile + - Command handler calls GET /api/v1/profile/summary/:namecode + - If profile not ready, reply ephemeral "Profile processing — try again in ~30s" +- Use interaction responses and followups properly (within interaction timeout) and include links to full profile in web + UI. +- Use webhooks for asynchronous notifications only if necessary (e.g., bulk ETL completion notifications to a given + channel) and restrict webhook URLs to server-side config. + +Rate limits & retry logic + +- Respect Discord's rate limits. Use Discord library's built-in rate limiting. +- Add retry with exponential backoff for 429 responses and log incidents (Sentry). +- Avoid heavy operations inside interaction handlers; delegate to background jobs. + +Security & privacy + +- Mask or never display tokens, private identifiers or PII in bot messages. +- If linking Discord user to a NameCode or internal user_id, require explicit user command/consent and store mapping in + DB with clear audit log. +- Use separate ephemeral tokens for webhooks if you must expose them. + +Operational considerations + +- Auto-sharding or multiple bot instances: use recommended sharding or a gateway manager when scaling. +- Monitor bot health: register heartbeats and expose metrics (command_count, error_count, avg_latency). +- Provide a "maintenance" mode command to disable features during migrations. + +Testing + +- Use a sandbox Discord application and test guilds. +- Provide sample tokens in a secure test secrets store for CI e2e tests (rotate regularly). + +--- + +### 10.2 Supabase / Postgres (connection, roles, backups) + +Overview + +- Primary data store: managed Postgres (Supabase or equivalent). +- Use Postgres for hero_snapshots (JSONB), normalized tables and small catalogs. + +Connection & configuration + +- Use DATABASE_URL from runtime secrets (format: postgres://user:pass@host:port/dbname?sslmode=require). +- Enforce SSL (PGSSLMODE=require) in production. +- Use connection pooling (pgbouncer) or the provider's connection pool to avoid hitting connection limits from many + workers/clients. +- Set statement timeouts and connection timeouts (application-side) to prevent long-running queries from blocking. + +Roles & least privilege + +- Define separate DB roles/users: + - migrations_role: used by migration jobs, limited to DDL operations in non-production or specifically granted on + production with approvals. + - app_write_role: used by API & workers for DML and ETL upserts. + - app_read_role: read-only used by analytics and dashboards. + - admin_role: only for emergency & DBA tasks (not used in app runtime). +- Use separate credentials for CI (migrations) with limited scope and require environment approvals to run production + migrations. + +Extensions & provider constraints + +- Preferred extension: pgcrypto (gen_random_uuid()). If provider disallows extension creation, document fallback ( + generate UUIDs client-side or use alternative). +- Avoid extensions that require superuser privileges unless provider allows them. + +Backups & retention + +- Use provider-managed automated backups (daily snapshots) and enable continuous WAL archiving when available. +- Backup retention policy: at least 90 days for quick restore; archive older snapshots per organization policy. +- Periodically test restores (quarterly or as defined in BACKUP_RESTORE.md) and record the exercises. + +Monitoring & maintenance + +- Monitor: + - connection count + - long-running queries + - replication lag (if using replicas) + - index bloat and vacuum stats +- Set alerts for high connection counts or replication lag. +- Use partitioning for hero_snapshots if volume grows (monthly partitions). + +Migrations & schema changes + +- Use node-pg-migrate for migrations; run migrations first in staging. +- Use migration preflight checks (check CREATE EXTENSION permissions, estimate index build time). +- Apply production migrations only via manual GitHub Actions workflow with environment protection. + +Security + +- Restrict public DB access via network rules (VPC, allowlist). +- Rotate DB credentials regularly and after incidents. +- Use encryption at rest and in transit. + +Secrets & connectivity for CI + +- Store DATABASE_URL in GitHub Secrets for GitHub Actions. +- Avoid echoing secrets in logs. Use run steps that mask secrets and use environment protection. + +Operational procedures + +- Scale read replicas for analytics and high read throughput. +- For heavy backfills, use isolated worker instances and throttled concurrency to limit write pressure. + +--- + +### 10.3 Google APIs (service account usage) + +Overview + +- Use Google service accounts for any CI or infrastructure tasks requiring Google Cloud (optional). + - Common uses: uploading artifacts to GCS, running Cloud tasks, secret access for GCP-hosted resources. + +Service account & key handling + +- Use a dedicated service account per purpose (CI, backup, monitoring). +- Prefer Workload Identity or OIDC (GitHub Actions -> GCP) over long-lived JSON keys when possible. +- If JSON keys are required, store them encrypted as GitHub Secrets (GOOGLE_SA_JSON) and restrict access to protected + workflows/environments. + +Scopes & permissions + +- Principle of the least privilege: + - Give service accounts only the minimal IAM roles required (e.g., storage.objectAdmin for uploads, but prefer + granular roles). + - Avoid broad roles like Owner. + +Example usage patterns + +- GitHub Actions authenticates to GCP to upload artifacts to GCS or run export jobs. +- Scheduled backups may push export files to a GCS bucket (or S3) using the service account. + +Security & rotation + +- Rotate service account keys periodically if used. +- Audit IAM bindings and service account usage logs. + +Alternative & recommended patterns + +- Prefer cloud provider-native authentication methods: + - For GCP: use Workload Identity Federation from GitHub Actions to eliminate JSON keys. + - For AWS: use OIDC or short-lived STS tokens. + +--- + +### 10.4 CI/CD & Container Registry (GH Actions, GHCR) + +Overview + +- CI/CD via GitHub Actions. +- Container images and artifacts published to GitHub Container Registry (GHCR) or other registries as required. + +Workflows & environments + +- Key workflows: + - ci.yml — run tests, lint, build artifacts + - build-and-publish.yml — build containers and publish to GHCR with tags (pr-, sha-, latest on main) + - db-bootstrap.yml — manual workflow_dispatch for running migrations/bootstraps (protected environment, approver + required) + - deploy.yml — deploy to staging/prod (manual approvals for prod) +- Use GitHub Environments to protect production secrets and require approvals for production bootstrap/deploy jobs. + +Secrets & artifacts + +- Store secrets in GitHub repository or organization secrets, restrict to environments. +- Examples: DATABASE_URL, DISCORD_TOKEN, REDIS_URL, GHCR_PAT (or use built-in GITHUB_TOKEN with package permissions). +- Mask secrets in logs and avoid printing them. +- Configure artifact retention in Actions settings (short retention for logs unless required). + +Image tagging & retention + +- Tag images: + - owner/repo:pr-, sha-, v, latest (only for main) +- Use immutable tags for released versions (vX.Y.Z). +- Configure GHCR retention and cleanup policy for old images (avoid unbounded storage). + +Security in CI + +- Use Dependabot for dependency updates and run dependency scanning in CI. +- Use SAST and license checks in CI (optional). +- Use OIDC federated credentials where supported to avoid storing long-lived cloud keys. + +Deployment & rollout + +- Canary / blue-green deployments recommended for stateful services. +- Use feature flags in code to control rollout of new behaviors. + +Access control & auditing + +- Limit who can approve workflows for protected environments. +- Audit GitHub Actions runs and who triggered them. + +Testing & promotion + +- Require all migration PRs to run a migration sanity job in CI against a disposable DB container. +- Promote changes to staging only after CI passes; production deployment requires manual approval. + +--- + +### 10.5 Monitoring & Logging providers (e.g., Prometheus, Grafana, Sentry) + +Overview + +- Observability stack to monitor ETL worker, API services, queue health, DB metrics and alert on anomalies. + +Metrics & monitoring + +- Metrics exported by services: + - ETL worker: processed_count, success_count, failure_count, processing_time_histogram, queue_depth_gauge + - API: http_requests_total, http_request_duration_seconds, error_rate + - DB: connections, active queries, replication lag, slow queries + - Infrastructure: CPU, memory, disk, network +- Prometheus: + - Scrape instrumented metrics endpoints (/metrics). + - Retention: per org policy; use long-term storage for analytics if needed. +- Grafana: + - Dashboards: + - ETL overview (throughput, latency, queue depth) + - Snapshot ingestion & failures + - DB health & query performance + - Worker instance metrics and resource usage + - Alerts: create alert rules for thresholds described in NFRs (ETL failure spikes, queue depth, high latency) + +Error tracking & logs + +- Sentry: + - Capture exceptions and errors in worker and API. + - Sanitize events to remove PII and tokens. + - Use environment tags (staging/prod) and sampling to control volume. +- Structured logging: + - JSON logs with fields: timestamp, service, level, snapshot_id/job_id, request_id, message, details. + - Centralized aggregation (ELK stack, Logflare, Datadog logs). + - Index common fields (snapshot_id, user_id, error_type) to make searching easier. + +Tracing + +- Distributed tracing (optional but helpful): instrument with OpenTelemetry and forward to a tracing backend (Jaeger, + Tempo). +- Include correlation ids in logs and traces (X-Request-Id or traceparent). + +Alerting & incident management + +- Medium: Alerts via Slack, PagerDuty, or email for critical issues. +- Define alert severity and on-call rotation in runbooks. +- Example alert thresholds: + - ETL failure rate > 1% sustained for 5 min → P0 alert + - Queue depth > threshold for 10 minutes → P1 alert + - DB connection count > 80% of limit → P1 alert + +Retention & privacy + +- Log retention policy: store logs for N days (configurable); anonymize or redact sensitive fields before long-term + storage. +- Sentry retention and sampling to avoid storing sensitive data. + +Testing & validation + +- Include observability tests in CI: ensure metrics endpoint is reachable and basic counters increment when running test + flows. + +--- + +### 10.6 Other third-party services + +This subsection covers additional services we likely use or consider integrating. Each entry includes purpose, key +constraints and operational guidance. + +Redis / Queue (BullMQ, RQ, Sidekiq, etc.) + +- Purpose: durable job queue for ETL tasks and background jobs. +- Choices: Redis-backed (BullMQ) or a hosted queue (AWS SQS, Google Pub/Sub) for durability. +- Notes: + - Redis must be sized for concurrency and not used as primary persistence. + - Ensure persistence if required and monitor memory usage. + - Consider durable queue options (SQS) if Redis availability is a concern. + +Object Storage (S3 / Spaces / MinIO) + +- Purpose: archive old snapshots, store exports (CSV/Parquet), store artifacts. +- Security: + - Use dedicated buckets, enforce IAM policies and encryption (SSE). + - Use lifecycle rules to move archived data to colder tiers and remove older backups per retention. +- Performance: + - When uploading many files in parallel, follow provider guidelines for prefixes and parallel requests. + +Email / Notifications (SES, SendGrid) + +- Purpose: notify admins about ETL failures, migration results, or user-facing notifications. +- Notes: + - Use verified domains, monitor quotas and bounce rates. + - Keep email templates for incident notifications. + +Dependency Scanning & Security Tools + +- Dependabot (GitHub), Snyk, WhiteSource +- Purpose: detect vulnerable dependencies and license issues. +- Integrate scans into CI and require fixes for critical vulnerabilities. + +Secrets Manager + +- Purpose: store runtime secrets securely (AWS Secrets Manager, GCP Secret Manager, Vault). +- Integration: + - Fetch secrets at runtime with minimal latency and caching. + - Audit access and rotate secrets regularly. + +CI Artifacts & Storage (GitHub Packages, GHCR) + +- Purpose: store built container images and artifacts. +- Policies: + - Retention and cleanup policy for storage. + - Access control and package permissions. + +Analytics / BI (Snowflake, BigQuery, Redshift, or ETL to CSV) + +- Purpose: heavy analytics and reporting (materialized views or periodic exports). +- Considerations: + - Use scheduled exports to avoid hitting transactional DB during work hours. + - Sanitize PII before exporting. + +Payment Providers (if future monetization) + +- Purpose: handle in-app purchases or subscriptions. +- Considerations: + - PCI compliance out of scope until payments added. Plan carefully before adding. + +CDN (Cloudflare, Fastly) + +- Purpose: accelerate static assets and protect APIs (WAF). +- Use for: hosting web UI assets, protecting API endpoints with rate-limiting and WAF rules. + +License & Third-party dependency inventory + +- Maintain a list of third-party services, versions and licenses in docs/THIRD_PARTY.md and enforce policies for + acceptable licenses. + +--- + +Integration checklist (practical) + +- [ ] Define credentials & secrets for each service and store them in secrets manager / GitHub Secrets. +- [ ] Document per-service IAM roles and scopes. +- [ ] Add health checks for each external dependency (DB, Redis, S3, Discord). +- [ ] Add monitoring (metrics & alerts) for third-party interactions: S3 failures, Redis memory, Discord rate-limit + events. +- [ ] Add tests & CI steps that validate integrations in sandbox environments. + +References + +- Discord developer docs: https://discord.com/developers/docs/intro +- Supabase docs: https://supabase.com/docs +- GitHub Actions docs & Environments: https://docs.github.com/en/actions +- Prometheus/Grafana best practices and sample dashboards in docs/OBSERVABILITY.md + +--- + +## 11. Architecture & Technical Design + +This section describes the system architecture and technical design decisions for the Player Profile & DB Foundation +project. It covers a high-level architecture overview, component diagrams and data flow (described), service boundaries, +storage and caching strategy, queueing and background jobs, deployment topology for staging/production, CI/CD pipeline +summary pointers, and failover & disaster recovery approach. + +--- + +### 11.1 System architecture overview + +Goal + +- Provide a scalable, observable and secure platform to ingest large player profile JSON snapshots, persist raw + snapshots for audit, normalize important fields into relational tables, serve low‑latency profile reads (bot/UI) and + support analytics/backfill operations. +- Keep components modular so we can scale each independently: API, ETL workers, Discord bot, admin UI, and + analytics/export jobs. + +Primary components + +- API service (stateless): receives snapshot ingestion requests, profile read endpoints, admin endpoints. +- ETL worker(s) (stateless compute): background consumers that claim hero_snapshots and upsert normalized tables. +- Queue broker: Redis (BullMQ) by default; optionally replaceable with durable queue (AWS SQS / GCP Pub/Sub) for higher + durability. +- Postgres primary DB: stores hero_snapshots (JSONB), normalized tables and metadata. Managed provider (Supabase or + equivalent) recommended. +- Redis cache / connection pool: used for job queue, ephemeral caches and rate limiting. +- Object storage: S3 or S3-compatible for snapshot archival and exports. +- Discord bot: separate process connecting to Discord Gateway and calling the API. +- Admin UI: read-only/operational UI for ETL dashboard, raw snapshot viewer and migration runbook links. +- CI/CD: GitHub Actions building images and running migrations (manual bootstrap for prod). +- Observability: Prometheus metrics, Grafana dashboards, Sentry for error tracking, centralized structured logging. + +High-level flow (summary) + +1. Client (CLI / bot / UI) posts snapshot to API or uploads snapshot file. +2. API validates, computes content hash, stores raw payload into hero_snapshots JSONB, and enqueues a job with snapshot + id. +3. ETL worker dequeues job, atomically claims the snapshot, streams/parses payload, upserts normalized rows, writes + profile_summary, and marks snapshot processed or records errors. +4. Bot/UI queries profile_summary for fast reads; falls back to latest processed hero_snapshot when summary missing. +5. Retention job archives snapshots to S3 and removes them from DB per policy. + +--- + +### 11.2 Component diagrams & data flow + +This section describes the component interactions and the core data flow as sequences and call graphs (textual). Keep a +copy of the diagram in docs/ERD.svg or docs/architecture/diagram.svg (recommended). + +Component interaction (textual diagram) + +- Client (CLI / Web / Bot) + -> API Service (Ingress) + -> Postgres (hero_snapshots) + -> Redis (enqueue snapshot id) + -> Response to Client (snapshot queued) +- Worker pool (n instances) + -> Redis (dequeue) + -> Postgres (upsert normalized tables) + -> Prometheus / Metrics + -> Sentry / Logs + -> Postgres (update hero_snapshot.processed_at) +- API Service + -> Postgres (read user_profile_summary) + -> Redis (cache profile_summary) + -> S3 (download archived snapshot when requested) +- Admin UI + -> API Service (admin endpoints) + -> GitHub Actions (link to migration runs and bootstrap) + +Sequence for snapshot ingestion & processing + +1. POST /api/v1/internal/snapshots -> API validates payload, computes SHA256(content). +2. API inserts hero_snapshots row with content_hash, size_bytes, raw JSONB and returns snapshot_id. +3. API enqueues snapshot_id into Redis/BullMQ queue (job payload minimal: snapshot_id). +4. Worker picks job: + - atomic SQL claim: UPDATE hero_snapshots SET processing=true WHERE id=$1 AND processing=false RETURNING id + - read raw JSONB from hero_snapshots by id (streaming/parsing) + - upsert users (ON CONFLICT), user_troops (ON CONFLICT DO UPDATE), etc. + - write user_profile_summary (INSERT ... ON CONFLICT DO UPDATE) + - update hero_snapshots (processed_at=now(), processing=false) and write alerts/metrics +5. API reads user_profile_summary for client requests; if missing, optionally compute best-effort response using latest + processed hero_snapshot. + +Data flow security & telemetry + +- All messages and DB writes include correlation id (X-Request-Id) to link API request -> queue job -> worker logs -> + final write. +- Telemetry produced: job_duration_ms, processed_rows_count, failure_count, queue_wait_time. + +--- + +### 11.3 Service boundaries (microservices / monolith) + +Recommended decomposition + +- api-service (stateless) + - Responsibilities: authentication & authorization, snapshot ingestion endpoint, read endpoints (profile summary), + admin endpoints, health & metrics. + - Tech: Node.js/TypeScript (existing repo), express/fastify, pg client, BullMQ client. +- etl-worker (stateless) + - Responsibilities: consume jobs, parse JSON, perform domain upserts, emit metrics, handle retries and error + recording. + - Tech: Node.js/TypeScript (same codebase or separate package), worker framework using BullMQ or alternative. +- discord-bot (separate process) + - Responsibilities: register slash commands, respond to users, call API for summary reads. + - Tech: discord.js or equivalent. +- admin-ui (optional separate frontend) + - Responsibilities: operational dashboard, raw snapshot viewer, migration links, reprocess UI. + - Tech: React / Next.js (deployed as static + API calls). +- analytics/export workers (batch) + - Responsibilities: materialized view refresh, export scheduled jobs to S3 (CSV/Parquet). +- orchestration/runtime + - Responsibilities: deploy & scale instances, scheduling, secrets, and monitoring. + +Why separate services + +- Scaling flexibility: ETL workers have different scaling needs (CPU/memory) compared to API service. +- Operational isolation: crashes or heavy ETL workloads should not impact API latency. +- Security: admin UI & migration operations isolated and access-restricted. + +Monorepo / single-repo approach + +- Keep services in the same repository (monorepo) for shared types and utilities, but package them as separate + containers. Share CI pipelines and consistent linting/tests. + +Boundaries & contracts + +- API <-> Worker: queue messages contain minimal payload (snapshot_id + correlation metadata). Workers rely only on + hero_snapshots table schema and documented upsert contracts. +- API <-> Bot: public read API with rate limiting; bot must not rely on slow snapshot ingestion synchronously. +- Admin UI <-> API: admin endpoints protected by RBAC and audit logging. + +--- + +### 11.4 Storage & caching strategy (Postgres, Redis, S3) + +Postgres (primary datastore) + +- Use managed Postgres (Supabase or equivalent) for transactional data and JSONB snapshots. +- Store raw snapshots in hero_snapshots JSONB (TOAST compression). +- Normalize frequently queried entities (user_troops, user_profile_summary) for performant queries. +- Index strategy: + - GIN index on hero_snapshots.raw for ad-hoc search (used sparingly). + - B-tree and partial indexes for normalized tables as documented in Section 7. +- Partition hero_snapshots by created_at (monthly) when dataset grows (recommended after threshold). + +Redis (cache and queue) + +- Primary usage: job broker (BullMQ) and ephemeral caching (profile_summary cache, rate-limits). +- Configure persistence and sizing appropriate for job retention; if Redis is not durable enough, consider SQS/Cloud + PubSub for critical jobs. +- Use Redis for a short-term summary cache (TTL e.g., 30–60s) to reduce DB reads during bursts (cache invalidation when + ETL updates summary). +- Use Redis for distributed locks if needed (but prefer DB atomic claims for snapshot processing to keep single source + of truth). + +Object storage (S3 or compatible) + +- Store archived snapshots and analytic exports in S3 bucket with lifecycle rules and encryption (SSE). +- Keep metadata (s3_path, checksum, archived_at) in Postgres audit tables. +- For large backfills / exports, write Parquet files to S3 for BI ingestion. + +Other caches + +- Optional application-level caches (in-memory) for extremely hot reads (but prefer Redis or read replicas for scale). +- Read replicas for Postgres to scale heavy reads without impacting writes. + +Backup & retention storage + +- Rely on provider-managed backups for Postgres. Additionally archive snapshots to S3 to control DB size. +- Set retention policy and lifecycle to reduce cost (e.g., archive to Glacier/cold tier after N days). + +--- + +### 11.5 Queueing & background jobs (worker design) + +Queue selection + +- Primary choice: Redis + BullMQ for job queueing (popular Node.js ecosystem). +- Alternative: AWS SQS or Google Pub/Sub for durability and managed scaling (easy to swap the queue adapter). + +Job design and payloads + +- Job payload minimal: { snapshot_id: UUID, correlation_id: UUID, attempt: n } +- Keep job small to avoid large payload serialization overhead. +- Use Idempotency-Key pattern for any jobs that may be retried or retriggered. + +Worker lifecycle & claim semantics + +- Claiming snapshot: + - Use atomic DB claim to avoid race conditions: UPDATE hero_snapshots SET processing=true, + processing_started_at=now() WHERE id=$1 AND (processing IS NULL OR processing=false) RETURNING id + - If claim succeeded, proceed; otherwise, skip job (another worker claimed it). +- Processing model: + - Stream/parse JSON payload (avoid loading entire 3MB object into memory if possible). + - For large arrays (troops): process in batches; build multi-row upserts for efficiency. + - Upsert per entity: + - users: upsert by unique key (namecode or discord_user_id) + - user_troops: ON CONFLICT (user_id, troop_id) DO UPDATE set amount/level/extra + - Commit by entity group (user, troops, pets) to keep transactions small and reduce lock contention. +- Error handling & retry policy: + - Transient DB/network errors: exponential backoff with jitter; use BullMQ retry features. + - Parsing/validation errors: write to etl_errors with details and mark snapshot as failed if not recoverable. + - After N retried attempts, mark snapshot as failed and escalate (alert). +- Idempotency: + - Design database upserts to be idempotent. Use snapshot_id for audit and include last_processed_snapshot_id on + summary rows if desired. +- Scalability & parallelism: + - Worker pool scales horizontally; use queue length to drive autoscaling. + - Limit concurrency per worker to avoid too many DB connections (respect pgbouncer limits). +- Observability: + - Expose /metrics for Prometheus: processed_count, success_count, failure_count, processing_time_histogram. + - Emit structured logs with snapshot_id and correlation_id for tracing. + +Admin & auxiliary jobs + +- Reprocess jobs: same job processing path but require admin-auth enqueues. +- Retention/archival job: scheduled job that selects snapshot partitions to archive to S3 and updates DB. +- Export jobs: scheduled or on-demand jobs that materialize queries and write to S3. + +--- + +### 11.6 Deployment topology (staging, production) + +Environments + +- local: developer machine via docker-compose (Postgres, Redis, local S3 emulator) with scripts/bootstrap-db.sh for + migrations. +- ci: GitHub Actions ephemeral environment for tests and migration preflight checks (use ephemeral DB). +- staging: production-like environment for integration testing, smoke tests and validation; runs same containers as prod + but smaller resources. +- production: high-availability deployment across availability zones/regions as required. + +Deployment model options + +- Container orchestration (recommended): + - Kubernetes (managed): EKS/GKE/AKS or Fly/Koyeb for small teams — gives autoscaling, health checks and service + discovery. + - Alternative: ECS/Fargate for simpler serverless container runtime. +- Simpler options: + - Single-container managed platforms (Fly.io, Render, Heroku) for smaller teams, acceptable if traffic low. + +Service placement & redundancy + +- API service: multiple replicas behind load balancer, autoscaled by CPU/RPS. +- ETL workers: autoscaled group sized by queue depth and desired ETL throughput. +- Discord bot: one or several instances (gateway sharding when required). +- Admin UI: static frontend served via CDN, backend admin APIs protected. +- Database: managed Postgres with read-replicas; multi-AZ recommended. +- Redis: managed (cluster or HA) with persistence enabled if needed. + +Traffic flow & ingress + +- Use API gateway / load balancer (managed) to terminate TLS and route to API services. +- Enforce WAF rules and rate limiting at edge if necessary. + +Deployment & release practices + +- Build container images in CI and push to GHCR. +- Use immutable tags (sha-) and promote images between environments. +- Use staged deployment: + - CI build -> staging deploy -> smoke tests -> manual approval -> production deploy +- Rollout strategies: + - Canary or blue/green deploys for API and worker updates where possible. + - For DB schema changes, follow migration best practices: add columns, backfill, convert and cut over in separate + steps. + +Secrets and configuration + +- Runtime secrets loaded from a secrets manager (or provider-specific secret store) and not stored in containers as + plain text. +- Use environment-specific configuration (12-factor app) and avoid baked-in credentials. + +--- + +### 11.7 CI/CD pipeline summary (link to CI_CD.md) + +Summary of CI/CD responsibilities (details in docs/CI_CD.md) + +- Tests: run unit tests, linting, typechecks and integration tests against ephemeral DB in CI. +- Build: build container artifacts and run container image scanning. +- Publish: push images to GHCR with immutability patterns. +- Migrations: run migration preflight in CI; production migrations performed by manual GitHub Actions workflow ( + db-bootstrap.yml) requiring environment approval. +- Releases: create GitHub release notes and tag images; deploy to staging automatically on main, production on manual + approval. +- Rollbacks: use prior immutable image tag to roll back services; run DB rollback only if reversible or follow + restore-from-backup runbook. + +Link: docs/CI_CD.md (see that document for full workflow definitions, GitHub Actions config examples and required +secrets). + +--- + +### 11.8 Failover & disaster recovery approach + +Objectives + +- Meet RTO and RPO targets (RTO target: 1 hour for critical reads; RPO: ~1 hour as defined in NFR). +- Ensure data durability and ability to restore service in degraded mode. + +Backup strategy + +- Managed DB automated daily full snapshots + continuous WAL archiving (where available). +- Regular snapshot backups validated by automated restore tests (quarterly full restore, monthly partial restore + testing). +- Archive raw snapshots to S3 with checksum and metadata as an additional data source for recovery. + +Failover & redundancy + +- Postgres: + - Use managed provider multi-AZ deployments with automated failover (enable read-replica promotion if provider + supports). + - Maintain standby read-replicas across AZs (optionally cross-region for higher DR). +- API & Workers: + - Multi-AZ replicas behind LB; implement health probes and automatic restart. + - Autoscaling groups should be configured with minimum replica count (ideally >1). +- Redis: + - Use HA/cluster deployment with failover, or use managed queue service (SQS) as fallback. +- Object storage: + - Use high durability provider (S3) and enable versioning if needed. + +DR runbooks & steps + +- Emergency flow examples: + 1. Detection: monitoring alerts (DB down, massive ETL failures). + 2. Triage: follow on-call runbook (docs/OP_RUNBOOKS/INCIDENT_RESPONSE.md). + 3. Mitigation: + - Stop workers to prevent further writes if DB in inconsistent state. + - Switch API traffic to read-only or to a fallback read-replica if primary degraded. + - If primary DB irrecoverable, promote a read-replica or restore to new instance from backup. + - If snapshot backfill required, use archived snapshots from S3 to re-ingest in controlled backfill job. + 4. Recovery: bring up services against restored DB, run smoke tests, resume processing. + 5. Postmortem: produce incident report, RCA and preventions. +- Pre-approved emergency steps: + - Restore from latest good snapshot -> run schema migrations replay (if needed). + - Use point-in-time restore to RPO target. + +Testing & drills + +- Schedule periodic DR drills to verify restore procedures and runbook clarity. +- Include simulated failover tests for read-replica promotion and application failover. +- Keep runbooks current and versioned in docs/OP_RUNBOOKS/. + +Data integrity & consistency + +- Design ETL with idempotent operations and audit logs so that reprocessing archived snapshots yields a consistent + state. +- Record which snapshot versions were used to build profile_summary so rebuilds are traceable. + +Cost vs availability tradeoffs + +- Evaluate cross-region replication and multi-region deployments against cost and required SLA. Use a tiered approach: + - Starter: single-region multi-AZ with backups and manual restore. + - Higher availability: cross-region read-replicas and automated failover. + +--- + +References & artifacts to maintain + +- docs/DB_MODEL.md (schema & DDL) +- docs/ETL_AND_WORKER.md (worker internals) +- docs/MIGRATIONS.md and docs/OP_RUNBOOKS/MIGRATION_ROLLBACK.md +- docs/CI_CD.md (CI/CD pipeline) +- docs/OBSERVABILITY.md (monitoring & alerting) +- diagrams/architecture.svg (visual architecture diagram) + +--- + +## 12. Operational & Runbook Items + +This section documents day‑to‑day operational responsibilities, on‑call and escalation paths, the short list of required +runbooks with actionable steps for common incidents, maintenance/upgrade procedure guidance and cost/budget monitoring +practices. The runbook content here is intended to be concise and actionable — each runbook below should be copied into +its own dedicated file under docs/OP_RUNBOOKS/ for expansion and sign‑off. + +--- + +### 12.1 Day-to-day operations (who does what) + +Role matrix (high level) + +- Product Owner (PO) + - Prioritize operational work and approve maintenance windows. + - Communicate incidents and planned downtime to community/stakeholders. +- Technical Lead / Engineering Lead + - Technical decisions and approvals for schema changes and major deploys. + - Triage complex incidents and coordinate engineering response. +- Backend Engineers / Devs + - Implement features, fixes and ETL improvements. + - Respond to issues assigned from monitoring (level 2). + - Maintain CI pipelines, ensure migrations and seeds are correct. +- DevOps / SRE + - Maintain infrastructure (DB, Redis, queues, S3) and CI/CD workflows. + - Responsible for runbooks maintenance, backups, and recovery drills. + - Implement and tune autoscaling, alerts and service accounts. +- On‑call Engineer (rotating) + - First responder for P0/P1 alerts per on‑call schedule. + - Execute runbooks for common incidents, escalate as needed. +- QA Engineer + - Validate fixes in staging, run deployment smoke tests, verify rollback succeed. +- Data Analyst + - Maintain analytics jobs, maintain materialized views and exports; assist on data integrity incidents. +- Community Manager / Bot Operator + - Communicate outages, respond to community reports and coordinate with PO for user messaging. +- Security Officer / Privacy Officer + - Advise on incidents with potential leakage, coordinate disclosure and legal steps. + +Daily operational tasks + +- Morning check: review dashboard for overnight anomalies (ETL failure rate, queue depth, DB replication lag). +- Alerts triage: acknowledge and assign alerts within defined SLA. +- Backups check: verify daily backup job success and retention logs. +- Queue health: inspect queue depth, worker heartbeats and recently failed jobs. +- Deployments: apply small, tested changes during working hours to non-prod; schedule production migrations. +- Documentation: update runbooks after any incident or change. + +Operational dashboards to monitor daily + +- ETL health: processed_count, failure_rate, queue_depth, avg_processing_time. +- Snapshot ingestion: ingestion rate, size distribution, duplicate rate. +- DB health: connections, slow queries, replication lag, bloat. +- Cost dashboards: monthly spend trends and top cost centers. + +--- + +### 12.2 On-call & escalation paths + +On-call model and contact flows + +- On-call schedule + - Maintain a weekly rotating on‑call roster (e.g., 1 week per engineer). Store roster in docs/OP_RUNBOOKS/ONCALL.md + and publicly available to maintainers. +- Alerting channels + - Primary alerts: PagerDuty (or equivalent) for P0 incidents (pages). + - Secondary notifications: Slack channel #ops-alerts (read‑only for automated alerts). + - Email for lower‑severity notifications and billing alerts. +- Severity definitions + - P0 — Critical: production outage affecting many or all users (API down, DB inaccessible, ETL completely halted). + Immediate page to on‑call, 15 min response expectation. + - P1 — High: significant degradation (high ETL failure rate, degraded performance, large queue backlog). Page or + high-priority Slack ping, 30 min response expectation. + - P2 — Medium: partial loss of functionality or non-urgent failures (single feature affected). Slack notify, action + within business day. + - P3 — Low: informational issues, minor UX bugs, scheduled maintenance notifications. +- Escalation path + 1. On‑call engineer (first responder). Acknowledge within 15 minutes for P0. + 2. If unresolved in 30 minutes or severity increases, escalate to Engineering Lead / Technical Lead. + 3. If unresolved in 60 minutes or incident impacts SLA/customers, escalate to Product Owner and DevOps lead. + 4. For security incidents or suspected data breaches, notify Security Officer immediately (do not postpone). +- Communication flow + - Use incident channel #incident- to coordinate response (create via standard template). + - Provide regular status updates (every 15–30 minutes) until mitigation. + - Once stabilized, perform a postmortem and publish RCA within agreed SLA (e.g., 3 business days). + +Escalation contact details (placeholder) + +- On‑call rotation / PagerDuty schedule: docs/OP_RUNBOOKS/ONCALL.md +- Slack channel: #ops-alerts, #incident-management +- Emergency escalation: Engineering Lead (name/email), PO (name/email), Security Officer (name/email) + (Replace placeholders with actual names/emails in production doc) + +--- + +### 12.3 Runbooks (short list of required runbooks) + +Each runbook below must exist as a dedicated document in docs/OP_RUNBOOKS/ with step‑by‑step commands, required +credentials (location reference), verification queries and "when to escalate" rules. The abbreviated runbook summaries +below provide the core actions and checks. + +Runbook: Incident Response (critical production incident) + +- Purpose: triage and mitigate production incidents quickly and safely. +- Preconditions: + - Alert triggered (P0/P1). On‑call engineer available. +- Quick checklist: + 1. Acknowledge the alert in PagerDuty and create an incident channel (#incident-YYYYMMDD-XYZ). + 2. Capture initial facts: time, alert name, affected services, scope (percent users), first observed. + 3. Establish incident commander (IC) and roles: scribe (notes), comms (external comms), tech leads. + 4. Triage: run health checks (GET /api/v1/health, DB connectivity check, queue depth, worker heartbeats). + 5. Contain: + - If API overloaded: enable read‑only mode or scale API replicas. + - If DB is failing: stop ETL workers to avoid further load; switch reads to replica if available. + - If queue backlog: scale workers cautiously or pause enqueueing non-critical jobs. + 6. Mitigate: apply hotfix or rollback to previous stable image; if schema related, stop and coordinate with + migrations owner. + 7. Communication: post periodic updates to stakeholders and public status page if applicable. + 8. Post-incident: collect logs, assign RCA owner, schedule postmortem publication (with timeline, root cause, + fixes). +- Verification: + - Confirm service health (API 200 on /health), ETL failure rates dropped, queue depth stabilized. +- Escalation: + - If suspected data loss or leak, notify Security Officer immediately. If SLA breach likely, notify PO and + stakeholders. + +Runbook: DB Restore (postgres recovery) + +- Purpose: restore Postgres to a known good state from backups (partial or full). +- Preconditions: + - Confirm backup availability and most recent successful backup (check backup logs). + - Ensure sufficient permissions to perform restore (DB admin). Coordinate maintenance window and approvals. +- Quick checklist: + 1. Stop ETL workers and pause incoming snapshot ingestion (set API to return 503 for new writes if necessary). + 2. Note current DB state and take diagnostic dumps (if possible). + 3. Choose restore point: + - Full snapshot restore: use most recent full backup. + - Point-in-time restore: compute desired target_time (RPO). + 4. Restore into a new DB instance (do not overwrite primary until validated). + - Provider-managed: use provider console (Supabase/GCP/AWS) to restore snapshot or perform PITR restore. + - Manual: pg_restore / psql restore from dump; apply WAL segments as required. + 5. Run smoke tests on restored DB: + - Schema validity: SELECT count(*) FROM users; sample user_profile_summary. + - Application smoke tests: run minimal end-to-end ingestion flow against restored DB in staging. + 6. If validation passes, promote restored DB to primary (follow provider-specific steps) or swap connection string + with minimal downtime. + 7. Restart workers and ingestion; monitor metrics closely. +- Verification: + - Successful sample queries, ETL run for small sample snapshot, alerting cleared. +- Rollback: + - If restored DB is invalid, revert to previous step and consult backups or escalate to DB admin. +- Postmortem: + - Document root cause, time to restore, data lost (if any), and process improvements. + +Runbook: Scaling Up (ETL workers / DB / Redis) + +- Purpose: increase capacity for worker throughput, API replicas, DB resources, or Redis. +- Preconditions: + - Observed sustained queue depth > threshold or high CPU/memory on workers/API or DB metrics triggers. +- Quick checklist: + 1. Assess current capacity and bottleneck (CPU, memory, DB connections, queue depth). + 2. For workers: + - Increase worker replicas (k8s HPA or start additional instances) or increase worker process concurrency env + var. + - Monitor DB connections and set per‑worker max connections to avoid exhausting DB. + - If needed, scale DB vertically or add read replicas for read traffic. + 3. For API: + - Scale API replicas using CI/CD or HPA based on CPU/RPS. + - Ensure load balancer health checks ok. + 4. For Postgres: + - Vertical scaling: increase instance class (CPU/RAM) via provider console. Plan for brief failover if provider + has maintenance windows. + - Read scale: add or promote read replica for analytics. + - Partition hero_snapshots if write volume extremely high. + 5. For Redis: + - Increase memory/instance class or add cluster nodes. Validate persistence settings. + 6. Verify: + - Monitor queue depth, processing rate, latency, DB connections and worker error rates. + 7. De‑scale when load returns to normal to control cost. +- Safety: + - Avoid scaling DB schema changes at same time as scale-up events; separate concerns. +- Post action: + - Update capacity planning docs and autoscaling thresholds. + +Other required runbooks (short titles & purpose) + +- Runbook: Reprocess Snapshot (admin flow) — step-by-step to re-enqueue and monitor. +- Runbook: Apply Migrations (preflight → apply → validate) — checklist for manual GitHub Action bootstrap approval. +- Runbook: Backup Verification & Restore Drill — schedule and how to run a restore drill. +- Runbook: Secrets Compromise / Rotation — how to rotate and revoke secrets quickly. +- Runbook: Cost Spike Investigation — identify services causing cost increase and emergency mitigation. + +--- + +### 12.4 Maintenance windows & upgrade procedures + +Maintenance windows + +- Policy: + - Routine maintenance window: weekly window for non-disruptive updates (e.g., Tuesdays 02:00–04:00 UTC) for + non-production environments and low-traffic production tasks. + - High-risk changes (schema-altering, large index builds): schedule during pre-approved maintenance windows with 48h + notice to stakeholders and community (if user-facing). + - Emergency maintenance: allowed outside windows for severe incidents, but must be communicated as soon as possible. +- Communication: + - Announce planned downtime/maintenance at least 48 hours in advance via Slack, status page and community channels. + - Publish expected impact, start/end time, contact point and rollback plan. + +Upgrade / release procedure (high level) + +1. Prepare + - Create migration PR and run migration preflight in CI against a disposable DB. + - Prepare rollback plan and ensure backups taken immediately before production migration. + - Prepare runbook and designate approvers. +2. Approve + - Get Product & Engineering Lead approval and schedule maintenance window if required. +3. Execute (during maintenance window) + - Run preflight checks (extensions availability, estimated index time). + - Trigger db-bootstrap GitHub Action (requires environment approval). + - Run schema migrations in staging and smoke tests; then apply to production with approvals. + - Apply application deployment (canary/blue-green). +4. Validate + - Run smoke tests (API /health, sample ingest & ETL processing). + - Monitor metrics and logs for regressions for at least agreed post-deploy window (e.g., 1–2 hours). +5. Rollback (if needed) + - If critical failure occurs, follow rollback runbook: stop workers, restore DB from backup if migration + irreversible, or deploy previous image and run compensating migration if safe. +6. Post-upgrade + - Publish post-deploy report summarizing changes and any observed issues. + - Update runbooks if new steps were required. + +Guidelines for DB migrations + +- Always take a fresh backup before applying production migrations. +- Avoid long-running exclusive locks; use phased migration strategy (add columns → backfill → enforce constraints). +- For large index creation use CREATE INDEX CONCURRENTLY and monitor the index build progress; schedule during + low-traffic windows. + +--- + +### 12.5 Cost monitoring & budget alerts + +Cost visibility & ownership + +- Assign cost owner per environment (staging, production) and per major service (DB, Redis, S3). +- Tag cloud resources (where possible) with project and environment tags to allow cost breakdowns. + +Monitoring and budgets + +- Set up billing alerts in cloud provider (monthly spend thresholds) and a billing dashboard with expected monthly run + rate. +- Configure budget alerts at multiple thresholds (e.g., 50%, 75%, 90%, 100% of monthly budget). +- Create a Slack channel #billing-alerts to forward budget notifications. + +Automated cost control measures + +- Autoscaling policies tuned to limit max replicas to reasonable levels and avoid runaway scaling. +- Implement lifecycle rules for S3 to move old archives to colder tiers and delete older-than-N-days. +- Scheduled job to prune or archive large volumes (hero_snapshots) per retention policy to control DB storage costs. +- Enforce image retention policy on GHCR (clean up old images). + +Action plan on cost spike + +1. Immediate triage: identify resource causing spike via cost dashboard (DB egress, large S3 writes, over-provisioned + instances). +2. Short-term mitigation: + - Scale down non-critical services, stop bulk backfill jobs, pause expensive scheduled exports. + - Apply aggressive retention / archival to remove hot storage. +3. Long-term: + - Rightsize instances, implement caching read paths, optimize ETL to reduce DB write churn, schedule heavy jobs + off-peak. +4. Post incident: + - Produce a cost RCA and update capacity plans. + +Billing & forecast review cadence + +- Weekly cost snapshot in the ops meeting. +- Monthly finance review and adjustment of budgets for upcoming events (community backfills, marketing events). +- Quarterly cost optimization audit. + +--- + +References & next actions + +- Create individual runbook files under docs/OP_RUNBOOKS/: + - ONCALL.md + - INCIDENT_RESPONSE.md + - DB_RESTORE.md + - SCALING_UP.md + - APPLY_MIGRATIONS.md + - BACKUP_DRILL.md + - COST_SPIKE.md +- Link runbooks from the main operations dashboard and ensure each runbook lists required permissions/secrets location, + contact list and verification queries. + +--- + +## 13. Testing & QA Strategy + +This section documents the testing strategy for the Player Profile & DB Foundation project: test levels, environments, +test data management, mapping of acceptance criteria to user stories, CI gating, and the QA sign‑off process. The goal +is to ensure high confidence when shipping changes that affect ingestion, ETL, schemas and read surfaces. + +--- + +### 13.1 Testing pyramid / levels + +We follow the standard testing pyramid and expand it with performance and reliability tests. Each level has +responsibilities, example tools and target coverage. + +1. Unit tests (fast, many) + - Purpose: verify individual functions, parsing logic, small utilities and business rules (e.g., content_hash + calculation, JSON mapping helpers, validation). + - Scope: + - JSON parsers / transformers for get_hero_profile. + - Upsert SQL generation helpers. + - Small utilities used by CLI and API. + - Tools: Jest / Vitest (Node/TypeScript), sinon/mock for time, quick DB mocks where needed. + - Targets: + - Fast (<< 1s per test). + - Coverage target: team-defined minimum (e.g., 70–80% overall; critical modules 90%+). + +2. Integration tests (medium, moderate speed) + - Purpose: verify interactions between components (API ↔ Postgres, worker ↔ DB, queue integration). + - Scope: + - API endpoints exercising DB (ephemeral test DB). + - Worker logic processing sample snapshots and persisting normalized rows. + - Migration preflight tests applying migrations to a fresh DB. + - Tools: Jest + Supertest for HTTP endpoints, testcontainers or Docker Compose for ephemeral Postgres/Redis, + node-pg-migrate test harness. + - Targets: + - Run in CI per PR; reasonably fast (~30–120s depending on setup). + - Exercise key happy-paths and common error paths. + +3. End-to-end (E2E) tests (slower, representative) + - Purpose: validate full vertical flows in an environment similar to staging (ingest → queue → worker → summary + read). + - Scope: + - Ingest a representative get_hero_profile sample file, ensure hero_snapshots inserted, worker processes it, and + profile_summary is readable via API and bot behaviour simulated. + - Admin flows: reprocess snapshot, run retention job (simulation). + - Tools: Playwright / Cypress for UI interactions (if Admin UI exists), or scripts using HTTP clients; + testcontainers/staging for infrastructure. + - Targets: + - Run in CI on merge-to-main or nightly; gating for release on staging success. + +4. Load / performance tests (wide, scheduled) + - Purpose: ensure system meets NFRs and scales under expected and spike loads. + - Scope: + - API ingestion throughput (concurrent snapshot POSTs). + - Worker throughput & memory profiling with large snapshots (2–3MB). + - Read latency for profile_summary under concurrent reads. + - Backfill / bulk ingestion scenarios to validate throttling and scaling behavior. + - Tools: k6, Gatling, Locust for HTTP load; custom scripts to simulate queue and worker scaling. + - Targets: + - Run on demand and scheduled (weekly or before major releases). + - Define baselines (e.g., 100 snapshots/hour per worker instance) and SLAs (p95 read <200ms). + - Observability: + - Collect metrics (CPU/memory, DB locks, queue depth) and use these for sizing and autoscaling policy tuning. + +Cross-cutting tests + +- Security tests: static analysis (SAST), dependency scanning (npm audit / Snyk), secret scanning. Run in CI. +- Contract tests: ensure API and worker contracts remain stable (OpenAPI contract tests, schema validation). +- Chaos / resilience tests (optional): simulate worker crashes or DB failover in staging to validate failover runbooks. + +--- + +### 13.2 Test data & environment + +Environment types + +- local-dev: developer machine with docker-compose (Postgres, Redis, optional S3 emulator) for fast iteration. +- ci: ephemeral environment spun up in GitHub Actions (testcontainers or ephemeral cloud DB) for PR checks. +- staging: production-like environment (managed Postgres, Redis, S3) used for smoke tests and QA sign-off. +- production: live environment with guarded deployments and manual approval for migrations. + +Test data management + +- Representative sample payloads: + - Keep canonical test files in repository: /examples/get_hero_profile_COCORIDER_JQGB.json and other variants (small, + medium, large, malformed). + - Use anonymized or synthetic payloads for tests to avoid PII in repo. +- Fixtures & seeds: + - database/seeds/ contains idempotent seeds for catalogs (troop_catalog, pet_catalog) and a few test users. + - In tests, use seeds to create required catalog rows before running ETL flows. +- Data isolation + - Each CI/integration test run should use a fresh DB schema or a disposable DB instance to avoid cross-test + contamination. + - Use randomized namecode values or test-specific UUIDs in fixtures. +- Sensitive data handling + - Never include real user credentials or PII in test repositories. + - If using production sample data for deeper tests, anonymize and audit the dataset, and restrict access (see + TEST_DATA_POLICY.md). +- Snapshot fixtures and golden files + - Maintain "golden" expected outputs for transformations (small JSON or SQL query results) to assert mapping + correctness. + - Keep versioned fixtures aligned with migration versions (if schema evolves, update fixtures). + +Environment provisioning & teardown + +- Use Docker Compose for local developer flows (fast start scripts). +- Use testcontainers or ephemeral cloud instances in CI to run integration and migration tests. +- CI must clean up resources after test run to avoid cost leaks. + +Test data lifecycle & retention + +- Keep test artifacts (failing test snapshots, logs) as CI artifacts for troubleshooting, but limit retention (e.g., + 7–30 days). +- CI should prune old test databases and S3 test artifacts per budget/policy. + +--- + +### 13.3 Acceptance criteria & test cases mapping (link to USER_STORIES) + +Traceability and mapping + +- All user stories in docs/USER_STORIES.md (or Section 4 user stories) must map to one or more test cases. +- Maintain a traceability matrix (simple CSV or doc) that links: + - Story ID → Acceptance Criteria → Test case IDs → Test type (unit/integration/e2e) → Automated? (yes/no) → + Location (test file path or test case management tool) + +Example mapping (samples) + +- STORY-DB-001 (migrations) + - Acceptance: running pnpm migrate:up creates expected tables + - Test cases: integration/migrations.test.ts (CI), manual smoke-check script for staging +- STORY-ETL-001 (worker idempotency) + - Acceptance: worker sets processed_at and normalized tables reflect snapshot; reprocessing is idempotent + - Test cases: integration/worker/idempotency.test.ts (inserts a snapshot, runs worker, asserts tables, re-runs + worker) +- STORY-API-001 (profile summary endpoint) + - Acceptance: GET /profile/summary/:namecode returns summary in <200ms p95 (staging) + - Test cases: e2e/profile_summary.test.ts; performance test scenario in k6 + +Acceptance test design + +- Format each acceptance case with Given/When/Then and an automated test that can be executed in CI or during staging + validation. +- Include negative tests (bad payloads, malformed JSON, duplicate payloads) to assert expected error responses and safe + behavior. + +Test case repository & management + +- Keep automated tests alongside code (monorepo) under /tests/ with clear naming: + - tests/unit/** + - tests/integration/** + - tests/e2e/** + - scripts/perf/** for load tests +- Consider a lightweight test management spreadsheet or a GitHub Project board to track manual test cases and QA + progress. + +--- + +### 13.4 CI test automation (gating, required checks) + +CI gating policy + +- All pull requests must pass required CI checks before merge into main. Required checks include: + - Linting (ESLint, Prettier) + - Type checking (TypeScript tsc) + - Unit tests (fast) + - Integration tests against ephemeral DB (target lightweight subset for PRs) + - Migration preflight (apply migrations to ephemeral DB and rollback if possible) + - Security scans: dependency audit (npm audit / Snyk) and secret scanning + - Code quality checks (optional): coverage guard, static analysis +- Merge block: protect main branch with required checks enforced by GitHub branch protection rules. + +Pipeline stages & examples + +- PR / Push pipeline: + 1. Install deps (pnpm install --frozen-lockfile) + 2. Lint, typecheck + 3. Unit tests + 4. Quick integration tests (single worker + test DB) + 5. Report coverage and test results +- Merge-to-main pipeline: + 1. Full integration suite (longer) + 2. E2E smoke tests against staging (or ephemeral staging) + 3. Build and publish container images to GHCR (with tags) + 4. Run migration preflight job (simulate or run migrations in disposable DB) +- Pre-release pipeline: + - Run performance tests (k6) against a staging deployment and generate performance report + - Run dependency scans and SCA checks +- Production deployment: + - Manual approval required for DB migrations and production deploy (GitHub Environment approval) + - Post-deploy smoke tests run automatically + +Gating specifics for migrations + +- Migrations must include up and down where reasonable. +- CI must run migration preflight (apply to fresh DB) and run a small smoke ETL to verify compatibility. +- Production migrations require manual approval and a pre-deploy backup step. + +Test flakiness management + +- Detect flaky tests via CI (re-run once automatically if intermittent, but flag and require fix). +- Maintain a flakiness dashboard and tag flaky tests for prioritization. + +Test artifacts & reporting + +- Upload test logs, failing request payloads, and sample DB dumps as CI artifacts on failures. +- Report summary: pass/fail, coverage percentage, test durations, and a link to failing logs. + +--- + +### 13.5 QA sign-off process + +Purpose + +- Define minimum criteria and process for QA/PO sign-off before a feature is considered releasable to production. + +Sign-off prerequisites + +- All required CI checks passed (lint, unit, integration, migration preflight). +- E2E smoke tests in staging passed. +- Performance tests for critical paths executed with results meeting NFRs in staging (or a documented known limitation + with mitigation). +- Security & SCA scans no critical vulnerabilities (or documented exception with mitigation). +- Documentation updated: README, docs/DB_MIGRATIONS.md, docs/ETL_AND_WORKER.md and user-facing docs if relevant. +- Runbooks/operational docs updated for any operational impacts (migration, retention policy changes, large-scale + backfills). +- Migration & backup validated: a pre-migration backup exists (or provider snapshot), and rollback plan documented. + +Sign-off actors & responsibilities + +- QA Lead: + - Executes and verifies acceptance tests in staging. + - Confirms regression checklist and documents any open non-blocking issues. +- Technical Lead / Engineering Lead: + - Reviews code changes and signs off on architecture and migration implications. +- Product Owner: + - Confirms feature behavior meets product requirements and acceptance criteria. +- Security Officer (for releases that touch sensitive flows): + - Reviews security findings and approves release if no critical risk exists. +- DevOps / SRE: + - Confirms required infrastructure and backup readiness and approves migration window. + +Sign-off checklist (example) + +- [ ] All required CI checks passed +- [ ] Migration preflight executed and backup taken +- [ ] E2E smoke tests passed in staging +- [ ] Performance targets validated or exception documented +- [ ] Documentation & runbooks updated +- [ ] Release notes drafted +- [ ] PO, TL, QA and SRE approvals recorded (names + timestamp) + +Recording sign-off + +- Use PR approvals (GitHub) plus a release checklist issue that includes approvals and links to test results and logs. +- For production migrations, require GitHub Environment approval and record approver(s) in the workflow run. + +Release / roll-out steps after sign-off + +- Promote image to production with manual approval. +- Apply production migrations via protected workflow (db-bootstrap.yml). +- Run post-deploy smoke tests and monitor dashboards for at least defined post-deploy window (e.g., 60–120 minutes). +- If issues, follow rollback plan and file an incident. + +Post-release validation & retrospective + +- After release, QA runs a small regression suite and monitors for 24–72 hours depending on impact. +- Hold a short retrospective to capture lessons learned and update tests/runbooks accordingly. + +--- + +## 14. Rollout & Release Plan + +This section defines the staged release process for features and schema changes, the controlled rollout strategy using +feature flags, rollback rules and criteria, communication plans for internal and external stakeholders, and the +post‑release monitoring checklist. The goal is a safe, observable, and auditable path from development to general +availability. + +--- + +### 14.1 Release phases (alpha → beta → canary → general) + +Define clear phases with entry/exit criteria so teams know when to progress a change. For DB schema changes the process +is stricter (preflight, backup, manual approval). + +Phase: Alpha (internal) + +- Audience: core developers, internal testers, trusted community contributors. +- Scope: early engineering validation of feature and schema changes; may contain telemetry instrumentation and debug + logs. +- Duration: short (days). +- Criteria to enter: + - Implemented feature with unit tests and integration tests passing locally. + - Migration preflight ran successfully on disposable DB. +- Exit criteria: + - No critical functional bugs in alpha test cases. + - Basic ETL smoke test passes (ingest → ETL → summary). + - Observability metrics created and collected (APM, metrics, logs). +- Controls: + - Feature flag default = off for all users except whitelisted test accounts. + - Enable debug logging for alpha actors only. + +Phase: Beta (broader, opt-in) + +- Audience: larger QA group, select community beta testers. +- Scope: more real-world validation, UX polish, performance profiling. +- Duration: 1–2+ weeks depending on risk. +- Criteria to enter: + - All unit & integration tests pass in CI. + - E2E smoke tests on staging pass. + - Performance test baseline executed and acceptable. +- Exit criteria: + - No P0/P1 regressions for a defined observation window (e.g., 48–72 hours). + - Telemetry shows stable error and latency rates. +- Controls: + - Feature flag rollout to a controlled audience (list of namecodes or Discord guilds). + - Beta telemetry and user feedback channels enabled. + +Phase: Canary (small percentage of production traffic) + +- Audience: a small fraction of production traffic/users. +- Scope: production validation under realistic load, confirm migration effects and scaling. +- Duration: staged, multiple steps (see canary steps below). +- Criteria to enter: + - Production preflight: manual approval, backup taken, migrations validated in staging. + - Deployment artifacts built and promoted to a canary tag. +- Canary steps (recommended): + 1. Deploy to 1% of traffic (or 1–5 users/entities depending on user cardinality) for 1–2 hours; monitor. + 2. If stable, increase to 5% for 2–4 hours; monitor. + 3. If stable, increase to 25% for 6–12 hours; monitor. + 4. If stable, increase to 100% (general rollout). +- Exit criteria: + - No critical errors or unacceptable metric regressions during each step (see thresholds in monitoring checklist). +- Controls: + - Use feature flags and routing (load balancer / gateway) to limit traffic. + - Have kill-switch procedure and runbook ready. + +Phase: General Availability (GA) + +- Audience: all users. +- Scope: fully enabled feature and final cleanup (remove temporary flags, adjust logging). +- Criteria to enter: + - Canary completed successfully and product owner + engineering lead sign-off. + - Migration and data changes validated in production; backup retention confirmed. +- Post-GA: + - Monitor for at least 24–72 hours with heightened observability. + - Plan cleanup: remove old feature flags, debug logging, and alpha-only instrumentation. + +Special handling: Database Migrations + +- DB migrations require an additional safety path: + - Migration preflight in CI & staging. + - Full backup/snapshot taken immediately before applying to production. + - Production migrations run via manual GitHub Action with environment approval. + - Prefer phased migrations: add columns nullable → backfill asynchronously → set NOT NULL later. +- If migration is destructive (drop/rename), require extended Canary and fallback plan including point-in-time restore + readiness. + +--- + +### 14.2 Feature flags & controlled rollout strategy + +Use feature flags to decouple deploy from launch and enable safe incremental rollouts. + +Flag types + +- Boolean flags: simple on/off for a feature. +- Percentage rollout flags: allow rolling out to X% of users. +- Targeted flags: enable for specific users, guilds, or environments (whitelists). +- Ops flags: control operational behavior (worker concurrency, ETL throttling). + +Storage & implementation + +- Store flags in feature_flags table (see DB model) with fields: name, enabled, rollout_percentage, data JSONB. +- Provide a lightweight SDK or helper in the backend to evaluate flags (deterministic hashing by user_id/namecode). +- Cache flags in Redis for fast evaluation with a short TTL; invalidate on change. + +Rollout patterns + +- Canary via flags: enable for a deterministic subset of users (hash-based) to ensure reproducibility. +- Guild-based pilot: enable for specific Discord guild IDs for community pilots. +- Manual whitelists: use for alpha/beta testers identified by namecode or user_id. +- Progressive percent rollout: + - Start at 0% → 1% → 5% → 25% → 100%, monitor between steps. + - Use automated gates: promote to next step only if monitoring checks pass. + +Kill-switch and emergency rollback + +- Always include a kill-switch flag that immediately disables the new feature or routes behavior to safe default. +- Flags should be actionable from admin UI and also via infra (direct DB update + cache invalidation) for emergency use. +- Document the exact steps to flip flags and confirm effect (e.g., clear local caches, restart nodes if needed). + +Telemetry & validation + +- Emit feature flag evaluation metrics (evaluations, enabled_count, latency) and track adoption. +- Attach correlation IDs to events produced while the flag is enabled for traceability. + +Governance + +- Require PRs that add feature flags to include a short plan: rollout steps, metrics to watch, rollback criteria, and + owner (engineer + PO). + +--- + +### 14.3 Rollback strategy and criteria + +Define clear, fast, and safe rules for rolling back both code and data changes. + +Rollback triggers (criteria) + +- Functional trigger: a P0 user-facing outage or feature-caused data corruption detected. +- Performance trigger: ETL failure rate or API error rate exceeds pre-defined thresholds (e.g., ETL failure rate > 1% + sustained for 5 min, p95 API latency > 2× baseline). +- Data integrity trigger: evidence of incorrect writes, FK violations or lost records traceable to a new change. +- Security trigger: any suspected data leak or credential exposure. + +Rollback types & steps + +A. Code-only rollback (safe, quick) + +- When to use: + - New service container causes errors but DB schema is unchanged. +- Steps: + 1. Flip feature flag to disable the feature (kill-switch). If that resolves issue, continue rollback verification. + 2. If kill-switch insufficient: deploy prior stable image (immutable tag) to replace the new release (canary or + full). + 3. Monitor health and metrics. + 4. Postmortem and root cause analysis. + +B. App + small reversible migration rollback + +- When to use: + - Migration added a non-destructive column or index and is reversible via down migration. +- Steps: + 1. Stop workers if writes could be inconsistent. + 2. Deploy previous application version. + 3. Run migration down if reversible and safe. + 4. Validate data integrity and resume workers. + 5. If migration down is risky, restore from backup instead (see C). + +C. Destructive migration / data rollback (complex) + +- When to use: + - Migration dropped or transformed data, or produced corruption; down migration not feasible. +- Steps: + 1. Stop ingestion and workers immediately to prevent further writes. + 2. Restore DB from pre‑migration backup or use point‑in‑time recovery (PITR) to restore to a time before the change. + 3. Apply tested migration path or compensating scripts against restored DB in an isolated environment first. + 4. Promote restored DB to production after validation or perform carefully orchestrated in-place corrective + migration. + 5. Re-run ETL/batch jobs if needed to repopulate derived tables. + 6. Communicate incident and data impact per communication plan. +- Notes: + - Always take a fresh backup prior to running any production migration; record the backup id and retention. + +Operational considerations + +- Always prefer feature-flag-based disablement before full rollback where possible (least impact). +- Maintain a changelog mapping release → migration id(s) → backup snapshot id used before migration. +- For rollback requiring DB restore, allocate a maintenance window and coordinate stakeholders. + +Verification after rollback + +- Run smoke tests: GET /api/v1/health, sample ingest & ETL flow, validate critical queries. +- Compare key metrics (error rate, latency) to baseline and ensure stability before resuming normal operations. + +--- + +### 14.4 Communication plan (users, stakeholders) + +Clear, timely communication is crucial for releases and incidents. Use templates and channels below. + +Stakeholders & channels + +- Internal stakeholders: + - Engineering team (Slack #engineering) + - SRE/DevOps (Slack #ops-alerts) + - Product (email/Slack) + - QA (Slack #qa) +- External stakeholders: + - Maintainers / beta testers (Discord private channel) + - Public users (Status page, Discord announcements) +- Incident escalation: + - PagerDuty for P0 pages + - #incident- Slack channel for coordination + +Pre-release communication + +- For releases that may impact users (schema changes, potential downtime): + - Announce at least 48 hours in advance to stakeholders and community (Discord + status page). + - Provide release notes including expected impact, maintenance window, rollback plan, and contact info. + - Share required action items for community (e.g., "do not run bulk fetches between 02:00–04:00 UTC"). + +Release-day communication + +- Before deployment: + - Post a short "deploy starting" message to #ops-alerts and status page. +- During deployment: + - Post progress updates if long (>10–15 minutes) or if human approvals required. +- After deployment: + - Post completion message with summary and link to release notes and monitoring dashboard. + +Post-release & incident communication + +- For incidents: + - Initial message: short description, scope, ETA for next update. + - Updates: every 15–30 minutes until mitigated. + - Resolution message: summary of cause, actions taken, and next steps. + - Postmortem: publish within agreed SLA (e.g., 3 business days) with RCA and remediation. +- External user messages: + - Use status page for system-wide issues, and Discord for community-specific messages. + - Keep external communications factual and avoid technical jargon; include mitigation steps and timelines. + +Templates (examples) + +- Release announcement (short) + - Title: "Release: Player Profile & DB Foundation — vX.Y — Scheduled " + - Body: summary, affected features, maintenance window, expected impact, contact +- Incident update + - Title: "Incident : — Update #n" + - Body: summary of observed issue, impact, actions taken, ETA, requester contact + +Documentation & release notes + +- Publish release notes per release in GitHub Releases and link to docs/CHANGELOG.md. +- Update docs/DB_MIGRATIONS.md and docs/ETL_AND_WORKER.md when relevant. + +--- + +### 14.5 Post-release monitoring checklist + +A concise, actionable checklist to run immediately after a release/canary promotion. Each item should be validated +within the post-deploy observation window (first 1–2 hours, and periodically during first 24–72 hours). + +Immediate smoke checks (0–15 minutes) + +- [ ] /health returns 200 from each service replica. +- [ ] DB connectivity check: run a light query (SELECT 1) and confirm response. +- [ ] Verify latest migrations applied and migration id recorded in schema_migrations table. +- [ ] Check worker heartbeats and ensure workers are processing queue items. +- [ ] Run a manual sample end-to-end test: ingest a sample snapshot and verify profile_summary populated. + +Metrics & alerts (0–60 minutes) + +- [ ] ETL failure rate: confirm < configured threshold (e.g., <1%); investigate spikes. +- [ ] Queue depth: confirm within expected range and draining rate healthy. +- [ ] API latencies: p95 and p99 within expected targets; errors per minute near baseline. +- [ ] DB metrics: connections < threshold, replication lag (if any) acceptable, no long-running locks. +- [ ] Error tracking: Sentry error count not spiking; new error types triaged. + +Logs & traces (0–60 minutes) + +- [ ] Inspect logs for repeated error patterns related to new code/migration. +- [ ] Search for any warnings about schema mismatches or unhandled JSON shapes. +- [ ] Verify correlation ids in traces for a sample request flow. + +Data integrity & validation (0–24 hours) + +- [ ] Sample data verification: compare key fields between raw snapshot and normalized tables for a set of sample users. +- [ ] Confirm no duplicate snapshots inserted unexpectedly (validate content_hash duplicates). +- [ ] Validate expected counts in materialized views or aggregates (if backfill executed). + +Operational & governance checks (0–24 hours) + +- [ ] Backups recorded: confirm successful backup/snapshot taken just prior to production migration. +- [ ] Feature flags: verify they’re in expected state; confirm ability to flip flags quickly if needed. +- [ ] Approvers on standby: confirm contact persons available in case rollback required. + +User-facing verification (0–72 hours) + +- [ ] Monitor support channels (Discord) for user reports and triage quickly. +- [ ] Validate a handful of reported user flows (bot commands) succeed. + +Post-release follow-up (within 72 hours) + +- [ ] Compile release metrics summary and circulate to stakeholders. +- [ ] Open tickets for any improvements, follow-ups or cleanup tasks (remove temporary flags, reduce debug logging). +- [ ] Schedule a short retrospective to capture lessons learned and action items. + +Automation & dashboards + +- Provide a release dashboard that aggregates the key monitoring signals (ETL, queue depth, API errors, DB metrics) for + easy at-a-glance verification. +- Configure automated gating: promote canary to next step only if all gate checks pass (automated checks + manual + approval). + +--- + +## 15. Migration & Backfill Plan + +This section describes the planned approach to apply schema migrations and to backfill historical/profile data into the +new normalized schema. It covers the migration/backfill strategy, risk & impact assessment, detailed migration steps +with rollback guidance, and dry‑run and validation checks to prove correctness before and after production runs. + +--- + +### 15.1 Data migration approach (backfill strategy) + +Goal + +- Safely evolve the database schema and populate normalized tables (users, user_troops, user_pets, user_artifacts, + user_teams, user_profile_summary, etc.) from existing raw snapshots, minimizing downtime and risk while preserving + full auditability. + +Principles + +- Non‑destructive first: prefer additive, reversible schema changes. Avoid blocking ALTERs that acquire long exclusive + locks. +- Idempotence: backfill and ETL upserts are idempotent — reprocessing the same source should not create duplicates. +- Small transactions: perform backfill in small batches per user or per snapshot to limit contention and expedite + recovery. +- Auditability: record backfill progress, source snapshot ids, and checksums so every transformed row can be traced back + to the original snapshot. +- Observable: emit metrics for backfill progress, throughput, errors and slowest operations. +- Throttled & controlled: support concurrency limits and backpressure to protect primary DB and upstream providers. + +Backfill sources + +- Primary source: existing hero_snapshots table containing raw JSONB (preferred). Backfill reads hero_snapshots rows ( + either latest per user, or historical per retention strategy). +- Secondary source (if hero_snapshots incomplete): ingest additional JSON files from archives (S3) or re-run fetch + scripts where upstream access available. +- Catalog seeds: ensure troop_catalog, pet_catalog, artifact_catalog seeded before backfill so FK-based upserts + succeed (or use placeholder catalogs with deferred resolution). + +Backfill modes + +- Incremental / live backfill (recommended for large datasets): + - Process newest snapshots first (most likely to be accessed), then older. + - Use per-user incremental approach: upsert current entity rows and mark snapshot as backfilled. + - Keep the ingestion and worker pipeline running; new snapshots continue to be processed normally. +- Bulk backfill (for initial fill or one-off full rebuild): + - Run in controlled window on a dedicated worker fleet with concurrency limits. + - Use read-replicas or a maintenance replica if provider supports it (avoid impacting primary). + - Consider restoring a backup into a dedicated backfill cluster and performing transformation there, then import + results into primary DB if low-impact rollout required. +- Hybrid: + - Perform initial bulk backfill over history in an offline prepared environment, then merge incremental changes done + in production via reprocessing of recent snapshots. + +Batching & parallelism + +- Batch granularity: process per user or per snapshot in batches of configurable size (e.g., 100–1000 user snapshosts + per job depending on complexity). +- Parallelism: use a worker pool sized to DB capacity; autoscale based on queue depth and target DB connection limits. +- Rate control: pause or reduce concurrency when DB metrics (CPU, connections, locks) exceed thresholds. + +Data transformation approach + +- Use ETL worker logic already designed for snapshot processing to perform backfill; reuse the same mapping rules to + guarantee parity between real-time ETL and backfill. +- For each snapshot: + - Parse raw JSON and extract user-level facts (namecode, user metadata). + - Upsert user row (ON CONFLICT by namecode or unique external id). + - Upsert user_troops, user_pets, user_artifacts with ON CONFLICT DO UPDATE. + - Update or create user_profile_summary with denormalized quick-read fields. + - Record mapping: write a backfill_audit (or etl_audit) row with snapshot_id, processed_by (backfill_job_id), + processed_at, status. +- Preserve unknown fields in extra JSONB columns to avoid data loss. + +Progress tracking & resume + +- Use a backfill_jobs table: + - id, started_at, completed_at, job_state, total_snapshots, processed_count, error_count, config (batch_size, + concurrency), owner +- For each processed snapshot record job_id and processed_at in hero_snapshots (or backfill_audit table) to allow + restart from last processed id. +- Support resumable jobs: if a job stops, it can resume from the last processed snapshot id for that job or process only + snapshots with processed_at IS NULL. + +Cost & time estimation + +- Estimate per-snapshot average processing time from sample: use that to forecast total backfill time = avg_time * + number_of_snapshots / concurrency. +- Include overhead for down-time windows and provider constraints. Provide budget estimate for compute, DB IOPS and S3 + egress if reading from archived storage. + +--- + +### 15.2 Risk & impact assessment + +Summary of key risks and mitigations + +1. Risk: Long-running migrations locking tables and blocking production traffic + - Impact: degraded API response or downtime. + - Mitigations: + - Avoid exclusive locks; use non-blocking patterns (add columns NULLable, backfill, convert). + - Use CREATE INDEX CONCURRENTLY for large indexes. + - Schedule high-impact changes in maintenance windows. + - Run migration preflight in staging and estimate index build times. + +2. Risk: ETL/backfill job overload causing DB connection exhaustion or lock contention + - Impact: production slowdowns or failures. + - Mitigations: + - Throttle worker concurrency and use connection pooling (pgbouncer). + - Use small batches and per-entity transactions. + - Monitor DB metrics and pause backfill if thresholds exceeded. + +3. Risk: Data corruption or incorrect mapping during backfill + - Impact: incorrect normalized state and downstream errors. + - Mitigations: + - Run dry-run and checksum comparisons in staging. + - Use golden-record tests on sample datasets and compare normalized outputs to expected values. + - Preserve raw snapshots and unmapped fields; write backfill_audit entries for traceability. + - Backfill into separate schema or branch before merge if high risk (validate then merge). + +4. Risk: Duplicates or inconsistent upserts due to non-idempotent logic + - Impact: duplicate rows or inconsistent aggregates. + - Mitigations: + - Use robust unique constraints and ON CONFLICT upserts keyed by (user_id, troop_id). + - Ensure worker is idempotent and uses snapshot_id-based audit tags. + +5. Risk: Storage & cost spike (DB size, S3, IO) + - Impact: billing surprises and throttling. + - Mitigations: + - Estimate storage needs; apply retention & archival policies to older snapshots. + - Use economical instance sizing and scale out/in during backfill. + - Monitor billing and set budget alerts. + +6. Risk: Extension/privilege limitations (CREATE EXTENSION denied) + - Impact: migrations failing in the provider environment. + - Mitigations: + - Have fallback code paths (generate UUIDs app-side). + - Document required provider permissions and request elevated privileges via ops ticket prior to migration. + +7. Risk: Upstream format changes mid-backfill + - Impact: mapping logic break or partial failures. + - Mitigations: + - Preserve raw JSON; mark failures and quarantine snapshots for manual review. + - Implement schema versioning field in hero_snapshots.raw (if upstream provides no version). + +Stakeholder impact matrix + +- Players (end users): minimal for read-only backfill; potential transient slowdowns — communicate maintenance windows. +- Bot operators: may see delays in summary availability during heavy backfill — provide ETA & status messages. +- Analytics team: will gain access to normalized data after backfill; may need a schedule for exports. +- Dev/DevOps: responsible for monitoring and rolling back if needed. + +Acceptance criteria for backfill run + +- All processed snapshots have backfill_audit entries and processed_at timestamps. +- Normalized counts for a sampled subset match expected derived values from raw snapshots. +- ETL failure rate under acceptable threshold (e.g., <1% with actionable failures recorded). +- No significant production latency regressions observed while backfill runs. + +--- + +### 15.3 Migration steps & rollback plan + +Pre-migration checklist (must be completed before any production migration/backfill) + +- Review & approve migration plan with Product and Engineering lead. +- Ensure database backups / snapshots are available and verify restore test result id. +- Determine maintenance window if required and notify stakeholders at least 48 hours prior. +- Ensure all required secrets and environment approvals present in GitHub Actions environment. +- Run migration preflight on a staging environment and validate migration up/down if feasible. +- Seed catalog tables (troop_catalog, pet_catalog, artifact_catalog) and verify referential integrity. +- Ensure runbooks and on-call staff are available during migration window. + +Production migration & backfill steps (example safe procedure) + +1. Preflight & Backup + - Take a full DB snapshot/backup and record backup id. + - Run migration preflight checks in CI/staging to ensure no immediate issues. + - Validate required extensions and privileges. + +2. Apply Additive Migrations (DDL) + - Apply non-blocking migrations via node-pg-migrate using the manual GitHub Action (db-bootstrap.yml) with + environment approval. + - Examples: + - CREATE EXTENSION IF NOT EXISTS pgcrypto; (separate migration) + - CREATE TABLE hero_snapshots (...); + - CREATE TABLE user_troops (...) (add indexes with CONCURRENTLY if large) + - Add user_profile_summary table + - Run post-migration sanity checks (expected tables exist, sample queries succeed). + +3. Seed catalogs + - Insert required catalog rows using idempotent seeds. + +4. Small-scale smoke backfill (canary) + - Run backfill on a small subset (e.g., 100 recent snapshots or selected test accounts). + - Validate mapping, idempotency and impact on DB metrics. If problems found, abort and fix. + +5. Progressive full backfill + - Start incremental backfill jobs with conservative concurrency. + - Monitor metrics and adjust concurrency. + - Use job-level checkpoints and backfill_audit table to resume if stopped. + +6. Post-backfill cleanup & verification + - Verify sample data and aggregates against expected values (see validation queries below). + - Remove temporary columns/flags in subsequent migrations only after verification. + - Update monitoring & dashboards to use normalized tables for production reads where applicable. + +Rollback plan + +A. If non-critical issue discovered (data mapping bug or isolated failures) + +- Pause backfill jobs. +- Fix ETL mapping or seed catalog and re-run backfill for affected snapshot id(s) via admin reprocess endpoint. +- No DB restore required if issue limited and fix is idempotent. + +B. If production performance degraded (DB connection exhaustion / high locks) + +- Pause backfill workers immediately (disable worker autoscaling or set concurrency to 0). +- If performance not restored: + - Revert application code to previous container image (code rollback). + - If migration caused the issue (e.g., index creation), consider reverting the migration if reversible or restoring + DB from backup if destructive. +- Resume normal operations and re-run backfill at reduced concurrency. + +C. If data corruption / destructive migration failure + +- Immediately stop ingestion and workers. +- Restore DB from the pre-migration backup / point-in-time restore to the last known good state. +- Re-evaluate migration plan, apply safe migration sequence (e.g., add columns, backfill then drop). +- Re-run backfill on restored DB as needed. + +Post-rollback actions + +- Conduct incident review and root cause analysis. +- Update mapping/tests to prevent recurrence. +- Communicate impact and remediation to stakeholders. + +Permissions & approvals + +- Require at least two approvers for production migrations: Engineering Lead + Product (or SRE). +- Ensure someone with DB admin permissions is on-call during migration windows. + +--- + +### 15.4 Dry-run and validation checks + +Dry-run goals + +- Prove the backfill process and mapping logic on a safe dataset and environment. +- Measure performance characteristics (per-snapshot processing time, DB impact). +- Validate idempotency and correctness of transforms. + +Dry-run environments + +- Use a staging environment running the same migration code and a representative subset of production snapshot data ( + anonymized). +- Optionally restore a recent production backup into an isolated environment for a full-scale dry-run if resources + permit. + +Dry-run steps + +1. Select representative dataset + - Choose small sets: (a) 100 recent snapshots, (b) 100 large login snapshots (2–3MB), (c) a few malformed/edge + snapshots. + - Optionally include a small random historical sample for regressions. + +2. Run backfill in staging using the exact code & worker config planned for production. + - Use same migration versions and config (batch_size, concurrency). + - Enable verbose logging and extra metrics during dry-run. + +3. Validate correctness & idempotency + - For each processed snapshot in the sample, run validation queries (see below). + - Re-run backfill for the same snapshots and assert no duplicate rows / same resulting normalized state. + +4. Performance & resource profiling + - Monitor DB CPU, memory, connection counts, locks and I/O; measure per-snapshot average time and memory usage. + - Tune batch_size and concurrency accordingly. + +Validation checks (automated) + +- Schema & migration checks + - Confirm all expected tables and indexes exist. + - Validate migration version recorded in migrations table. + +- Row-level validation (examples) + - Snapshot count processed: + - SELECT count(*) FROM hero_snapshots WHERE processed_at IS NOT NULL AND backfill_job_id = ''; + - User mapping verification: + - For a sample snapshot id: compare namecode from raw JSON to users.namecode + - SELECT raw ->> 'NameCode' as namecode_raw FROM hero_snapshots WHERE id = ''; + - SELECT namecode FROM users WHERE id = (SELECT user_id FROM hero_snapshots WHERE id = ''); + - Troop counts parity: + - Parse raw snapshot troops and compare aggregated sums with user_troops for that user (example pseudocode): + - FROM raw: sum Amount for TroopId = X + - FROM normalized: SELECT amount FROM user_troops WHERE user_id = ... AND troop_id = X + - Check uniqueness constraints: + - SELECT user_id, troop_id, count(*) FROM user_troops GROUP BY user_id, troop_id HAVING count(*) > 1; + - Expect zero rows. + - Validate profile_summary correctness: + - For sampled users, compute top troops from user_troops and compare to user_profile_summary.top_troops JSON. + - Check etl_errors: + - SELECT * FROM etl_errors WHERE snapshot_id IN () — ensure no unexpected errors remain. + +- Idempotency test + - Reprocess the same sample snapshots and assert: + - processed_at updated (or processed_at unchanged if logic preserves timestamp), but normalized rows identical ( + check checksums). + - No rows duplicated. + - etl_errors count unchanged or only increased for new failures. + +- Consistency checks + - Referential integrity: no user_troops with user_id NULL. + - SELECT count(*) FROM user_troops WHERE user_id IS NULL; EXPECT 0 + - Catalog foreign key checks (if FK present): + - SELECT ut.* FROM user_troops ut LEFT JOIN troop_catalog tc ON ut.troop_id = tc.id WHERE tc.id IS NULL LIMIT + 10; + +- Performance acceptance + - Per-snapshot avg processing time and P95 within planned target for staging and adjusted for prod. + - DB connection usage and CPU consumption below alert thresholds during backfill. + +Automated regression comparison + +- Keep golden output for sample snapshots and compare output JSON or joined SQL results using a diff tool or automated + test script. +- Store validation reports for each dry-run as CI artifacts. + +Pre-production checklist (pass required to run production backfill) + +- All dry-run validation checks pass for representative dataset. +- Backups verified and restore tested. +- Capacity checks completed and concurrency limits configured. +- Observability dashboards and alerts in place. +- Approvals recorded (Engineering Lead and PO or SRE). + +Post-backfill validation (production) + +- Run a reduced set of validation queries (sampling) immediately after backfill completes. +- Monitor metrics for 24–72 hours to detect delayed regressions. +- Keep backfill job logs as artifacts and ensure etl_errors are triaged. + +--- + +## 16. Security & Compliance Details + +This section summarizes the security posture, controls and operational policies for the Player Profile & DB Foundation +project. It focuses on threat modelling, sensitive data handling, required testing and audits, evidence and logs +retention policies, and a secrets rotation policy. These items should be used as inputs to the security review and to +the operational runbooks. + +--- + +### 16.1 Threat model summary + +Scope + +- Assets in scope: + - Raw player profile snapshots (hero_snapshots JSONB) that may contain PII or tokens. + - Normalized user data (users, user_troops, user_profile_summary). + - Infrastructure credentials (DATABASE_URL, REDIS_URL, cloud keys). + - CI/CD pipelines, GitHub Actions secrets and container images in GHCR. + - Discord bot tokens and any linked OAuth tokens. + - Backups and archived snapshots stored in S3. + +Key threats + +1. Credential leakage + - Cause: accidental commits, misconfigured CI logs, compromised GitHub secrets, or leaked service account keys. + - Impact: unauthorized DB or cloud access, data exfiltration, impersonation of bot. + +2. Data exfiltration / unauthorized access + - Cause: compromised application server, weak RBAC, exposed DB ports, misconfigured S3 buckets. + - Impact: PII or sensitive tokens exposed to external parties. + +3. Injection / data-driven attacks + - Cause: unvalidated input, direct JSON injection into SQL or unsafe query building. + - Impact: data corruption, privilege escalation, SQL injection. + +4. Supply-chain / dependency compromise + - Cause: malicious NPM package or exploitable transitive dependency. + - Impact: remote code execution, exfiltrate secrets from CI or runtime. + +5. Abuse & DoS + - Cause: high-frequency snapshot ingest from many clients or upstream spike, or malicious bot commands. + - Impact: DB overload, worker OOM, elevated infra costs. + +6. Privilege misuse & insider risk + - Cause: overly-broad credentials, locally stored secrets, or un-audited admin actions. + - Impact: unauthorized changes, accidental data deletion or migration misapplication. + +7. Data leakage via logs/telemetry + - Cause: logging raw snapshots or tokens to Sentry / structured logs. + - Impact: PII or tokens visible in logs retained widely. + +Mitigations / Controls (high level) + +- Least privilege: separate roles for migrations, app writes, reads, analytics. +- Secrets management: do not keep long-lived keys in code; use secrets manager / GitHub Secrets; avoid printing secrets + to logs. +- Network restrictions: restrict DB access to trusted hosts / VPCs and limit S3 access via IAM policies. +- Input validation & safe DB access: parameterized queries / prepared statements for all DB writes. +- ETL safeguards: parse JSON defensively, preserve raw snapshot for auditing, store unknown fields in extra JSONB, and + keep idempotent upserts. +- Rate limiting & quotas: throttle ingestion and bot commands; reject abusive traffic. +- Observability & alerting: monitor for abnormal activity (spikes in ingestion rate, high failure/etl error rates). +- Dependency management: enable Dependabot, run SCA and SAST checks in CI. +- Incident response: defined runbooks, PagerDuty escalation, and forensic playbooks for compromise scenarios. + +Threat modelling outputs to maintain + +- Asset inventory (what is stored and where) +- Data classification table (PII, Sensitive, Internal, Public) +- Attack surface map (APIs, bot gateway, CI, DB, S3) +- Risk register with probability/impact and mitigation owners + +--- + +### 16.2 Sensitive data handling checklist + +Use this checklist when designing features, accepting snapshots, or adding a new data store. Items marked "MUST" are +mandatory controls; "SHOULD" are strongly recommended. + +Data classification + +- [MUST] Define which fields in snapshots are PII or sensitive (emails, real names, device ids, auth tokens). Document + mapping in docs/DATA_PRIVACY.md. +- [MUST] Classify all tables and archived stores with data sensitivity labels. + +Ingest & storage controls + +- [MUST] Do not persist user plaintext passwords; login flows must keep credentials only in local client or ephemeral + memory. +- [MUST] Redact or remove auth tokens from hero_snapshots if not required; if stored, encrypt and mark as secrets in + metadata. +- [SHOULD] Normalize and store minimum PII required; avoid duplicating PII across tables. +- [MUST] Use JSONB storage for raw snapshots but mask sensitive fields on disk copies exported as artifacts. + +Access control + +- [MUST] Enforce RBAC for admin endpoints (reprocess, raw snapshot access, migrations). +- [MUST] Use dedicated DB roles for migrations, app writes and analytics reads. +- [SHOULD] Enforce context-aware access (e.g., only certain GitHub environments can run production migrations). + +Logs & telemetry + +- [MUST] Redact tokens, passwords, and PII from logs and Sentry events. Apply automated scrubbers where feasible. +- [SHOULD] Flag logs that may include identifiable fields and restrict access to operations staff. + +Backups & archives + +- [MUST] Encrypt backups and S3 objects (SSE). +- [MUST] Keep an audit trail for backups and restores (who triggered, when, and restoration id). +- [SHOULD] Apply access controls for archived snapshots; require elevated roles to retrieve raw archived data. + +Data subject rights & deletion + +- [MUST] Implement processes to locate and delete personal data on request (right-to-be-forgotten). This includes + deletion from DB and from S3 archives (or mark for purge and follow up). +- [SHOULD] Provide an API/admin path to request user data removal; document expected SLAs for erasure. + +Transport & storage encryption + +- [MUST] TLS for all network communications (HTTPS, SSL for DB). +- [MUST] Rely on provider encryption for data at rest and, for extra-sensitive fields, apply application-level + encryption. + +Developer & CI hygiene + +- [MUST] Scan commits for accidental secrets (git-secrets, pre-commit hooks) and block commits with secrets. +- [MUST] Use ephemeral credentials in CI where possible (OIDC federation). Avoid storing long-lived JSON service keys in + repo. +- [SHOULD] Enforce protected branches, code review and signed commits for infrastructure/config changes. + +Incident handling + +- [MUST] On suspected exposure of sensitive data or secrets, follow SECRET_COMPROMISE runbook: rotate secrets, revoke + tokens, notify security lead, and perform forensic log capture. +- [MUST] Keep an incident log for any PII exposure with timeline and actions taken. + +Compliance & audit + +- [SHOULD] Maintain a mapping of sensitive fields to legal obligations (GDPR, CCPA). +- [MUST] Retain logs/audit trails long enough for investigations and compliance obligations (see next section). + +--- + +### 16.3 Penetration testing / audits required + +Scope & cadence + +- Initial security assessment: + - [REQUIRED] External penetration test (black-box) before major production release (GA). + - Scope: public APIs (/api/v1/*), admin endpoints, Discord bot public surface, OAuth flows, authentication & session + handling. +- Ongoing testing: + - [RECOMMENDED] Annual external pentest (or after major infra changes). + - [RECOMMENDED] Quarterly internal security review (dependency checks, SAST scans, config review). +- Triggered tests: + - [REQUIRED] Re-run pentest or focused retest after any significant infrastructure change that affects the public + attack surface (new public endpoint, major migration that exposes data in new ways). + - [REQUIRED] If a high-severity vulnerability is discovered in a dependency (critical CVE), perform a targeted + security audit. + +Penetration test scope items + +- External network perimeter: API ingress, rate limiting, WAF rules. +- Authentication & authorization: token issuance, scope checks, RBAC, API key lifecycle. +- Input validation and injection vectors: JSON handling, SQL injection, stored XSS in data consumed by admin UI. +- Data exposure: attempts to access raw hero_snapshots, backups or S3 artifacts without authorization. +- Business logic flaws: unauthorized reprocessing, duplication, or data overwrite. +- CI/CD & supply chain: checks on GitHub Actions secrets, packaging, GHCR permission configuration, and dependency + provenance. +- Social engineering / ops procedures: review of runbooks and approval gating to ensure no weak human-process vectors. + +Deliverables & remediation + +- Pen test report with findings categorized by severity (Critical, High, Medium, Low). +- Fix timelines: + - Critical: fix or mitigation within 24–72 hours (depending on exploitability). + - High: fix within 7 calendar days. + - Medium/Low: tracked and scheduled per roadmap. +- Post-remediation verification: targeted retest for critical & high issues. + +Audit log & compliance review + +- [RECOMMENDED] SOC2 readiness assessment if project intends to serve enterprise customers. +- [RECOMMENDED] Data Protection Impact Assessment (DPIA) if processing sensitive PII or large-scale profiling. +- Keep copies of pentest and audit reports in a secure internal docs area with controlled access. + +--- + +### 16.4 Compliance evidence & logs retention + +Retention policy (recommended baseline) + +- Application logs (error/transactional): + - Retain detailed logs for 30 days in primary logging store for debugging and triage. + - Retain aggregated/rollup metrics for 365 days for trends and capacity planning. +- Audit & security logs (admin actions, migration runs, audit trail for data deletion): + - Retain for minimum 1 year (or longer if regulatory requirements demand); ideally 3–7 years depending on legal + context. +- ETL & processing metadata (etl_errors, backfill_audit, backfill_jobs): + - Retain for 365 days by default. Archive older entries to cold storage if needed. +- Backups & archived snapshots (S3): + - Keep production backups for at least 90 days online. Move older backups to cold storage tiers per retention + policy. + - For legal holds or compliance requests, preserve required data longer as directed by legal. +- Security & compliance artifacts (pentest reports, DPIA, SOC2 evidence): + - Keep indefinitely in a secured, access-controlled repository for audit purposes. + +Access & integrity + +- [MUST] Audit access to logs and backups: log who accessed, when, and reason. +- [MUST] Protect log storage with ACLs and encryption. +- [SHOULD] Implement tamper-evident storage or immutability where required for forensic chain-of-custody (e.g., + retention vaults). + +Evidence for audits + +- Maintain the following evidence for compliance or audit requests: + - Migration runbooks and execution logs (who ran migrations, when, and output). + - Backup and restore logs with backup ids and successful restore proof. + - Pentest & remediation reports with timelines and verification. + - RBAC and secrets inventory (who has access to which secrets). + - Data deletion logs for user DSR requests (what was deleted, when, and by whom). + +Privacy & DSR logging + +- Log DSR requests (Right to access, right to be forgotten) and the action taken; keep proof of deletion/archival and + any correspondence with the user. +- Section docs/DATA_PRIVACY.md should specify SLA for responding to DSRs (e.g., 30 days as per GDPR). + +Legal hold & e-discovery + +- Provide a means to suspend deletion and retention policies for data subject to legal hold; document the procedure, + authorization steps and access control. + +--- + +### 16.5 Secrets rotation policy + +Purpose & goals + +- Minimize risk from leaked or compromised secrets by enforcing regular rotation, limiting secret lifetime and enabling + rapid revocation and re-issuance. + +Secret types & owners + +- CI/CD secrets (GitHub Secrets, GHCR tokens): owner = SRE/DevOps +- Runtime secrets (DATABASE_URL, REDIS_URL): owner = SRE/DevOps, application config +- Service accounts & cloud keys (AWS/GCP): owner = Infrastructure/Cloud team +- Bot tokens (Discord): owner = Bot operator +- API keys for external partners: owner = Integrations lead + +Rotation frequency & rules + +- Short-lived tokens (recommended where supported, e.g., OIDC / ephemeral credentials): + - Rotate automatically based on provider; prefer ephemeral credentials. +- Long-lived secrets (where unavoidable): + - Database credentials: rotate at least every 90 days. + - Service account keys (JSON): rotate at least every 90 days or migrate to OIDC federation to avoid keys. + - Bot tokens: rotate every 90 days or immediately on suspected compromise. + - API keys for third parties: follow vendor guidance; rotate at least every 180 days. +- Access keys with high privilege (migrations, admin): rotation and multi-person approval for re-issuance. + +Automated rotation practices + +- Use a secrets manager (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault) that supports programmatic rotation + and versioning. +- Where possible, use cloud provider OIDC federation (GitHub Actions -> cloud) to avoid storing static credentials. +- Integrate secret rotation into CI/CD pipelines: deploy rotated secret, run smoke tests, then retire previous secret. + +Revocation & incident steps + +- On suspected or confirmed compromise: + 1. Immediately revoke the secret (disable token or change password). + 2. Issue a new secret and update runtime config via secrets manager + CI deployment. + 3. Re-deploy or restart services that consume the secret in a controlled manner. + 4. Run the SECRET_COMPROMISE runbook (rotate, audit logs, notify stakeholders, escalate to security). + 5. Record incident, root cause, and mitigation steps. + +Operational considerations + +- Maintain an inventory of all secrets, owners and last-rotation timestamps (securely). +- Use automation to detect secrets that are past rotation windows and create tickets/alerts. +- Limit the blast radius by scoping secrets narrowly (least privilege) and using different secrets per environment ( + dev/staging/prod). +- Avoid secret sprawl: prefer a small number of managed secrets references (e.g., one DATABASE_URL per environment) + rather than multiple copies. + +Audit & verification + +- Periodically run automated scans for secret usage in code, history, and CI logs. +- Track rotation events and provide proof of rotation for audits (timestamped events from secrets manager). +- Include secret rotation checks in regular compliance evidence package. + +--- + +References & next steps + +- Add a dedicated docs/SECURITY_REVIEW.md containing the threat model diagram, asset inventory and signed-off + mitigations. +- Add /docs/OP_RUNBOOKS/SECRET_COMPROMISE.md with step-by-step rotation & incident handling procedures. +- Implement automatic log redaction and Sentry scrubbing for PII and tokens. +- Schedule an external pentest prior to GA and add remediation timelines into the project plan. + +--- + +## 17. Risks, Assumptions & Mitigations + +This section collects the project’s core assumptions, a prioritized risk register with mitigation plans, and the open +questions or decisions that still need resolution. Use this section to drive risk reviews, prioritize spikes, and record +stakeholder decisions. + +--- + +### 17.1 Key assumptions + +These assumptions underlie the designs, timelines and choices in this PRD. Treat them as validated premises or items +requiring early verification. + +- The upstream get_hero_profile JSON shape is reasonably stable for core fields (namecode, troops, pets, teams); + occasional new fields may appear but not frequent breaking shape changes. +- Raw snapshots are available to be stored as JSONB in Postgres and storing them (TOAST) is acceptable cost-wise for the + short term. +- The project will use a managed Postgres provider (e.g., Supabase) that allows pgcrypto or provides an acceptable + fallback for UUID generation. +- Node.js + pnpm is the primary runtime and packaging tool for API and worker services. +- Team can enforce manual approval for production migrations via GitHub Environments (no automatic apply-on-merge). +- Worker compute (memory/CPU) can be autoscaled and Redis (or alternative queue) is available and permitted by ops + budget. +- Backups and point-in-time restore (PITR) are available through the DB provider and can be executed in case of + migration issues. +- Community users will run CLI-based login flows locally for large payloads; credentials will not be persisted by the + backend. +- Performance targets (p95 read < 200ms, ETL interactive p95 < 30s) are achievable after normalization and indexing. +- The team has or will obtain authority to enable required DB extensions or accept documented fallback solutions if + provider restricts extensions. +- Security & compliance requirements (GDPR/DSR) are implementable within the project scope and timeline. + +--- + +### 17.2 Risk register (ID, description, probability, impact, mitigation) + +Entries are ordered roughly by priority (high to low). Probability and impact use High / Medium / Low. + +- RISK-001 — Migration causes production outage + - Probability: Medium + - Impact: High + - Description: A migration (DDL or index build) acquires locks or triggers a long-running operation that blocks + production reads/writes. + - Mitigation: + - Preflight migrations against staging; estimate index build times. + - Use phased migrations (add columns nullable → backfill → enforce constraints). + - Create backups immediately before production migration. + - Run production migrations only with manual workflow and approver present. + - Have rollback runbook and on-call DB admin available. + - Owner: Engineering Lead + +- RISK-002 — ETL backfill overloads DB (connection/IO exhaustion) + - Probability: Medium + - Impact: High + - Description: Large-scale backfill or poorly throttled workers overwhelm DB causing degraded production + performance. + - Mitigation: + - Throttle backfill concurrency; process in small batches. + - Use connection pooling (pgbouncer) and limit per-worker DB connections. + - Monitor DB metrics and implement automatic pause if thresholds exceeded. + - Consider using replica/isolated backfill cluster if volume is very large. + - Owner: SRE / DevOps + +- RISK-003 — Worker OOM or excessive memory use when parsing large snapshots + - Probability: Medium + - Impact: High + - Description: 2–3MB snapshots parsed naively can cause worker OOM and instability. + - Mitigation: + - Implement stream-aware or chunked parsers; process large arrays in batches. + - Enforce maximum snapshot size limit and handle larger payloads asynchronously. + - Add memory monitoring and autoscale worker pods based on memory/queue metrics. + - Owner: Backend Engineer (ETL) + +- RISK-004 — Provider denies required DB extension + - Probability: Medium + - Impact: Medium + - Description: Managed Postgres provider refuses CREATE EXTENSION pgcrypto or other needed extensions. + - Mitigation: + - Prefer pgcrypto but implement fallback (generate UUIDs in app). + - Isolate extensions into a single migration and detect permission errors early in preflight. + - Document provider requirements and request operations team/provider to enable extension where possible. + - Owner: DevOps + +- RISK-005 — Sensitive data (tokens/PII) leaked in logs or artifacts + - Probability: Low + - Impact: High + - Description: Raw snapshots or tokens are accidentally logged or included in CI artifacts leading to exposure. + - Mitigation: + - Enforce automatic log scrubbing and Sentry scrubbing rules. + - Pre-commit hooks and CI secret scanning (git-secrets, GitHub secret scanning). + - Do not include raw snapshots in public artifacts; redact in stored artifacts. + - Maintain SECRET_COMPROMISE runbook and rotate secrets promptly. + - Owner: Security Officer + +- RISK-006 — Upstream API rate limits or changes disrupt ingestion + - Probability: Medium + - Impact: Medium + - Description: Upstream service rate-limits or modifies API causing fetch failures or inconsistent payloads. + - Mitigation: + - Implement exponential backoff and retry logic in clients. + - Record upstream rate-limit responses and surface friendly messages to users. + - Preserve raw snapshots and build ETL tolerant to missing/extra fields; employ quarantining for malformed + snapshots. + - Owner: Backend Engineer / Integrations lead + +- RISK-007 — Duplicate snapshots flooding DB + - Probability: Medium + - Impact: Low/Medium (storage growth) + - Description: Clients may submit identical snapshots repeatedly causing storage growth and repeated ETL work. + - Mitigation: + - Compute and check content_hash (SHA256) on insert; dedupe within a configurable window. + - Record duplicate attempts without full duplicate insertion. + - Monitor duplicate rate and surface alerts when threshold exceeded. + - Owner: Backend Engineer + +- RISK-008 — Dependency vulnerability or supply-chain attack + - Probability: Medium + - Impact: High + - Description: Malicious or vulnerable NPM package compromises CI/runtime. + - Mitigation: + - Enable Dependabot and SCA tools; run SAST in CI. + - Pin critical dependencies and use reproducible builds. + - Limit GitHub Actions permissions and use OIDC for cloud credentials. + - Owner: Engineering Lead / Security + +- RISK-009 — Unauthorized access to raw snapshots or admin endpoints + - Probability: Low + - Impact: High + - Description: Misconfigured RBAC or leaked admin keys give external access to sensitive admin APIs or snapshots. + - Mitigation: + - Enforce RBAC and scoped tokens for admin endpoints. + - Require multi-person approval for migration workflows and limit environment access. + - Audit admin actions and keep audit logs with retention. + - Owner: Security / DevOps + +- RISK-010 — Cost overrun due to high storage or backfill compute + - Probability: Medium + - Impact: Medium + - Description: Storing many raw snapshots and running heavy backfills increase monthly infra costs unexpectedly. + - Mitigation: + - Implement and enforce retention/archival policies. + - Estimate costs before backfill and run within budget alerts. + - Use cold storage (S3 infrequent/Glacier) for older archives. + - Add budget alerts and automated throttles for heavy jobs. + - Owner: Finance / DevOps + +- RISK-011 — Data integrity mismatch between raw snapshot and normalized tables + - Probability: Low/Medium + - Impact: Medium + - Description: Mapping bugs or schema mismatches cause normalized records to diverge from raw content. + - Mitigation: + - Provide extensive unit/integration tests against sample snapshots. + - Perform dry-runs with validation queries and golden outputs. + - Preserve raw JSON and unmapped fields to enable replay and corrections. + - Add backfill_audit and etl_errors for traceability. + - Owner: Data Engineer / Backend + +--- + +### 17.3 Open questions / decisions pending + +List of unresolved items that require explicit decisions; include an owner and suggested resolution approach. + +- Q-001: Exact ETL queue technology choice + - Status: Pending + - Options: Redis + BullMQ (current preference) vs AWS SQS / PubSub (managed) + - Impact: operational complexity, persistence guarantees, cost + - Owner: Engineering Lead + - Suggested resolution: Evaluate Redis vs SQS using short spike comparing durability, ease of retries and cost; pick + SQS if long-term durability and operational simplicity are prioritized. + +- Q-002: Default retention policy for snapshots (days and per-user N) + - Status: Pending + - Options: 30 / 90 / 180 days and keep last N snapshots per user + - Owner: Product Owner + Legal + - Suggested resolution: Default to 90 days + keep last 30 snapshots per user; confirm with Legal for GDPR. + +- Q-003: Whether profile_summary endpoint will be public (no auth) or require authentication + - Status: Pending + - Impact: Rate limiting, privacy, caching, user discoverability + - Owner: Product Owner + - Suggested resolution: Make read endpoint public but rate-limited; sensitive raw snapshot access remains + restricted. + +- Q-004: DB provider selection and extension availability + - Status: Pending + - Impact: UUID strategy, migration scripts + - Owner: DevOps + - Suggested resolution: Confirm Supabase/managed Postgres capability and whether pgcrypto is allowed; if not, adopt + app-side UUIDs. + +- Q-005: Backup & restore SLAs (RTO/RPO) to target for production + - Status: Pending + - Owner: SRE / Product + - Suggested resolution: Define RTO = 1 hour and RPO = 1 hour as initial targets; confirm provider can support and + budget accordingly. + +- Q-006: Feature flag system to use (in-house DB flags vs third-party) + - Status: Pending + - Impact: rollout control, SDK complexity, cost + - Owner: Engineering Lead + Product + - Suggested resolution: Start with in-house feature_flags table + simple SDK; revisit third-party if flags and + experimentation needs grow. + +- Q-007: Decision on using read-replicas for scaling reads vs caching (Redis) for profile_summary + - Status: Pending + - Impact: consistency, cost, operational complexity + - Owner: SRE + - Suggested resolution: Implement short-term Redis caching for summaries; plan read-replica architecture as load + grows. + +- Q-008: Level of sensitivity for fields in snapshots (what is considered PII) + - Status: Pending + - Owner: Legal / Security + - Suggested resolution: Produce a field-level mapping from sample snapshots and have Legal classify fields; then + implement redaction rules. + +- Q-009: Policy for storing tokens that appear in snapshots (if upstream includes session tokens) + - Status: Pending + - Owner: Security Officer + - Suggested resolution: Default policy is "do not store"; if storing is required, encrypt and restrict access, + document retention and rotation. + +- Q-010: Canary rollout thresholds and automated gating criteria + - Status: Pending + - Owner: Engineering Lead + SRE + PO + - Suggested resolution: Define concrete gates (ETL failure rate < 0.5%, API p95 within +20% of baseline, queue + depth < threshold) to allow automated promotion; add manual approval step for production migration. + +--- + +Action items + +- Validate assumptions Q-002, Q-004, Q-005 and Q-008 as highest priority before large-scale backfill or production + migration. +- Assign owners to outstanding questions and schedule decision checkpoints during the next planning meeting. +- Add high-priority mitigation tasks for RISK-001 through RISK-004 into the immediate backlog (spikes & tests). + +--- + +## 18. Dependencies & Stakeholders + +This section lists internal and external dependencies required to deliver the Player Profile & DB Foundation work and a +RACI (Responsible / Accountable / Consulted / Informed) matrix describing stakeholder responsibilities for the major +activities. Use this section to coordinate cross-team work, request permissions, and track who to contact for approvals. + +--- + +### 18.1 Internal dependencies (teams, services) + +Teams + +- Product + - Prioritizes features, defines acceptance criteria and approves releases. +- Backend / Platform Engineering + - Implements API, ETL worker, migrations, and data model. +- DevOps / SRE + - Manages CI/CD, production infrastructure, backups, scaling, secrets and runbooks. +- Data Engineering / Analytics + - Seeds catalogs, builds materialized views, validates backfill and exports. +- QA / Test Engineering + - Creates test plans, runs integration/E2E tests and validates staging. +- Security & Privacy + - Reviews threat model, approves retention/PII handling, coordinates pentest. +- Community / Bot Operator + - Runs and monitors the Discord bot, coordinates community betas and feedback. +- Legal / Compliance + - Advises on GDPR/CCPA and data residency/DSR obligations. +- Documentation / Developer Experience + - Maintains docs: DB_MIGRATIONS.md, ETL_AND_WORKER.md, OP_RUNBOOKS, onboarding. + +Internal services & components + +- GitHub (Repositories, Actions, Environments) + - Workflows, protected environments, secrets stored, manual approvals. +- CI runners (GitHub Actions) + - Build, test, migration preflight, publish artifacts. +- Container Registry (GHCR) + - Stores built images for API, worker, bot and jobs. +- Observability stack (Prometheus, Grafana, Sentry, logging pipeline) + - Metrics, dashboards, alerting and error aggregation. +- Internal secret manager (if in use) or GitHub Secrets + - Storage for DATABASE_URL, DISCORD_TOKEN, REDIS_URL, GOOGLE_SA_JSON, GHCR_PAT. +- Internal artifact/storage (S3/GCS) + - Archive snapshots, export files and backup artifacts. + +Cross-team coordination points + +- DB extension & provider constraints — coordinate DevOps and Backend to confirm provider capabilities. +- Migration schedule — Product, Engineering and SRE must agree to windows and approvers. +- Backfill windows & capacity — Data Engineering + SRE to plan concurrency and cost. +- Security review & pentest scheduling — Security, Product and Engineering to approve scope and remediation timelines. + +--- + +### 18.2 External dependencies (third-party services) + +Primary external services + +- Managed Postgres (Supabase or equivalent) + - Stores hero_snapshots JSONB and normalized tables. Dependencies: extension support (pgcrypto), backups, connection + limits and pricing. +- Redis / Queue provider (self-hosted Redis or managed provider such as Upstash/RedisLabs) + - Job queue backend (BullMQ) or small cache for flags and rate limiting. +- Object storage (AWS S3, DigitalOcean Spaces, GCS or S3-compatible) + - Archive raw snapshots, exports and large artifacts. +- Discord (API & Gateway) + - Bot integration, slash commands, webhooks and guild interactions. +- Upstream game API (get_hero_profile) + - Source of the raw profile snapshots; rate limits and format stability are external constraints. +- GitHub (Actions, Environments, GHCR) + - CI/CD, protected workflows, secret storage and image registry. +- Monitoring / error tracking providers (Prometheus + Grafana, Sentry or hosted alternatives) + - Metrics aggregation, dashboards and exception tracking. +- Cloud provider APIs (AWS/GCP) or platform services + - If used for storage, backups, or additional compute; may require service accounts and billing. +- Third-party feature-flag or experimentation providers (optional) + - If adopted later for rollout control (e.g., LaunchDarkly). +- Dependency/security scanners (Dependabot, Snyk) + - Supply-chain monitoring and vuln alerts. + +Operational constraints & SLAs to confirm + +- Rate limits and quotas for Discord and upstream game API. +- Connection limits and available extensions for chosen Postgres provider. +- Cost/usage quotas for GitHub Actions, GHCR and cloud services. +- Data residency constraints for object storage (region availability). + +Third-party contact & support expectations + +- Identify account owners and support tiers for each provider (e.g., Supabase contact, S3 support plan). +- Plan for support escalation for production incidents affecting third-party services. + +--- + +### 18.3 Stakeholder RACI / responsibility matrix + +This RACI matrix maps key activities to stakeholders. Use it to determine who executes work, who signs off, who should +be consulted, and who must be informed. + +Key: R = Responsible (do the work), A = Accountable (final sign-off), C = Consulted (advised/inputs), I = Informed (kept +up-to-date) + +Activities / Stakeholders + +- PO = Product Owner +- TL = Technical Lead / Engineering Lead +- BE = Backend Engineers +- DE = Data Engineering / Analytics +- SRE = DevOps / SRE +- QA = Quality Assurance +- SEC = Security & Privacy +- BOT = Bot Operator / Community Manager +- LEG = Legal / Compliance +- DOC = Documentation / DevEx + +1) Schema migrations & DB bootstrap + +- Responsible: BE, SRE (BE writes migrations, SRE runs/provisions) +- Accountable: TL +- Consulted: PO, QA, SEC, LEG +- Informed: DOC, BOT + +2) Snapshot ingestion endpoint & CLI integration + +- Responsible: BE +- Accountable: TL +- Consulted: BOT, SRE, QA +- Informed: PO, DOC + +3) ETL worker implementation & backfill + +- Responsible: BE, DE +- Accountable: TL +- Consulted: SRE, QA, SEC +- Informed: PO, DOC + +4) Backfill execution & operational run + +- Responsible: DE, SRE +- Accountable: TL +- Consulted: PO, BE, QA +- Informed: BOT, DOC, LEG + +5) Profile summary API & bot commands + +- Responsible: BE, BOT +- Accountable: TL +- Consulted: PO, QA +- Informed: SRE, DOC + +6) Admin UI & reprocess endpoints + +- Responsible: BE, DOC (API + UI) +- Accountable: TL +- Consulted: SRE, QA, SEC +- Informed: PO, BOT + +7) Retention & archival jobs + +- Responsible: SRE, BE +- Accountable: TL +- Consulted: DE, LEG, SEC +- Informed: PO, DOC + +8) Observability, metrics & alerts + +- Responsible: SRE +- Accountable: SRE Lead (or TL if SRE embedded) +- Consulted: BE, DE, QA +- Informed: PO, BOT, SEC + +9) Security review & pentest + +- Responsible: SEC +- Accountable: SEC Lead +- Consulted: BE, SRE, TL +- Informed: PO, LEG + +10) CI/CD & production deployments (including db-bootstrap workflow) + +- Responsible: SRE, BE (CI config) +- Accountable: SRE Lead +- Consulted: TL, PO, QA +- Informed: DOC, BOT + +11) Incident response & runbook execution + +- Responsible: On-call Engineer (SRE/BE) +- Accountable: SRE Lead / TL +- Consulted: PO, SEC, LEG +- Informed: All stakeholders via incident channel + +12) Legal / Compliance sign-off (retention, DSR) + +- Responsible: LEG +- Accountable: LEG lead +- Consulted: PO, SEC, SRE +- Informed: TL, DOC + +13) Developer docs & onboarding + +- Responsible: DOC, BE +- Accountable: TL +- Consulted: QA, SRE +- Informed: PO + +Example RACI table (condensed) + +- Activity: Migrations & bootstrap -> R: BE/SRE | A: TL | C: PO/QA/SEC/LEG | I: DOC +- Activity: ETL Worker -> R: BE/DE | A: TL | C: SRE/QA/SEC | I: PO +- Activity: Backfill -> R: DE/SRE | A: TL | C: BE/QA/PO | I: LEG/BOT +- Activity: Profile API -> R: BE | A: TL | C: BOT/QA | I: SRE/PO +- Activity: Observability -> R: SRE | A: SRE Lead | C: BE/DE | I: PO + +Notes & recommendations + +- Assign named owners for each role as the project matures (replace generic roles with specific people). +- For critical actions (production migration, data deletion/DSR), require explicit multi-person approvals (Accountable + + one approver). +- Keep the RACI matrix in docs/OWNERS.md and update when responsibilities change. +- Ensure all teams have a clear escalation contact and a primary + secondary on-call rotation. + +--- + +## 19. Acceptance Criteria & Definition of Done + +This section defines the concrete acceptance criteria that must be met before features, epics or the project can be +considered done. It covers functional story-level criteria, non‑functional requirements (SLAs, performance, security), +compliance items required to mark work complete, and the final sign‑off process (who must approve). + +--- + +### 19.1 Functional acceptance criteria (per epic / story) + +Below are the core epics and their minimal functional acceptance criteria. Each story derived from these epics must map +to one or more of the criteria below or have its own Given/When/Then acceptance steps recorded in the backlog. + +EPIC-DB-FOUNDATION + +- node-pg-migrate integrated and initial migrations are present in database/migrations/. +- Running migrations locally: + - Given a fresh Postgres instance and DATABASE_URL, when `pnpm run migrate:up` is executed, then tables listed in + DB_MODEL.md are created without error. +- Bootstrap workflow: + - Given repository secrets present, when the manual db-bootstrap GitHub Action is triggered, then migrations and + seeds complete and a sanity query returns expected tables. +- Seeds are idempotent: + - Re-running seeds should not create duplicate seed rows. + +EPIC-SNAPSHOT-INGESTION + +- Raw snapshot persist: + - Given a valid get_hero_profile payload, when POST /api/v1/internal/snapshots is called, then hero_snapshots + contains a JSONB row with size_bytes and content_hash. +- Duplicate detection: + - When the same payload is posted within dedupe_window, system returns a duplicate response and does not insert a + duplicate raw row (duplicate attempt recorded). +- CLI integration: + - The CLI can save payloads locally and optionally POST to ingestion endpoint; successful run prints snapshot id. + +EPIC-ETL-WORKER + +- Worker idempotency: + - Given a hero_snapshot row, when the worker processes it, then processed_at is set and normalized tables (users, + user_troops, user_pets, user_profile_summary) reflect expected data. + - Reprocessing the same snapshot does not create duplicate rows and leaves normalized state consistent. +- Large payload handling: + - Given a ~3MB snapshot and constrained memory environment, when the worker runs, then it completes without OOM and + updates processed_at (or marks for retry if transient failure). +- Error handling: + - Malformed snapshots are recorded in etl_errors and do not crash the worker; admins can reprocess after fix. + +EPIC-API-BACKEND & BOT + +- Profile summary endpoint: + - Given a processed profile, when GET /api/v1/profile/summary/:namecode is called, then it returns the denormalized + user_profile_summary. +- Fallback behavior: + - If summary missing but a processed snapshot exists, the endpoint returns best-effort data or 202 with an ETA + message. +- Bot command: + - Slash command `/profile ` returns the summary or a friendly "processing" message; respects Discord + timeouts. + +EPIC-ANALYTICS + +- Materialized view: + - Given user_troops populated, when materialized view is refreshed, then a query for troop ownership runs within + acceptable time for the staging dataset. + +EPIC-DEVEX & DOCS + +- Documentation: + - docs/DB_MIGRATIONS.md, docs/ETL_AND_WORKER.md and .env.example exist and show step-by-step local bootstrap. +- Sample payload: + - examples/get_hero_profile_*.json present and ingest-sample.sh populates normalized tables in local environment. + +EPIC-SECURITY & PRIVACY + +- No plaintext passwords stored: + - When a login-based payload is processed, no user passwords are persisted in DB or logs. +- Tokens redaction: + - Logs and Sentry events do not contain unredacted tokens or credentials. + +Operational acceptance + +- Admin reprocess endpoint: + - Given a snapshot id and admin auth, POST /admin/snapshots/:id/reprocess enqueues a job and returns 202 with job + id. + +--- + +### 19.2 Non-functional acceptance criteria (SLA, perf, security) + +These measurable NFRs must be met before marking release done for production. + +Performance & availability + +- Profile summary read latency: + - p95 < 200ms, p99 < 500ms for the typical staging dataset when servicing user_profile_summary. +- Ingestion ack latency: + - POST /internal/snapshots ack p95 < 1s, p99 < 3s. +- ETL processing SLA (interactive): + - Median processing time < 10s, p95 < 30s for small/average snapshots. For large login payloads (2–3MB) treat as + asynchronous; aim for p95 < 5 minutes initially. +- Throughput baseline: + - System can sustain at least 100 snapshots/hour per worker pool configuration; autoscaling must allow scaling + beyond baseline for spikes. + +Reliability & resilience + +- Uptime / SLA: + - Target 99.9% availability for read API during business hours. +- Error rates: + - ETL job failure rate < 1% (transient errors retried automatically). +- Backups: + - Automated backups configured; verified restore procedure documented and a successful restore drill performed + before GA. + +Security & compliance + +- Authentication & RBAC: + - Admin endpoints require scoped tokens and role checks; unauthenticated access to raw snapshots is disallowed. +- Secrets handling: + - No secrets present in repo; GitHub Actions secrets used and not printed to logs. +- Pentest: + - External pentest scheduled and medium/high issues resolved or have documented mitigations before GA. + +Observability + +- Metrics: + - ETL processed_count, failure_count, processing_latency, and queue_depth are exported and visible on dashboards. +- Alerts: + - Alerts for ETL failure spikes, high queue depth, and DB connection saturation are configured with on-call + recipients. + +Scalability & capacity + +- Connection limits: + - Worker concurrency settings must respect DB connection limits (no more than configured max connections). +- Partitioning / indexing: + - Indexes required for primary query patterns are present and validated by explain plans for critical queries. + +Privacy & data retention + +- Retention policy: + - Snapshot retention configuration implemented (default 90 days or as approved), archival to S3 validated. +- DSR handling: + - Data deletion flow for user erasure requests tested end-to-end and logged. + +--- + +### 19.3 Compliance checklist to mark "Done" + +Before marking an Epic/Release "Done" for production, the following compliance items must be completed and evidence +attached (logs, links or runbook entries): + +General compliance + +- [ ] Data classification documented and approved (docs/DATA_PRIVACY.md). +- [ ] Retention policy defined and implemented (docs/DATA_RETENTION.md). +- [ ] Encryption at rest and in transit confirmed for DB and S3. + +Security & auditing + +- [ ] RBAC policies applied for admin endpoints; list of privileged accounts recorded. +- [ ] Secrets stored in secrets manager / GitHub Secrets; rotation policy documented. +- [ ] Logging redaction implemented (tokens/PII not present in logs/Sentry). +- [ ] Pentest scheduled or completed; critical/high findings resolved or have mitigation timeline. + +Operational & backups + +- [ ] Daily backup configured; last successful backup ID documented. +- [ ] Restore drill performed successfully in staging (date and outcome recorded). +- [ ] Runbooks created and reviewed for: INCIDENT_RESPONSE.md, DB_RESTORE.md, APPLY_MIGRATIONS.md. + +Privacy compliance + +- [ ] DSR (data subject request) flow implemented and tested (deletion and verification sample). +- [ ] Data processing agreements (DPA) in place with cloud providers if EU data is involved. + +Legal & policy + +- [ ] Legal sign-off obtained for retention and data residency constraints (if applicable). +- [ ] Documentation for any third‑party contracts (Discord, provider SLAs) linked. + +Testing & documentation + +- [ ] All required automated tests pass in CI (unit, integration, migration preflight). +- [ ] E2E smoke tests passed in staging with screenshots/logs attached. +- [ ] Documentation updated (README, migration docs, ETL mapping) and links included in release notes. + +Evidence & artifacts + +- [ ] Attach links to: migration run logs, backup id, CI job runs, pentest report (or ticket), dashboards used for + monitoring. + +Only after all the checked items above have corresponding evidence linked to the release ticket should the release be +considered compliant and ready for GA. + +--- + +### 19.4 Sign-off: Product, Engineering, Security, QA + +Final approval requires explicit sign-off from the following roles. Record approver name, date, and any notes in the +release checklist. + +Sign-off table + +- Product Owner (PO) + - Responsibility: Accept functional behavior and user-impacting decisions. + - Sign-off required: Yes + - Example signature line: PO: __________________ Date: ______ Notes: __________________ + +- Engineering Lead / Technical Lead (TL) + - Responsibility: Confirm architecture, migration safety and rollout readiness. + - Sign-off required: Yes + - Example signature line: TL: __________________ Date: ______ Notes: __________________ + +- Security & Privacy Officer (SEC) + - Responsibility: Verify security controls, secret management, pentest remediation and data protection measures. + - Sign-off required: Yes (or documented exceptions) + - Example signature line: SEC: __________________ Date: ______ Notes: __________________ + +- QA Lead + - Responsibility: Confirm tests passed, E2E smoke tests in staging and regression checklist. + - Sign-off required: Yes + - Example signature line: QA: __________________ Date: ______ Notes: __________________ + +- SRE / DevOps Lead (optional but recommended) + - Responsibility: Confirm backups, restore capability, monitoring & alerts and deployment readiness. + - Sign-off required: Recommended + - Example signature line: SRE: __________________ Date: ______ Notes: __________________ + +Sign-off process + +1. Populate release checklist with links to CI runs, test artifacts, migration logs, backup ids and monitoring + dashboards. +2. Each approver reviews checklist items and evidence. +3. Approver signs (digital approval in ticketing system or a GitHub Environment approval) and adds comments if any + conditions apply. +4. If any approver refuses sign-off, the release is blocked until conditions are met or an explicit escalation/exception + is recorded and approved by senior leadership. + +Definition of Done (DoD) summary + +- All functional acceptance criteria satisfied and tests pass. +- Non-functional SLAs and monitoring configured and validated. +- Compliance checklist items completed with evidence. +- Required sign-offs obtained (PO, TL, SEC, QA). +- Release notes and runbooks updated and accessible. + +Once all items are complete and sign-offs recorded, the feature/release may be promoted to production following the +rollout plan in Section 14. + +--- + +## 20. Metrics, Observability & Analytics + +This section specifies what to measure, how to expose it, alerting rules, logging/tracing conventions, telemetry event +schemas and retention policies. Use these guidelines to instrument code, build Grafana dashboards, and satisfy +operational & compliance needs. + +--- + +### 20.1 Business metrics to monitor (dashboards) + +Purpose: track product health, adoption and business outcomes. Dashboards should be organized by audience (Product / +Ops / Data). + +Suggested dashboards and widgets + +1) Snapshot ingestion & freshness (Product / Ops) + +- Ingest rate: snapshots received per minute / hour +- Enqueued vs processed ratio (per minute) +- Average / median snapshot size (bytes) +- Content-hash duplicate rate (%) over window +- Profile freshness: percentage of active users with profile_summary cached within last 5/30/90 minutes +- SLA compliance: percent of snapshot ACKs under 1s + +2) Bot & user engagement (Product) + +- Bot command usage: /profile commands per minute, per guild +- Bot success rate: percentage of commands that return summary vs "processing" errors +- Active unique namecodes per day / week +- Top N guilds by command volume + +3) ETL throughput & business outcomes (Data) + +- Snapshots processed per hour (by worker pool) +- Number of user records created/updated per hour +- Troop ownership changes: top changed troop_ids (useful for product analytics) +- Materialized view refresh status and last refresh timestamp + +4) Storage & cost monitoring (Finance/OPS) + +- DB storage growth (MB/day) for hero_snapshots table +- S3 archival bytes written per day +- Estimate monthly cost for storage/compute (if available from cloud provider) + +5) Backfill & migration progress (Ops/Data) + +- Backfill job progress: total snapshots, processed, failed +- Backfill throughput (snapshots/hour) +- Migration run status and applied migration id + +Visualization tips + +- Use heatmaps or time-series with p50/p95/p99 shading. +- Add annotations for deployments, migration runs and manual bootstrap events to correlate with metric spikes. +- Provide a succinct "At-a-glance" status tile showing overall system health: OK / Degraded / Critical derived from key + alerts. + +--- + +### 20.2 Technical metrics & alerts + +Instrument the system with metrics exposed for Prometheus (or equivalent). Provide clear alerting rules with playbooks. + +Core metric types (per service) + +- Counters: + - snapshots_received_total + - snapshots_enqueued_total + - snapshots_processed_total + - snapshots_failed_total + - etl_entity_upserts_total (tagged by entity: users, user_troops, user_pets, etc.) + - bot_commands_total (tag: guild, command) +- Gauges: + - queue_depth (per queue) + - worker_pool_instances + - worker_memory_bytes / worker_cpu_seconds + - db_connections_current + - db_replication_lag_seconds + - hero_snapshots_table_size_bytes +- Histograms / Summaries: + - api_request_duration_seconds (labels: endpoint, method, status) + - etl_processing_duration_seconds (labels: success/failure) + - snapshot_size_bytes distribution + - etl_entity_upsert_latency_seconds + +Recommended alert rules (with suggested thresholds and urgency) + +- P0: ETL failure rate spike + - Condition: rate(snapshots_failed_total[5m]) / rate(snapshots_processed_total[5m]) > 0.01 (i.e., >1%) + - Action: page on-call, runbook INCIDENT_RESPONSE +- P0: Queue depth high (backlog) + - Condition: queue_depth > X (configurable; e.g., >500) for >10 minutes OR queue depth growth > 3x baseline + - Action: page on-call, investigate worker health and scale +- P0: DB connection exhaustion + - Condition: db_connections_current > 0.9 * db_connections_max + - Action: page on-call, reduce worker concurrency, scale DB if necessary +- P0: DB replication lag critical + - Condition: db_replication_lag_seconds > 30s (tunable) + - Action: page; investigate replica health +- P1: API error rate increase + - Condition: rate(http_requests_total{status=~"5.."}[5m]) > baseline * 5 or error rate > 1% of requests + - Action: notify Slack & on-call +- P1: API latency regression + - Condition: api_request_duration_seconds{endpoint="/api/v1/profile/summary"} p95 > 2× baseline for 10m + - Action: notify SRE; consider temporary throttling +- P1: Worker OOM or repeated process restarts + - Condition: rate(worker_restarts_total[10m]) > 3 + - Action: page & investigate memory usage +- P1: Snapshot ACK latency high + - Condition: api snapshot ack p95 > 3s + - Action: notify and investigate upstream or DB latency +- P2: Duplicate snapshot rate increase + - Condition: rate(duplicate_snapshots_total[1h]) / rate(snapshots_received_total[1h]) > 0.2 + - Action: notify product and review client behavior + +Alert content should include: summary, affected service, recent metric snippets, runbook link, and suggested remediation +steps. Tie alerts to runbooks in docs/OP_RUNBOOKS/*. + +Noise reduction & escalation + +- Use multi-window evaluation (5m & 1h) to avoid transient noise. +- Require consecutive alerts or burst detection before paging for non-critical metrics. +- Automatically escalate to TL if alert unresolved for defined time windows (e.g., 30/60 minutes). + +--- + +### 20.3 Logging & tracing guidelines + +Goal: consistent, actionable logs and traces to speed triage and preserve privacy. + +Logging guidelines + +- Structured logs (JSON) only. Each log record should include standardized fields: + - timestamp (ISO 8601) + - service (api | worker | bot | admin-ui) + - env (staging | production) + - level (DEBUG | INFO | WARN | ERROR) + - message (short human-readable) + - request_id (X-Request-Id) — correlate API request + - trace_id / span_id — if tracing available + - snapshot_id (where applicable) + - user_id, namecode (only if not PII; prefer namecode as identifier) + - job_id (background jobs) + - module/component + - error_code (if error) + - details (JSON) — any non-sensitive structured context +- Do NOT log: + - Plaintext passwords, raw tokens, secrets or full PII fields (email, real name) unless redacted. + - Full raw snapshot payloads in standard logs; use a debug-only path that is disabled in production. If raw + snapshots are required for troubleshooting, log a reference (snapshot_id and s3_path) only. +- Log levels: + - DEBUG: verbose dev info (local & staging only or gated by feature flag) + - INFO: normal operational events (snapshot_enqueued, job_started) + - WARN: recoverable issues (transient upstream error) + - ERROR: failures requiring investigation (etl failure, DB errors) +- Sampling: + - For high-volume endpoints or repetitive errors, sample logs (e.g., 1%) and always log the first N occurrences. +- Centralization: + - Ship logs to a centralized store (ELK, Datadog logs, Logflare). Protect access to logs and ensure redact/scrub + pipelines before long-term storage. + +Tracing guidelines + +- Use OpenTelemetry-compatible tracing libraries. +- Propagate trace_id across: + - API request -> queue enqueue (include request_id and trace_id in job payload) -> worker processing -> DB writes -> + subsequent API calls. +- Typical spans: + - http.server (API ingress) + - queue.enqueue + - queue.dequeue / worker.process + - etl.parse + - db.upsert.users, db.upsert.user_troops, db.index.create + - external.call (call to upstream get_hero_profile) +- Trace retention: keep high-resolution traces for 7 days; store sampling traces (1-5%) for 30 days if allowed. +- Link traces to logs with trace_id to provide full context during incident triage. + +Sentry / error tracking + +- Capture exceptions with structured context (service, request_id, snapshot_id). +- Configure scrubbing rules to remove any PII or tokens from event data before sending. +- Set error alerting for new error types (regression), increasing frequency, or critical severity. + +Operational notes + +- Inject correlation ids early (API middleware) and return X-Request-Id to clients. +- Make request_id visible in user-facing error messages (support code) so users can report issues with trace context. +- Ensure job retries include the original trace metadata where useful (but avoid spamming traces for retry loops). + +--- + +### 20.4 Telemetry events (schema + examples) + +Ship meaningful business and operational events to analytics and event pipelines (e.g., Kafka, Segment, BigQuery). Keep +event schemas versioned. + +Event design principles + +- Events are immutable facts (e.g., snapshot_received). Keep schema small and stable. +- Use snake_case for event names and fields. +- Include common header fields in every event: + - event_name + - event_version (semver or integer) + - event_timestamp (ISO 8601 UTC) + - env (staging|production) + - service + - request_id + - trace_id (optional) + - user_id (nullable) + - namecode (nullable) + - snapshot_id (nullable) + - source (cli|bot|ui|upstream) + +Core event schemas (examples) + +1) snapshot_received (v1) + +- Purpose: recorded when API receives a snapshot payload +- Schema: + { + "event_name": "snapshot_received", + "event_version": 1, + "event_timestamp": "2025-11-28T12:34:56Z", + "env": "production", + "service": "api", + "request_id": "uuid", + "trace_id": "trace-uuid", + "user_id": "uuid|null", + "namecode": "COCORIDER_JQGB|null", + "snapshot_id": "uuid", + "source": "fetch_by_namecode|login|cli_upload", + "size_bytes": 234567, + "content_hash": "sha256-hex", + "ingest_latency_ms": 123 + } + +2) snapshot_enqueued (v1) + +- Purpose: a lightweight event after enqueue to queue +- Schema: + { + "event_name": "snapshot_enqueued", + "event_version": 1, + "event_timestamp": "...", + "service": "api", + "snapshot_id": "uuid", + "queue_name": "etl_default", + "queue_depth_at_enqueue": 42 + } + +3) snapshot_processed (v1) + +- Purpose: emitted when worker finishes processing a snapshot (success) +- Schema: + { + "event_name": "snapshot_processed", + "event_version": 1, + "event_timestamp": "...", + "service": "worker", + "snapshot_id": "uuid", + "user_id": "uuid|null", + "namecode": "COCORIDER_JQGB|null", + "processing_time_ms": 5432, + "troops_count": 120, + "pets_count": 3, + "entities_upserted": { "users": 1, "user_troops": 120, "user_pets": 3 }, + "worker_instance": "worker-1", + "success": true + } + +4) snapshot_failed (v1) + +- Purpose: processing failure with limited context +- Schema: + { + "event_name": "snapshot_failed", + "event_version": 1, + "event_timestamp": "...", + "service": "worker", + "snapshot_id": "uuid", + "error_code": "PARSE_ERROR|DB_ERROR|FK_VIOLATION|OOM", + "error_message": "short message (sanitized)", + "retry_count": 2 + } + +5) etl_entity_upserted (v1) + +- Purpose: emitted per entity upsert aggregation (useful for analytics) +- Schema: + { + "event_name": "etl_entity_upserted", + "event_version": 1, + "event_timestamp": "...", + "service": "worker", + "snapshot_id":"uuid", + "entity": "user_troops", + "rows_upserted": 120, + "rows_updated": 10 + } + +6) api_request (v1) + +- Purpose: generic API access logging for analytics and rate-limiting metrics +- Schema: + { + "event_name": "api_request", + "event_version": 1, + "event_timestamp": "...", + "service": "api", + "endpoint": "/api/v1/profile/summary/:namecode", + "method": "GET", + "status_code": 200, + "latency_ms": 120, + "client": "bot|cli|web", + "user_id": "uuid|null", + "namecode": "COCORIDER_JQGB|null" + } + +7) bot_command_executed (v1) + +- Purpose: track bot command usage +- Schema: + { + "event_name": "bot_command_executed", + "event_version": 1, + "event_timestamp": "...", + "service": "bot", + "guild_id": "1234567890", + "channel_id": "9876543210", + "command_name": "/profile", + "user_discord_id": "discord-id", + "namecode": "COCORIDER_JQGB|null", + "response_type": "summary|processing|error", + "latency_ms": 500 + } + +Schema versioning & evolution + +- Start event_version = 1 and increment if breaking changes occur. +- Always make new fields optional; maintain backward compatibility in consumers. +- Store event schemas in docs/events/ or a schema registry so downstream consumers can validate. + +Transport & storage + +- Export events to a streaming system (Kafka / Kinesis / PubSub) or directly to an analytics store (Segment, Snowplow). +- Ensure event pipeline scrubs PII fields per DATA_PRIVACY policy before writing to long-term analytics stores. + +Examples: small sequence + +- User triggers CLI upload → snapshot_received → snapshot_enqueued → snapshot_processed → etl_entity_upserted events + emitted. Downstream dashboards aggregate these into KPIs. + +--- + +### 20.5 Data retention for metrics/logs + +Retention policy goals: balance operational usefulness, storage cost, and compliance. + +Recommended baseline retention (tunable per org policy) + +Metrics (Prometheus / TSDB) + +- High-resolution metrics (raw samples): + - Retention: 30 days at full resolution (default) +- Medium-resolution rollups: + - Retain 1m/5m aggregates for 365 days for capacity planning and trending +- Long-term aggregates: + - Retain monthly rollups for > 3 years if required for audits + +Traces + +- Full traces: + - Retain for 7 days at full fidelity +- Sampled traces: + - Retain a sampled set (1–5%) for 30 days for longer-term debugging of regressions + +Logs + +- Application logs (structured): + - Hot store (searchable): 30 days + - Warm/Cold archive (compressed): 365 days (or longer per compliance) +- Security & audit logs (admin actions, migration runs, DSR events): + - Retain for a minimum of 1 year; consider 3–7 years for legal/contractual needs +- Error tracking (Sentry): + - Retain raw events per Sentry plan; keep organization-level issues and resolution history for at least 365 days + +Event / telemetry data (analytics) + +- Raw event stream: + - Keep raw events in data lake for 90–180 days depending on cost and analytics needs +- Processed/aggregated analytics tables: + - Retain 365 days to support reporting, longer if storage and compliance allow + +Archival & deletion strategy + +- Use lifecycle policies (S3) to move logs and events to cheaper tiers (Infrequent Access, Glacier) after N days. +- Provide an archival index in the DB linking artifact ids to archived S3 paths to retrieve data for audits when needed. +- Implement deletion workflows tied to retention rules and DSR requests; log every deletion action for audit. + +Privacy & compliance notes + +- Before storing telemetry or logs that include user-identifiable information, ensure it is either necessary, + pseudonymized, or covered by explicit consent. +- For DSR requests requiring deletion, purge or mark related telemetry per policy and propagate deletions to archived + stores where feasible (or flag for legal hold if deletion not possible due to regulatory reasons). + +--- + +Next steps / implementation checklist + +- Instrument API & worker code to emit the metrics and events listed above. +- Create Grafana dashboard templates and alerting rules (Prometheus alertmanager) based on thresholds here. +- Implement structured logging and OpenTelemetry tracing in API and worker, and ensure correlation ids are propagated + into job payloads. +- Define concrete retention settings in monitoring stack and S3 lifecycle rules in infra-as-code. + +--- + +## 21. Documentation & Training + +This section lists required developer documentation, onboarding checklists, runbooks and operations documentation, +user‑facing help articles, and a training plan for support teams. The goal is to ensure engineers, SREs, QA and +community support have the information and playbooks they need to operate, troubleshoot and explain the system. + +--- + +### 21.1 Developer docs & onboarding checklist + +Purpose: get a new developer productive quickly and reduce cognitive load when making changes that touch ingestion, ETL, +migrations or the profile APIs. + +Minimum docs to include (location suggestions) + +- docs/README.md — project overview, architecture summary, where to start +- docs/DEVELOPER_ONBOARDING.md — step-by-step local setup +- docs/DB_MIGRATIONS.md — migration conventions, node-pg-migrate usage, preflight checks +- docs/ETL_AND_WORKER.md — ETL design, idempotency rules, upsert patterns, error handling +- docs/DB_MODEL.md — canonical schema, ERD, table definitions and DDL snippets +- docs/CI_CD.md — CI jobs, GitHub Actions workflows, protected environments +- docs/OBSERVABILITY.md — metrics, dashboards and alert guide +- docs/DATA_PRIVACY.md — PII classification and handling rules +- docs/EXAMPLES.md — example requests, sample payloads and sample CLI usage +- docs/CHANGELOG.md — release notes and migration mapping + +Onboarding checklist (developer) + +1. Access & accounts + - Request GitHub access and membership to required repos/teams. + - Request access to required secrets manager entries (read-only where appropriate) and test credentials for staging. +2. Local environment + - Install Node.js (supported LTS), pnpm, Docker. + - Clone repo and run pnpm install. + - Copy .env.example -> .env.local and populate required values for local Postgres/Redis emulators. +3. Bootstrap DB locally + - Run ./scripts/bootstrap-db.sh (or pnpm run db:bootstrap) and verify migrations applied. + - Run database seeds and validate sample data present. +4. Run worker & API locally + - Start API server in dev mode and start a local worker; run ingest-sample.sh to process a sample snapshot. + - Verify user_profile_summary exists and API GET /api/v1/profile/summary/:namecode returns expected output. +5. Tests & CI + - Run unit and integration tests locally (pnpm test); ensure familiarity with testcontainers usage. +6. Observability & debugging + - Learn to read logs, use X-Request-Id to correlate flows and run basic Prometheus queries against local dev + metrics (if available). +7. PR workflow + - Follow GitHub branching, PR, code review and CI requirements; ensure migrations include up/down where feasible. + +Developer docs helpful additions + +- Quick troubleshooting FAQ (common errors and resolutions). +- Common SQL snippets for debugging (e.g., find latest snapshot, reprocess a snapshot). +- Packaging & release notes template for PRs that include migration changes. + +--- + +### 21.2 Runbooks & operations docs required + +Runbooks should be short, actionable and kept in docs/OP_RUNBOOKS/*. Each must list prerequisites, exact commands, +expected outputs and "when to escalate". + +Priority runbooks (minimum) + +- INCIDENT_RESPONSE.md — triage steps, create incident channel, initial checks, containment and mitigation. +- DB_RESTORE.md — step-by-step restore from provider snapshot or pg_restore with verification queries. +- APPLY_MIGRATIONS.md — preflight checklist, how to trigger the protected db-bootstrap GitHub Action, required approvals + and post-checks. +- SCALING_UP.md — how to scale workers/API, safe concurrency increments, DB connection considerations. +- REPROCESS_SNAPSHOT.md — how to re-enqueue a snapshot (API & manual DB option) and verify results. +- SECRET_COMPROMISE.md — immediate revocation/rotation steps, who to notify, short-term containment. +- BACKUP_DRILL.md — how to run and verify a restore drill; checklist for sign-off. +- COST_SPIKE.md — identify cost drivers, throttle jobs and communicate with Finance. +- MAINTENANCE_WINDOW.md — how to schedule, communicate and run maintenance with rollback plan. + +Formatting & maintenance + +- Each runbook: Purpose, Preconditions, Step-by-step commands (copyable), Verification queries, Post-action steps, + Contacts (owners & backup), Audit logging requirements. +- Store runbooks in Markdown with code examples and links to related docs. +- Review cadence: runbooks reviewed quarterly and after each major incident. + +Runbook automation + +- Include scripts or helper CLI commands where safe (e.g., enqueue-reprocess.sh) and ensure they respect RBAC and + require approver confirmation for production operations. + +--- + +### 21.3 User-facing documentation / help articles + +Audience: end users (players), community managers and bot operators. Docs should be accessible, concise and linked from +Discord bot messages and project site. + +Essential articles (docs/USER_DOCS/*) + +- Quick Start: How to fetch your profile + - Steps for NameCode fetch using CLI, web UI or bot command. + - Explanation of interactive (--login) vs NameCode fetch; privacy notes. +- Understanding Profile Summary + - What the summary shows (level, top troops, pet), freshness and how to request a refresh. +- How to use the Discord bot + - List of slash commands and examples (/profile ), permissions and rate limits. +- Privacy & Data Handling for players + - What is stored, retention windows, how to request deletion (DSR flow) and contact. +- Troubleshooting & FAQ + - Common errors: "Profile pending", "No profile found", rate-limited upstream, what to do. +- Developer-facing: CLI usage + - Detailed docs for get_hero_profile.sh, ingest options, Idempotency-Key usage, extracting NameCode with jq. +- Community admin guide + - How guild leaders can view guild-level reports, request bulk fetches, and best practices for coordinating + community snapshot runs. +- Change log & release notes (user friendly) + - Short, clear notes about feature additions and any user-impacting maintenance. + +Delivery & discoverability + +- Host docs on a docs site (GitHub Pages or a small static site) and link them from bot embeds and repository README. +- Embed short help text in bot replies (with link to fuller docs). +- Keep a short status/known-issues / planned-maintenance page for community. + +Guidelines for user docs + +- Use plain language and examples. +- Make privacy implications explicit for login flows. +- Provide expected ETA guidance for processing and tips to reduce failure (e.g., run CLI from stable connection). + +--- + +### 21.4 Training plan for support teams + +Audience: Community moderators, Bot operators, Support reps and first-line troubleshooters. + +Goals + +- Enable support staff to triage common user questions, interpret basic metrics, perform safe non-destructive recovery + steps, and escalate incidents properly. + +Training components + +1. Training materials + - Slide deck: "StarForge Profiles: Architecture & Troubleshooting" covering end-to-end flow, common failure points, + and tools. + - Short video walkthroughs (5–10 min) for: + - How to fetch profile (user perspective) + - How to read the Admin ETL Dashboard + - How to reprocess a snapshot (admin flow) + - How to interpret status messages and error codes + - One‑page cheat sheets: + - Quick triage steps (first 5 checks) + - Error codes & recommended actions + - Contact & escalation list (PagerDuty/Slack/Email) + +2. Hands-on sessions + - Run a live 60–90 minute training for initial cohort: demo ingestion, show raw snapshot view, reprocess flow and + run a simulated incident drill. + - Provide a sandbox environment where support reps can practice reprocessing and viewing snapshots safely. + +3. Knowledge base & playbooks + - Support playbook: step-by-step for common cases (profile pending, snapshot malformed, user asks to delete data). + - FAQ maintained in docs/USER_DOCS/FAQ.md with contributor access for community managers. + +4. Certification & sign-off + - A lightweight quiz or checklist to verify competency (e.g., 10-question quiz + practical task: reprocess snapshot + in staging). + - Maintain a list of trained and certified support reps. + +5. Ongoing refresh + - Monthly 30-min brown-bag updates for new features, major incidents and process changes. + - Immediate ad-hoc training for significant process changes (migrations that affect admin workflows). + +Support escalation & SLA + +- Define support Tier 1 responsibilities: user communication, initial triage, run known non-invasive fixes (request + snapshot fetch, retry). +- Tier 2: Engineers handle reprocessing issues, data anomalies and runbook execution. +- Ensure support team knows how to open an incident channel and whom to page for P0 incidents. + +Onboarding timeline (example) + +- Week 0: Documentation provided; basic self-study. +- Week 1: Live hands-on training + sandbox practice. +- Week 2: Shadowing: support rep observes two real incidents with engineer present. +- Week 3: Practical sign-off: support rep reprocesses a snapshot and verifies result in staging. + +Measurement & feedback + +- Track support KPIs: time-to-first-response, time-to-resolution for common issues, number of escalations to + engineering. +- Collect support feedback and update docs/runbooks quarterly based on real incidents and recurring support questions. + +--- + +## 22. Legal, Licensing & Third‑party Notices + +This section summarizes the legal and licensing considerations relevant to the Player Profile & DB Foundation project: +license choices for the project, impacts on Terms of Service and Privacy Policy, and obligations that arise from +third‑party dependencies. Use this as guidance for engineering, product, and legal teams to ensure compliance before +public releases. + +--- + +### 22.1 License considerations + +Purpose + +- Provide clear guidance on what license the project will be published under, how contributors and consumers are + affected, and how to handle mixed‑license third‑party components. + +Key questions to decide + +- What primary license will the project use? (e.g., MIT, Apache 2.0, BSD-3, GPLv3) +- Will contributors sign a Contributor License Agreement (CLA) or Developer Certificate of Origin (DCO)? +- Are there any components we cannot ship under the chosen license due to dependency compatibility? + +License selection guidance + +- Permissive (recommended for tools, SDKs, and developer utilities): + - MIT or BSD-3: minimal obligations, broad adoption, simple attribution requirements. + - Apache 2.0: permissive + explicit patent grant; recommended if patent concerns exist. +- Copyleft (use with care): + - GPLv2/GPLv3: enforces distribution of derivative source; avoid if you want to allow proprietary downstream use. + - LGPL: weaker copyleft for libraries; still has obligations. +- Commercial / dual-licensing: + - Consider only if organization plans to enforce paid licensing. + +Recommended default + +- For this project, choose Apache 2.0 or MIT: + - Apache 2.0 if you want explicit patent protections and stronger contributor grant language. + - MIT if you prefer the simplest permissive terms. +- Record the license in a top-level LICENSE file and include SPDX identifier (e.g., "Apache-2.0", "MIT"). + +Contributor policy + +- Use a DCO or lightweight CLA to ensure contribution ownership and grant of rights: + - DCO is simpler (contributors sign-off on commits). + - CLA provides stronger legal certainty for corporate contributors. +- Document contribution process in CONTRIBUTING.md, requiring sign-off and specifying license acceptance. + +Third‑party license compatibility + +- Audit all direct and transitive dependencies for license compatibility with the chosen project license. +- Watch for viral/copy-left licenses (GPL family) that may impose distribution obligations. +- If a dependency is GPL and used in a way that triggers distribution (e.g., linked into a distributed binary), consult + legal before including. + +Attribution & NOTICE file + +- For Apache 2.0 projects, maintain a NOTICE file with required attributions for included third‑party components. +- For any license that requires attribution, include the attribution block in docs or README and ensure it is packaged + with releases. + +Binary / Release packaging + +- When shipping compiled artifacts or containers, include: + - LICENSE (project license) + - Third‑party licenses (LICENSES-THIRD-PARTY.txt) + - NOTICE (if required) + - Source link (if required by license) or instructions on how to obtain source + +License scanning & automation + +- Integrate license scanning in CI: + - Tools: FOSSA, licensee, scancode-toolkit, OSS Review Toolkit, or GitHub's dependency graph/license detection. +- CI should fail or require manual review on detection of disallowed licenses. +- Record license approvals and exceptions in a dependency inventory (docs/THIRD_PARTY.md). + +Export controls & cryptography + +- If using cryptography (e.g., pgcrypto, client-side encryption), verify export control obligations in your + jurisdictions. +- Include notice if the project contains cryptographic code that may be subject to export/import restrictions. + +Practical checklist + +- [ ] Choose project license and add LICENSE file (with SPDX identifier). +- [ ] Add CONTRIBUTING.md with sign-off policy (DCO/CLA). +- [ ] Add third‑party license aggregation file (LICENSES-THIRD-PARTY.txt). +- [ ] Enable license scanning in CI and define a policy for disallowed licenses. +- [ ] Ensure NOTICE file present if using Apache 2.0 and third‑party components require it. + +--- + +### 22.2 Terms of Service / Privacy policy impacts + +Purpose + +- Capture the product‑level legal impacts that arise from accepting, storing and processing user snapshots and from + public access to profile data; prepare language for Terms of Service (ToS) and Privacy Policy. + +Privacy & data processing considerations + +- Data processed: + - Raw snapshots potentially include personal data (emails, real names, device identifiers, tokens). Audit sample + payloads to enumerate PII fields. +- Lawful basis: + - Document lawful basis for processing (consent, legitimate interest) depending on product design and jurisdiction. + - For data captured during login flows, explicit user consent is recommended and should be included in UI/CLI flow. +- Retention: + - Implement retention policy in ToS/Privacy Policy (e.g., snapshots retained 90 days by default; last N snapshots + kept per user). + - Communicate retention periods and archival behavior to users. +- Data subject rights: + - Describe procedures for access, correction, portability, and deletion (DSR). Indicate expected SLAs (e.g., 30 + days). + - Provide an easy way for users to request deletion — record requests and audit actions. +- Third‑party disclosures: + - If you send data to analytics, monitoring or backup providers, list those categories of processors and link to + DPAs where applicable. +- Children’s data: + - If product may be used by minors, comply with COPPA and local laws; consider blocking or giving special treatment. + +Terms of Service (ToS) impacts + +- Usage rules: + - Define acceptable use (no abuse/rate‑limit circumvention, no scraping of other users). + - State that users must not submit others' credentials or private data without consent. +- Liability & disclaimers: + - Disclaim accuracy of third‑party data and limit liability for damages arising from using aggregated profile data. +- Intellectual property: + - Clarify ownership of snapshots uploaded by users (user retains rights but grants the service limited rights to + store/process/display). + - Include a license grant from user to the service to process snapshots for the features described. +- Termination & account handling: + - Describe consequences of ToS violations (removal of profiles, account suspensions), and data retention after + termination. + +Consent flows & UX + +- For login-based ingestion (where credentials or tokens are used locally), ensure: + - Users explicitly consent before uploading snapshots to the service. + - Provide clear on-screen/local CLI notices explaining what will be uploaded and retention. + - CLI must never persist credentials unless explicitly requested and encrypted — document best practices. + +Cross-border / data residency + +- If serving EU users or requiring EU hosting: + - State data transfer practices in Privacy Policy. + - Consider limiting storage/processing to EU regions for EU users or provide opt‑in/out options. + - Execute DPAs with cloud providers where required. + +Security responsibilities + +- Communicate security measures in high-level terms (encryption in transit and at rest, access controls) in Privacy + Policy. +- Avoid including implementation details that could aid attackers (e.g., exact rotation schedules). + +Incident disclosure & breach notification + +- Define breach notification timelines (e.g., notify affected users and supervisory authorities within statutory + deadlines — GDPR: 72 hours where applicable). +- Document how users will be informed (email, status page, direct contact) and who to contact for questions. + +Practical checklist for ToS/Privacy updates + +- [ ] Map PII fields and decide what is stored vs redacted. +- [ ] Draft Privacy Policy section covering snapshots, retention, DSR process and data transfers. +- [ ] Add ToS clauses for user uploads, acceptable use, IP grants and liability limits. +- [ ] Ensure CLI and UI show concise consent notices for uploading snapshots. +- [ ] Prepare template DSR response and deletion verification logs. +- [ ] Obtain Legal sign-off and publish policies with version and date. + +--- + +### 22.3 Third-party dependency licenses and obligations + +Purpose + +- Outline obligations arising from third‑party libraries, services and tools used in the project and how to comply with + their license terms and contractual obligations. + +Types of third‑party items + +- Open-source libraries (NPM packages, build tooling) +- System libraries and DB extensions (pgcrypto, extensions) +- Hosted services (Supabase, Redis provider, S3, Sentry, Grafana Cloud) +- Vendor SDKs (Google Cloud, AWS SDKs) +- Fonts, images, and UI assets (may have separate license obligations) + +Common license obligations and practical steps + +- MIT / BSD / Apache 2.0: + - Typically require inclusion of copyright and license text; Apache 2.0 requires NOTICE attribution for some + components. + - Action: include the dependency’s license block in LICENSES-THIRD-PARTY.txt and record their package and version. +- LGPL: + - If used in a library form, ensure you meet requirements for relinking or providing source for the LGPLed library + if distributing compiled binaries. + - Action: avoid linking proprietary code in ways that would trigger LGPL obligations without legal review. +- GPL family: + - Viral licenses: can require distribution of derivative source code under GPL if a GPL component is combined/linked + into distributed binaries. + - Action: avoid GPL libraries for server-side components unless you accept the obligations; consult legal for any + transitive GPL. +- Commercial SDKs / APIs: + - Adhere to terms of service: usage limits, attribution, branding constraints and paid plan obligations. + - Action: store and review the TOS for each vendor and ensure quotas/monitoring are in place. + +Obligations for hosted services + +- Data processing agreements (DPA): + - For processors handling personal data, ensure a DPA exists (Supabase, S3, Sentry) and is accessible in contract + records. +- Security & compliance: + - Some providers require specific configurations to maintain compliance (e.g., encryption settings, region + settings). + - Action: record provider hardening checklist and evidence. + +Attribution and bundling + +- When distributing binaries/containers or publishing the project: + - Include all required license texts and attributions in the distribution. + - Provide a clear third‑party license notice file in the repository and release artifacts. + +Practical steps for dependency compliance + +- Maintain a dependency inventory: + - Tool-driven: `npm ls --json`, `yarn licenses`, scancode-toolkit, or a dedicated SBOM tool. + - Record: package name, version, license, license URL, author, and any special obligations. +- Automate detection: + - CI integration: fail builds on unknown or disallowed license types, or flag for human review. +- Handle exceptions: + - If a dependency is required but license is incompatible, evaluate: + - Replace with an alternative library. + - Isolate use so distribution obligations are not triggered. + - Seek legal approval and document an exception. +- Keep up to date: + - Track security advisories and licensing changes for dependencies; update dependencies and re-scan regularly. + +Templates and artifacts to maintain + +- LICENSE (project license) +- LICENSES-THIRD-PARTY.txt (aggregate of all bundled licenses) +- THIRD_PARTY_NOTICES.md (short list showing critical dependencies and required attributions) +- DEPENDENCY_INVENTORY.csv or SBOM (software bill of materials) +- DPA and vendor contract references in a secure legal drive + +Audit readiness & periodic review + +- Schedule periodic audits: + - Quarterly: dependency license scan + security vuln scan. + - Before GA: full license audit and a list of third‑party obligations to satisfy packaging/release requirements. +- Keep audit logs and approvals for any license exceptions. + +--- + +### Final checklist (legal readiness) + +- [ ] Project license chosen and LICENSE file added. +- [ ] Contributor policy (DCO/CLA) chosen and CONTRIBUTING.md updated. +- [ ] Third‑party license inventory created and included with releases. +- [ ] NOTICE file (if required) assembled and included. +- [ ] Privacy Policy and Terms of Service updated to reflect snapshot processing, retention and DSR flows. +- [ ] DPAs signed with cloud vendors processing personal data, if applicable. +- [ ] License scanning configured in CI and exceptions tracked. +- [ ] Legal sign-off obtained prior to public GA release. + +--- + +## 23. Appendices + +This appendix collects supporting material, definitions, diagrams and references useful when implementing, operating and +auditing the Player Profile & DB Foundation project. Link targets point to files and folders in the repository; if any +are missing, create the referenced file or update the links. + +--- + +### 23.1 Glossary & abbreviations + +- API — Application Programming Interface. +- ETL — Extract, Transform, Load. The background processing that turns raw snapshots into normalized rows. +- P0/P1/P2 — Priority/Severity levels used for incidents (P0 = critical). +- DSR — Data Subject Request (requests under GDPR/CCPA to access or delete personal data). +- DPA — Data Processing Agreement. +- DB — Database (Postgres in this project). +- JSONB — PostgreSQL JSON binary column type. +- GIN — Generalized Inverted Index (Postgres index type used for jsonb). +- TTL — Time To Live. +- SLA — Service Level Agreement. +- SLO — Service Level Objective. +- RTO — Recovery Time Objective. +- RPO — Recovery Point Objective. +- PITR — Point-In-Time Recovery. +- RBAC — Role-Based Access Control. +- CI / CD — Continuous Integration / Continuous Delivery. +- GHCR — GitHub Container Registry. +- DCO / CLA — Developer Certificate of Origin / Contributor License Agreement. +- TTL — Time To Live (cache lifetime). +- p95 / p99 — 95th / 99th percentile latency. +- S3 — Object storage API (AWS S3 or S3-compatible service). +- OIDC — OpenID Connect. +- PCI / SOC2 — Compliance standards (Payment Card Industry / Service Organization Controls). +- PID — Process Identifier (used for workers) or Product ID depending on context — clarify in context. +- Idempotency-Key — header used to make POST operations idempotent. +- NameCode — user-identifying code used in upstream game (example: COCORIDER_JQGB). + +--- + +### 23.2 Reference documents / links + +Core repository documents: + +- DB model & ERD: docs/DB_MODEL.md — canonical DDL, ER diagrams and table descriptions. +- ETL design & worker contract: docs/ETL_AND_WORKER.md +- Migrations conventions: docs/MIGRATIONS.md +- Observability & alerts: docs/OBSERVABILITY.md +- Data privacy & DSR: docs/DATA_PRIVACY.md +- CI/CD and deployment: docs/CI_CD.md +- Runbooks (operations): docs/OP_RUNBOOKS/ (directory with INCIDENT_RESPONSE.md, DB_RESTORE.md, etc.) +- Change log: docs/CHANGELOG.md + +External reference links: + +- Discord Developer Docs: https://discord.com/developers/docs/intro +- Postgres Documentation (jsonb, GIN, partitioning): https://www.postgresql.org/docs/ +- OpenTelemetry: https://opentelemetry.io/ +- Prometheus: https://prometheus.io/ +- Grafana: https://grafana.com/ +- OWASP Top 10: https://owasp.org/www-project-top-ten/ + +Repository pointers (replace with actual repo links if needed): + +- Project repo root: https://github.com/CorentynDevPro/StarForge +- Examples folder: https://github.com/CorentynDevPro/StarForge/tree/main/docs/examples +- PRD main file: docs/PRD.md — this document (project requirements) + +--- + +### 23.3 Example payloads (e.g., hero profile JSON) — link to examples folder + +Canonical example payloads and sample test fixtures live in the examples folder. Use these as fixtures for unit, +integration and E2E tests, and for dry-runs of backfill. + +Repository examples folder: + +- docs/examples/ + - get_hero_profile_sample_small.json + - get_hero_profile_sample_large.json + - get_hero_profile_malformed_example.json + - ingest-sample.sh (helper script) + Link: https://github.com/CorentynDevPro/StarForge/tree/main/docs/examples + +Notes: + +- All example payloads must be synthetic or scrubbed of real PII. Do not commit real user credentials or tokens. +- Keep a short README in the examples folder describing each fixture, expected test outcomes and any special parsing + caveats. + +--- + +### 23.4 ER diagrams / sequence diagrams + +Canonical diagrams for architecture, data model and core flows are stored under docs/diagrams/ (or docs/ as noted +below). Use these for onboarding, design reviews and runbooks. + +Suggested diagram files (placeholders): + +- docs/ERD.svg — high-level Entity Relationship Diagram (visual). +- docs/architecture/ingest_sequence.svg — sequence diagram: client -> API -> hero_snapshots -> enqueue -> worker -> + normalized tables. +- docs/architecture/system_component_diagram.svg — components and dataflow (API, Worker, Redis/Queue, Postgres, S3, + Discord). +- docs/diagrams/admin_workflow.svg — admin reprocess / retention / migration workflow. + +Repository pointers: + +- ERD & diagrams folder: https://github.com/CorentynDevPro/StarForge/tree/main/docs/diagrams +- If diagrams are not yet present, export SVG/PNG from your design tools (Figma / draw.io / diagrams.net) and add them + to docs/diagrams/ then reference the files above. + +Best practices: + +- Keep source files (draw.io / Figma links) in addition to exported images for easy edits. +- Add short captions to each diagram explaining the intended audience and version (diagram versioning helps track schema + changes). + +--- + +### 23.5 Relevant meeting notes / decisions + +Keep a curated record of architecture decisions, meeting notes and important decisions in a single place for +traceability. Suggested locations: + +- docs/DECISIONS.md — Architecture Decision Records (ADRs) listing decisions, rationale, consequences and date/owner. +- docs/MEETINGS/ — directory with meeting notes (YYYY-MM-DD_.md). Example files: + - docs/MEETINGS/2025-10-01_architecture_kickoff.md + - docs/MEETINGS/2025-10-15_migration_plan_review.md + - docs/MEETINGS/2025-11-01_security_review.md + +Meeting notes template (example) + +- Date: YYYY-MM-DD +- Attendees: list +- Purpose / agenda: +- Decisions made: +- Action items (owner, due date) +- Links to related PRs / tickets / docs + +Why this matters: + +- ADRs and meeting notes provide context for future engineers, help legal/compliance audits, and document trade-offs ( + e.g., choice of queue, retention defaults, extension fallbacks). + +--- + +### 23.6 Change log (pointer to docs/CHANGELOG.md) + +Record release notes, migration mappings and notable changes in a CHANGELOG to help operations and users understand what +changed and why. + +Location: + +- docs/CHANGELOG.md — canonical changelog for releases and major infra/migration events + Link: https://github.com/CorentynDevPro/StarForge/blob/main/docs/CHANGELOG.md + +Changelog guidance: + +- Follow "Keep a Changelog" format: date-stamped entries, Unreleased section, and per-release notes (Added, Changed, + Deprecated, Removed, Fixed, Security). +- For migration-related releases include: + - Migration ids applied + - Backup snapshot id used before migration + - Any runbook links for rollback or verification +- Record who approved the release and link to the release ticket/PR for traceability. + +--- + +## 24. Approvals + +- Product Owner: **_Star Tiflette_** +- Engineering Lead: name / date +- Security Review: name / date +- QA: name / date +- Operations: name / date + +---