[Feature Request]: Implement 4-Database Architecture for scaling and separation of METRICS, LOGS and OBSERVABILITY data

## Overview

Split the **59 application tables** across up to 4 specialized PostgreSQL instances. All databases are **optional** - by default everything runs in a single database, and you can progressively offload workloads as needed.

| Database | Purpose | Tables | Write Volume | Optional |
|----------|---------|--------|--------------|----------|
| `DATABASE_URL` | Core OLTP | 36 (or all 59) | Low-Medium | **Required** |
| `METRICS_DATABASE_URL` | Execution metrics | 14 | Very High | Yes |
| `LOGS_DATABASE_URL` | Audit & compliance | 4 | High | Yes |
| `OBSERVABILITY_DATABASE_URL` | Distributed tracing | 5 | Very High | Yes |

---

## Deployment Scenarios

| Scenario | Configuration | Tables in CORE |
|----------|---------------|----------------|
| **Simple (default)** | Only `DATABASE_URL` | All 59 |
| **Offload Metrics** | + `METRICS_DATABASE_URL` | 45 |
| **Offload Logs** | + `LOGS_DATABASE_URL` | 55 |
| **Offload Observability** | + `OBSERVABILITY_DATABASE_URL` | 54 |
| **Offload Metrics + Logs** | + both | 41 |
| **Full Split** | All 4 URLs | 36 |

---

## Architecture Diagram

```
                                             ┌─────────────────┐
                                        ┌───►│ postgres:5432   │ (CORE)
                                        │    │ database: mcp   │
┌─────────────┐    ┌────────────────┐   │    └─────────────────┘
│   Gateway   │───►│   PgBouncer    │───┤    ┌─────────────────┐
│  Workers    │    │     :6432      │   ├───►│ metrics-pg:5432 │ (METRICS) [optional]
└─────────────┘    │                │   │    │ db: mcp_metrics │
                   │ Routes by DB   │   │    └─────────────────┘
                   │ name in URL    │   │    ┌─────────────────┐
                   └────────────────┘   ├───►│ logs-pg:5432    │ (LOGS) [optional]
                                        │    │ db: mcp_logs    │
                                        │    └─────────────────┘
                                        │    ┌─────────────────┐
                                        └───►│ obs-pg:5432     │ (OBSERVABILITY) [optional]
                                             │ db: mcp_obs     │
                                             └─────────────────┘
```

---

## PgBouncer Configuration

The `edoburu/pgbouncer` image supports multiple databases via the `DATABASE_URLS` environment variable (comma-separated).

### Minimal Setup (Single Database)

```yaml
# docker-compose.yml - Everything in one database
pgbouncer:
  image: edoburu/pgbouncer:latest
  environment:
    - DATABASE_URL=postgres://postgres:${POSTGRES_PASSWORD}@postgres:5432/mcp
    - LISTEN_PORT=6432
    - POOL_MODE=transaction
    - MAX_CLIENT_CONN=5000
    - DEFAULT_POOL_SIZE=600
    - MAX_DB_CONNECTIONS=700
```

### Full Split (4 Databases)

```yaml
# docker-compose.yml - All databases through single PgBouncer
pgbouncer:
  image: edoburu/pgbouncer:latest
  environment:
    # Multiple databases via comma-separated URLs
    - DATABASE_URLS=postgres://postgres:${POSTGRES_PASSWORD}@postgres:5432/mcp,postgres://postgres:${POSTGRES_PASSWORD}@metrics-pg:5432/mcp_metrics,postgres://postgres:${POSTGRES_PASSWORD}@logs-pg:5432/mcp_logs,postgres://postgres:${POSTGRES_PASSWORD}@obs-pg:5432/mcp_obs
    - LISTEN_PORT=6432
    - POOL_MODE=transaction
    - MAX_CLIENT_CONN=5000
    - DEFAULT_POOL_SIZE=600
    - MAX_DB_CONNECTIONS=700
    - MIN_POOL_SIZE=100
    - RESERVE_POOL_SIZE=150
    - SERVER_RESET_QUERY=DISCARD ALL
    - SERVER_RESET_QUERY_ALWAYS=1
  depends_on:
    postgres:
      condition: service_healthy
    metrics-pg:
      condition: service_healthy
    logs-pg:
      condition: service_healthy
    obs-pg:
      condition: service_healthy
```

This generates the following PgBouncer config:

```ini
[databases]
mcp = host=postgres port=5432 auth_user=postgres
mcp_metrics = host=metrics-pg port=5432 auth_user=postgres
mcp_logs = host=logs-pg port=5432 auth_user=postgres
mcp_obs = host=obs-pg port=5432 auth_user=postgres

[pgbouncer]
pool_mode = transaction
max_client_conn = 5000
default_pool_size = 600
max_db_connections = 700
...
```

---

## Gateway Configuration

### All URLs Through PgBouncer

```bash
# Gateway environment - all connections pooled
DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp
METRICS_DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp_metrics
LOGS_DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp_logs
OBSERVABILITY_DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp_obs
```

### Fallback Behavior

When optional URLs are **not set**, the application uses `DATABASE_URL` for those tables:

```bash
# Minimal - everything in one DB (current behavior, fully backward compatible)
DATABASE_URL=postgresql+psycopg://postgres:pass@pgbouncer:6432/mcp

# Partial offload - only metrics separate
DATABASE_URL=postgresql+psycopg://postgres:pass@pgbouncer:6432/mcp
METRICS_DATABASE_URL=postgresql+psycopg://postgres:pass@pgbouncer:6432/mcp_metrics
# LOGS_DATABASE_URL not set → uses DATABASE_URL
# OBSERVABILITY_DATABASE_URL not set → uses DATABASE_URL

# Offload metrics and logs, keep observability in core
DATABASE_URL=postgresql+psycopg://postgres:pass@pgbouncer:6432/mcp
METRICS_DATABASE_URL=postgresql+psycopg://postgres:pass@pgbouncer:6432/mcp_metrics
LOGS_DATABASE_URL=postgresql+psycopg://postgres:pass@pgbouncer:6432/mcp_logs
# OBSERVABILITY_DATABASE_URL not set → uses DATABASE_URL
```

---

## Database 1: CORE (DATABASE_URL) - Required

**Purpose:** Transactional data requiring ACID consistency and complex joins.

### Tables (36)

```
ENTITIES (9)
├── tools
├── servers
├── resources
├── prompts
├── gateways
├── a2a_agents
├── grpc_services
├── llm_models
└── llm_providers

AUTH/IDENTITY (15)
├── email_users
├── email_teams
├── email_team_members
├── email_team_member_history
├── email_team_invitations
├── email_team_join_requests
├── email_api_tokens
├── email_auth_events          # Auth events stay in CORE (not LOGS)
├── roles
├── user_roles
├── pending_user_approvals
├── token_revocations
├── sso_providers
├── sso_auth_sessions
└── registered_oauth_clients

OAUTH (2)
├── oauth_states
└── oauth_tokens

ASSOCIATIONS (4)
├── server_tool_association
├── server_resource_association
├── server_prompt_association
└── server_a2a_association

OTHER (6)
├── mcp_sessions
├── mcp_messages
├── resource_subscriptions
├── global_config
├── alembic_version
└── toolops_test_cases
```

### PostgreSQL Tuning

```yaml
postgres:
  image: postgres:18
  command:
    - "-c"
    - "max_connections=800"
    - "-c"
    - "shared_buffers=512MB"
    - "-c"
    - "synchronous_commit=on"          # Data integrity required
    - "-c"
    - "wal_level=replica"              # Support replication
    - "-c"
    - "checkpoint_completion_target=0.9"
    - "-c"
    - "random_page_cost=1.1"           # SSD optimized
  deploy:
    resources:
      limits:
        cpus: '4'
        memory: 8G
```

---

## Database 2: METRICS (METRICS_DATABASE_URL) - Optional

**Purpose:** High-volume execution metrics with time-based aggregations.

### Tables (14)

```
ENTITY METRICS - Raw (5) - denormalized with entity name
├── tool_metrics          → tool_id + tool_name (soft reference, no FK)
├── server_metrics        → server_id + server_name (soft reference, no FK)
├── resource_metrics      → resource_id + resource_name (soft reference, no FK)
├── prompt_metrics        → prompt_id + prompt_name (soft reference, no FK)
└── a2a_agent_metrics     → a2a_agent_id + agent_name (soft reference, no FK)

ENTITY METRICS - Hourly Rollups (5) - denormalized, no FK
├── tool_metrics_hourly          (tool_id + tool_name, no FK)
├── server_metrics_hourly        (server_id + server_name, no FK)
├── resource_metrics_hourly      (resource_id + resource_name, no FK)
├── prompt_metrics_hourly        (prompt_id + prompt_name, no FK)
└── a2a_agent_metrics_hourly     (a2a_agent_id + agent_name, no FK)

PERFORMANCE (3)
├── performance_metrics
├── performance_snapshots
└── performance_aggregates

LLM USAGE (1)
└── token_usage_logs
```

### FK Strategy

**Both raw and hourly metrics tables** require FK removal for the database split:

**Raw metrics tables** use **soft references** (VARCHAR ID without FK constraint):
- Allows entity deletion without CASCADE
- Metrics history preserved even if entity removed
- No cross-database FK needed
- **NEW:** Include denormalized `{entity}_name` column captured at write time

**Hourly rollup tables** also require FK removal:
- Currently have nullable FK to entity tables (with ON DELETE SET NULL)
- **For split:** Remove FK constraint, keep soft reference (ID as VARCHAR)
- Already have denormalized `{entity}_name` column - no schema change needed
- Self-sufficient for reporting without joins to CORE database

### Entity Name Capture (Denormalization Strategy)

**When is entity_name captured?**

The entity name is captured **at write time** when recording a metric, not during rollup:

```python
# In tool_service.py when recording a metric (using dependency injection)
async def record_tool_execution(
    self,
    tool_id: str,
    response_time: float,
    success: bool,
    db: Session = Depends(get_db),              # CORE DB for tool lookup
    metrics_db: Session = Depends(get_metrics_db)  # METRICS DB for write
):
    # Lookup tool name from CORE DB (or cache)
    tool = db.query(Tool).filter(Tool.id == tool_id).first()
    tool_name = tool.name if tool else "unknown"

    # Write to METRICS DB with denormalized name
    metric = ToolMetric(
        tool_id=tool_id,
        tool_name=tool_name,  # Captured at write time
        timestamp=datetime.now(UTC),
        response_time=response_time,
        is_success=success,
    )
    metrics_db.add(metric)
    metrics_db.commit()
```

**Schema change required:**
```python
class ToolMetric(Base):
    __tablename__ = "tool_metrics"
    tool_id: Mapped[str] = mapped_column(String(36), nullable=False, index=True)
    tool_name: Mapped[str] = mapped_column(String(255), nullable=False)  # NEW
    timestamp: Mapped[datetime] = ...
```

**How are entity renames handled?**

- **Raw metrics:** Capture name at write time. Historical metrics show the name *at time of execution*.
- **Hourly rollups:** Inherit name from raw metrics. Shows name *at time each metric was recorded*.
- **Dashboard queries:** Use denormalized `{entity}_name` column (e.g., `tool_name`, `server_name`) directly - no join to CORE needed.
- **Trade-off:** After rename, historical data shows old name. This is intentional for audit accuracy.

If you need current names in reports, add a lookup cache:
```python
# Cache entity_id → current_name mapping, refresh every 5 min
ENTITY_NAME_CACHE: Dict[str, str] = {}
```

### PostgreSQL Tuning

```yaml
metrics-pg:
  image: postgres:18
  environment:
    - POSTGRES_DB=mcp_metrics
    - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
  command:
    - "-c"
    - "max_connections=200"
    - "-c"
    - "shared_buffers=256MB"
    - "-c"
    - "synchronous_commit=off"         # Async WAL for speed
    - "-c"
    - "wal_level=minimal"              # No replication needed
    - "-c"
    - "max_wal_senders=0"              # Disable replication
    - "-c"
    - "checkpoint_timeout=15min"       # Less frequent checkpoints
    - "-c"
    - "max_wal_size=4GB"               # Buffer more WAL
    - "-c"
    - "work_mem=32MB"                  # Larger for aggregations
  deploy:
    resources:
      limits:
        cpus: '2'
        memory: 4G
```

### Partitioning (Recommended)

```sql
-- Partition raw metrics by week for easy retention management
CREATE TABLE tool_metrics (
    id SERIAL,
    tool_id VARCHAR(36) NOT NULL,
    tool_name VARCHAR(255) NOT NULL,  -- Denormalized entity name
    timestamp TIMESTAMPTZ NOT NULL,
    response_time DOUBLE PRECISION,
    is_success BOOLEAN
) PARTITION BY RANGE (timestamp);

-- Create index on tool_id for lookups
CREATE INDEX idx_tool_metrics_tool_id ON tool_metrics (tool_id);

CREATE TABLE tool_metrics_2026_w01 PARTITION OF tool_metrics
    FOR VALUES FROM ('2026-01-01') TO ('2026-01-08');

-- Drop old partitions for retention
DROP TABLE tool_metrics_2025_w40;  -- Drop data older than 90 days
```

---

## Database 3: LOGS (LOGS_DATABASE_URL) - Optional

**Purpose:** Compliance audit trails and application logs.

**Durability:** Configurable via `LOGS_DB_DURABLE` environment variable.
- `true` (default): Full durability (`synchronous_commit=on`) for SOC2/HIPAA compliance
- `false`: Relaxed durability (`synchronous_commit=off`) for development/non-compliance environments

### Tables (4)

```
AUDIT (2) - No FK
├── audit_trails              # SOC2/HIPAA compliance
└── permission_audit_log      # RBAC changes

STRUCTURED LOGS (2) - Internal FK only
├── structured_log_entries    # Application logs
└── security_events           # FK → structured_log_entries (internal)
```

> **Note:** `email_auth_events` remains in CORE database with other auth tables.

### FK Strategy

- `security_events.log_entry_id` → `structured_log_entries.id` (internal FK, stays valid)
- No external dependencies - entire cluster is self-contained

### PostgreSQL Tuning

Choose the appropriate configuration based on your durability requirements:

**Production / Compliance (default):**
```yaml
logs-pg:
  image: postgres:18
  environment:
    - POSTGRES_DB=mcp_logs
    - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
  command:
    - "-c"
    - "max_connections=100"
    - "-c"
    - "shared_buffers=128MB"
    - "-c"
    - "synchronous_commit=on"              # Full durability
    - "-c"
    - "wal_level=replica"                  # Supports backup/replication
    - "-c"
    - "wal_compression=on"
    - "-c"
    - "autovacuum_vacuum_scale_factor=0.1"
    - "-c"
    - "checkpoint_completion_target=0.9"
  deploy:
    resources:
      limits:
        cpus: '1'
        memory: 2G
```

**Development / Non-compliance (set `LOGS_DB_DURABLE=false`):**
```yaml
logs-pg:
  command:
    # ... same base settings, but with relaxed durability:
    - "-c"
    - "synchronous_commit=off"             # Faster writes, risk of loss
    - "-c"
    - "wal_level=minimal"                  # No replication support
```

> **Note:** `LOGS_DB_DURABLE` is an application-level setting. When set to `false`, the application accepts that audit data may be lost on crash. The PostgreSQL container must be configured to match (use the development config above).

### Retention Policy

```sql
-- Delete logs older than 90 days (configurable)
DELETE FROM structured_log_entries
WHERE timestamp < NOW() - INTERVAL '90 days';

-- Keep audit trails longer (365 days for compliance)
DELETE FROM audit_trails
WHERE timestamp < NOW() - INTERVAL '365 days';
```

---

## Database 4: OBSERVABILITY (OBSERVABILITY_DATABASE_URL) - Optional

**Purpose:** Distributed tracing with OpenTelemetry-style spans.

### Tables (5)

```
TRACES - Self-contained FK cluster
├── observability_traces          # Root (no FK)
│   ├── observability_spans       # FK → traces, FK → spans (self-ref)
│   │   └── observability_events  # FK → spans
│   └── observability_metrics     # FK → traces
└── observability_saved_queries   # No FK (user queries)
```

### FK Strategy

All FKs are **internal to this database** - the cluster is self-contained:
- `spans.trace_id` → `traces.trace_id`
- `spans.parent_span_id` → `spans.span_id`
- `events.span_id` → `spans.span_id`
- `metrics.trace_id` → `traces.trace_id`

### PostgreSQL Tuning

```yaml
obs-pg:
  image: postgres:18
  environment:
    - POSTGRES_DB=mcp_obs
    - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
  command:
    - "-c"
    - "max_connections=100"
    - "-c"
    - "shared_buffers=256MB"
    - "-c"
    - "synchronous_commit=off"
    - "-c"
    - "wal_level=minimal"
    - "-c"
    - "max_wal_senders=0"
    - "-c"
    - "work_mem=64MB"                  # Large for trace reconstruction
    - "-c"
    - "effective_cache_size=1GB"       # Cache hot traces
  deploy:
    resources:
      limits:
        cpus: '2'
        memory: 4G
```

### Alternative: TimescaleDB

For high-volume tracing, consider TimescaleDB:

```yaml
obs-pg:
  image: timescale/timescaledb:latest-pg16
  # Automatic time-based partitioning
  # Better compression for time-series
  # Continuous aggregates for dashboards
```

---

## Application Configuration

### config.py

```python
class Settings(BaseSettings):
    # Primary database (required)
    database_url: str = Field(
        default="sqlite:///./mcp.db",
        description="Primary database for core entities and auth"
    )

    # Secondary databases (optional - fall back to database_url if not set)
    metrics_database_url: Optional[str] = Field(
        default=None,
        description="Database for execution metrics. Falls back to database_url."
    )
    logs_database_url: Optional[str] = Field(
        default=None,
        description="Database for audit/compliance logs. Falls back to database_url."
    )
    observability_database_url: Optional[str] = Field(
        default=None,
        description="Database for distributed tracing. Falls back to database_url."
    )

    # LOGS database durability setting
    logs_db_durable: bool = Field(
        default=True,
        description="Enable full durability for LOGS database (synchronous_commit=on). "
                    "Set to False for development/non-compliance environments."
    )
```

### Table Routing (db.py)

```python
# Which database each table belongs to
METRICS_TABLES = {
    'tool_metrics', 'tool_metrics_hourly',
    'server_metrics', 'server_metrics_hourly',
    'resource_metrics', 'resource_metrics_hourly',
    'prompt_metrics', 'prompt_metrics_hourly',
    'a2a_agent_metrics', 'a2a_agent_metrics_hourly',
    'performance_metrics', 'performance_snapshots', 'performance_aggregates',
    'token_usage_logs',
}

LOGS_TABLES = {
    'audit_trails', 'permission_audit_log',
    'structured_log_entries', 'security_events',
}  # Note: email_auth_events stays in CORE with other auth tables

OBSERVABILITY_TABLES = {
    'observability_traces', 'observability_spans',
    'observability_events', 'observability_metrics',
    'observability_saved_queries',
}

# Create engines (optional DBs fall back to primary)
engine = create_engine(settings.database_url, ...)

metrics_engine = (
    create_engine(settings.metrics_database_url, ...)
    if settings.metrics_database_url else engine
)
logs_engine = (
    create_engine(settings.logs_database_url, ...)
    if settings.logs_database_url else engine
)
observability_engine = (
    create_engine(settings.observability_database_url, ...)
    if settings.observability_database_url else engine
)

# Session factories per database
SessionLocal = sessionmaker(bind=engine)
MetricsSessionLocal = sessionmaker(bind=metrics_engine)
LogsSessionLocal = sessionmaker(bind=logs_engine)
ObservabilitySessionLocal = sessionmaker(bind=observability_engine)

def get_db():
    """Get CORE database session."""
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()

def get_metrics_db():
    """Get METRICS database session (falls back to CORE if not configured)."""
    db = MetricsSessionLocal()
    try:
        yield db
    finally:
        db.close()

def get_logs_db():
    """Get LOGS database session (falls back to CORE if not configured)."""
    db = LogsSessionLocal()
    try:
        yield db
    finally:
        db.close()

def get_observability_db():
    """Get OBSERVABILITY database session (falls back to CORE if not configured)."""
    db = ObservabilitySessionLocal()
    try:
        yield db
    finally:
        db.close()
```

### Avoiding Cross-Database Queries

**Critical:** Never join tables across databases. Each service must use the correct session:

```python
# WRONG - Uses CORE session for metrics table
async def get_tool_metrics(db: Session = Depends(get_db)):
    return db.query(ToolMetric).all()  # Will fail if metrics in separate DB!

# CORRECT - Uses METRICS session
async def get_tool_metrics(metrics_db: Session = Depends(get_metrics_db)):
    return metrics_db.query(ToolMetric).all()

# CORRECT - Cross-DB lookup via separate sessions
async def get_tool_with_metrics(
    tool_id: str,
    db: Session = Depends(get_db),
    metrics_db: Session = Depends(get_metrics_db)
):
    tool = db.query(Tool).filter(Tool.id == tool_id).first()  # CORE DB
    metrics = metrics_db.query(ToolMetric).filter(ToolMetric.tool_id == tool_id).all()  # METRICS DB
    return {"tool": tool, "metrics": metrics}
```

---

## Complete docker-compose.yml Example

```yaml
version: "3.8"

services:
  # ==========================================================================
  # Gateway Application
  # ==========================================================================
  gateway:
    image: mcpgateway:latest
    environment:
      # All database URLs go through PgBouncer
      - DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp
      - METRICS_DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp_metrics
      - LOGS_DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp_logs
      - OBSERVABILITY_DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp_obs
      # ... other settings
    depends_on:
      pgbouncer:
        condition: service_healthy

  # ==========================================================================
  # PgBouncer - Connection Pooler (routes to all databases)
  # ==========================================================================
  pgbouncer:
    image: edoburu/pgbouncer:latest
    environment:
      # Multiple databases via comma-separated URLs
      - DATABASE_URLS=postgres://postgres:${POSTGRES_PASSWORD}@postgres:5432/mcp,postgres://postgres:${POSTGRES_PASSWORD}@metrics-pg:5432/mcp_metrics,postgres://postgres:${POSTGRES_PASSWORD}@logs-pg:5432/mcp_logs,postgres://postgres:${POSTGRES_PASSWORD}@obs-pg:5432/mcp_obs
      - LISTEN_PORT=6432
      - POOL_MODE=transaction
      - MAX_CLIENT_CONN=5000
      - DEFAULT_POOL_SIZE=600
      - MAX_DB_CONNECTIONS=700
      - MIN_POOL_SIZE=100
      - RESERVE_POOL_SIZE=150
      - RESERVE_POOL_TIMEOUT=2
      - SERVER_RESET_QUERY=DISCARD ALL
      - SERVER_RESET_QUERY_ALWAYS=1
      - SERVER_LIFETIME=3600
      - SERVER_IDLE_TIMEOUT=600
      - QUERY_WAIT_TIMEOUT=60
      - CLIENT_IDLE_TIMEOUT=300
      - IDLE_TRANSACTION_TIMEOUT=300
    ports:
      - "6432:6432"
    depends_on:
      postgres:
        condition: service_healthy
      metrics-pg:
        condition: service_healthy
      logs-pg:
        condition: service_healthy
      obs-pg:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "pg_isready", "-h", "localhost", "-p", "6432"]
      interval: 10s
      timeout: 5s
      retries: 5

  # ==========================================================================
  # Primary Database - CORE (OLTP)
  # ==========================================================================
  postgres:
    image: postgres:18
    environment:
      - POSTGRES_DB=mcp
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    command:
      - "-c"
      - "max_connections=800"
      - "-c"
      - "shared_buffers=512MB"
      - "-c"
      - "synchronous_commit=on"
      - "-c"
      - "wal_level=replica"
      - "-c"
      - "checkpoint_completion_target=0.9"
    volumes:
      - pgdata:/var/lib/postgresql/data
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres -d mcp"]
      interval: 10s
      timeout: 5s
      retries: 5

  # ==========================================================================
  # Metrics Database - High Write Volume (OLAP)
  # ==========================================================================
  metrics-pg:
    image: postgres:18
    environment:
      - POSTGRES_DB=mcp_metrics
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    command:
      - "-c"
      - "max_connections=200"
      - "-c"
      - "shared_buffers=256MB"
      - "-c"
      - "synchronous_commit=off"
      - "-c"
      - "wal_level=minimal"
      - "-c"
      - "max_wal_senders=0"
      - "-c"
      - "checkpoint_timeout=15min"
      - "-c"
      - "max_wal_size=4GB"
    volumes:
      - metricsdata:/var/lib/postgresql/data
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres -d mcp_metrics"]
      interval: 10s
      timeout: 5s
      retries: 5

  # ==========================================================================
  # Logs Database - Audit & Compliance (FULL DURABILITY REQUIRED)
  # ==========================================================================
  logs-pg:
    image: postgres:18
    environment:
      - POSTGRES_DB=mcp_logs
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    command:
      - "-c"
      - "max_connections=100"
      - "-c"
      - "shared_buffers=128MB"
      - "-c"
      - "synchronous_commit=on"        # REQUIRED for audit compliance
      - "-c"
      - "wal_level=replica"            # Allow backup/replication
      - "-c"
      - "wal_compression=on"
      - "-c"
      - "checkpoint_completion_target=0.9"
    volumes:
      - logsdata:/var/lib/postgresql/data
    deploy:
      resources:
        limits:
          cpus: '1'
          memory: 2G
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres -d mcp_logs"]
      interval: 10s
      timeout: 5s
      retries: 5

  # ==========================================================================
  # Observability Database - Distributed Tracing
  # ==========================================================================
  obs-pg:
    image: postgres:18
    environment:
      - POSTGRES_DB=mcp_obs
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    command:
      - "-c"
      - "max_connections=100"
      - "-c"
      - "shared_buffers=256MB"
      - "-c"
      - "synchronous_commit=off"
      - "-c"
      - "wal_level=minimal"
      - "-c"
      - "max_wal_senders=0"
      - "-c"
      - "work_mem=64MB"
    volumes:
      - obsdata:/var/lib/postgresql/data
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres -d mcp_obs"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  pgdata:
  metricsdata:
  logsdata:
  obsdata:
```

---

## Expected Impact

### Primary Database Relief

| Metric | Before | After (Full Split) |
|--------|--------|-------------------|
| Tables | 59 | 36 |
| Write I/O | 100% | ~25% |
| CPU Usage | 99% | ~50-60% |
| Connection Pressure | High | Medium |

### Workload Isolation

| Database | Optimized For | sync_commit | Durability |
|----------|---------------|-------------|------------|
| CORE | Low-latency reads, complex joins | ON | Full |
| METRICS | High-throughput writes, aggregations | OFF | Relaxed (re-computable) |
| LOGS | Append-only, compliance audit | **Configurable** | **Configurable** (default: full) |
| OBSERVABILITY | Time-series, trace reconstruction | OFF | Relaxed (ephemeral) |

> **LOGS durability:** Set `LOGS_DB_DURABLE=false` for development. Production compliance environments should use the default (`true`).

---

## Implementation Requirements

### Schema Changes (Required Before Multi-DB Support)

These changes enable the database split and must be applied via Alembic migration:

1. **Remove FK constraints** from **both raw and hourly** metrics tables:
   - Raw metrics: `tool_metrics`, `server_metrics`, `resource_metrics`, `prompt_metrics`, `a2a_agent_metrics`
   - Hourly rollups: `tool_metrics_hourly`, `server_metrics_hourly`, `resource_metrics_hourly`, `prompt_metrics_hourly`, `a2a_agent_metrics_hourly`

2. **Add `{entity}_name` column** to raw metrics tables for denormalization (hourly tables already have this)

3. **Update service layer** to use correct session factories per table group

### New Deployment Setup

For **new installations** with multi-database support:

1. **Configure environment variables:**
   ```bash
   # Database URLs
   DATABASE_URL=postgresql+psycopg://...@pgbouncer:6432/mcp
   METRICS_DATABASE_URL=postgresql+psycopg://...@pgbouncer:6432/mcp_metrics    # Optional
   LOGS_DATABASE_URL=postgresql+psycopg://...@pgbouncer:6432/mcp_logs          # Optional
   OBSERVABILITY_DATABASE_URL=postgresql+psycopg://...@pgbouncer:6432/mcp_obs  # Optional

   # LOGS durability (default: true for compliance; set false for dev only)
   LOGS_DB_DURABLE=true
   ```

2. **Deploy database containers** (see docker-compose example above)

3. **Run Alembic migrations** on each configured database:
   ```bash
   # Core (always)
   alembic upgrade head

   # Secondary databases (if configured)
   ALEMBIC_TARGET_DB=metrics DATABASE_URL=$METRICS_DATABASE_URL alembic upgrade head
   ALEMBIC_TARGET_DB=logs DATABASE_URL=$LOGS_DATABASE_URL alembic upgrade head
   ALEMBIC_TARGET_DB=observability DATABASE_URL=$OBSERVABILITY_DATABASE_URL alembic upgrade head
   ```

4. **Start application** - routing happens automatically based on configured URLs

> **Existing installations:** If enabling multi-database on an existing single-database deployment, historical data for metrics/logs/observability remains in the CORE database and won't be visible in the new secondary databases. See [Appendix: Migrating Existing Single-DB Installations](#appendix-migrating-existing-single-db-installations) for data migration steps.

---

## Alembic Per-Database Targeting

Use `include_object` filter to scope migrations per database:

```python
# alembic/env.py

import os

METRICS_TABLES = {'tool_metrics', 'tool_metrics_hourly', ...}  # Full list
LOGS_TABLES = {'audit_trails', 'permission_audit_log', ...}
OBSERVABILITY_TABLES = {'observability_traces', ...}

def include_object(object, name, type_, reflected, compare_to):
    """Filter tables based on ALEMBIC_TARGET_DB environment variable."""
    if type_ != "table":
        return True

    target_db = os.environ.get("ALEMBIC_TARGET_DB", "core")

    if target_db == "metrics":
        return name in METRICS_TABLES
    elif target_db == "logs":
        return name in LOGS_TABLES
    elif target_db == "observability":
        return name in OBSERVABILITY_TABLES
    else:  # "core" or unset
        return name not in (METRICS_TABLES | LOGS_TABLES | OBSERVABILITY_TABLES)

def run_migrations_online():
    # ... existing code ...
    context.configure(
        connection=connection,
        target_metadata=target_metadata,
        include_object=include_object,  # Add this
        # ...
    )
```

**Usage:**
```bash
# Migrate CORE database (default)
alembic upgrade head

# Migrate METRICS database
ALEMBIC_TARGET_DB=metrics DATABASE_URL=$METRICS_DATABASE_URL alembic upgrade head

# Migrate LOGS database
ALEMBIC_TARGET_DB=logs DATABASE_URL=$LOGS_DATABASE_URL alembic upgrade head

# Migrate OBSERVABILITY database
ALEMBIC_TARGET_DB=observability DATABASE_URL=$OBSERVABILITY_DATABASE_URL alembic upgrade head
```

---

## Appendix: Migrating Existing Single-DB Installations

> **Note:** This section is reference documentation for administrators who want to migrate an existing single-database deployment to multi-database. This is **not** part of the core feature implementation.

### Migration Steps (Admin Reference)

1. **Backup existing database**
   ```bash
   pg_dump -h postgres -p 5432 -U postgres -d mcp > backup.sql
   ```

2. **Deploy secondary database containers**

3. **Run Alembic migrations on secondary databases**
   ```bash
   ALEMBIC_TARGET_DB=metrics DATABASE_URL=$METRICS_DATABASE_URL alembic upgrade head
   # Repeat for logs, observability
   ```

4. **Copy data from primary to secondary** (connect directly, not via PgBouncer)
   ```bash
   # Example for METRICS tables
   pg_dump -h postgres -p 5432 -U postgres -d mcp \
     --data-only \
     -t tool_metrics -t tool_metrics_hourly \
     -t server_metrics -t server_metrics_hourly \
     -t resource_metrics -t resource_metrics_hourly \
     -t prompt_metrics -t prompt_metrics_hourly \
     -t a2a_agent_metrics -t a2a_agent_metrics_hourly \
     -t performance_metrics -t performance_snapshots -t performance_aggregates \
     -t token_usage_logs \
     | psql -h metrics-pg -p 5432 -U postgres -d mcp_metrics
   ```

5. **Stop application, configure URLs, restart**

6. **Verify data integrity** (compare row counts)

7. **(Optional) Drop migrated tables from primary** after verification

### Reverting to Single-DB

**If migrated tables still exist in primary:** Simply unset the optional URLs:

```bash
# Only DATABASE_URL configured = single-database mode
DATABASE_URL=postgresql+psycopg://postgres:pass@pgbouncer:6432/mcp
# METRICS_DATABASE_URL=  (unset)
# LOGS_DATABASE_URL=     (unset)
# OBSERVABILITY_DATABASE_URL=  (unset)
```

The application falls back to using `DATABASE_URL` for all tables.

**If migrated tables were dropped from primary:** You must restore data first:

```bash
# 1. Copy data from secondary back to primary
pg_dump -h metrics-pg -p 5432 -U postgres -d mcp_metrics --data-only \
  | psql -h postgres -p 5432 -U postgres -d mcp

# 2. Then unset URLs and restart
```

> **Warning:** If you dropped tables from primary without a backup and the secondary databases are unavailable, data recovery is not possible. Always maintain backups before dropping tables.


Database	Purpose	Tables	Write Volume	Optional
`DATABASE_URL`	Core OLTP	36 (or all 59)	Low-Medium	Required
`METRICS_DATABASE_URL`	Execution metrics	14	Very High	Yes
`LOGS_DATABASE_URL`	Audit & compliance	4	High	Yes
`OBSERVABILITY_DATABASE_URL`	Distributed tracing	5	Very High	Yes

Scenario	Configuration	Tables in CORE
Simple (default)	Only `DATABASE_URL`	All 59
Offload Metrics	+ `METRICS_DATABASE_URL`	45
Offload Logs	+ `LOGS_DATABASE_URL`	55
Offload Observability	+ `OBSERVABILITY_DATABASE_URL`	54
Offload Metrics + Logs	+ both	41
Full Split	All 4 URLs	36

Metric	Before	After (Full Split)
Tables	59	36
Write I/O	100%	~25%
CPU Usage	99%	~50-60%
Connection Pressure	High	Medium

Database	Optimized For	sync_commit	Durability
CORE	Low-latency reads, complex joins	ON	Full
METRICS	High-throughput writes, aggregations	OFF	Relaxed (re-computable)
LOGS	Append-only, compliance audit	Configurable	Configurable (default: full)
OBSERVABILITY	Time-series, trace reconstruction	OFF	Relaxed (ephemeral)

[Feature Request]: Implement 4-Database Architecture for scaling and separation of METRICS, LOGS and OBSERVABILITY data #1952

Description

Overview

Deployment Scenarios

Architecture Diagram

PgBouncer Configuration

Minimal Setup (Single Database)

Full Split (4 Databases)

Gateway Configuration

All URLs Through PgBouncer

Fallback Behavior

Database 1: CORE (DATABASE_URL) - Required

Tables (36)

PostgreSQL Tuning

Database 2: METRICS (METRICS_DATABASE_URL) - Optional

Tables (14)

FK Strategy

Entity Name Capture (Denormalization Strategy)

PostgreSQL Tuning

Partitioning (Recommended)

Database 3: LOGS (LOGS_DATABASE_URL) - Optional

Tables (4)

FK Strategy

PostgreSQL Tuning

Retention Policy

Database 4: OBSERVABILITY (OBSERVABILITY_DATABASE_URL) - Optional

Tables (5)

FK Strategy

PostgreSQL Tuning

Alternative: TimescaleDB

Application Configuration

config.py

Table Routing (db.py)

Avoiding Cross-Database Queries

Complete docker-compose.yml Example

Expected Impact

Primary Database Relief

Workload Isolation

Implementation Requirements

Schema Changes (Required Before Multi-DB Support)

New Deployment Setup

Alembic Per-Database Targeting

Appendix: Migrating Existing Single-DB Installations

Migration Steps (Admin Reference)

Reverting to Single-DB

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions