Skip to content

[Feature Request]: Implement 4-Database Architecture for scaling and separation of METRICS, LOGS and OBSERVABILITY data #1952

@crivetimihai

Description

@crivetimihai

Overview

Split the 59 application tables across up to 4 specialized PostgreSQL instances. All databases are optional - by default everything runs in a single database, and you can progressively offload workloads as needed.

Database Purpose Tables Write Volume Optional
DATABASE_URL Core OLTP 36 (or all 59) Low-Medium Required
METRICS_DATABASE_URL Execution metrics 14 Very High Yes
LOGS_DATABASE_URL Audit & compliance 4 High Yes
OBSERVABILITY_DATABASE_URL Distributed tracing 5 Very High Yes

Deployment Scenarios

Scenario Configuration Tables in CORE
Simple (default) Only DATABASE_URL All 59
Offload Metrics + METRICS_DATABASE_URL 45
Offload Logs + LOGS_DATABASE_URL 55
Offload Observability + OBSERVABILITY_DATABASE_URL 54
Offload Metrics + Logs + both 41
Full Split All 4 URLs 36

Architecture Diagram

                                             ┌─────────────────┐
                                        ┌───►│ postgres:5432   │ (CORE)
                                        │    │ database: mcp   │
┌─────────────┐    ┌────────────────┐   │    └─────────────────┘
│   Gateway   │───►│   PgBouncer    │───┤    ┌─────────────────┐
│  Workers    │    │     :6432      │   ├───►│ metrics-pg:5432 │ (METRICS) [optional]
└─────────────┘    │                │   │    │ db: mcp_metrics │
                   │ Routes by DB   │   │    └─────────────────┘
                   │ name in URL    │   │    ┌─────────────────┐
                   └────────────────┘   ├───►│ logs-pg:5432    │ (LOGS) [optional]
                                        │    │ db: mcp_logs    │
                                        │    └─────────────────┘
                                        │    ┌─────────────────┐
                                        └───►│ obs-pg:5432     │ (OBSERVABILITY) [optional]
                                             │ db: mcp_obs     │
                                             └─────────────────┘

PgBouncer Configuration

The edoburu/pgbouncer image supports multiple databases via the DATABASE_URLS environment variable (comma-separated).

Minimal Setup (Single Database)

# docker-compose.yml - Everything in one database
pgbouncer:
  image: edoburu/pgbouncer:latest
  environment:
    - DATABASE_URL=postgres://postgres:${POSTGRES_PASSWORD}@postgres:5432/mcp
    - LISTEN_PORT=6432
    - POOL_MODE=transaction
    - MAX_CLIENT_CONN=5000
    - DEFAULT_POOL_SIZE=600
    - MAX_DB_CONNECTIONS=700

Full Split (4 Databases)

# docker-compose.yml - All databases through single PgBouncer
pgbouncer:
  image: edoburu/pgbouncer:latest
  environment:
    # Multiple databases via comma-separated URLs
    - DATABASE_URLS=postgres://postgres:${POSTGRES_PASSWORD}@postgres:5432/mcp,postgres://postgres:${POSTGRES_PASSWORD}@metrics-pg:5432/mcp_metrics,postgres://postgres:${POSTGRES_PASSWORD}@logs-pg:5432/mcp_logs,postgres://postgres:${POSTGRES_PASSWORD}@obs-pg:5432/mcp_obs
    - LISTEN_PORT=6432
    - POOL_MODE=transaction
    - MAX_CLIENT_CONN=5000
    - DEFAULT_POOL_SIZE=600
    - MAX_DB_CONNECTIONS=700
    - MIN_POOL_SIZE=100
    - RESERVE_POOL_SIZE=150
    - SERVER_RESET_QUERY=DISCARD ALL
    - SERVER_RESET_QUERY_ALWAYS=1
  depends_on:
    postgres:
      condition: service_healthy
    metrics-pg:
      condition: service_healthy
    logs-pg:
      condition: service_healthy
    obs-pg:
      condition: service_healthy

This generates the following PgBouncer config:

[databases]
mcp = host=postgres port=5432 auth_user=postgres
mcp_metrics = host=metrics-pg port=5432 auth_user=postgres
mcp_logs = host=logs-pg port=5432 auth_user=postgres
mcp_obs = host=obs-pg port=5432 auth_user=postgres

[pgbouncer]
pool_mode = transaction
max_client_conn = 5000
default_pool_size = 600
max_db_connections = 700
...

Gateway Configuration

All URLs Through PgBouncer

# Gateway environment - all connections pooled
DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp
METRICS_DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp_metrics
LOGS_DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp_logs
OBSERVABILITY_DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp_obs

Fallback Behavior

When optional URLs are not set, the application uses DATABASE_URL for those tables:

# Minimal - everything in one DB (current behavior, fully backward compatible)
DATABASE_URL=postgresql+psycopg://postgres:pass@pgbouncer:6432/mcp

# Partial offload - only metrics separate
DATABASE_URL=postgresql+psycopg://postgres:pass@pgbouncer:6432/mcp
METRICS_DATABASE_URL=postgresql+psycopg://postgres:pass@pgbouncer:6432/mcp_metrics
# LOGS_DATABASE_URL not set → uses DATABASE_URL
# OBSERVABILITY_DATABASE_URL not set → uses DATABASE_URL

# Offload metrics and logs, keep observability in core
DATABASE_URL=postgresql+psycopg://postgres:pass@pgbouncer:6432/mcp
METRICS_DATABASE_URL=postgresql+psycopg://postgres:pass@pgbouncer:6432/mcp_metrics
LOGS_DATABASE_URL=postgresql+psycopg://postgres:pass@pgbouncer:6432/mcp_logs
# OBSERVABILITY_DATABASE_URL not set → uses DATABASE_URL

Database 1: CORE (DATABASE_URL) - Required

Purpose: Transactional data requiring ACID consistency and complex joins.

Tables (36)

ENTITIES (9)
├── tools
├── servers
├── resources
├── prompts
├── gateways
├── a2a_agents
├── grpc_services
├── llm_models
└── llm_providers

AUTH/IDENTITY (15)
├── email_users
├── email_teams
├── email_team_members
├── email_team_member_history
├── email_team_invitations
├── email_team_join_requests
├── email_api_tokens
├── email_auth_events          # Auth events stay in CORE (not LOGS)
├── roles
├── user_roles
├── pending_user_approvals
├── token_revocations
├── sso_providers
├── sso_auth_sessions
└── registered_oauth_clients

OAUTH (2)
├── oauth_states
└── oauth_tokens

ASSOCIATIONS (4)
├── server_tool_association
├── server_resource_association
├── server_prompt_association
└── server_a2a_association

OTHER (6)
├── mcp_sessions
├── mcp_messages
├── resource_subscriptions
├── global_config
├── alembic_version
└── toolops_test_cases

PostgreSQL Tuning

postgres:
  image: postgres:18
  command:
    - "-c"
    - "max_connections=800"
    - "-c"
    - "shared_buffers=512MB"
    - "-c"
    - "synchronous_commit=on"          # Data integrity required
    - "-c"
    - "wal_level=replica"              # Support replication
    - "-c"
    - "checkpoint_completion_target=0.9"
    - "-c"
    - "random_page_cost=1.1"           # SSD optimized
  deploy:
    resources:
      limits:
        cpus: '4'
        memory: 8G

Database 2: METRICS (METRICS_DATABASE_URL) - Optional

Purpose: High-volume execution metrics with time-based aggregations.

Tables (14)

ENTITY METRICS - Raw (5) - denormalized with entity name
├── tool_metrics          → tool_id + tool_name (soft reference, no FK)
├── server_metrics        → server_id + server_name (soft reference, no FK)
├── resource_metrics      → resource_id + resource_name (soft reference, no FK)
├── prompt_metrics        → prompt_id + prompt_name (soft reference, no FK)
└── a2a_agent_metrics     → a2a_agent_id + agent_name (soft reference, no FK)

ENTITY METRICS - Hourly Rollups (5) - denormalized, no FK
├── tool_metrics_hourly          (tool_id + tool_name, no FK)
├── server_metrics_hourly        (server_id + server_name, no FK)
├── resource_metrics_hourly      (resource_id + resource_name, no FK)
├── prompt_metrics_hourly        (prompt_id + prompt_name, no FK)
└── a2a_agent_metrics_hourly     (a2a_agent_id + agent_name, no FK)

PERFORMANCE (3)
├── performance_metrics
├── performance_snapshots
└── performance_aggregates

LLM USAGE (1)
└── token_usage_logs

FK Strategy

Both raw and hourly metrics tables require FK removal for the database split:

Raw metrics tables use soft references (VARCHAR ID without FK constraint):

  • Allows entity deletion without CASCADE
  • Metrics history preserved even if entity removed
  • No cross-database FK needed
  • NEW: Include denormalized {entity}_name column captured at write time

Hourly rollup tables also require FK removal:

  • Currently have nullable FK to entity tables (with ON DELETE SET NULL)
  • For split: Remove FK constraint, keep soft reference (ID as VARCHAR)
  • Already have denormalized {entity}_name column - no schema change needed
  • Self-sufficient for reporting without joins to CORE database

Entity Name Capture (Denormalization Strategy)

When is entity_name captured?

The entity name is captured at write time when recording a metric, not during rollup:

# In tool_service.py when recording a metric (using dependency injection)
async def record_tool_execution(
    self,
    tool_id: str,
    response_time: float,
    success: bool,
    db: Session = Depends(get_db),              # CORE DB for tool lookup
    metrics_db: Session = Depends(get_metrics_db)  # METRICS DB for write
):
    # Lookup tool name from CORE DB (or cache)
    tool = db.query(Tool).filter(Tool.id == tool_id).first()
    tool_name = tool.name if tool else "unknown"

    # Write to METRICS DB with denormalized name
    metric = ToolMetric(
        tool_id=tool_id,
        tool_name=tool_name,  # Captured at write time
        timestamp=datetime.now(UTC),
        response_time=response_time,
        is_success=success,
    )
    metrics_db.add(metric)
    metrics_db.commit()

Schema change required:

class ToolMetric(Base):
    __tablename__ = "tool_metrics"
    tool_id: Mapped[str] = mapped_column(String(36), nullable=False, index=True)
    tool_name: Mapped[str] = mapped_column(String(255), nullable=False)  # NEW
    timestamp: Mapped[datetime] = ...

How are entity renames handled?

  • Raw metrics: Capture name at write time. Historical metrics show the name at time of execution.
  • Hourly rollups: Inherit name from raw metrics. Shows name at time each metric was recorded.
  • Dashboard queries: Use denormalized {entity}_name column (e.g., tool_name, server_name) directly - no join to CORE needed.
  • Trade-off: After rename, historical data shows old name. This is intentional for audit accuracy.

If you need current names in reports, add a lookup cache:

# Cache entity_id → current_name mapping, refresh every 5 min
ENTITY_NAME_CACHE: Dict[str, str] = {}

PostgreSQL Tuning

metrics-pg:
  image: postgres:18
  environment:
    - POSTGRES_DB=mcp_metrics
    - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
  command:
    - "-c"
    - "max_connections=200"
    - "-c"
    - "shared_buffers=256MB"
    - "-c"
    - "synchronous_commit=off"         # Async WAL for speed
    - "-c"
    - "wal_level=minimal"              # No replication needed
    - "-c"
    - "max_wal_senders=0"              # Disable replication
    - "-c"
    - "checkpoint_timeout=15min"       # Less frequent checkpoints
    - "-c"
    - "max_wal_size=4GB"               # Buffer more WAL
    - "-c"
    - "work_mem=32MB"                  # Larger for aggregations
  deploy:
    resources:
      limits:
        cpus: '2'
        memory: 4G

Partitioning (Recommended)

-- Partition raw metrics by week for easy retention management
CREATE TABLE tool_metrics (
    id SERIAL,
    tool_id VARCHAR(36) NOT NULL,
    tool_name VARCHAR(255) NOT NULL,  -- Denormalized entity name
    timestamp TIMESTAMPTZ NOT NULL,
    response_time DOUBLE PRECISION,
    is_success BOOLEAN
) PARTITION BY RANGE (timestamp);

-- Create index on tool_id for lookups
CREATE INDEX idx_tool_metrics_tool_id ON tool_metrics (tool_id);

CREATE TABLE tool_metrics_2026_w01 PARTITION OF tool_metrics
    FOR VALUES FROM ('2026-01-01') TO ('2026-01-08');

-- Drop old partitions for retention
DROP TABLE tool_metrics_2025_w40;  -- Drop data older than 90 days

Database 3: LOGS (LOGS_DATABASE_URL) - Optional

Purpose: Compliance audit trails and application logs.

Durability: Configurable via LOGS_DB_DURABLE environment variable.

  • true (default): Full durability (synchronous_commit=on) for SOC2/HIPAA compliance
  • false: Relaxed durability (synchronous_commit=off) for development/non-compliance environments

Tables (4)

AUDIT (2) - No FK
├── audit_trails              # SOC2/HIPAA compliance
└── permission_audit_log      # RBAC changes

STRUCTURED LOGS (2) - Internal FK only
├── structured_log_entries    # Application logs
└── security_events           # FK → structured_log_entries (internal)

Note: email_auth_events remains in CORE database with other auth tables.

FK Strategy

  • security_events.log_entry_idstructured_log_entries.id (internal FK, stays valid)
  • No external dependencies - entire cluster is self-contained

PostgreSQL Tuning

Choose the appropriate configuration based on your durability requirements:

Production / Compliance (default):

logs-pg:
  image: postgres:18
  environment:
    - POSTGRES_DB=mcp_logs
    - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
  command:
    - "-c"
    - "max_connections=100"
    - "-c"
    - "shared_buffers=128MB"
    - "-c"
    - "synchronous_commit=on"              # Full durability
    - "-c"
    - "wal_level=replica"                  # Supports backup/replication
    - "-c"
    - "wal_compression=on"
    - "-c"
    - "autovacuum_vacuum_scale_factor=0.1"
    - "-c"
    - "checkpoint_completion_target=0.9"
  deploy:
    resources:
      limits:
        cpus: '1'
        memory: 2G

Development / Non-compliance (set LOGS_DB_DURABLE=false):

logs-pg:
  command:
    # ... same base settings, but with relaxed durability:
    - "-c"
    - "synchronous_commit=off"             # Faster writes, risk of loss
    - "-c"
    - "wal_level=minimal"                  # No replication support

Note: LOGS_DB_DURABLE is an application-level setting. When set to false, the application accepts that audit data may be lost on crash. The PostgreSQL container must be configured to match (use the development config above).

Retention Policy

-- Delete logs older than 90 days (configurable)
DELETE FROM structured_log_entries
WHERE timestamp < NOW() - INTERVAL '90 days';

-- Keep audit trails longer (365 days for compliance)
DELETE FROM audit_trails
WHERE timestamp < NOW() - INTERVAL '365 days';

Database 4: OBSERVABILITY (OBSERVABILITY_DATABASE_URL) - Optional

Purpose: Distributed tracing with OpenTelemetry-style spans.

Tables (5)

TRACES - Self-contained FK cluster
├── observability_traces          # Root (no FK)
│   ├── observability_spans       # FK → traces, FK → spans (self-ref)
│   │   └── observability_events  # FK → spans
│   └── observability_metrics     # FK → traces
└── observability_saved_queries   # No FK (user queries)

FK Strategy

All FKs are internal to this database - the cluster is self-contained:

  • spans.trace_idtraces.trace_id
  • spans.parent_span_idspans.span_id
  • events.span_idspans.span_id
  • metrics.trace_idtraces.trace_id

PostgreSQL Tuning

obs-pg:
  image: postgres:18
  environment:
    - POSTGRES_DB=mcp_obs
    - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
  command:
    - "-c"
    - "max_connections=100"
    - "-c"
    - "shared_buffers=256MB"
    - "-c"
    - "synchronous_commit=off"
    - "-c"
    - "wal_level=minimal"
    - "-c"
    - "max_wal_senders=0"
    - "-c"
    - "work_mem=64MB"                  # Large for trace reconstruction
    - "-c"
    - "effective_cache_size=1GB"       # Cache hot traces
  deploy:
    resources:
      limits:
        cpus: '2'
        memory: 4G

Alternative: TimescaleDB

For high-volume tracing, consider TimescaleDB:

obs-pg:
  image: timescale/timescaledb:latest-pg16
  # Automatic time-based partitioning
  # Better compression for time-series
  # Continuous aggregates for dashboards

Application Configuration

config.py

class Settings(BaseSettings):
    # Primary database (required)
    database_url: str = Field(
        default="sqlite:///./mcp.db",
        description="Primary database for core entities and auth"
    )

    # Secondary databases (optional - fall back to database_url if not set)
    metrics_database_url: Optional[str] = Field(
        default=None,
        description="Database for execution metrics. Falls back to database_url."
    )
    logs_database_url: Optional[str] = Field(
        default=None,
        description="Database for audit/compliance logs. Falls back to database_url."
    )
    observability_database_url: Optional[str] = Field(
        default=None,
        description="Database for distributed tracing. Falls back to database_url."
    )

    # LOGS database durability setting
    logs_db_durable: bool = Field(
        default=True,
        description="Enable full durability for LOGS database (synchronous_commit=on). "
                    "Set to False for development/non-compliance environments."
    )

Table Routing (db.py)

# Which database each table belongs to
METRICS_TABLES = {
    'tool_metrics', 'tool_metrics_hourly',
    'server_metrics', 'server_metrics_hourly',
    'resource_metrics', 'resource_metrics_hourly',
    'prompt_metrics', 'prompt_metrics_hourly',
    'a2a_agent_metrics', 'a2a_agent_metrics_hourly',
    'performance_metrics', 'performance_snapshots', 'performance_aggregates',
    'token_usage_logs',
}

LOGS_TABLES = {
    'audit_trails', 'permission_audit_log',
    'structured_log_entries', 'security_events',
}  # Note: email_auth_events stays in CORE with other auth tables

OBSERVABILITY_TABLES = {
    'observability_traces', 'observability_spans',
    'observability_events', 'observability_metrics',
    'observability_saved_queries',
}

# Create engines (optional DBs fall back to primary)
engine = create_engine(settings.database_url, ...)

metrics_engine = (
    create_engine(settings.metrics_database_url, ...)
    if settings.metrics_database_url else engine
)
logs_engine = (
    create_engine(settings.logs_database_url, ...)
    if settings.logs_database_url else engine
)
observability_engine = (
    create_engine(settings.observability_database_url, ...)
    if settings.observability_database_url else engine
)

# Session factories per database
SessionLocal = sessionmaker(bind=engine)
MetricsSessionLocal = sessionmaker(bind=metrics_engine)
LogsSessionLocal = sessionmaker(bind=logs_engine)
ObservabilitySessionLocal = sessionmaker(bind=observability_engine)

def get_db():
    """Get CORE database session."""
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()

def get_metrics_db():
    """Get METRICS database session (falls back to CORE if not configured)."""
    db = MetricsSessionLocal()
    try:
        yield db
    finally:
        db.close()

def get_logs_db():
    """Get LOGS database session (falls back to CORE if not configured)."""
    db = LogsSessionLocal()
    try:
        yield db
    finally:
        db.close()

def get_observability_db():
    """Get OBSERVABILITY database session (falls back to CORE if not configured)."""
    db = ObservabilitySessionLocal()
    try:
        yield db
    finally:
        db.close()

Avoiding Cross-Database Queries

Critical: Never join tables across databases. Each service must use the correct session:

# WRONG - Uses CORE session for metrics table
async def get_tool_metrics(db: Session = Depends(get_db)):
    return db.query(ToolMetric).all()  # Will fail if metrics in separate DB!

# CORRECT - Uses METRICS session
async def get_tool_metrics(metrics_db: Session = Depends(get_metrics_db)):
    return metrics_db.query(ToolMetric).all()

# CORRECT - Cross-DB lookup via separate sessions
async def get_tool_with_metrics(
    tool_id: str,
    db: Session = Depends(get_db),
    metrics_db: Session = Depends(get_metrics_db)
):
    tool = db.query(Tool).filter(Tool.id == tool_id).first()  # CORE DB
    metrics = metrics_db.query(ToolMetric).filter(ToolMetric.tool_id == tool_id).all()  # METRICS DB
    return {"tool": tool, "metrics": metrics}

Complete docker-compose.yml Example

version: "3.8"

services:
  # ==========================================================================
  # Gateway Application
  # ==========================================================================
  gateway:
    image: mcpgateway:latest
    environment:
      # All database URLs go through PgBouncer
      - DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp
      - METRICS_DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp_metrics
      - LOGS_DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp_logs
      - OBSERVABILITY_DATABASE_URL=postgresql+psycopg://postgres:${POSTGRES_PASSWORD}@pgbouncer:6432/mcp_obs
      # ... other settings
    depends_on:
      pgbouncer:
        condition: service_healthy

  # ==========================================================================
  # PgBouncer - Connection Pooler (routes to all databases)
  # ==========================================================================
  pgbouncer:
    image: edoburu/pgbouncer:latest
    environment:
      # Multiple databases via comma-separated URLs
      - DATABASE_URLS=postgres://postgres:${POSTGRES_PASSWORD}@postgres:5432/mcp,postgres://postgres:${POSTGRES_PASSWORD}@metrics-pg:5432/mcp_metrics,postgres://postgres:${POSTGRES_PASSWORD}@logs-pg:5432/mcp_logs,postgres://postgres:${POSTGRES_PASSWORD}@obs-pg:5432/mcp_obs
      - LISTEN_PORT=6432
      - POOL_MODE=transaction
      - MAX_CLIENT_CONN=5000
      - DEFAULT_POOL_SIZE=600
      - MAX_DB_CONNECTIONS=700
      - MIN_POOL_SIZE=100
      - RESERVE_POOL_SIZE=150
      - RESERVE_POOL_TIMEOUT=2
      - SERVER_RESET_QUERY=DISCARD ALL
      - SERVER_RESET_QUERY_ALWAYS=1
      - SERVER_LIFETIME=3600
      - SERVER_IDLE_TIMEOUT=600
      - QUERY_WAIT_TIMEOUT=60
      - CLIENT_IDLE_TIMEOUT=300
      - IDLE_TRANSACTION_TIMEOUT=300
    ports:
      - "6432:6432"
    depends_on:
      postgres:
        condition: service_healthy
      metrics-pg:
        condition: service_healthy
      logs-pg:
        condition: service_healthy
      obs-pg:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "pg_isready", "-h", "localhost", "-p", "6432"]
      interval: 10s
      timeout: 5s
      retries: 5

  # ==========================================================================
  # Primary Database - CORE (OLTP)
  # ==========================================================================
  postgres:
    image: postgres:18
    environment:
      - POSTGRES_DB=mcp
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    command:
      - "-c"
      - "max_connections=800"
      - "-c"
      - "shared_buffers=512MB"
      - "-c"
      - "synchronous_commit=on"
      - "-c"
      - "wal_level=replica"
      - "-c"
      - "checkpoint_completion_target=0.9"
    volumes:
      - pgdata:/var/lib/postgresql/data
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres -d mcp"]
      interval: 10s
      timeout: 5s
      retries: 5

  # ==========================================================================
  # Metrics Database - High Write Volume (OLAP)
  # ==========================================================================
  metrics-pg:
    image: postgres:18
    environment:
      - POSTGRES_DB=mcp_metrics
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    command:
      - "-c"
      - "max_connections=200"
      - "-c"
      - "shared_buffers=256MB"
      - "-c"
      - "synchronous_commit=off"
      - "-c"
      - "wal_level=minimal"
      - "-c"
      - "max_wal_senders=0"
      - "-c"
      - "checkpoint_timeout=15min"
      - "-c"
      - "max_wal_size=4GB"
    volumes:
      - metricsdata:/var/lib/postgresql/data
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres -d mcp_metrics"]
      interval: 10s
      timeout: 5s
      retries: 5

  # ==========================================================================
  # Logs Database - Audit & Compliance (FULL DURABILITY REQUIRED)
  # ==========================================================================
  logs-pg:
    image: postgres:18
    environment:
      - POSTGRES_DB=mcp_logs
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    command:
      - "-c"
      - "max_connections=100"
      - "-c"
      - "shared_buffers=128MB"
      - "-c"
      - "synchronous_commit=on"        # REQUIRED for audit compliance
      - "-c"
      - "wal_level=replica"            # Allow backup/replication
      - "-c"
      - "wal_compression=on"
      - "-c"
      - "checkpoint_completion_target=0.9"
    volumes:
      - logsdata:/var/lib/postgresql/data
    deploy:
      resources:
        limits:
          cpus: '1'
          memory: 2G
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres -d mcp_logs"]
      interval: 10s
      timeout: 5s
      retries: 5

  # ==========================================================================
  # Observability Database - Distributed Tracing
  # ==========================================================================
  obs-pg:
    image: postgres:18
    environment:
      - POSTGRES_DB=mcp_obs
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    command:
      - "-c"
      - "max_connections=100"
      - "-c"
      - "shared_buffers=256MB"
      - "-c"
      - "synchronous_commit=off"
      - "-c"
      - "wal_level=minimal"
      - "-c"
      - "max_wal_senders=0"
      - "-c"
      - "work_mem=64MB"
    volumes:
      - obsdata:/var/lib/postgresql/data
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres -d mcp_obs"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  pgdata:
  metricsdata:
  logsdata:
  obsdata:

Expected Impact

Primary Database Relief

Metric Before After (Full Split)
Tables 59 36
Write I/O 100% ~25%
CPU Usage 99% ~50-60%
Connection Pressure High Medium

Workload Isolation

Database Optimized For sync_commit Durability
CORE Low-latency reads, complex joins ON Full
METRICS High-throughput writes, aggregations OFF Relaxed (re-computable)
LOGS Append-only, compliance audit Configurable Configurable (default: full)
OBSERVABILITY Time-series, trace reconstruction OFF Relaxed (ephemeral)

LOGS durability: Set LOGS_DB_DURABLE=false for development. Production compliance environments should use the default (true).


Implementation Requirements

Schema Changes (Required Before Multi-DB Support)

These changes enable the database split and must be applied via Alembic migration:

  1. Remove FK constraints from both raw and hourly metrics tables:

    • Raw metrics: tool_metrics, server_metrics, resource_metrics, prompt_metrics, a2a_agent_metrics
    • Hourly rollups: tool_metrics_hourly, server_metrics_hourly, resource_metrics_hourly, prompt_metrics_hourly, a2a_agent_metrics_hourly
  2. Add {entity}_name column to raw metrics tables for denormalization (hourly tables already have this)

  3. Update service layer to use correct session factories per table group

New Deployment Setup

For new installations with multi-database support:

  1. Configure environment variables:

    # Database URLs
    DATABASE_URL=postgresql+psycopg://...@pgbouncer:6432/mcp
    METRICS_DATABASE_URL=postgresql+psycopg://...@pgbouncer:6432/mcp_metrics    # Optional
    LOGS_DATABASE_URL=postgresql+psycopg://...@pgbouncer:6432/mcp_logs          # Optional
    OBSERVABILITY_DATABASE_URL=postgresql+psycopg://...@pgbouncer:6432/mcp_obs  # Optional
    
    # LOGS durability (default: true for compliance; set false for dev only)
    LOGS_DB_DURABLE=true
  2. Deploy database containers (see docker-compose example above)

  3. Run Alembic migrations on each configured database:

    # Core (always)
    alembic upgrade head
    
    # Secondary databases (if configured)
    ALEMBIC_TARGET_DB=metrics DATABASE_URL=$METRICS_DATABASE_URL alembic upgrade head
    ALEMBIC_TARGET_DB=logs DATABASE_URL=$LOGS_DATABASE_URL alembic upgrade head
    ALEMBIC_TARGET_DB=observability DATABASE_URL=$OBSERVABILITY_DATABASE_URL alembic upgrade head
  4. Start application - routing happens automatically based on configured URLs

Existing installations: If enabling multi-database on an existing single-database deployment, historical data for metrics/logs/observability remains in the CORE database and won't be visible in the new secondary databases. See Appendix: Migrating Existing Single-DB Installations for data migration steps.


Alembic Per-Database Targeting

Use include_object filter to scope migrations per database:

# alembic/env.py

import os

METRICS_TABLES = {'tool_metrics', 'tool_metrics_hourly', ...}  # Full list
LOGS_TABLES = {'audit_trails', 'permission_audit_log', ...}
OBSERVABILITY_TABLES = {'observability_traces', ...}

def include_object(object, name, type_, reflected, compare_to):
    """Filter tables based on ALEMBIC_TARGET_DB environment variable."""
    if type_ != "table":
        return True

    target_db = os.environ.get("ALEMBIC_TARGET_DB", "core")

    if target_db == "metrics":
        return name in METRICS_TABLES
    elif target_db == "logs":
        return name in LOGS_TABLES
    elif target_db == "observability":
        return name in OBSERVABILITY_TABLES
    else:  # "core" or unset
        return name not in (METRICS_TABLES | LOGS_TABLES | OBSERVABILITY_TABLES)

def run_migrations_online():
    # ... existing code ...
    context.configure(
        connection=connection,
        target_metadata=target_metadata,
        include_object=include_object,  # Add this
        # ...
    )

Usage:

# Migrate CORE database (default)
alembic upgrade head

# Migrate METRICS database
ALEMBIC_TARGET_DB=metrics DATABASE_URL=$METRICS_DATABASE_URL alembic upgrade head

# Migrate LOGS database
ALEMBIC_TARGET_DB=logs DATABASE_URL=$LOGS_DATABASE_URL alembic upgrade head

# Migrate OBSERVABILITY database
ALEMBIC_TARGET_DB=observability DATABASE_URL=$OBSERVABILITY_DATABASE_URL alembic upgrade head

Appendix: Migrating Existing Single-DB Installations

Note: This section is reference documentation for administrators who want to migrate an existing single-database deployment to multi-database. This is not part of the core feature implementation.

Migration Steps (Admin Reference)

  1. Backup existing database

    pg_dump -h postgres -p 5432 -U postgres -d mcp > backup.sql
  2. Deploy secondary database containers

  3. Run Alembic migrations on secondary databases

    ALEMBIC_TARGET_DB=metrics DATABASE_URL=$METRICS_DATABASE_URL alembic upgrade head
    # Repeat for logs, observability
  4. Copy data from primary to secondary (connect directly, not via PgBouncer)

    # Example for METRICS tables
    pg_dump -h postgres -p 5432 -U postgres -d mcp \
      --data-only \
      -t tool_metrics -t tool_metrics_hourly \
      -t server_metrics -t server_metrics_hourly \
      -t resource_metrics -t resource_metrics_hourly \
      -t prompt_metrics -t prompt_metrics_hourly \
      -t a2a_agent_metrics -t a2a_agent_metrics_hourly \
      -t performance_metrics -t performance_snapshots -t performance_aggregates \
      -t token_usage_logs \
      | psql -h metrics-pg -p 5432 -U postgres -d mcp_metrics
  5. Stop application, configure URLs, restart

  6. Verify data integrity (compare row counts)

  7. (Optional) Drop migrated tables from primary after verification

Reverting to Single-DB

If migrated tables still exist in primary: Simply unset the optional URLs:

# Only DATABASE_URL configured = single-database mode
DATABASE_URL=postgresql+psycopg://postgres:pass@pgbouncer:6432/mcp
# METRICS_DATABASE_URL=  (unset)
# LOGS_DATABASE_URL=     (unset)
# OBSERVABILITY_DATABASE_URL=  (unset)

The application falls back to using DATABASE_URL for all tables.

If migrated tables were dropped from primary: You must restore data first:

# 1. Copy data from secondary back to primary
pg_dump -h metrics-pg -p 5432 -U postgres -d mcp_metrics --data-only \
  | psql -h postgres -p 5432 -U postgres -d mcp

# 2. Then unset URLs and restart

Warning: If you dropped tables from primary without a backup and the secondary databases are unavailable, data recovery is not possible. Always maintain backups before dropping tables.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestperformancePerformance related itemstriageIssues / Features awaiting triage

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions