DreamOps - AI-Powered Incident Response Platform

Dream easy while AI takes your on-call duty - Intelligent incident response and infrastructure management powered by Claude AI

Recent Changes (November 2025)

Fixed: Real-time Log Streaming (SSE)

Issue: Agent logs were not appearing in the frontend UI despite showing "Connected" status
Root Cause: Next.js rewrites buffer HTTP responses, breaking Server-Sent Events (SSE) streaming
Solution: Changed SSE connections to connect directly to the backend API (NEXT_PUBLIC_API_URL) instead of going through Next.js rewrites
Files Changed: frontend/lib/hooks/use-agent-logs.ts

Working Features

✅ Real-time AI agent log streaming to frontend
✅ Test Event button with auto-resolve (prevents disturbing on-call engineers)
✅ Claude model configuration (claude-sonnet-4-5-20250929)
✅ PagerDuty webhook integration (V3 format)
✅ Kubernetes MCP server setup (requires manual kubeconfig on server)
✅ Docker deployment with proper Node.js for MCP
✅ SSE streaming from backend to frontend

Needs Work

⚠️ Incident Report Generation - needs proper implementation connected to AI agent output
⚠️ UI Revamp - needs polish and cleanup for production readiness
⚠️ Manual kubeconfig setup required on production server

See todo.md for detailed task tracking.

TODO List

See todo.md for detailed task list with progress tracking.

Project Overview

DreamOps is an intelligent AI-powered incident response and infrastructure management platform that automates on-call duties using Claude AI and Model Context Protocol (MCP) integrations.

Key Features

🤖 AI-Powered Incident Response: Automatic alert analysis and remediation using Claude AI
🔧 YOLO Mode: Autonomous operation that executes fixes without human approval
🎯 Smart Alert Routing: Intelligent alert categorization and prioritization
🔌 MCP Integrations: Kubernetes, GitHub, PagerDuty, Notion, and Grafana
💳 Flexible Alert System: Free tier with 3 alerts/month
📊 Real-time Dashboard: Next.js frontend with live incident tracking
🚀 Cloud-Native: Docker, Terraform, and AWS deployment ready
🔒 Enterprise Security: Complete environment separation and secure secrets management
📈 Chaos to Insights: Turn chaos engineering results into actionable recommendations

Technology Stack

Backend: FastAPI, Python AsyncIO, uv package manager
Frontend: Next.js 15, TypeScript, TailwindCSS, Drizzle ORM
AI: Claude 3.5 Sonnet API, Model Context Protocol (MCP)
Database: Neon PostgreSQL with environment separation
Infrastructure: Docker, Terraform, AWS (ECS Fargate, S3, CloudFront)
Authentication: Handled by Authentik reverse proxy (no built-in auth)
Monitoring: CloudWatch, custom metrics and dashboards

Quick Start Guide

Prerequisites

For Docker Setup (Recommended):

Docker and Docker Compose
Anthropic API key for Claude

For Manual Setup:

Python 3.12+
Node.js 18+
PostgreSQL database (we use Neon)
Anthropic API key for Claude

Fast Setup

Option 1: Docker Compose (Recommended)

# Clone the repository
git clone https://github.com/yourusername/oncall-agent.git
cd oncall-agent

# Start all services with Docker
./docker-dev.sh up

# Access the application:
# - Frontend: http://localhost:3000
# - Backend API: http://localhost:8000
# - API Docs: http://localhost:8000/docs

Option 2: Manual Setup

# Clone the repository
git clone https://github.com/yourusername/oncall-agent.git
cd oncall-agent

# Backend setup
cd backend
pip install uv
uv sync
cp .env.example .env.local
# Edit .env.local with your API keys

# Frontend setup
cd ../frontend
npm install
cp .env.example .env.local
# Edit .env.local with your database URL

# Start development servers
npm run dev

Access the application:

Architecture & Technical Details

System Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│                 │     │                  │     │                 │
│   Next.js       │────▶│  FastAPI         │────▶│  Claude AI      │
│   Frontend      │     │  Backend         │     │  (Anthropic)    │
│                 │     │                  │     │                 │
└─────────────────┘     └──────────────────┘     └─────────────────┘
         │                       │                         │
         │                       │                         │
         ▼                       ▼                         ▼
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│                 │     │                  │     │                 │
│  Neon           │     │  MCP             │     │  Alert          │
│  PostgreSQL     │     │  Integrations    │     │  Processing     │
│                 │     │                  │     │                 │
└─────────────────┘     └──────────────────┘     └─────────────────┘

Key Architecture Decisions

Modular MCP Integrations: All integrations extend MCPIntegration base class
Async-First: All operations use async/await for concurrent processing
Configuration-Driven: Pydantic for validation and environment variables
Type-Safe: Extensive TypeScript and Python type hints
Retry Logic: Built-in exponential backoff for network operations
Environment Separation: Complete isolation between local/staging/production
YOLO Mode: Autonomous remediation with safety mechanisms

Project Structure

oncall-agent/
├── backend/
│   ├── src/oncall_agent/
│   │   ├── agent.py              # Core agent logic
│   │   ├── agent_enhanced.py     # Enhanced agent with YOLO mode
│   │   ├── agent_executor.py     # Command execution engine
│   │   ├── api/                  # FastAPI routes and schemas
│   │   ├── mcp_integrations/     # MCP integration modules
│   │   ├── services/             # Business logic services
│   │   └── strategies/           # Resolution strategies
│   ├── tests/                    # Test files
│   └── Dockerfile               # Production Docker image
├── frontend/
│   ├── app/                     # Next.js app router
│   ├── components/              # React components
│   ├── lib/                     # Utilities and database
│   └── scripts/                 # Build and deployment scripts
├── terraform/                   # Infrastructure as Code
└── docs/                       # Documentation

Installation & Setup

Backend Setup

Install Dependencies:

cd backend
pip install uv
uv sync

Configure Environment:

cp .env.example .env.local

Edit .env.local:

# Core Configuration
ANTHROPIC_API_KEY=your-anthropic-api-key
CLAUDE_MODEL=claude-3-5-sonnet-20241022
ENVIRONMENT=local
LOG_LEVEL=INFO

# Database
DATABASE_URL=postgresql://user:pass@host/dbname?sslmode=require

# API Server
API_HOST=0.0.0.0
API_PORT=8000
API_RELOAD=true

# PagerDuty Integration
PAGERDUTY_ENABLED=true
PAGERDUTY_API_KEY=your-pagerduty-api-key
PAGERDUTY_WEBHOOK_SECRET=your-webhook-secret

# Kubernetes Integration
K8S_ENABLED=true
K8S_CONFIG_PATH=~/.kube/config
K8S_ENABLE_DESTRUCTIVE_OPERATIONS=false

# Alert Settings
# Free tier: 3 alerts/month

Run the Backend:

uv run python api_server.py

Frontend Setup

Install Dependencies:

cd frontend
npm install

Configure Environment:

cp .env.example .env.local

Edit .env.local:

# Database Configuration
POSTGRES_URL=postgresql://user:pass@host/dbname?sslmode=require

# API Configuration
NEXT_PUBLIC_API_URL=http://localhost:8000
NODE_ENV=development
AUTH_SECRET=development-secret-key

Run Database Migrations:

npm run db:migrate:local

Run the Frontend:

npm run dev

Database Setup

The project uses Neon PostgreSQL with complete environment separation:

Create Neon Projects:
- Create separate projects for local, staging, and production
- Each environment has its own database instance

Configure Connection Strings:

# Local (.env.local)
POSTGRES_URL=postgresql://neondb_owner:xxx@ep-xxx.region.neon.tech/neondb?sslmode=require

# Staging (.env.staging)
POSTGRES_URL=postgresql://neondb_owner:xxx@ep-yyy.region.neon.tech/neondb?sslmode=require

# Production (.env.production)
POSTGRES_URL=postgresql://neondb_owner:xxx@ep-zzz.region.neon.tech/neondb?sslmode=require

Run Migrations:

# Local
npm run db:migrate:local

# Staging
npm run db:migrate:staging

# Production (requires confirmation)
npm run db:migrate:production

Configuration

PagerDuty Integration

1. Create Events API V2 Integration

Go to Services → Service Directory
Select your service (e.g., frai-backend)
Click Integrations tab
Click Add Integration
Search for Events API V2
Copy the Integration Key (routing key for sending events)

2. Configure V3 Webhook Subscription

Go to Integrations → Generic Webhooks (v3)
Click New Webhook
Configure:
- Webhook URL: http://oncall.frai.pro:8001/webhook/pagerduty
- Description: DreamOps AI Agent Webhook
- Scope Type: Service
- Scope: Select your service (e.g., frai-backend)
- Event Subscription - Select these events:
  - ✅ incident.triggered
  - ✅ incident.acknowledged
  - ✅ incident.escalated
  - ✅ incident.resolved
  - ✅ incident.priority_updated
Click Add Webhook
Important: Copy the webhook secret provided (optional, for signature verification)

⚠️ IMPORTANT: The webhook URL is /webhook/pagerduty (NOT /api/v1/webhook/pagerduty). The webhook router is mounted at the root level, not under the /api/v1 prefix.

3. Environment Variables

# PagerDuty Configuration
PAGERDUTY_ENABLED=true
PAGERDUTY_API_KEY=your-api-key              # Optional: For API operations (acknowledge, resolve)
PAGERDUTY_USER_EMAIL=your-email@company.com  # Optional: Required if using API
PAGERDUTY_WEBHOOK_SECRET=your-webhook-secret # Optional: For signature verification

Note: The webhook integration works without API key. API key is only needed if you want DreamOps to acknowledge/resolve incidents in PagerDuty.

4. Test the Integration

Option A: Trigger Real Incident via Events API

# Replace with your Integration Key from step 1
curl -X POST https://events.pagerduty.com/v2/enqueue \
  -H "Content-Type: application/json" \
  -d '{
  "routing_key": "YOUR_INTEGRATION_KEY",
  "event_action": "trigger",
  "dedup_key": "test-'$(date +%s)'",
  "payload": {
    "summary": "Test: High CPU usage on production server",
    "severity": "critical",
    "source": "monitoring-system",
    "custom_details": {
      "cpu_usage": "95%",
      "server": "prod-api-01"
    }
  }
}'

Option B: Direct Webhook Test (V3 Format)

curl -X POST http://localhost:8000/webhook/pagerduty \
  -H "Content-Type: application/json" \
  -d '{
  "event": {
    "id": "test-event-123",
    "event_type": "incident.triggered",
    "resource_type": "incident",
    "occurred_at": "2025-11-25T14:00:00Z",
    "data": {
      "id": "TEST123",
      "type": "incident",
      "status": "triggered",
      "title": "Test Alert",
      "service": {
        "id": "PSVC123",
        "summary": "Test Service"
      },
      "urgency": "high"
    }
  }
}'

5. Deployment Configuration

For bare metal deployment on port 8001:

# Docker Compose Configuration
# Expose port 8001 for webhook delivery
ports:
  - "8001:80"  # nginx → backend routing

Important Notes:

PagerDuty's "Send Test Event" button does NOT actually send webhooks - it only validates the URL format
Use Events API v2 (Option A above) to trigger real incidents that will send webhooks
Webhook URL must be publicly accessible (no localhost)
Custom ports like 8001 are supported
Both HTTP and HTTPS are supported

⚠️ CRITICAL: ALWAYS RESOLVE TEST INCIDENTS IMMEDIATELY!

Test incidents trigger real alerts to the on-call engineer. NEVER leave test incidents open.

After triggering a test, immediately resolve it:
curl -X POST https://events.pagerduty.com/v2/enqueue \
  -H "Content-Type: application/json" \
  -d '{"routing_key":"<YOUR_KEY>","event_action":"resolve","dedup_key":"<SAME_DEDUP_KEY>"}'

Slack Integration

DreamOps posts AI analysis results as thread replies under PagerDuty incident messages in Slack.

1. Create Slack App

Go to https://api.slack.com/apps
Click "Create New App" → "From scratch"
Name it (e.g., "ONCALL AI") and select your workspace

2. Configure Bot Permissions

In OAuth & Permissions, add these Bot Token Scopes:

chat:write - Post messages
channels:history - Read channel messages (to find PagerDuty threads)

3. Install App to Workspace

Click "Install to Workspace" and authorize the app.

4. Get Credentials

Bot Token: Copy from OAuth & Permissions page (starts with xoxb-)
Channel ID: Right-click channel → View details → Copy Channel ID

5. Invite Bot to Channel

In Slack, go to your incidents channel and type:

/invite @ONCALL AI

6. Configure Environment Variables

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx  # Optional fallback
SLACK_BOT_TOKEN=xoxb-your-bot-token
SLACK_CHANNEL_ID=C07A3NZAYSD
SLACK_CHANNEL=#oncall
SLACK_ENABLED=true

Slack Notification Format

When AI analysis completes, a concise thread reply is posted:

🤖 AI Analysis

Cause: Out of Memory (OOM) - Pod exceeded memory limits

Recommended Fixes:
• kubectl get pods -n production --field-selector=status.phase=Failed
• kubectl rollout restart deployment api-service -n production

View Full Report (clickable link to incident)

AI Agent Toggle (Enable/Disable AI Analysis)

⚠️ CURRENT STATUS: AI Agent is DISABLED on production via AI_AGENT_ENABLED=false

The AI agent is currently disabled. Incoming PagerDuty incidents are logged but NOT analyzed. To enable AI analysis, set AI_AGENT_ENABLED=true in the environment.

DreamOps provides two ways to control the AI agent:

1. Environment Variable (Server-level, takes precedence)

# Master toggle - set to false to completely disable AI analysis
# When false, all incoming incidents are logged but NOT analyzed (no AI, no Slack messages)
AI_AGENT_ENABLED=false  # Currently DISABLED on production

This is the recommended way to disable AI in production. It takes precedence over the UI toggle.

2. UI Toggle (Per-user, in AI Control Panel)

The AI Control Panel (/ai-control) has a toggle switch to enable/disable AI analysis. However, if AI_AGENT_ENABLED=false is set in the environment, the UI toggle has no effect.

Toggle Priority

ENV VAR (AI_AGENT_ENABLED) - Checked first, takes precedence
UI Toggle - Only checked if ENV VAR is true

API Endpoints

# Check current status
curl http://oncall.frai.pro:8001/api/v1/agent/toggle

# Response shows both ENV VAR and UI status:
# {
#   "ai_agent_enabled": false,        # Effective status
#   "env_var_enabled": false,         # AI_AGENT_ENABLED env var
#   "ui_toggle_enabled": true,        # UI toggle (ignored if env_var_enabled is false)
#   "disabled_by": "environment_variable",
#   "message": "AI agent is DISABLED via environment variable (AI_AGENT_ENABLED=false)"
# }

# Toggle via UI (won't work if ENV VAR is false)
curl -X POST "http://oncall.frai.pro:8001/api/v1/agent/toggle?enabled=true"

Kubernetes Integration Options

DreamOps offers multiple ways to integrate with Kubernetes clusters:

1. Standard Kubernetes Integration

The basic integration provides comprehensive cluster management:

K8S_ENABLED=true
K8S_CONFIG_PATH=~/.kube/config
K8S_CONTEXT=your-context-name
K8S_NAMESPACE=default
K8S_ENABLE_DESTRUCTIVE_OPERATIONS=false  # Set true for YOLO mode

Available Actions:

Pod management (list, logs, describe, restart)
Deployment operations (status, scale, rollback)
Service monitoring and health checks
Event retrieval and analysis
Automated resolution strategies
Resource constraint analysis

2. Enhanced Kubernetes with Auto-Discovery

The enhanced integration adds intelligent cluster discovery:

K8S_ENHANCED_ENABLED=true
K8S_ENHANCED_MULTI_CONTEXT=true
K8S_ENHANCED_AUTO_DISCOVER=true
K8S_ENHANCED_PERMISSION_CHECK=true

Features:

Automatic context discovery from kubeconfig
Multi-context support
Permission verification
Frontend configuration UI
Namespace auto-discovery

3. Agno Framework Integration

For remote Kubernetes management via Agno:

AGNO_ENABLED=true
AGNO_GITHUB_TOKEN=ghp_your_github_token
AGNO_CONFIG_REPO=your-org/kubernetes-configs

Connection Methods:

Service Account authentication
Kubeconfig file authentication
Client certificate authentication

Setup:

# Configure remote connection
POST /api/v1/agno/configure
{
  "cluster_name": "production",
  "auth_method": "service_account",
  "credentials": {
    "token": "your-sa-token",
    "ca_cert": "base64-encoded-cert",
    "server": "https://k8s-api.example.com"
  }
}

4. Kubernetes MCP Server

Run Kubernetes operations via MCP protocol:

# Start MCP server
./start-kubernetes-mcp-server.sh

# Or run directly
uv run python -m src.oncall_agent.mcp_integrations.kubernetes_mcp_server

Available Operations:

get_pods, describe_pod, get_pod_logs
get_deployments, scale_deployment, restart_deployment
get_services, get_endpoints
apply_manifest, delete_resource
get_events, get_nodes

Notion Integration

1. Get Notion API Token

Go to https://www.notion.so/my-integrations
Click "New integration"
Give it a name (e.g., "DreamOps AI Agent")
Select the workspace
Copy the "Internal Integration Token" (starts with secret_)

2. Create a Database

In Notion, create a new page
Add a database (Table, Board, etc.)
Add these properties:
- Title (default)
- Status (Select: Open, In Progress, Resolved)
- Priority (Select: Low, Medium, High, Critical)
- Created (Date)
- Description (Text)

3. Get Database ID

Open your database in Notion
Copy the URL: https://www.notion.so/your-workspace/[DATABASE_ID]?v=...
The DATABASE_ID is the 32-character string after the workspace name

4. Share Database with Integration

In your database, click "..." → "Add connections"
Search for your integration name
Click to add it

5. Configure Environment Variables

# Notion Integration
NOTION_TOKEN=secret_YOUR_INTEGRATION_TOKEN_HERE
NOTION_DATABASE_ID=YOUR_DATABASE_ID_HERE
NOTION_VERSION=2022-06-28

6. Test the Integration

# Check integration status
curl http://localhost:8000/api/v1/integrations

# Test Notion specifically
curl -X POST http://localhost:8000/api/v1/integrations/notion/test

Grafana Integration Setup

The Grafana integration provides metric retrieval and dashboard analysis:

GRAFANA_URL=http://localhost:3000
GRAFANA_API_KEY=your-grafana-api-key

Features:

Metric retrieval
Dashboard analysis
Alert correlation
Performance insights

Testing: The project includes a comprehensive Grafana test suite:

cd backend/tests/integrations/grafana
docker-compose up -d
pytest test_grafana_integration.py -v

Environment Separation

DreamOps uses strict environment separation to ensure development features don't leak into production.

Environment Detection

The system uses two environment variables to determine the current mode:

NODE_ENV - Standard Node.js environment variable
- development - Local development
- staging - Staging environment
- production - Production environment
NEXT_PUBLIC_DEV_MODE - Explicit dev mode flag
- true - Enable development features
- false - Disable development features (default)

Development Mode Features

When NEXT_PUBLIC_DEV_MODE=true OR NODE_ENV=development:

Automatic Pro Plan: All new users start with Pro plan
All Integrations Enabled: No plan restrictions for integrations
Unlimited Alerts: No alert limits in development
Debug Logging: Enhanced logging for debugging
Hot Reload: API server auto-reloads on file changes

Environment Files

.env.local          # Local development (NEXT_PUBLIC_DEV_MODE=true)
.env.staging        # Staging environment (NEXT_PUBLIC_DEV_MODE=false)
.env.production     # Production environment (NEXT_PUBLIC_DEV_MODE=false)

Configuration Loading Order

The config loader checks for environment files in this order:

.env.{NODE_ENV} (e.g., .env.production)
.env.local
.env

Production Safety

Explicit Production Settings:

# .env.production
NODE_ENV=production
NEXT_PUBLIC_DEV_MODE=false

Code Checks:

# Check if in development mode
is_dev_mode = (
    os.getenv("NEXT_PUBLIC_DEV_MODE", "false").lower() == "true" 
    or os.getenv("NODE_ENV", "") == "development"
)

Deployment Configuration

Local Development:

NODE_ENV=development ./start-dev-server.sh
# OR
NODE_ENV=development uv run python api_server.py

Production Deployment:

NODE_ENV=production uv run python api_server.py

AWS/Render Environment Variables:

NODE_ENV=production
NEXT_PUBLIC_DEV_MODE=false

Integration Plan Restrictions

Integration	Free/Starter	Pro/Enterprise	Dev Mode
Kubernetes	✅	✅	✅
PagerDuty	✅	✅	✅
Notion	❌	✅	✅
GitHub	❌	✅	✅
Grafana	❌	✅	✅
Datadog	❌	✅	✅

Verifying Environment

# Check current environment
curl http://localhost:8000/api/v1/alert-tracking/usage/test-user | jq .account_tier
# Dev mode: "pro", Prod mode: "free"

# Check integration access
curl "http://localhost:8000/api/v1/alert-tracking/check-integration-access/test-user/notion"
# Dev mode: {"has_access": true, "reason": "Development mode - all integrations enabled"}
# Prod mode: {"has_access": false, "reason": "Integration 'notion' is not allowed on free plan"}

Features & Integrations

MCP (Model Context Protocol) Integrations

1. Kubernetes Integration

Enhanced Kubernetes MCP with intelligent error detection and automated remediation:

Features:

Real-time pod monitoring and management
Automatic error detection (CrashLoopBackOff, OOM, ImagePullBackOff)
Intelligent remediation strategies
Resource usage analysis
Deployment management and rollbacks

Example Usage:

# The agent automatically detects and fixes Kubernetes issues
alert = {
    "service": "payment-service",
    "description": "Pod CrashLoopBackOff detected",
    "severity": "high"
}
# Agent will analyze logs, identify root cause, and execute fixes

2. GitHub Integration

Configuration:

GITHUB_TOKEN=ghp_your_github_token
GITHUB_MCP_SERVER_PATH=../../github-mcp-server/github-mcp-server

Features:

Repository management
Issue and PR creation
Code search and analysis
Automated fixes with commits

3. Notion Integration

Features:

Incident documentation
Knowledge base updates
Runbook management
Post-mortem automation

Automated Documentation: Each incident is documented with:

Incident title and ID
Service affected
Issue type and severity
Timestamp
Detailed metadata
Investigation checklist
Resolution placeholder

4. Grafana Integration

Features:

Metric retrieval
Dashboard analysis
Alert correlation
Performance insights

YOLO Mode

YOLO (You Only Launch Once) mode enables fully autonomous operation:

K8S_ENABLE_DESTRUCTIVE_OPERATIONS=true
ALERT_AUTO_ACKNOWLEDGE=true

Safety Mechanisms:

Action logging before execution
Rollback capability
Dry-run mode for testing
Configurable action limits

Testing YOLO Mode:

# Simulate Kubernetes failures
./fuck_kubernetes.sh [1-5|all|random|clean]

# Scenarios:
# 1 - Pod crashes (CrashLoopBackOff)
# 2 - Image pull errors (ImagePullBackOff)
# 3 - OOM kills
# 4 - Deployment failures
# 5 - Service unavailability

Chaos to Insights Flow

The platform turns chaos engineering results into actionable insights:

1. Chaos Engineering Execution

# From frontend incidents page - click "Nuke Infrastructure"
# Or run directly:
./fuck_kubernetes.sh [1-5|all|random]

2. Automatic Processing

Chaos script creates Kubernetes issues
Alerts sent to PagerDuty
AI agent analyzes and remediates
Creates detailed Notion documentation

3. AI-Powered Insights

Real-time Analysis:

POST /api/v1/insights/analyze-chaos

Incidents from last 2 hours
Services affected
Issue types detected
Specific recommendations

Infrastructure Health Report:

GET /api/v1/insights/report

Total incidents over time
Most problematic services
Incident type distribution
Trend analysis
Prioritized recommendations

4. Pattern Detection

Identifies recurring issues
Detects time-based patterns
Tracks incident frequency trends
Builds knowledge base over time

Alert System

Overview

DreamOps uses a freemium model:

Free Tier: 3 alerts per month
Starter: 50 alerts/month
Professional: Unlimited alerts
Enterprise: Custom limits

Alert Tracking

The system tracks alert usage per user:

interface AlertUsage {
  user_id: string;
  alerts_used: number;
  alerts_limit: number;
  billing_cycle_start: Date;
  account_tier: 'free' | 'starter' | 'professional' | 'enterprise';
}

When alert count exceeds limit, users are prompted to upgrade.

Deployment

Local Development

# Start backend
cd backend && uv run python api_server.py

# Start frontend (in another terminal)
cd frontend && npm run dev

Docker Setup

Development with Docker Compose

The project includes a complete Docker setup for local development with hot reload, automatic database setup, and all services configured.

Quick Start

# Start all services (frontend, backend, postgres, redis)
./docker-dev.sh up

# View logs
./docker-dev.sh logs

# Stop all services
./docker-dev.sh down

Docker Components

Backend Service (backend/Dockerfile.dev):
- Python 3.12 with uv package manager
- FastAPI with hot reload enabled
- Kubectl installed for Kubernetes operations
- Development environment variables pre-configured
Frontend Service (frontend/Dockerfile.dev):
- Node.js 18 Alpine
- Next.js development server
- Hot reload enabled
- Automatic database migrations
PostgreSQL Database:
- PostgreSQL 16 Alpine
- Pre-configured with development credentials
- Persistent volume for data
Redis Cache:
- Redis 7 Alpine
- Used for caching and real-time features

Docker Commands

# Build images
./docker-dev.sh build

# Rebuild from scratch
./docker-dev.sh rebuild

# Access database
./docker-dev.sh db

# Run migrations
./docker-dev.sh migrate

# Open Drizzle Studio
./docker-dev.sh studio

# Open shell in container
./docker-dev.sh shell backend
./docker-dev.sh shell frontend

Testing Docker Setup

# Run automated test suite
./test-docker-setup.sh

# Manual health checks
curl http://localhost:8000/health
curl http://localhost:8000/api/v1/payments/debug/environment

Docker Environment Variables

The Docker setup automatically configures:

NODE_ENV=development - Development mode
NEXT_PUBLIC_DEV_MODE=true - Enable all features
ALERTS_LIMIT=100 - Increased alert limit for development
CORS_ORIGINS - Configured for frontend access
Database connections pre-configured

Troubleshooting Docker

Port conflicts:

# Kill processes using required ports
lsof -ti:8000 | xargs kill -9  # Backend
lsof -ti:3000 | xargs kill -9  # Frontend
lsof -ti:5432 | xargs kill -9  # PostgreSQL
lsof -ti:6379 | xargs kill -9  # Redis

View container logs:

docker-compose logs -f backend   # Backend logs
docker-compose logs -f frontend  # Frontend logs
docker-compose logs -f postgres  # Database logs

Reset everything:

# Stop and remove all containers, networks, volumes
docker-compose down -v

# Remove images too
docker-compose down --rmi all

Production Docker Deployment

# Build and run with Docker Compose
docker-compose -f docker-compose.production.yml up -d

# Environment-specific:
docker-compose -f docker-compose.staging.yml up -d

AWS Deployment

Terraform Deployment

Prerequisites:
- AWS CLI configured
- Terraform installed
- Domain name (optional)

Deploy Infrastructure:

cd terraform
terraform init
terraform plan -var-file=production.tfvars
terraform apply -var-file=production.tfvars

Components Deployed:
- ECS Fargate for backend
- S3 + CloudFront for frontend
- ALB for load balancing
- RDS/Aurora for database (optional)
- CloudWatch for monitoring
- Secrets Manager for credentials

AWS Amplify Deployment

For frontend deployment via Amplify:

version: 1
applications:
  - appRoot: frontend
    frontend:
      phases:
        preBuild:
          commands:
            - npm ci
        build:
          commands:
            - npm run build
      artifacts:
        baseDirectory: .next
        files:
          - '**/*'
      cache:
        paths:
          - 'node_modules/**/*'

Render Deployment

Deploy to Render.com for a managed cloud platform experience:

Prerequisites

Render.com account
GitHub repository connected to Render
Neon database (or other PostgreSQL)

Backend Deployment

Create Web Service:
- Name: dreamops-backend
- Environment: Python
- Build Command: pip install uv && uv sync
- Start Command: uv run python api_server.py

Environment Variables:

# Core
ANTHROPIC_API_KEY=sk-ant-xxx
CLAUDE_MODEL=claude-3-5-sonnet-20241022
NODE_ENV=production
NEXT_PUBLIC_DEV_MODE=false

# Database
DATABASE_URL=postgresql://xxx

# API Configuration
API_HOST=0.0.0.0
API_PORT=10000
CORS_ORIGINS=https://your-frontend.onrender.com

# Integrations (as needed)
PAGERDUTY_API_KEY=xxx
K8S_ENABLED=true

Advanced Settings:
- Instance Type: Standard or higher
- Health Check Path: /health
- Auto-Deploy: Yes

Frontend Deployment

Create Static Site:
- Name: dreamops-frontend
- Build Command: npm install && npm run build
- Publish Directory: out

Environment Variables:

POSTGRES_URL=postgresql://xxx
NEXT_PUBLIC_API_URL=https://dreamops-backend.onrender.com
NODE_ENV=production

Headers (render.yaml):

headers:
  - path: /*
    name: X-Frame-Options
    value: DENY
  - path: /*
    name: X-Content-Type-Options
    value: nosniff

Post-Deployment

Verify Services:

curl https://dreamops-backend.onrender.com/health
curl https://dreamops-frontend.onrender.com

Configure Webhooks: Update PagerDuty webhook URL to Render backend URL
Monitor Logs: Check Render dashboard for deployment and runtime logs

GitHub Actions Deployment

The project includes CI/CD workflows:

# .github/workflows/deploy.yml
name: Deploy to Production
on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Deploy Backend
        # Deploy to ECS
      - name: Deploy Frontend
        # Deploy to Amplify/S3

Testing & Development

Development Workflow

Docker Development (Recommended)

# Start all services
./docker-dev.sh up

# View logs in real-time
./docker-dev.sh logs

# Run database migrations
./docker-dev.sh migrate

# Open Drizzle Studio
./docker-dev.sh studio

# Test the setup
./test-docker-setup.sh

Manual Development

Backend Development:

cd backend
uv run python main.py  # Run CLI
uv run python api_server.py  # Run API server

Frontend Development:

cd frontend
npm run dev  # Start development server
npm run db:studio  # Open Drizzle Studio

Testing Commands:

# Backend
uv run pytest tests/
uv run ruff check . --fix
uv run mypy . --ignore-missing-imports

# Frontend
npm run lint
npm run type-check
npm test

Testing Integrations

# Test PagerDuty webhook
curl -X POST http://localhost:8000/webhook/pagerduty \
  -H "Content-Type: application/json" \
  -d @test_webhook_payload.json

# Test Kubernetes integration
uv run python test_k8s_pagerduty_integration.py

YOLO Mode Testing

# Create test namespace
kubectl create namespace fuck-kubernetes-test

# Run failure simulations
./fuck_kubernetes.sh all

# Monitor agent response
tail -f logs/agent.log

# Clean up
./fuck_kubernetes.sh clean

Grafana Test Suite

The project includes comprehensive Grafana integration tests:

cd backend/tests/integrations/grafana

# Start test environment
docker-compose up -d

# Run tests
pytest test_grafana_integration.py -v

# Performance benchmarks
pytest test_grafana_integration.py::test_performance -v

Test Categories:

Connection tests
Metric retrieval tests
Alert integration tests
Performance benchmarks
Error handling tests

CI/CD

GitHub Actions Workflows

Backend CI (.github/workflows/backend-ci.yml):
- Python linting and formatting
- Type checking with mypy
- Unit tests with pytest
- Security scanning
- Docker image build
Frontend CI (.github/workflows/frontend-ci.yml):
- ESLint and TypeScript checks
- Unit and integration tests
- Build verification
- Bundle size analysis
Security Scanning (.github/workflows/security-scan.yml):
- Dependency vulnerability scanning
- SAST with Semgrep
- Container image scanning
- Secret detection
Deployment (.github/workflows/deploy.yml):
- Environment-specific deployments
- Database migrations
- Health checks
- Rollback capability

Environment Management

# GitHub Secrets Required
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
ANTHROPIC_API_KEY
NEON_DATABASE_URL_STAGING
NEON_DATABASE_URL_PROD
AMPLIFY_APP_ID

Security Considerations

Authentication

Authentication is handled by Authentik reverse proxy
No built-in authentication in the application
User identity is provided via headers from Authentik
Ensure Authentik is properly configured and secured

Environment Variables

Never commit .env files to version control
Use different secrets for each environment
Rotate API keys regularly
Use least-privilege access for service accounts

Database Security

Each environment uses completely separate databases
Connection strings include SSL requirements
Database users have minimal required permissions
Regular security updates for database instances

Kubernetes Security

RBAC permissions are minimally scoped
Destructive operations require explicit enablement
All kubectl commands are logged
Namespace isolation for testing

API Security

Request validation using Pydantic models
Rate limiting implemented
Comprehensive audit logging

Performance Optimization

Async Operations

All I/O operations use async/await
Concurrent processing of multiple alerts
Connection pooling for database operations
Efficient resource cleanup

Caching Strategy

Configuration cached in memory
Database query results cached where appropriate
Static assets served from CDN
Browser caching for frontend resources

Monitoring and Alerting

CloudWatch integration for metrics
Custom dashboards for system health
Alerting on error rates and performance
Real-time log streaming

Migration Guides

kubectl to MCP Migration

When migrating from direct kubectl commands to MCP-based operations:

Command Mappings

kubectl Command	MCP Tool	Parameters
`kubectl get pods`	`get_pods`	`namespace`, `label_selector`
`kubectl describe pod`	`describe_pod`	`name`, `namespace`
`kubectl logs`	`get_pod_logs`	`name`, `namespace`, `container`
`kubectl get deployments`	`get_deployments`	`namespace`
`kubectl scale`	`scale_deployment`	`name`, `namespace`, `replicas`
`kubectl rollout restart`	`restart_deployment`	`name`, `namespace`
`kubectl apply -f`	`apply_manifest`	`manifest`
`kubectl delete`	`delete_resource`	`resource_type`, `name`, `namespace`

Before (Direct kubectl):

result = subprocess.run(["kubectl", "get", "pods", "-n", "default"], capture_output=True)

After (MCP):

result = await k8s_integration.call_tool("get_pods", {"namespace": "default"})

Migration Checklist

Replace subprocess kubectl calls with MCP tools
Update error handling for MCP responses
Add proper async/await for MCP calls
Update logging to use MCP context
Test rollback scenarios
Verify permissions work with MCP

Troubleshooting

Common Issues

1. JSON Serialization Errors

Problem: TypeError: Object of type ResolutionAction is not JSON serializable

Solution:

# Add to_dict() method to dataclasses
def to_dict(self) -> dict[str, Any]:
    return asdict(self)

2. PagerDuty API Errors

Problem: "Requester User Not Found"

Solution: Ensure PAGERDUTY_USER_EMAIL is valid in your PagerDuty account

3. Database Connection Issues

Problem: Connection timeouts or SSL errors

Solution:

Include ?sslmode=require in connection string
Check Neon project is active
Remove &channel_binding=require if present

4. Kubernetes Connection Failed

Problem: kubectl connection test failed

Solution:

Verify kubectl is installed
Check ~/.kube/config exists
Set correct context: K8S_CONTEXT=your-context

5. Notion Integration Issues

Common Problems:

"notion integration requires NOTION_TOKEN and NOTION_DATABASE_ID"
- Make sure both environment variables are set
- Restart the backend server after adding them
"Failed to connect to Notion API"
- Check your integration token is correct
- Ensure the token starts with secret_
"Database not found"
- Verify the database ID is correct (32 characters)
- Make sure you've shared the database with your integration
"Insufficient permissions"
- The integration needs read and write access
- Re-share the database with the integration

Debug Tools

# Check API health
curl http://localhost:8000/health

# View payment environment
curl http://localhost:8000/api/v1/payments/debug/environment

# Test database connections
cd frontend && npm run test:db

# View agent logs
tail -f backend/logs/agent.log

# Check integration status
curl http://localhost:8000/api/v1/integrations

# Monitor webhook traffic (if using ngrok)
curl http://localhost:4040/inspect/http

API Reference

Core Endpoints

Health Check

GET /health
Response: {"status": "healthy", "version": "1.0.0"}

Webhook Handler

POST /webhook/pagerduty
Content-Type: application/json
X-Webhook-Secret: your-secret

{
  "event": {
    "event_type": "incident.triggered",
    "data": {...}
  }
}

Alert Management

Track Alert Usage

GET /api/v1/alert-tracking/usage/{team_id}
Response: {
  "team_id": "team_123",
  "alerts_used": 2,
  "alerts_limit": 3,
  "account_tier": "free"
}

Create Manual Alert

POST /api/v1/alert-tracking/alerts
{
  "team_id": "team_123",
  "title": "Database connection failed",
  "severity": "high"
}

Check Integration Access

GET /api/v1/alert-tracking/check-integration-access/{team_id}/{integration_name}
Response: {
  "has_access": true,
  "reason": "Integration allowed on pro plan"
}

Integration Endpoints

List Integrations

GET /api/v1/integrations
Response: {
  "integrations": [
    {"name": "kubernetes", "enabled": true, "status": "connected"},
    {"name": "pagerduty", "enabled": true, "status": "connected"},
    {"name": "notion", "enabled": false, "status": "not_configured"}
  ]
}

Test Integration

POST /api/v1/integrations/{name}/test
Response: {
  "success": true,
  "message": "Integration test successful"
}

Integration Health Check

POST /api/v1/integrations/{name}/health
Response: {
  "healthy": true,
  "details": {...}
}

Dashboard API

Get Incidents

GET /api/v1/incidents?team_id={team_id}
Response: {
  "incidents": [{
    "id": "INC_123",
    "title": "Pod CrashLoopBackOff",
    "status": "resolved",
    "created_at": "2024-01-01T00:00:00Z"
  }]
}

Get Metrics

GET /api/v1/dashboard/metrics?team_id={team_id}
Response: {
  "total_incidents": 45,
  "resolved_incidents": 40,
  "avg_resolution_time": 300,
  "uptime_percentage": 99.9
}

Get AI Actions

GET /api/v1/dashboard/ai-actions?team_id={team_id}
Response: {
  "actions": [{
    "id": "ACT_123",
    "type": "pod_restart",
    "status": "completed",
    "timestamp": "2024-01-01T00:00:00Z"
  }]
}

Insights API

Analyze Recent Chaos

POST /api/v1/insights/analyze-chaos
Response: {
  "incidents_created": 3,
  "services_affected": ["oom-app", "bad-image-app"],
  "insights": ["Memory issues detected", "Image pull failures"],
  "recommendations": ["Increase memory limits", "Check registry access"]
}

Get Infrastructure Report

GET /api/v1/insights/report
Response: Markdown report with analysis and recommendations

Get Service Analysis

GET /api/v1/insights/analysis?service=oom-app
Response: {
  "service": "oom-app",
  "incident_count": 5,
  "issue_types": ["oom"],
  "recommendations": ["Urgent: Increase memory limits"]
}

Agno Framework Endpoints

Configure Remote Cluster

POST /api/v1/agno/configure
{
  "cluster_name": "production",
  "auth_method": "service_account",
  "credentials": {
    "token": "your-sa-token",
    "ca_cert": "base64-encoded-cert",
    "server": "https://k8s-api.example.com"
  }
}

Execute Remote Command

POST /api/v1/agno/execute
{
  "cluster": "production",
  "operation": "get_pods",
  "namespace": "default"
}

Contributing

For Developers

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

For AI Assistants

When working with this codebase:

Always check the latest documentation before making changes
Test all changes locally before suggesting them
Follow established patterns for error handling and logging
Ensure JSON serialization compatibility for all API responses

Run the pre-commit checklist:

uv run ruff check . --fix
uv run mypy . --ignore-missing-imports
uv run pytest tests/
uv run python main.py
uv run python api_server.py

Code Style

Python: PEP 8, type hints, async/await
TypeScript: ESLint config, proper types
Git: Conventional commits
Documentation: Clear, concise, with examples

License

MIT License - see LICENSE file for details

Built with ❤️ by the DreamOps Team

Dream easy while AI takes your on-call duty

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.github/workflows		.github/workflows
.next		.next
.playwright-mcp		.playwright-mcp
backend		backend
docs		docs
frontend		frontend
k8s		k8s
monitoring		monitoring
scripts		scripts
src/oncall_agent		src/oncall_agent
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
CLAUDE.md		CLAUDE.md
README.md		README.md
docker-compose.monitoring.yml		docker-compose.monitoring.yml
docker-compose.yml		docker-compose.yml
fuck_kubernetes.sh		fuck_kubernetes.sh
nginx.conf		nginx.conf
todo.md		todo.md

SkySingh04/DreamOps

Folders and files

Latest commit

History

Repository files navigation

DreamOps - AI-Powered Incident Response Platform

Table of Contents

Recent Changes (November 2025)

Fixed: Real-time Log Streaming (SSE)

Working Features

Needs Work

TODO List

Project Overview

Key Features

Technology Stack

Quick Start Guide

Prerequisites

For Docker Setup (Recommended):

For Manual Setup:

Fast Setup

Option 1: Docker Compose (Recommended)

Option 2: Manual Setup

Architecture & Technical Details

System Architecture

Key Architecture Decisions

Project Structure

Installation & Setup

Backend Setup

Frontend Setup

Database Setup

Configuration

PagerDuty Integration

1. Create Events API V2 Integration

2. Configure V3 Webhook Subscription

3. Environment Variables

4. Test the Integration

Option A: Trigger Real Incident via Events API

Option B: Direct Webhook Test (V3 Format)

5. Deployment Configuration

Slack Integration

1. Create Slack App

2. Configure Bot Permissions

3. Install App to Workspace

4. Get Credentials

5. Invite Bot to Channel

6. Configure Environment Variables

Slack Notification Format

AI Agent Toggle (Enable/Disable AI Analysis)

1. Environment Variable (Server-level, takes precedence)

2. UI Toggle (Per-user, in AI Control Panel)

Toggle Priority

API Endpoints

Kubernetes Integration Options

1. Standard Kubernetes Integration

2. Enhanced Kubernetes with Auto-Discovery

3. Agno Framework Integration

4. Kubernetes MCP Server

Notion Integration

1. Get Notion API Token

2. Create a Database

3. Get Database ID

4. Share Database with Integration

5. Configure Environment Variables

6. Test the Integration

Grafana Integration Setup

Environment Separation

Environment Detection

Development Mode Features

Environment Files

Configuration Loading Order

Production Safety

Deployment Configuration

Integration Plan Restrictions

Verifying Environment

Features & Integrations

MCP (Model Context Protocol) Integrations

1. Kubernetes Integration

2. GitHub Integration

3. Notion Integration

4. Grafana Integration