Dream easy while AI takes your on-call duty - Intelligent incident response and infrastructure management powered by Claude AI
- TODO List
- Project Overview
- Quick Start Guide
- Architecture & Technical Details
- Installation & Setup
- Configuration
- Features & Integrations
- Alert System
- Deployment
- Testing & Development
- CI/CD
- Security Considerations
- Performance Optimization
- Migration Guides
- Troubleshooting
- API Reference
- Contributing
- Issue: Agent logs were not appearing in the frontend UI despite showing "Connected" status
- Root Cause: Next.js rewrites buffer HTTP responses, breaking Server-Sent Events (SSE) streaming
- Solution: Changed SSE connections to connect directly to the backend API (
NEXT_PUBLIC_API_URL) instead of going through Next.js rewrites - Files Changed:
frontend/lib/hooks/use-agent-logs.ts
- ✅ Real-time AI agent log streaming to frontend
- ✅ Test Event button with auto-resolve (prevents disturbing on-call engineers)
- ✅ Claude model configuration (claude-sonnet-4-5-20250929)
- ✅ PagerDuty webhook integration (V3 format)
- ✅ Kubernetes MCP server setup (requires manual kubeconfig on server)
- ✅ Docker deployment with proper Node.js for MCP
- ✅ SSE streaming from backend to frontend
⚠️ Incident Report Generation - needs proper implementation connected to AI agent output⚠️ UI Revamp - needs polish and cleanup for production readiness⚠️ Manual kubeconfig setup required on production server
See todo.md for detailed task tracking.
See todo.md for detailed task list with progress tracking.
DreamOps is an intelligent AI-powered incident response and infrastructure management platform that automates on-call duties using Claude AI and Model Context Protocol (MCP) integrations.
- 🤖 AI-Powered Incident Response: Automatic alert analysis and remediation using Claude AI
- 🔧 YOLO Mode: Autonomous operation that executes fixes without human approval
- 🎯 Smart Alert Routing: Intelligent alert categorization and prioritization
- 🔌 MCP Integrations: Kubernetes, GitHub, PagerDuty, Notion, and Grafana
- 💳 Flexible Alert System: Free tier with 3 alerts/month
- 📊 Real-time Dashboard: Next.js frontend with live incident tracking
- 🚀 Cloud-Native: Docker, Terraform, and AWS deployment ready
- 🔒 Enterprise Security: Complete environment separation and secure secrets management
- 📈 Chaos to Insights: Turn chaos engineering results into actionable recommendations
- Backend: FastAPI, Python AsyncIO, uv package manager
- Frontend: Next.js 15, TypeScript, TailwindCSS, Drizzle ORM
- AI: Claude 3.5 Sonnet API, Model Context Protocol (MCP)
- Database: Neon PostgreSQL with environment separation
- Infrastructure: Docker, Terraform, AWS (ECS Fargate, S3, CloudFront)
- Authentication: Handled by Authentik reverse proxy (no built-in auth)
- Monitoring: CloudWatch, custom metrics and dashboards
- Docker and Docker Compose
- Anthropic API key for Claude
- Python 3.12+
- Node.js 18+
- PostgreSQL database (we use Neon)
- Anthropic API key for Claude
# Clone the repository
git clone https://github.com/yourusername/oncall-agent.git
cd oncall-agent
# Start all services with Docker
./docker-dev.sh up
# Access the application:
# - Frontend: http://localhost:3000
# - Backend API: http://localhost:8000
# - API Docs: http://localhost:8000/docs# Clone the repository
git clone https://github.com/yourusername/oncall-agent.git
cd oncall-agent
# Backend setup
cd backend
pip install uv
uv sync
cp .env.example .env.local
# Edit .env.local with your API keys
# Frontend setup
cd ../frontend
npm install
cp .env.example .env.local
# Edit .env.local with your database URL
# Start development servers
npm run devAccess the application:
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
- API Docs: http://localhost:8000/docs
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ │ │ │ │ │
│ Next.js │────▶│ FastAPI │────▶│ Claude AI │
│ Frontend │ │ Backend │ │ (Anthropic) │
│ │ │ │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ │ │ │ │ │
│ Neon │ │ MCP │ │ Alert │
│ PostgreSQL │ │ Integrations │ │ Processing │
│ │ │ │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
- Modular MCP Integrations: All integrations extend
MCPIntegrationbase class - Async-First: All operations use async/await for concurrent processing
- Configuration-Driven: Pydantic for validation and environment variables
- Type-Safe: Extensive TypeScript and Python type hints
- Retry Logic: Built-in exponential backoff for network operations
- Environment Separation: Complete isolation between local/staging/production
- YOLO Mode: Autonomous remediation with safety mechanisms
oncall-agent/
├── backend/
│ ├── src/oncall_agent/
│ │ ├── agent.py # Core agent logic
│ │ ├── agent_enhanced.py # Enhanced agent with YOLO mode
│ │ ├── agent_executor.py # Command execution engine
│ │ ├── api/ # FastAPI routes and schemas
│ │ ├── mcp_integrations/ # MCP integration modules
│ │ ├── services/ # Business logic services
│ │ └── strategies/ # Resolution strategies
│ ├── tests/ # Test files
│ └── Dockerfile # Production Docker image
├── frontend/
│ ├── app/ # Next.js app router
│ ├── components/ # React components
│ ├── lib/ # Utilities and database
│ └── scripts/ # Build and deployment scripts
├── terraform/ # Infrastructure as Code
└── docs/ # Documentation
- Install Dependencies:
cd backend
pip install uv
uv sync- Configure Environment:
cp .env.example .env.localEdit .env.local:
# Core Configuration
ANTHROPIC_API_KEY=your-anthropic-api-key
CLAUDE_MODEL=claude-3-5-sonnet-20241022
ENVIRONMENT=local
LOG_LEVEL=INFO
# Database
DATABASE_URL=postgresql://user:pass@host/dbname?sslmode=require
# API Server
API_HOST=0.0.0.0
API_PORT=8000
API_RELOAD=true
# PagerDuty Integration
PAGERDUTY_ENABLED=true
PAGERDUTY_API_KEY=your-pagerduty-api-key
PAGERDUTY_WEBHOOK_SECRET=your-webhook-secret
# Kubernetes Integration
K8S_ENABLED=true
K8S_CONFIG_PATH=~/.kube/config
K8S_ENABLE_DESTRUCTIVE_OPERATIONS=false
# Alert Settings
# Free tier: 3 alerts/month- Run the Backend:
uv run python api_server.py- Install Dependencies:
cd frontend
npm install- Configure Environment:
cp .env.example .env.localEdit .env.local:
# Database Configuration
POSTGRES_URL=postgresql://user:pass@host/dbname?sslmode=require
# API Configuration
NEXT_PUBLIC_API_URL=http://localhost:8000
NODE_ENV=development
AUTH_SECRET=development-secret-key
- Run Database Migrations:
npm run db:migrate:local- Run the Frontend:
npm run devThe project uses Neon PostgreSQL with complete environment separation:
-
Create Neon Projects:
- Create separate projects for local, staging, and production
- Each environment has its own database instance
-
Configure Connection Strings:
# Local (.env.local) POSTGRES_URL=postgresql://neondb_owner:xxx@ep-xxx.region.neon.tech/neondb?sslmode=require # Staging (.env.staging) POSTGRES_URL=postgresql://neondb_owner:xxx@ep-yyy.region.neon.tech/neondb?sslmode=require # Production (.env.production) POSTGRES_URL=postgresql://neondb_owner:xxx@ep-zzz.region.neon.tech/neondb?sslmode=require
-
Run Migrations:
# Local npm run db:migrate:local # Staging npm run db:migrate:staging # Production (requires confirmation) npm run db:migrate:production
- Go to Services → Service Directory
- Select your service (e.g.,
frai-backend) - Click Integrations tab
- Click Add Integration
- Search for Events API V2
- Copy the Integration Key (routing key for sending events)
- Go to Integrations → Generic Webhooks (v3)
- Click New Webhook
- Configure:
- Webhook URL:
http://oncall.frai.pro:8001/webhook/pagerduty - Description: DreamOps AI Agent Webhook
- Scope Type: Service
- Scope: Select your service (e.g.,
frai-backend) - Event Subscription - Select these events:
- ✅
incident.triggered - ✅
incident.acknowledged - ✅
incident.escalated - ✅
incident.resolved - ✅
incident.priority_updated
- ✅
- Webhook URL:
- Click Add Webhook
- Important: Copy the webhook secret provided (optional, for signature verification)
⚠️ IMPORTANT: The webhook URL is/webhook/pagerduty(NOT/api/v1/webhook/pagerduty). The webhook router is mounted at the root level, not under the/api/v1prefix.
# PagerDuty Configuration
PAGERDUTY_ENABLED=true
PAGERDUTY_API_KEY=your-api-key # Optional: For API operations (acknowledge, resolve)
PAGERDUTY_USER_EMAIL=your-email@company.com # Optional: Required if using API
PAGERDUTY_WEBHOOK_SECRET=your-webhook-secret # Optional: For signature verificationNote: The webhook integration works without API key. API key is only needed if you want DreamOps to acknowledge/resolve incidents in PagerDuty.
# Replace with your Integration Key from step 1
curl -X POST https://events.pagerduty.com/v2/enqueue \
-H "Content-Type: application/json" \
-d '{
"routing_key": "YOUR_INTEGRATION_KEY",
"event_action": "trigger",
"dedup_key": "test-'$(date +%s)'",
"payload": {
"summary": "Test: High CPU usage on production server",
"severity": "critical",
"source": "monitoring-system",
"custom_details": {
"cpu_usage": "95%",
"server": "prod-api-01"
}
}
}'curl -X POST http://localhost:8000/webhook/pagerduty \
-H "Content-Type: application/json" \
-d '{
"event": {
"id": "test-event-123",
"event_type": "incident.triggered",
"resource_type": "incident",
"occurred_at": "2025-11-25T14:00:00Z",
"data": {
"id": "TEST123",
"type": "incident",
"status": "triggered",
"title": "Test Alert",
"service": {
"id": "PSVC123",
"summary": "Test Service"
},
"urgency": "high"
}
}
}'For bare metal deployment on port 8001:
# Docker Compose Configuration
# Expose port 8001 for webhook delivery
ports:
- "8001:80" # nginx → backend routingImportant Notes:
- PagerDuty's "Send Test Event" button does NOT actually send webhooks - it only validates the URL format
- Use Events API v2 (Option A above) to trigger real incidents that will send webhooks
- Webhook URL must be publicly accessible (no localhost)
- Custom ports like 8001 are supported
- Both HTTP and HTTPS are supported
⚠️ CRITICAL: ALWAYS RESOLVE TEST INCIDENTS IMMEDIATELY!Test incidents trigger real alerts to the on-call engineer. NEVER leave test incidents open.
After triggering a test, immediately resolve it:
curl -X POST https://events.pagerduty.com/v2/enqueue \ -H "Content-Type: application/json" \ -d '{"routing_key":"<YOUR_KEY>","event_action":"resolve","dedup_key":"<SAME_DEDUP_KEY>"}'
DreamOps posts AI analysis results as thread replies under PagerDuty incident messages in Slack.
- Go to https://api.slack.com/apps
- Click "Create New App" → "From scratch"
- Name it (e.g., "ONCALL AI") and select your workspace
In OAuth & Permissions, add these Bot Token Scopes:
chat:write- Post messageschannels:history- Read channel messages (to find PagerDuty threads)
Click "Install to Workspace" and authorize the app.
- Bot Token: Copy from OAuth & Permissions page (starts with
xoxb-) - Channel ID: Right-click channel → View details → Copy Channel ID
In Slack, go to your incidents channel and type:
/invite @ONCALL AI
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxx # Optional fallback
SLACK_BOT_TOKEN=xoxb-your-bot-token
SLACK_CHANNEL_ID=C07A3NZAYSD
SLACK_CHANNEL=#oncall
SLACK_ENABLED=trueWhen AI analysis completes, a concise thread reply is posted:
🤖 AI Analysis
Cause: Out of Memory (OOM) - Pod exceeded memory limits
Recommended Fixes:
• kubectl get pods -n production --field-selector=status.phase=Failed
• kubectl rollout restart deployment api-service -n production
View Full Report (clickable link to incident)
⚠️ CURRENT STATUS: AI Agent is DISABLED on production viaAI_AGENT_ENABLED=falseThe AI agent is currently disabled. Incoming PagerDuty incidents are logged but NOT analyzed. To enable AI analysis, set
AI_AGENT_ENABLED=truein the environment.
DreamOps provides two ways to control the AI agent:
# Master toggle - set to false to completely disable AI analysis
# When false, all incoming incidents are logged but NOT analyzed (no AI, no Slack messages)
AI_AGENT_ENABLED=false # Currently DISABLED on productionThis is the recommended way to disable AI in production. It takes precedence over the UI toggle.
The AI Control Panel (/ai-control) has a toggle switch to enable/disable AI analysis.
However, if AI_AGENT_ENABLED=false is set in the environment, the UI toggle has no effect.
- ENV VAR (
AI_AGENT_ENABLED) - Checked first, takes precedence - UI Toggle - Only checked if ENV VAR is true
# Check current status
curl http://oncall.frai.pro:8001/api/v1/agent/toggle
# Response shows both ENV VAR and UI status:
# {
# "ai_agent_enabled": false, # Effective status
# "env_var_enabled": false, # AI_AGENT_ENABLED env var
# "ui_toggle_enabled": true, # UI toggle (ignored if env_var_enabled is false)
# "disabled_by": "environment_variable",
# "message": "AI agent is DISABLED via environment variable (AI_AGENT_ENABLED=false)"
# }
# Toggle via UI (won't work if ENV VAR is false)
curl -X POST "http://oncall.frai.pro:8001/api/v1/agent/toggle?enabled=true"DreamOps offers multiple ways to integrate with Kubernetes clusters:
The basic integration provides comprehensive cluster management:
K8S_ENABLED=true
K8S_CONFIG_PATH=~/.kube/config
K8S_CONTEXT=your-context-name
K8S_NAMESPACE=default
K8S_ENABLE_DESTRUCTIVE_OPERATIONS=false # Set true for YOLO modeAvailable Actions:
- Pod management (list, logs, describe, restart)
- Deployment operations (status, scale, rollback)
- Service monitoring and health checks
- Event retrieval and analysis
- Automated resolution strategies
- Resource constraint analysis
The enhanced integration adds intelligent cluster discovery:
K8S_ENHANCED_ENABLED=true
K8S_ENHANCED_MULTI_CONTEXT=true
K8S_ENHANCED_AUTO_DISCOVER=true
K8S_ENHANCED_PERMISSION_CHECK=trueFeatures:
- Automatic context discovery from kubeconfig
- Multi-context support
- Permission verification
- Frontend configuration UI
- Namespace auto-discovery
For remote Kubernetes management via Agno:
AGNO_ENABLED=true
AGNO_GITHUB_TOKEN=ghp_your_github_token
AGNO_CONFIG_REPO=your-org/kubernetes-configsConnection Methods:
- Service Account authentication
- Kubeconfig file authentication
- Client certificate authentication
Setup:
# Configure remote connection
POST /api/v1/agno/configure
{
"cluster_name": "production",
"auth_method": "service_account",
"credentials": {
"token": "your-sa-token",
"ca_cert": "base64-encoded-cert",
"server": "https://k8s-api.example.com"
}
}Run Kubernetes operations via MCP protocol:
# Start MCP server
./start-kubernetes-mcp-server.sh
# Or run directly
uv run python -m src.oncall_agent.mcp_integrations.kubernetes_mcp_serverAvailable Operations:
- get_pods, describe_pod, get_pod_logs
- get_deployments, scale_deployment, restart_deployment
- get_services, get_endpoints
- apply_manifest, delete_resource
- get_events, get_nodes
- Go to https://www.notion.so/my-integrations
- Click "New integration"
- Give it a name (e.g., "DreamOps AI Agent")
- Select the workspace
- Copy the "Internal Integration Token" (starts with
secret_)
- In Notion, create a new page
- Add a database (Table, Board, etc.)
- Add these properties:
- Title (default)
- Status (Select: Open, In Progress, Resolved)
- Priority (Select: Low, Medium, High, Critical)
- Created (Date)
- Description (Text)
- Open your database in Notion
- Copy the URL:
https://www.notion.so/your-workspace/[DATABASE_ID]?v=... - The DATABASE_ID is the 32-character string after the workspace name
- In your database, click "..." → "Add connections"
- Search for your integration name
- Click to add it
# Notion Integration
NOTION_TOKEN=secret_YOUR_INTEGRATION_TOKEN_HERE
NOTION_DATABASE_ID=YOUR_DATABASE_ID_HERE
NOTION_VERSION=2022-06-28# Check integration status
curl http://localhost:8000/api/v1/integrations
# Test Notion specifically
curl -X POST http://localhost:8000/api/v1/integrations/notion/testThe Grafana integration provides metric retrieval and dashboard analysis:
GRAFANA_URL=http://localhost:3000
GRAFANA_API_KEY=your-grafana-api-keyFeatures:
- Metric retrieval
- Dashboard analysis
- Alert correlation
- Performance insights
Testing: The project includes a comprehensive Grafana test suite:
cd backend/tests/integrations/grafana
docker-compose up -d
pytest test_grafana_integration.py -vDreamOps uses strict environment separation to ensure development features don't leak into production.
The system uses two environment variables to determine the current mode:
-
NODE_ENV- Standard Node.js environment variabledevelopment- Local developmentstaging- Staging environmentproduction- Production environment
-
NEXT_PUBLIC_DEV_MODE- Explicit dev mode flagtrue- Enable development featuresfalse- Disable development features (default)
When NEXT_PUBLIC_DEV_MODE=true OR NODE_ENV=development:
- Automatic Pro Plan: All new users start with Pro plan
- All Integrations Enabled: No plan restrictions for integrations
- Unlimited Alerts: No alert limits in development
- Debug Logging: Enhanced logging for debugging
- Hot Reload: API server auto-reloads on file changes
.env.local # Local development (NEXT_PUBLIC_DEV_MODE=true)
.env.staging # Staging environment (NEXT_PUBLIC_DEV_MODE=false)
.env.production # Production environment (NEXT_PUBLIC_DEV_MODE=false)
The config loader checks for environment files in this order:
.env.{NODE_ENV}(e.g., .env.production).env.local.env
Explicit Production Settings:
# .env.production
NODE_ENV=production
NEXT_PUBLIC_DEV_MODE=falseCode Checks:
# Check if in development mode
is_dev_mode = (
os.getenv("NEXT_PUBLIC_DEV_MODE", "false").lower() == "true"
or os.getenv("NODE_ENV", "") == "development"
)Local Development:
NODE_ENV=development ./start-dev-server.sh
# OR
NODE_ENV=development uv run python api_server.pyProduction Deployment:
NODE_ENV=production uv run python api_server.pyAWS/Render Environment Variables:
NODE_ENV=productionNEXT_PUBLIC_DEV_MODE=false
| Integration | Free/Starter | Pro/Enterprise | Dev Mode |
|---|---|---|---|
| Kubernetes | ✅ | ✅ | ✅ |
| PagerDuty | ✅ | ✅ | ✅ |
| Notion | ❌ | ✅ | ✅ |
| GitHub | ❌ | ✅ | ✅ |
| Grafana | ❌ | ✅ | ✅ |
| Datadog | ❌ | ✅ | ✅ |
# Check current environment
curl http://localhost:8000/api/v1/alert-tracking/usage/test-user | jq .account_tier
# Dev mode: "pro", Prod mode: "free"
# Check integration access
curl "http://localhost:8000/api/v1/alert-tracking/check-integration-access/test-user/notion"
# Dev mode: {"has_access": true, "reason": "Development mode - all integrations enabled"}
# Prod mode: {"has_access": false, "reason": "Integration 'notion' is not allowed on free plan"}Enhanced Kubernetes MCP with intelligent error detection and automated remediation:
Features:
- Real-time pod monitoring and management
- Automatic error detection (CrashLoopBackOff, OOM, ImagePullBackOff)
- Intelligent remediation strategies
- Resource usage analysis
- Deployment management and rollbacks
Example Usage:
# The agent automatically detects and fixes Kubernetes issues
alert = {
"service": "payment-service",
"description": "Pod CrashLoopBackOff detected",
"severity": "high"
}
# Agent will analyze logs, identify root cause, and execute fixesConfiguration:
GITHUB_TOKEN=ghp_your_github_token
GITHUB_MCP_SERVER_PATH=../../github-mcp-server/github-mcp-serverFeatures:
- Repository management
- Issue and PR creation
- Code search and analysis
- Automated fixes with commits
Features:
- Incident documentation
- Knowledge base updates
- Runbook management
- Post-mortem automation
Automated Documentation: Each incident is documented with:
- Incident title and ID
- Service affected
- Issue type and severity
- Timestamp
- Detailed metadata
- Investigation checklist
- Resolution placeholder
Features:
- Metric retrieval
- Dashboard analysis
- Alert correlation
- Performance insights
YOLO (You Only Launch Once) mode enables fully autonomous operation:
K8S_ENABLE_DESTRUCTIVE_OPERATIONS=true
ALERT_AUTO_ACKNOWLEDGE=trueSafety Mechanisms:
- Action logging before execution
- Rollback capability
- Dry-run mode for testing
- Configurable action limits
Testing YOLO Mode:
# Simulate Kubernetes failures
./fuck_kubernetes.sh [1-5|all|random|clean]
# Scenarios:
# 1 - Pod crashes (CrashLoopBackOff)
# 2 - Image pull errors (ImagePullBackOff)
# 3 - OOM kills
# 4 - Deployment failures
# 5 - Service unavailabilityThe platform turns chaos engineering results into actionable insights:
# From frontend incidents page - click "Nuke Infrastructure"
# Or run directly:
./fuck_kubernetes.sh [1-5|all|random]- Chaos script creates Kubernetes issues
- Alerts sent to PagerDuty
- AI agent analyzes and remediates
- Creates detailed Notion documentation
Real-time Analysis:
POST /api/v1/insights/analyze-chaos- Incidents from last 2 hours
- Services affected
- Issue types detected
- Specific recommendations
Infrastructure Health Report:
GET /api/v1/insights/report- Total incidents over time
- Most problematic services
- Incident type distribution
- Trend analysis
- Prioritized recommendations
- Identifies recurring issues
- Detects time-based patterns
- Tracks incident frequency trends
- Builds knowledge base over time
DreamOps uses a freemium model:
- Free Tier: 3 alerts per month
- Starter: 50 alerts/month
- Professional: Unlimited alerts
- Enterprise: Custom limits
The system tracks alert usage per user:
interface AlertUsage {
user_id: string;
alerts_used: number;
alerts_limit: number;
billing_cycle_start: Date;
account_tier: 'free' | 'starter' | 'professional' | 'enterprise';
}When alert count exceeds limit, users are prompted to upgrade.
# Start backend
cd backend && uv run python api_server.py
# Start frontend (in another terminal)
cd frontend && npm run devThe project includes a complete Docker setup for local development with hot reload, automatic database setup, and all services configured.
# Start all services (frontend, backend, postgres, redis)
./docker-dev.sh up
# View logs
./docker-dev.sh logs
# Stop all services
./docker-dev.sh down-
Backend Service (
backend/Dockerfile.dev):- Python 3.12 with uv package manager
- FastAPI with hot reload enabled
- Kubectl installed for Kubernetes operations
- Development environment variables pre-configured
-
Frontend Service (
frontend/Dockerfile.dev):- Node.js 18 Alpine
- Next.js development server
- Hot reload enabled
- Automatic database migrations
-
PostgreSQL Database:
- PostgreSQL 16 Alpine
- Pre-configured with development credentials
- Persistent volume for data
-
Redis Cache:
- Redis 7 Alpine
- Used for caching and real-time features
# Build images
./docker-dev.sh build
# Rebuild from scratch
./docker-dev.sh rebuild
# Access database
./docker-dev.sh db
# Run migrations
./docker-dev.sh migrate
# Open Drizzle Studio
./docker-dev.sh studio
# Open shell in container
./docker-dev.sh shell backend
./docker-dev.sh shell frontend# Run automated test suite
./test-docker-setup.sh
# Manual health checks
curl http://localhost:8000/health
curl http://localhost:8000/api/v1/payments/debug/environmentThe Docker setup automatically configures:
NODE_ENV=development- Development modeNEXT_PUBLIC_DEV_MODE=true- Enable all featuresALERTS_LIMIT=100- Increased alert limit for developmentCORS_ORIGINS- Configured for frontend access- Database connections pre-configured
Port conflicts:
# Kill processes using required ports
lsof -ti:8000 | xargs kill -9 # Backend
lsof -ti:3000 | xargs kill -9 # Frontend
lsof -ti:5432 | xargs kill -9 # PostgreSQL
lsof -ti:6379 | xargs kill -9 # RedisView container logs:
docker-compose logs -f backend # Backend logs
docker-compose logs -f frontend # Frontend logs
docker-compose logs -f postgres # Database logsReset everything:
# Stop and remove all containers, networks, volumes
docker-compose down -v
# Remove images too
docker-compose down --rmi all# Build and run with Docker Compose
docker-compose -f docker-compose.production.yml up -d
# Environment-specific:
docker-compose -f docker-compose.staging.yml up -d-
Prerequisites:
- AWS CLI configured
- Terraform installed
- Domain name (optional)
-
Deploy Infrastructure:
cd terraform terraform init terraform plan -var-file=production.tfvars terraform apply -var-file=production.tfvars -
Components Deployed:
- ECS Fargate for backend
- S3 + CloudFront for frontend
- ALB for load balancing
- RDS/Aurora for database (optional)
- CloudWatch for monitoring
- Secrets Manager for credentials
For frontend deployment via Amplify:
version: 1
applications:
- appRoot: frontend
frontend:
phases:
preBuild:
commands:
- npm ci
build:
commands:
- npm run build
artifacts:
baseDirectory: .next
files:
- '**/*'
cache:
paths:
- 'node_modules/**/*'Deploy to Render.com for a managed cloud platform experience:
- Render.com account
- GitHub repository connected to Render
- Neon database (or other PostgreSQL)
-
Create Web Service:
- Name:
dreamops-backend - Environment:
Python - Build Command:
pip install uv && uv sync - Start Command:
uv run python api_server.py
- Name:
-
Environment Variables:
# Core ANTHROPIC_API_KEY=sk-ant-xxx CLAUDE_MODEL=claude-3-5-sonnet-20241022 NODE_ENV=production NEXT_PUBLIC_DEV_MODE=false # Database DATABASE_URL=postgresql://xxx # API Configuration API_HOST=0.0.0.0 API_PORT=10000 CORS_ORIGINS=https://your-frontend.onrender.com # Integrations (as needed) PAGERDUTY_API_KEY=xxx K8S_ENABLED=true
-
Advanced Settings:
- Instance Type: Standard or higher
- Health Check Path:
/health - Auto-Deploy: Yes
-
Create Static Site:
- Name:
dreamops-frontend - Build Command:
npm install && npm run build - Publish Directory:
out
- Name:
-
Environment Variables:
POSTGRES_URL=postgresql://xxx NEXT_PUBLIC_API_URL=https://dreamops-backend.onrender.com NODE_ENV=production
-
Headers (
render.yaml):headers: - path: /* name: X-Frame-Options value: DENY - path: /* name: X-Content-Type-Options value: nosniff
-
Verify Services:
curl https://dreamops-backend.onrender.com/health curl https://dreamops-frontend.onrender.com
-
Configure Webhooks: Update PagerDuty webhook URL to Render backend URL
-
Monitor Logs: Check Render dashboard for deployment and runtime logs
The project includes CI/CD workflows:
# .github/workflows/deploy.yml
name: Deploy to Production
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy Backend
# Deploy to ECS
- name: Deploy Frontend
# Deploy to Amplify/S3# Start all services
./docker-dev.sh up
# View logs in real-time
./docker-dev.sh logs
# Run database migrations
./docker-dev.sh migrate
# Open Drizzle Studio
./docker-dev.sh studio
# Test the setup
./test-docker-setup.sh-
Backend Development:
cd backend uv run python main.py # Run CLI uv run python api_server.py # Run API server
-
Frontend Development:
cd frontend npm run dev # Start development server npm run db:studio # Open Drizzle Studio
-
Testing Commands:
# Backend uv run pytest tests/ uv run ruff check . --fix uv run mypy . --ignore-missing-imports # Frontend npm run lint npm run type-check npm test
# Test PagerDuty webhook
curl -X POST http://localhost:8000/webhook/pagerduty \
-H "Content-Type: application/json" \
-d @test_webhook_payload.json
# Test Kubernetes integration
uv run python test_k8s_pagerduty_integration.py# Create test namespace
kubectl create namespace fuck-kubernetes-test
# Run failure simulations
./fuck_kubernetes.sh all
# Monitor agent response
tail -f logs/agent.log
# Clean up
./fuck_kubernetes.sh cleanThe project includes comprehensive Grafana integration tests:
cd backend/tests/integrations/grafana
# Start test environment
docker-compose up -d
# Run tests
pytest test_grafana_integration.py -v
# Performance benchmarks
pytest test_grafana_integration.py::test_performance -vTest Categories:
- Connection tests
- Metric retrieval tests
- Alert integration tests
- Performance benchmarks
- Error handling tests
-
Backend CI (
.github/workflows/backend-ci.yml):- Python linting and formatting
- Type checking with mypy
- Unit tests with pytest
- Security scanning
- Docker image build
-
Frontend CI (
.github/workflows/frontend-ci.yml):- ESLint and TypeScript checks
- Unit and integration tests
- Build verification
- Bundle size analysis
-
Security Scanning (
.github/workflows/security-scan.yml):- Dependency vulnerability scanning
- SAST with Semgrep
- Container image scanning
- Secret detection
-
Deployment (
.github/workflows/deploy.yml):- Environment-specific deployments
- Database migrations
- Health checks
- Rollback capability
# GitHub Secrets Required
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
ANTHROPIC_API_KEY
NEON_DATABASE_URL_STAGING
NEON_DATABASE_URL_PROD
AMPLIFY_APP_ID- Authentication is handled by Authentik reverse proxy
- No built-in authentication in the application
- User identity is provided via headers from Authentik
- Ensure Authentik is properly configured and secured
- Never commit
.envfiles to version control - Use different secrets for each environment
- Rotate API keys regularly
- Use least-privilege access for service accounts
- Each environment uses completely separate databases
- Connection strings include SSL requirements
- Database users have minimal required permissions
- Regular security updates for database instances
- RBAC permissions are minimally scoped
- Destructive operations require explicit enablement
- All kubectl commands are logged
- Namespace isolation for testing
- Request validation using Pydantic models
- Rate limiting implemented
- Comprehensive audit logging
- All I/O operations use async/await
- Concurrent processing of multiple alerts
- Connection pooling for database operations
- Efficient resource cleanup
- Configuration cached in memory
- Database query results cached where appropriate
- Static assets served from CDN
- Browser caching for frontend resources
- CloudWatch integration for metrics
- Custom dashboards for system health
- Alerting on error rates and performance
- Real-time log streaming
When migrating from direct kubectl commands to MCP-based operations:
| kubectl Command | MCP Tool | Parameters |
|---|---|---|
kubectl get pods |
get_pods |
namespace, label_selector |
kubectl describe pod |
describe_pod |
name, namespace |
kubectl logs |
get_pod_logs |
name, namespace, container |
kubectl get deployments |
get_deployments |
namespace |
kubectl scale |
scale_deployment |
name, namespace, replicas |
kubectl rollout restart |
restart_deployment |
name, namespace |
kubectl apply -f |
apply_manifest |
manifest |
kubectl delete |
delete_resource |
resource_type, name, namespace |
result = subprocess.run(["kubectl", "get", "pods", "-n", "default"], capture_output=True)result = await k8s_integration.call_tool("get_pods", {"namespace": "default"})- Replace subprocess kubectl calls with MCP tools
- Update error handling for MCP responses
- Add proper async/await for MCP calls
- Update logging to use MCP context
- Test rollback scenarios
- Verify permissions work with MCP
Problem: TypeError: Object of type ResolutionAction is not JSON serializable
Solution:
# Add to_dict() method to dataclasses
def to_dict(self) -> dict[str, Any]:
return asdict(self)Problem: "Requester User Not Found"
Solution: Ensure PAGERDUTY_USER_EMAIL is valid in your PagerDuty account
Problem: Connection timeouts or SSL errors
Solution:
- Include
?sslmode=requirein connection string - Check Neon project is active
- Remove
&channel_binding=requireif present
Problem: kubectl connection test failed
Solution:
- Verify kubectl is installed
- Check
~/.kube/configexists - Set correct context:
K8S_CONTEXT=your-context
Common Problems:
-
"notion integration requires NOTION_TOKEN and NOTION_DATABASE_ID"
- Make sure both environment variables are set
- Restart the backend server after adding them
-
"Failed to connect to Notion API"
- Check your integration token is correct
- Ensure the token starts with
secret_
-
"Database not found"
- Verify the database ID is correct (32 characters)
- Make sure you've shared the database with your integration
-
"Insufficient permissions"
- The integration needs read and write access
- Re-share the database with the integration
# Check API health
curl http://localhost:8000/health
# View payment environment
curl http://localhost:8000/api/v1/payments/debug/environment
# Test database connections
cd frontend && npm run test:db
# View agent logs
tail -f backend/logs/agent.log
# Check integration status
curl http://localhost:8000/api/v1/integrations
# Monitor webhook traffic (if using ngrok)
curl http://localhost:4040/inspect/httpGET /health
Response: {"status": "healthy", "version": "1.0.0"}POST /webhook/pagerduty
Content-Type: application/json
X-Webhook-Secret: your-secret
{
"event": {
"event_type": "incident.triggered",
"data": {...}
}
}GET /api/v1/alert-tracking/usage/{team_id}
Response: {
"team_id": "team_123",
"alerts_used": 2,
"alerts_limit": 3,
"account_tier": "free"
}POST /api/v1/alert-tracking/alerts
{
"team_id": "team_123",
"title": "Database connection failed",
"severity": "high"
}GET /api/v1/alert-tracking/check-integration-access/{team_id}/{integration_name}
Response: {
"has_access": true,
"reason": "Integration allowed on pro plan"
}GET /api/v1/integrations
Response: {
"integrations": [
{"name": "kubernetes", "enabled": true, "status": "connected"},
{"name": "pagerduty", "enabled": true, "status": "connected"},
{"name": "notion", "enabled": false, "status": "not_configured"}
]
}POST /api/v1/integrations/{name}/test
Response: {
"success": true,
"message": "Integration test successful"
}POST /api/v1/integrations/{name}/health
Response: {
"healthy": true,
"details": {...}
}GET /api/v1/incidents?team_id={team_id}
Response: {
"incidents": [{
"id": "INC_123",
"title": "Pod CrashLoopBackOff",
"status": "resolved",
"created_at": "2024-01-01T00:00:00Z"
}]
}GET /api/v1/dashboard/metrics?team_id={team_id}
Response: {
"total_incidents": 45,
"resolved_incidents": 40,
"avg_resolution_time": 300,
"uptime_percentage": 99.9
}GET /api/v1/dashboard/ai-actions?team_id={team_id}
Response: {
"actions": [{
"id": "ACT_123",
"type": "pod_restart",
"status": "completed",
"timestamp": "2024-01-01T00:00:00Z"
}]
}POST /api/v1/insights/analyze-chaos
Response: {
"incidents_created": 3,
"services_affected": ["oom-app", "bad-image-app"],
"insights": ["Memory issues detected", "Image pull failures"],
"recommendations": ["Increase memory limits", "Check registry access"]
}GET /api/v1/insights/report
Response: Markdown report with analysis and recommendationsGET /api/v1/insights/analysis?service=oom-app
Response: {
"service": "oom-app",
"incident_count": 5,
"issue_types": ["oom"],
"recommendations": ["Urgent: Increase memory limits"]
}POST /api/v1/agno/configure
{
"cluster_name": "production",
"auth_method": "service_account",
"credentials": {
"token": "your-sa-token",
"ca_cert": "base64-encoded-cert",
"server": "https://k8s-api.example.com"
}
}POST /api/v1/agno/execute
{
"cluster": "production",
"operation": "get_pods",
"namespace": "default"
}- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
When working with this codebase:
- Always check the latest documentation before making changes
- Test all changes locally before suggesting them
- Follow established patterns for error handling and logging
- Ensure JSON serialization compatibility for all API responses
- Run the pre-commit checklist:
uv run ruff check . --fix uv run mypy . --ignore-missing-imports uv run pytest tests/ uv run python main.py uv run python api_server.py
- Python: PEP 8, type hints, async/await
- TypeScript: ESLint config, proper types
- Git: Conventional commits
- Documentation: Clear, concise, with examples
MIT License - see LICENSE file for details
Built with ❤️ by the DreamOps Team
Dream easy while AI takes your on-call duty