Date: March 10, 2026
Status: ✅ COMPLETE - Production-Ready
Effort: Full monitoring stack with Prometheus, Grafana, Loki
File: docker/config/prometheus.yml
- Scrapes metrics from Backend (port 8000/metrics)
- Scrapes metrics from Workers (distributed)
- Monitors Redis, RabbitMQ status
- 15-second global evaluation interval
- Stores TSDB data in docker volume
Features:
- Auto-discovery of backend & worker metrics
- Health check endpoints for all services
- Alerting infrastructure ready
Files:
docker/config/grafana/provisioning/datasources/datasources.ymldocker/config/grafana/provisioning/dashboards/dashboards.ymldocker/config/grafana/provisioning/dashboards/niyanta-overview.json
Default Dashboard: Niyanta System Overview
- HTTP Request Rate (5m moving avg)
- Active Workers Count
- Request Latency (p95, p99 percentiles)
- Task Completion Rate
- Cache Hit Rate (%)
- Error Rate trends
Access:
- URL: http://localhost:3000
- Credentials: admin/admin
- Auto-refresh: 10 seconds
- Time range: Last 1 hour
File: docker/config/loki-config.yml
- Centralized log storage
- Filesystem-based persistence (upgradeable to S3)
- Full-text searchable logs
- 24-hour default retention (configurable)
- BoltDB index for fast queries
Log Labels Tracked:
container: Container nameservice: Service identifierstream: Log stream (stdout/stderr)app: Application labeljob: Job identifier
File: docker/config/promtail-config.yml
- Ships Docker container logs to Loki
- Monitors docker.sock for all containers
- Adds service & container labels
- Real-time log forwarding
Monitored Containers:
- Backend
- Workers (all instances)
- Redis
- RabbitMQ
- Prometheus
- Grafana
- Loki
- Promtail
- Redis Exporter
Service: redis-exporter
- Monitors Redis memory, connections, operations
- Exports metrics in Prometheus format
- Port: 9121
- Auto-scraping by Prometheus
Files:
backend/utils/metrics.py- Metrics definitionsbackend/main.py- Metrics middleware integrationbackend/requirements.txt- prometheus-client dependency
Metrics Exposed:
-
Request Metrics:
http_requests_total- By method/endpoint/statushttp_request_duration_seconds- Latency histogrambackend_uptime_seconds- Service uptime
-
Task Metrics:
completed_tasks_total- By worker/statustask_duration_seconds- By task typeactive_workers- Current countqueued_tasks- Pending count
-
Cache Metrics:
cache_hits_total- Successful cache retrievalscache_misses_total- Cache misses
-
RAG Metrics:
rag_query_duration_seconds- By pipeline modevector_retrieval_duration_seconds- Vector DB latency
-
Database Metrics:
redis_operations_total- Get/set/delete operationsworker_memory_bytes- Per-worker memory usage
-
Error Metrics:
errors_total- By type and service
Metrics Endpoint:
GET /metrics- Prometheus format response- Updated: Every request is tracked automatically
Updated: docker/docker-compose.yml
Services:
├── prometheus # :9090 - Metrics collection
├── grafana # :3000 - Visualization (admin/admin)
├── loki # :3100 - Log aggregation
├── promtail # :9080 - Log shipper
└── redis-exporter # :9121 - Redis metrics
Volumes:
├── prometheus-data # Metrics storage
├── grafana-data # Dashboard configs
└── loki-data # Log storageHealth Checks:
- All monitoring services have automated health checks
- Docker Compose verifies service readiness
| Service | URL | Purpose |
|---|---|---|
| Grafana | http://localhost:3000 | Dashboards & visualization |
| Prometheus | http://localhost:9090 | Metrics database & queries |
| Loki | http://localhost:3100 | Log search (via Grafana) |
| Backend Metrics | http://localhost:8000/metrics | Prometheus format metrics |
| Redis Exporter | http://localhost:9121/metrics | Redis metrics |
| RabbitMQ Stats | http://localhost:15672 | Queue statistics |
cd docker
# Start monitoring stack only
docker-compose up -d prometheus grafana loki promtail redis-exporter
# Start full stack (app + monitoring)
docker-compose up -d
# Access dashboard after 30 seconds
# Grafana: http://localhost:3000 (admin/admin)
# View logs
docker-compose logs -f prometheus
docker-compose logs -f grafanadocker/
├── config/
│ ├── prometheus.yml ← Prometheus config (scrape targets, rules)
│ ├── loki-config.yml ← Loki config (storage, retention)
│ ├── promtail-config.yml ← Promtail config (log sources)
│ └── grafana/
│ └── provisioning/
│ ├── datasources/
│ │ └── datasources.yml ← Prometheus + Loki data sources
│ └── dashboards/
│ ├── dashboards.yml ← Dashboard provider
│ └── niyanta-overview.json ← System overview dashboard
├── docker-compose.yml ← Updated with monitoring services
├── MONITORING.md ← Comprehensive guides
└── ...
backend/
├── utils/
│ └── metrics.py ← Metrics definitions & middleware
├── main.py ← /metrics endpoint added
└── requirements.txt ← prometheus-client==0.20.0 added
✅ Zero-Config Dashboards
- Auto-loads Niyanta System Overview dashboard
- Pre-configured data sources (Prometheus + Loki)
- No manual setup required
✅ Comprehensive Metrics
- Request tracking (latency, throughput, errors)
- Worker performance (active count, task completion)
- Cache effectiveness (hit rate, miss rate)
- System health (uptime, errors, resource usage)
✅ Centralized Logging
- All container logs shipped to Loki
- Searchable by service, container, log level
- Full-text search support
✅ Production Ready
- Health checks for all services
- Persistent data volumes
- Error handling & graceful degradation
- Resource limits configured
✅ Scalability
- Works with dynamic worker scaling
- Promtail auto-discovers new containers
- Prometheus handles multiple worker instances
# Request rate per endpoint
sum by (endpoint) (rate(http_requests_total[5m]))
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Cache efficiency
(cache_hits_total / (cache_hits_total + cache_misses_total)) * 100
# Error rate percentage
(rate(errors_total[5m]) / rate(http_requests_total[5m])) * 100
# Worker utilization
sum(active_workers) / 10 # Assuming 10 max workers
# Task throughput by worker
sum by (worker_id) (rate(completed_tasks_total[5m]))
# All backend logs
{service="backend"}
# Errors only
{service="worker"} | level="ERROR"
# Specific container logs
{container_name="niyanta_backend"}
# Multi-service aggregation
{service=~"backend|worker"}
# JSON parsing
{job="docker"} | json | error="true"
Typical Storage Requirements:
| Volume | Size/Day | 7 Days | 30 Days |
|---|---|---|---|
| Prometheus | 50-100 MB | 350-700 MB | 1.5-3 GB |
| Loki | 100-200 MB | 700 MB-1.4 GB | 3-6 GB |
| Grafana | ~500 MB | ~500 MB | ~500 MB |
| Total | 150-300 MB | 1-2.6 GB | 4.5-9.5 GB |
Cleanup:
# Clean old monitoring data
docker volume rm docker_prometheus-data docker_loki-data
# Keep Grafana dashboards
# docker volume pruneReady to implement:
- AlertManager service
- Email/Slack notifications
- Alert rules (error rate, worker down, etc.)
- On-call routingCPU/Memory Overhead:
- Prometheus: ~100-200 MB RAM, ~5% CPU
- Grafana: ~100-150 MB RAM, ~2% CPU
- Loki: ~80-150 MB RAM, ~3% CPU
- Promtail: ~30-50 MB RAM, ~2% CPU
- Redis Exporter: ~10-20 MB RAM, <1% CPU
Total: ~400-600 MB RAM, ~15% CPU (minimal)
✅ All services start successfully
✅ Metrics endpoint returns valid Prometheus format
✅ Docker Compose syntax valid
✅ Grafana auto-loads dashboard
✅ Prometheus scrapes all targets
✅ Logs appear in Loki
✅ Health checks passing
Complete guide at: docker/MONITORING.md
Covers:
- Quick start & access points
- Architecture diagrams
- Available metrics reference
- Dashboard creation guide
- LogQL query examples
- Troubleshooting section
- Performance tuning
- Future roadmap (Alerting, Tracing)
# View all services
docker-compose ps
# Check metrics on backend
curl http://localhost:8000/metrics
# Scale workers and monitor
docker-compose up -d --scale worker=7
# View Prometheus targets
curl http://localhost:9090/api/v1/targets
# Query Prometheus
curl 'http://localhost:9090/api/v1/query?query=active_workers'
# View Loki label values
curl http://localhost:3100/loki/api/v1/labelsImplementation Ready! ✅
The monitoring & observability stack is now production-ready and integrated with Niyanta.
Start now: docker-compose up -d then visit http://localhost:3000