Skip to content

Latest commit

 

History

History
405 lines (299 loc) · 9.71 KB

File metadata and controls

405 lines (299 loc) · 9.71 KB

Niyanta Monitoring & Observability - Implementation Summary

Date: March 10, 2026
Status: ✅ COMPLETE - Production-Ready
Effort: Full monitoring stack with Prometheus, Grafana, Loki


What Was Implemented

1. ✅ Prometheus Metrics Collection

File: docker/config/prometheus.yml

  • Scrapes metrics from Backend (port 8000/metrics)
  • Scrapes metrics from Workers (distributed)
  • Monitors Redis, RabbitMQ status
  • 15-second global evaluation interval
  • Stores TSDB data in docker volume

Features:

  • Auto-discovery of backend & worker metrics
  • Health check endpoints for all services
  • Alerting infrastructure ready

2. ✅ Grafana Visualization & Dashboards

Files:

  • docker/config/grafana/provisioning/datasources/datasources.yml
  • docker/config/grafana/provisioning/dashboards/dashboards.yml
  • docker/config/grafana/provisioning/dashboards/niyanta-overview.json

Default Dashboard: Niyanta System Overview

  • HTTP Request Rate (5m moving avg)
  • Active Workers Count
  • Request Latency (p95, p99 percentiles)
  • Task Completion Rate
  • Cache Hit Rate (%)
  • Error Rate trends

Access:


3. ✅ Loki Log Aggregation

File: docker/config/loki-config.yml

  • Centralized log storage
  • Filesystem-based persistence (upgradeable to S3)
  • Full-text searchable logs
  • 24-hour default retention (configurable)
  • BoltDB index for fast queries

Log Labels Tracked:

  • container: Container name
  • service: Service identifier
  • stream: Log stream (stdout/stderr)
  • app: Application label
  • job: Job identifier

4. ✅ Promtail Log Shipper

File: docker/config/promtail-config.yml

  • Ships Docker container logs to Loki
  • Monitors docker.sock for all containers
  • Adds service & container labels
  • Real-time log forwarding

Monitored Containers:

  • Backend
  • Workers (all instances)
  • Redis
  • RabbitMQ
  • Prometheus
  • Grafana
  • Loki
  • Promtail
  • Redis Exporter

5. ✅ Redis Exporter

Service: redis-exporter

  • Monitors Redis memory, connections, operations
  • Exports metrics in Prometheus format
  • Port: 9121
  • Auto-scraping by Prometheus

6. ✅ FastAPI Backend Instrumentation

Files:

  • backend/utils/metrics.py - Metrics definitions
  • backend/main.py - Metrics middleware integration
  • backend/requirements.txt - prometheus-client dependency

Metrics Exposed:

  • Request Metrics:

    • http_requests_total - By method/endpoint/status
    • http_request_duration_seconds - Latency histogram
    • backend_uptime_seconds - Service uptime
  • Task Metrics:

    • completed_tasks_total - By worker/status
    • task_duration_seconds - By task type
    • active_workers - Current count
    • queued_tasks - Pending count
  • Cache Metrics:

    • cache_hits_total - Successful cache retrievals
    • cache_misses_total - Cache misses
  • RAG Metrics:

    • rag_query_duration_seconds - By pipeline mode
    • vector_retrieval_duration_seconds - Vector DB latency
  • Database Metrics:

    • redis_operations_total - Get/set/delete operations
    • worker_memory_bytes - Per-worker memory usage
  • Error Metrics:

    • errors_total - By type and service

Metrics Endpoint:

  • GET /metrics - Prometheus format response
  • Updated: Every request is tracked automatically

Docker Compose Integration

Updated: docker/docker-compose.yml

New Services Added:

Services:
├── prometheus         # :9090 - Metrics collection
├── grafana            # :3000 - Visualization (admin/admin)
├── loki               # :3100 - Log aggregation
├── promtail           # :9080 - Log shipper
└── redis-exporter     # :9121 - Redis metrics

Volumes:
├── prometheus-data    # Metrics storage
├── grafana-data       # Dashboard configs
└── loki-data          # Log storage

Health Checks:

  • All monitoring services have automated health checks
  • Docker Compose verifies service readiness

Access Points

Service URL Purpose
Grafana http://localhost:3000 Dashboards & visualization
Prometheus http://localhost:9090 Metrics database & queries
Loki http://localhost:3100 Log search (via Grafana)
Backend Metrics http://localhost:8000/metrics Prometheus format metrics
Redis Exporter http://localhost:9121/metrics Redis metrics
RabbitMQ Stats http://localhost:15672 Queue statistics

Quick Start

cd docker

# Start monitoring stack only
docker-compose up -d prometheus grafana loki promtail redis-exporter

# Start full stack (app + monitoring)
docker-compose up -d

# Access dashboard after 30 seconds
# Grafana: http://localhost:3000 (admin/admin)

# View logs
docker-compose logs -f prometheus
docker-compose logs -f grafana

File Structure

docker/
├── config/
│   ├── prometheus.yml              ← Prometheus config (scrape targets, rules)
│   ├── loki-config.yml             ← Loki config (storage, retention)
│   ├── promtail-config.yml         ← Promtail config (log sources)
│   └── grafana/
│       └── provisioning/
│           ├── datasources/
│           │   └── datasources.yml ← Prometheus + Loki data sources
│           └── dashboards/
│               ├── dashboards.yml  ← Dashboard provider
│               └── niyanta-overview.json ← System overview dashboard
├── docker-compose.yml              ← Updated with monitoring services
├── MONITORING.md                   ← Comprehensive guides
└── ...

backend/
├── utils/
│   └── metrics.py                  ← Metrics definitions & middleware
├── main.py                         ← /metrics endpoint added
└── requirements.txt                ← prometheus-client==0.20.0 added

Key Features

Zero-Config Dashboards

  • Auto-loads Niyanta System Overview dashboard
  • Pre-configured data sources (Prometheus + Loki)
  • No manual setup required

Comprehensive Metrics

  • Request tracking (latency, throughput, errors)
  • Worker performance (active count, task completion)
  • Cache effectiveness (hit rate, miss rate)
  • System health (uptime, errors, resource usage)

Centralized Logging

  • All container logs shipped to Loki
  • Searchable by service, container, log level
  • Full-text search support

Production Ready

  • Health checks for all services
  • Persistent data volumes
  • Error handling & graceful degradation
  • Resource limits configured

Scalability

  • Works with dynamic worker scaling
  • Promtail auto-discovers new containers
  • Prometheus handles multiple worker instances

Metrics Example Queries

# Request rate per endpoint
sum by (endpoint) (rate(http_requests_total[5m]))

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Cache efficiency
(cache_hits_total / (cache_hits_total + cache_misses_total)) * 100

# Error rate percentage
(rate(errors_total[5m]) / rate(http_requests_total[5m])) * 100

# Worker utilization
sum(active_workers) / 10  # Assuming 10 max workers

# Task throughput by worker
sum by (worker_id) (rate(completed_tasks_total[5m]))

Log Search Examples (LogQL)

# All backend logs
{service="backend"}

# Errors only
{service="worker"} | level="ERROR"

# Specific container logs
{container_name="niyanta_backend"}

# Multi-service aggregation
{service=~"backend|worker"}

# JSON parsing
{job="docker"} | json | error="true"

Volume Space

Typical Storage Requirements:

Volume Size/Day 7 Days 30 Days
Prometheus 50-100 MB 350-700 MB 1.5-3 GB
Loki 100-200 MB 700 MB-1.4 GB 3-6 GB
Grafana ~500 MB ~500 MB ~500 MB
Total 150-300 MB 1-2.6 GB 4.5-9.5 GB

Cleanup:

# Clean old monitoring data
docker volume rm docker_prometheus-data docker_loki-data

# Keep Grafana dashboards
# docker volume prune

Next Phase: Alerting

Ready to implement:

- AlertManager service
- Email/Slack notifications
- Alert rules (error rate, worker down, etc.)
- On-call routing

Performance Impact

CPU/Memory Overhead:

  • Prometheus: ~100-200 MB RAM, ~5% CPU
  • Grafana: ~100-150 MB RAM, ~2% CPU
  • Loki: ~80-150 MB RAM, ~3% CPU
  • Promtail: ~30-50 MB RAM, ~2% CPU
  • Redis Exporter: ~10-20 MB RAM, <1% CPU

Total: ~400-600 MB RAM, ~15% CPU (minimal)


Testing Completed

✅ All services start successfully
✅ Metrics endpoint returns valid Prometheus format
✅ Docker Compose syntax valid
✅ Grafana auto-loads dashboard
✅ Prometheus scrapes all targets
✅ Logs appear in Loki
✅ Health checks passing


Documentation

Complete guide at: docker/MONITORING.md

Covers:

  • Quick start & access points
  • Architecture diagrams
  • Available metrics reference
  • Dashboard creation guide
  • LogQL query examples
  • Troubleshooting section
  • Performance tuning
  • Future roadmap (Alerting, Tracing)

Commands Reference

# View all services
docker-compose ps

# Check metrics on backend
curl http://localhost:8000/metrics

# Scale workers and monitor
docker-compose up -d --scale worker=7

# View Prometheus targets
curl http://localhost:9090/api/v1/targets

# Query Prometheus
curl 'http://localhost:9090/api/v1/query?query=active_workers'

# View Loki label values
curl http://localhost:3100/loki/api/v1/labels

Implementation Ready!

The monitoring & observability stack is now production-ready and integrated with Niyanta.

Start now: docker-compose up -d then visit http://localhost:3000