-
Issue: Site or service becomes unavailable.
-
Cause: Server crash, bad deployment, DNS failure, etc.
-
Fix:
- Check health checks (/health, /status).
- Use load balancer failover.
- Roll back deployment quickly using CI/CD.
-
Issue: App crashes or services stop.
-
Cause: Logs, images, backups filling /var, /tmp, or /.
-
Fix:
- Clear old logs: sudo journalctl --vacuum-time=3d
- Clean Docker: docker system prune
- Setup alerts for disk usage.
-
Issue: Server becomes unresponsive.
-
Cause: Infinite loops, memory leaks, heavy DB queries.
-
Fix:
- top / htop to identify process.
- Restart or scale up/down pods (Kubernetes).
- Use monitoring tools: CloudWatch, Prometheus.
-
Issue: HTTPS stops working, users see security warnings.
-
Fix:
- Renew using Let's Encrypt: certbot renew
- Automate renewal with cron.
- Use Route53 + ACM in AWS for auto-renew.
-
Issue: Code changes aren’t deployed.
-
Cause: Git conflicts, failed tests, broken scripts.
-
Fix:
- Review pipeline logs in Jenkins/GitHub Actions/GitLab.
- Fix test or script errors.
- Re-trigger build after fixing.
-
Issue: Users can't access site after migration.
-
Cause: DNS TTL too high or misconfigured records.
-
Fix:
- Use low TTL during migration.
- Verify with nslookup, dig.
-
Issue: Traffic not routing properly.
-
Fix:
- Check health check config.
- Review target group settings.
- Confirm correct listener rules.
-
Issue: “Access Denied” errors for developers or automation.
-
Fix:
- Review IAM policies.
- Use least privilege principle.
- Use sts decode-authorization-message (AWS) to debug.
-
Issue: Pod/Container restarts continuously.
-
Fix:
- Check logs: docker logs or kubectl logs.
- Check image tag, health checks, volume paths.
- Add proper readiness & liveness probes.
-
Issue: 100s of alerts flood after small incident.
-
Fix:
- Implement alert grouping (like in Prometheus Alertmanager).
- Use thresholds with time windows.
- Tune noise-making alerts.
-
Cause: High load, faulty drivers, hardware issues.
-
Action:
- Analyze logs (/var/log/syslog, dmesg).
- Use monitoring alerts to auto-reboot or failover.
-
Cause: Too many connections, slow queries, network issues.
-
Fix:
- Optimize queries & indexing.
- Tune DB connection pool.
- Use RDS/Aurora metrics & auto-scaling.
-
Issue: API keys or passwords pushed to GitHub.
-
Fix:
- Revoke and rotate secrets.
- Use tools like AWS Secrets Manager, HashiCorp Vault.
- Scan commits using git-secrets or TruffleHog.
-
Cause: Wrong instance type, public IP enabled, missing tags.
-
Fix:
- Review Terraform plan before apply.
- Use terraform state to inspect issues.
- Maintain modules and version control.
-
Issue: Build takes too long, huge images (>1GB).
-
Fix:
- Use multi-stage builds.
- Use alpine or slim base images.
- Clean up unnecessary files in Dockerfile.
-
Issue: Requests hitting wrong targets.
-
Fix:
- Verify listener rules, health checks.
- Ensure target groups are healthy.
- Use AWS ELB/ALB access logs for debugging.
-
Cause: App crashes on start due to config/env issues.
-
Fix:
- Run kubectl logs and kubectl describe pod.
- Use readiness/liveness probes correctly.
- Check resource limits (CPU/Memory).
-
Cause: Network hops, unoptimized backend, throttling.
-
Fix:
- Use tracing (OpenTelemetry, AWS X-Ray, Jaeger).
- Monitor with Prometheus + Grafana.
- Optimize API response time and DB calls.
-
Cause: Long tests, large artifacts, misconfigured agents.
-
Fix:
- Use caching (e.g., Docker layer cache, NPM cache).
- Split pipelines into stages (build/test/deploy).
- Parallelize jobs and use runners wisely.
-
Issue: Pods/instances not scaling when needed.
-
Cause: Metrics not collected properly or thresholds too high.
-
Fix:
- Use proper CPU/memory requests & limits.
- Verify HPA (Horizontal Pod Autoscaler) setup.
- Ensure CloudWatch/Prometheus metrics are accurate.
-
Cause: Volume deletion, wrong rm -rf, or dropped DB.
-
Fix:
- Always use backups (EBS snapshots, S3 versioning).
- Practice disaster recovery (RTO, RPO).
- Avoid destructive scripts in CI/CD.
-
Issue: Either too many logs or missing logs.
-
Fix:
- Use centralized logging (ELK, EFK, CloudWatch Logs).
- Implement log rotation.
- Adjust log levels in prod (info/warn vs debug).
-
Issue: Domain not resolving.
-
Fix:
- Check with dig, nslookup.
- Validate Route53 or DNS provider settings.
- Check CNAME, A records, TTL, etc.
-
Issue: JWT expiry mismatch, logs out of order.
-
Fix:
- Sync all servers with NTP.
- Use tools like Chrony or ntpd.
-
Cause: Tag mismatch, stale cache, build errors.
-
Fix:
- Use Git SHA or version in artifacts.
- Lock dependencies in package-lock.json or requirements.txt.
-
Issue: Ports like 22, 3306 open to all.
-
Fix:
- Use least privilege and restrict to known IPs.
- Regularly audit using tools like AWS Trusted Advisor.
-
Issue: App fails due to missing persistent volume.
-
Fix:
- Check if volume is attached and mounted.
- Review volume permissions and mount path in Kubernetes.
-
Issue: Lambda/S3/EC2 can't access services.
-
Fix:
- Verify sts:AssumeRole policies.
- Use IAM policy simulator to debug.