Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy kube-prometheus-stack for Cluster Monitoring #47

Open
8 tasks
mischavandenburg opened this issue Jan 11, 2025 · 1 comment
Open
8 tasks

Deploy kube-prometheus-stack for Cluster Monitoring #47

mischavandenburg opened this issue Jan 11, 2025 · 1 comment

Comments

@mischavandenburg
Copy link
Owner

Deploy kube-prometheus-stack for Cluster Monitoring

Description

We need to implement the kube-prometheus-stack Helm chart to establish comprehensive monitoring and alerting for our k3s cluster. This stack includes Prometheus, Grafana, AlertManager, and various exporters to provide full observability of our cluster and applications.

Requirements

  • Deploy kube-prometheus-stack via GitOps using Helm
  • Configure persistent storage for metrics
  • Set up proper ingress with TLS
  • Implement basic alerting rules
  • Configure Grafana dashboards for key metrics

Technical Considerations

  • Resource requirements and limits
  • Retention period for metrics
  • Integration with cert-manager for TLS
  • Storage class selection for persistence
  • Integration with identity provider for authentication

Tasks

  1. Initial Stack Deployment

    • Create Helm values configuration
    • Configure persistent storage
    • Test basic functionality
  2. Grafana Setup

    • Configure SSO with identity provider
    • Import essential dashboards
    • Set up data sources
    • Configure persistent storage
    • Make proposal for user management
  3. AlertManager Configuration

    • Set up basic alert rules
    • Configure notification channels (Slack)
    • Test alert delivery

Acceptance Criteria

  • Stack is successfully deployed and operational
  • Persistent storage is properly configured
  • Ingress with TLS is working
  • Basic alerts are functional
  • Key dashboards are available
  • Authentication is working
  • Metrics retention is configured
  • Documentation is complete

Important Dashboards to Include

  • Node metrics
  • Kubernetes cluster overview
  • Persistent volumes
  • API server metrics
  • etcd metrics
  • cert-manager status
  • CloudNativePG databases
  • ArgoCD status

Basic Alerts to Implement

  • Node status
  • Pod status
  • Storage capacity
  • Certificate expiration
  • Backup status
  • High CPU/Memory usage
  • Persistent volume status

Additional Notes

  • Document all the setup steps for new deployments in the future
@vikramreddym
Copy link

I am interested in working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants