Skip to content

Latest commit

 

History

History
63 lines (56 loc) · 6.94 KB

File metadata and controls

63 lines (56 loc) · 6.94 KB

About

This terraform module performs some adjustments on a prometheus configuration and syncs it with an etcd key prefix.

It is meant to:

Currently, the two kinds of boilerplate that are supported:

  • Node exporter rules and alerts for vms (number of hosts detected, cpu, ram, disks)
  • Terracd jobs metrics and alerts (to get the interval since the last plan/apply and a threshold value that will trigger an alert)

Inputs

  • config: This should be the value of the entrypoint prometheus.yml configuration file which will be generated from this value. The module will add some rule_files entries for the rule files it generates and otherwise will leave the content as is.
  • fs_path: Path where the prometheus configuration will be generated prior to synchronizting it with etcd. Beyond generating the prometheus.yml file there, boilerplate rule files will be generated in the rules subdirectory.
  • etcd_key_prefix: Etcd prefix where the processed prometheus configuration will be synchronized.
  • node_exporter_jobs: List of node exporter jobs to generate boilerplate for. Each entry should take the following keys:
    • tag: Tag for the node exporter job. Is should consist of words separated by dashes. The job is expected to be called <tag>-node-exporter
    • expected_count: Expected number of instances associated with the job
    • memory_usage_threshold: Maximum memory usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
    • cpu_usage_threshold: Maximum cpu usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
    • expected_disks_count: Expected number of disks (ex: 2). An alert will be triggered if the number of disks doesn't match. Can be set to -1 to disable the alert.
    • disk_space_usage_threshold: Maximum disk space usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
    • disk_io_usage_threshold: Maximum disk io usage as a percentage (ex: 90). An alert will be triggered if this threshold is crossed for 15 minutes of more.
    • alert_labels: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
  • blackbox_exporter_jobs: List of blackbox tcp/http exporter jobs to generate boilerplate for. Each entry should take the following keys:
    • tag: Tag for the blackbox exporter job. Is should consist of words separated by dashes. The job is expected to be called <tag>-blackbox-exporter
    • unavailability_tolerance: Duration the service can be unavailable before an alert triggers. The format of the duration is a string formated as prometheus expects in the for field of alert rules.
    • max_acceptable_latency: Duration in seconds indicating the maximum acceptable response time for the service. If the service continuously takes longer than this to respond for an interval of time longer than unavailability_tolerance, a slow service alert will be triggered.
    • cert_renewal_window: Delay in days indicating the expected renewal window for the tls certificate provided by the service. If the certificate the service provides expires within a delay shorter than this window, an alert will be triggered to indicate the certificate wasn't renewed properly.
    • has_tls: Boolean indicating whether the service expects a tls connection. If false, alerts for the cert renewal window and tls version will not be set.
    • expect_recent_tls: Boolean indicating whether the service is expected to use tls version 1.3. If set to true and the service uses a version of tls older than 1.3, an alert will be triggered.
    • alert_labels: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
  • terracd_jobs: List of terracd jobs to generate boilerplate for. Each entry should take the following keys:
    • tag: Tag for the terracd job. It should correspond to the job name.
    • plan_interval_threshold: Interval threshold after which an alert will be triggered if a plan or apply command did not run successfully. Used to diagnose a broken or non-running pipeline.
    • apply_interval_threshold: Interval threshold after which an alert will be triggered if an apply command did not run successfully. Used to detect a pipeline that was left in plan and never put back on apply.
    • unit: Base time unit to use (minute or hour) that will affect how the thresholds are interepreted and how the rules are processed (to be either in minutes or hours)
    • alert_labels: Map of string keys and values corresponding to labels to add to all the jobs' alerts.
  • kubernetes_cluster_jobs: List of kubernetes cluster jobs to generate boilerplate for. Each entry should take the following key:
    • tag: Tag for the kubernetes cluster job. It should correspond to the cluster name.
    • expected_services: List of expected deployments that should have a certain number of long running instances. Each entry should have the following keys:
      • namespace: Namespace where the service is expected to run
      • name: Name of the service. It should match the k8 deployment name.
      • expected_min_count: Minimum expected number of instances that should be running.
      • expected_start_delay: Expected delay before an instance is started. Running instances that have been around for less than that delay won't be considered running.
      • alert_labels: Extra labels to add to alerts triggered for the service.
  • minio_cluster_jobs: List of minio cluster jobs to generate boilerplate for. Each entry should take the following key:
    • tag: Tag for the minio cluster job. It should correspond to the cluster name.
  • etcd_exporter_jobs: List of etcd exporter jobs to generate boilerplate for. Each entry should take the following keys:
    • tag: Tag for the etcd exporter job. Is should consist of words separated by dashes. The job is expected to be called <tag>-etcd-exporter
    • expected_count: Expected number of etcd members associated with the job
    • max_learn_time: Max expected time for an etcd learner to catchup.
    • max_db_size: Maximum expected data size (note that etcd has its own limit if 8GiB)
    • alert_labels: Map of string keys and values corresponding to labels to add to all the jobs' alerts.

Example

For a usage example, see: https://github.com/Ferlab-Ste-Justine/kvm-dev-orchestrations/blob/main/prometheus/prometheus-configs.tf