Skip to content

TIG Stack

Dan Davies edited this page Oct 1, 2025 · 5 revisions

TIG (Telegraf/InfluxDB/Grafana) Stack

A TIG stack has been set up to monitor disk usage over time. There is a public Grafana dashboard (internal to Imperial) showing disk usage, RAM and CPU over time.

image

Disk usage of over 85% should trigger a Slack message in the #alerts channel of the aichemy Slack workspace. This is configured within Grafana directly (see below).

image

The docker-compose.yaml and other config files are on the server at /tig-stack/.

Example docker-compose.yaml:

version: '3.8'

services:
  influxdb:
    image: influxdb:2.7
    container_name: influxdb
    restart: unless-stopped
    ports:
      - "8086:8086"
    volumes:
      - influxdb_data:/var/lib/influxdb2
    environment:
      - DOCKER_INFLUXDB_INIT_MODE=setup
      - DOCKER_INFLUXDB_INIT_USERNAME=admin
      - DOCKER_INFLUXDB_INIT_PASSWORD=<redacted>
      - DOCKER_INFLUXDB_INIT_ORG=default-org
      - DOCKER_INFLUXDB_INIT_BUCKET=telegraf
      - DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=<redacted>

  telegraf:
    image: telegraf:latest
    container_name: telegraf
    restart: unless-stopped
    depends_on:
      - influxdb
    volumes:
      - ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
      - /:/hostfs:ro

  grafana:
    image: grafana/grafana-oss
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana

volumes:
  influxdb_data:
  grafana_data:

Example telegraf.conf:

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

[[outputs.influxdb_v2]]
  urls = ["http://influxdb:8086"]
  token = <redacted>
  organization = "default-org"
  bucket = "telegraf"

[[inputs.disk]]
  mount_points = ["/"]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
  fielddrop = ["inodes*"]

[[inputs.diskio]]
[[inputs.mem]]
[[inputs.system]]

[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states
  report_active = false
Clone this wiki locally