Skip to content

tclinken/fdb-otel-exporter

Repository files navigation

FDB OTEL Exporter

The FDB OTEL exporter tails JSON FoundationDB logs and emits OTEL metrics. This project also sets up simple local Prometheus and Grafana containers for local analysis of FDB logs. So far, this project has only been tested locally on a small 6-process cluster running on a Mac.

The Grafana dashboard has over 150 charts to visualize various FDB metrics:

Example Dashboard

Why not use status json?

While relying solely on FDB logs prevents exposing some cluster-wide metrics that the cluster controller aggregates, the logs also contain many metrics not exposed through status json. For example, Histogram trace events give fine-grained details on latencies of FDB sub-operations, helpful for identifying bottlenecks. Furthermore, status json will give incomplete output if the cluster is in an unhealthy state, e.g. stuck in recovery. This means that useful metrics will be missing from status json precisely when it's most necessary to have visibility into the cluster's status. Not relying on status json also means that the fdb-otel-exporter crate has no dependency on the FDB client, though there is an implicit dependency on trace event schemas.

Prerequisites

  • Install Docker
  • Set the FDB_LOG_DIR environment variable to the location of your trace.*.json logs files generated by FDB processes (note that XML trace events are not supported by this tool).
  • Export GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD with the Grafana admin credentials you want to use.

Usage

Run:

docker compose up --build -d

Open localhost:3000 in your browser and login with the credentials provided in GF_SECURITY_ADMIN_USER / GF_SECURITY_ADMIN_PASSWORD. Then navigate in Grafana to the FDB Metrics dashboard to view FDB process metrics.

Runtime Configuration

The exporter reads a handful of environment variables at startup:

  • LOG_DIR (default logs/): directory to tail for trace.*.json files and to emit generated samples.
  • LISTEN_ADDR (default 0.0.0.0:9200): socket address for the HTTP server that exposes /metrics and /health.
  • TRACE_LOG_FILE (default logs/tracing.log): path where structured logs from the exporter itself are written.
  • LOG_POLL_INTERVAL_SECS (default 2): frequency (in seconds) to rescan the log directory for new trace files.

Gauge Configuration

Reported metrics are configured in the gauge_config.toml file. There are currently 5 types of gauges that can be reported from JSON trace files:

  • Simple: Reports the numeric value of the field
  • CounterTotal: Reports the total value from a counter (the third space-delimited value of the field)
  • CounterRate: Reports the rate from a counter (the first space-delimited value of the field)
  • ElapsedRate: Reports the numeric value of the field divided by the Elapsed field in the same trace event
  • HistogramPercentile: Interpolates (assuming an exponential distribution) percentiles from histogram buckets aggregated by FDB

For each gauge, the trace_type, field_name, gauge_name, and description must be configured. For example, the following gauge configuration:

[[counter_total_gauge]]
trace_type = "StorageMetrics"
gauge_name = "ss_bytes_input"
field_name = "BytesInput"
description = "Total input bytes on storage server"

will report a gauge from trace events of the form:

{ "Type": "StorageMetrics", "Time": "<trace_time>", "BytesInput": "<rate> <roughness> <total>", "Machine": "<process_address>", ... }

For histogram percentile gauges, the schema is different, and a list of percentiles are provided. For example:

[[histogram_percentile_gauge]]
group = "CommitProxy"
op = "TlogLogging"
percentiles = [0.5, 0.99, 0.999]
gauge_name = "cp_tlog_logging_latency"
description = "commit proxy TLog logging latency"

will report interpolated P50, P99, and P999 latency estimates from FDB trace events with Type="Hisogram", Group="CommitProxy", and Op="TlogLogging".

Recommended Knob Overrides

These charts are most valuable with fine-grained latency metrics and histograms. To achieve this, apply the following knob overrides:

knob_latency_metrics_logging_interval = 10.0
knob_histogram_report_interval = 10.0
knob_kaio_latency_logging_interval = 10.0

Tracing these histograms frequently is relatively cheap for FDB, and the increased visibility from more frequent trace events generally outweighs any performance overhead.

About

A project to tail FoundationDB logs and export as OTEL metrics for Grafana dashboards

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published