The FDB OTEL exporter tails JSON FoundationDB logs and emits OTEL metrics. This project also sets up simple local Prometheus and Grafana containers for local analysis of FDB logs. So far, this project has only been tested locally on a small 6-process cluster running on a Mac.
The Grafana dashboard has over 150 charts to visualize various FDB metrics:
While relying solely on FDB logs prevents exposing some cluster-wide metrics that the cluster controller aggregates, the logs also contain many metrics not exposed through status json. For example, Histogram trace events give fine-grained details on latencies of FDB sub-operations, helpful for identifying bottlenecks. Furthermore, status json will give incomplete output if the cluster is in an unhealthy state, e.g. stuck in recovery. This means that useful metrics will be missing from status json precisely when it's most necessary to have visibility into the cluster's status. Not relying on status json also means that the fdb-otel-exporter crate has no dependency on the FDB client, though there is an implicit dependency on trace event schemas.
- Install Docker
- Set the
FDB_LOG_DIRenvironment variable to the location of yourtrace.*.jsonlogs files generated by FDB processes (note that XML trace events are not supported by this tool). - Export
GF_SECURITY_ADMIN_USERandGF_SECURITY_ADMIN_PASSWORDwith the Grafana admin credentials you want to use.
Run:
docker compose up --build -d
Open localhost:3000 in your browser and login with the credentials provided in GF_SECURITY_ADMIN_USER / GF_SECURITY_ADMIN_PASSWORD. Then navigate in Grafana to the FDB Metrics dashboard to view FDB process metrics.
The exporter reads a handful of environment variables at startup:
LOG_DIR(defaultlogs/): directory to tail fortrace.*.jsonfiles and to emit generated samples.LISTEN_ADDR(default0.0.0.0:9200): socket address for the HTTP server that exposes/metricsand/health.TRACE_LOG_FILE(defaultlogs/tracing.log): path where structured logs from the exporter itself are written.LOG_POLL_INTERVAL_SECS(default2): frequency (in seconds) to rescan the log directory for new trace files.
Reported metrics are configured in the gauge_config.toml file. There are currently 5 types of gauges that can be reported from JSON trace files:
Simple: Reports the numeric value of the fieldCounterTotal: Reports the total value from a counter (the third space-delimited value of the field)CounterRate: Reports the rate from a counter (the first space-delimited value of the field)ElapsedRate: Reports the numeric value of the field divided by theElapsedfield in the same trace eventHistogramPercentile: Interpolates (assuming an exponential distribution) percentiles from histogram buckets aggregated by FDB
For each gauge, the trace_type, field_name, gauge_name, and description must be configured. For example, the following gauge configuration:
[[counter_total_gauge]]
trace_type = "StorageMetrics"
gauge_name = "ss_bytes_input"
field_name = "BytesInput"
description = "Total input bytes on storage server"
will report a gauge from trace events of the form:
{ "Type": "StorageMetrics", "Time": "<trace_time>", "BytesInput": "<rate> <roughness> <total>", "Machine": "<process_address>", ... }
For histogram percentile gauges, the schema is different, and a list of percentiles are provided. For example:
[[histogram_percentile_gauge]]
group = "CommitProxy"
op = "TlogLogging"
percentiles = [0.5, 0.99, 0.999]
gauge_name = "cp_tlog_logging_latency"
description = "commit proxy TLog logging latency"
will report interpolated P50, P99, and P999 latency estimates from FDB trace events with Type="Hisogram", Group="CommitProxy", and Op="TlogLogging".
These charts are most valuable with fine-grained latency metrics and histograms. To achieve this, apply the following knob overrides:
knob_latency_metrics_logging_interval = 10.0
knob_histogram_report_interval = 10.0
knob_kaio_latency_logging_interval = 10.0
Tracing these histograms frequently is relatively cheap for FDB, and the increased visibility from more frequent trace events generally outweighs any performance overhead.
