Skip to content
This repository has been archived by the owner on Dec 2, 2020. It is now read-only.

Collectd

mbostock edited this page Apr 16, 2012 · 2 revisions

WikiAPI ReferenceCollectorCollectd

Cube's Collector supports integration with collectd, a popular tool for monitoring systems and collecting statistics. Assuming your collector is running on localhost with the default port 1080, edit your collectd.conf like so:

<Plugin write_http>
	<URL "http://localhost:1080/collectd">
		Format "JSON"
	</URL>
</Plugin>

You'll also need to enable the write_http plugin:

LoadPlugin write_http

If you want help installing collectd, see the download page. On Mac OS X, collectd can be installed via Homebrew as brew install collectd. To find where collectd stores its configuration file, run collectd --version. You may want to disable some of the other plugins with the default configuration, such as network.

The generic structure of a collectd event is:

{
  "time": "<time>",
  "type": "collectd",
  "data": {
    "host": "<host>",
    "plugin": "<plugin_instance>",
    "type": "<type_instance>",
    "<plugin>": {
      "<type>": {
        "<dsname>": value
      }
    }
  }
}

For example, the load plugin generates an event like this:

{
  "time": "2012-04-16T15:37:58Z",
  "type": "collectd",
  "data": {
    "host": "localhost",
    "load": {
      "shortterm": 0.727539,
      "midterm": 0.652344,
      "longterm": 0.604004
    }
  }
}

To plot the shortterm load for localhost, say:

max(collectd(load.shortterm)
    .eq(host, 'localhost'))

If the type and plugin have the same name (as "load" above), Cube simplifies the structure by collapsing the type map into the plugin map. In contrast, if the type and plugin have different names, then a nested object is generated. This object may have multiple values, if the plugin has multiple data sources. For example, the interface plugin generates data with rx and tx values:

{
  "host": "localhost",
  "plugin": "en1",
  "interface": {
    "if_octets": {
      "rx": 104,
      "tx": 66
    }
  }
}

Likewise the disk plugin generates read and write values:

{
  "host": "localhost",
  "plugin": "14-0",
  "disk": {
    "disk_ops": {
      "read": 0,
      "write": 13
    }
  }
}

If only a single value is reported, the value is stored as a single value rather than a map of values. The memory plugin behaves this way (and also has a matching plugin and type name), resulting in the simplest event structure:

{
  "host": "localhost",
  "type": "inactive",
  "memory": 1036500000
}

If a collectd plugin has multiple instances, then it will generate multiple events per interval. For example, the df plugin has an instance for each disk and a type instance for each category of usage. Rather than one event every ten seconds, this plugin might generate six or more:

{"host": "localhost", "df": {"df_complex": 187904}, "plugin": "dev", "type": "used"}
{"host": "localhost", "df": {"df_complex": 0}, "plugin": "dev", "type": "reserved"}
{"host": "localhost", "df": {"df_complex": 0}, "plugin": "dev", "type": "free"}
{"host": "localhost", "df": {"df_complex": 111875000000}, "plugin": "root", "type": "used"}
{"host": "localhost", "df": {"df_complex": 262144000}, "plugin": "root", "type": "reserved"}
{"host": "localhost", "df": {"df_complex": 126652000000}, "plugin": "root", "type": "free"}

To look at disk usage for localhost's root partition, use the eq filter to restrict which events are selected:

max(collectd(df.df_complex)
    .eq(host, 'localhost')
    .eq(plugin, 'root')
    .eq(type, 'used'))

Note that the reduce (max, here) has little effect on the 10-second interval because we only expect one event per interval. However, for larger intervals, the reduce is similar to Graphite's summarize method: the above would return the maximum value across all events in the same time interval. This is useful for "worst-case" analysis; consider using median for "normal-case".

To convert disk usage to a ratio, divide used by used + free. However, since these values are reported as separate events, we must compute metrics for each, and then divide them (rather than divide properties on a single event):

sum(collectd(df.df_complex)
    .eq(host, 'localhost')
    .eq(plugin, 'root')
    .eq(type, 'used'))
  / sum(collectd(df.df_complex)
    .eq(host, 'localhost')
    .eq(plugin, 'root')
    .in(type, ['used', 'free']))

Equivalently, we can sum everything except the reserved disk:

sum(collectd(df.df_complex)
    .eq(host, 'localhost')
    .eq(plugin, 'root')
    .eq(type, 'used'))
  / sum(collectd(df.df_complex)
    .eq(host, 'localhost')
    .eq(plugin, 'root')
    .ne(type, 'reserved'))

For "derive" type data, collectd reports monotonically increasing values. For example, the disk plugin reports disk_ops as an ever-increasing count of operations, and the interface plugin reports if_octets as an ever-increasing count of bytes. To be useful, we must compute the derivative: the number of disk operations or network octets in a particular time interval. Cube does this by storing the difference between the current value and previous value. Therefore, the first event that is received for any particular "derive" data source is always stored as a zero, because the previous value is unknown.

To read more about how collectd reports values, see:

Clone this wiki locally