Changelog is here.
This project is a heavily modified version of the amazing skx/overseer one. Compatibility between the two projects's data is not guaranteed, and should not be expected.
Overseer is a simple and scalable golang-based remote protocol tester, which allows you to monitor the state of your network, and the services running upon it.
"Remote Protocol Tester" sounds a little vague, so to be more concrete this application lets you test that (remote) services are running, and has built-in support for performing testing against:
- DNS-servers
- Test lookups of A, AAAA, MX, NS, and TXT records.
- Finger
- FTP
- HTTP & HTTPS fetches.
- HTTP basic-authentication is supported.
- Requests may be DELETE, GET, HEAD, POST, PATCH, POST, & etc.
- SSL certificate validation and expiration warnings are supported.
- IMAP & IMAPS
- Kubernetes service endpoints check
- MySQL
- NNTP
- ping / ping6
- POP3 & POP3S
- Postgres
- redis
- rsync
- SMTP
- SSH
- SSL
- Telnet
- VNC
- XMPP
(The implementation of the protocol-handlers can be found beneath the top-level protocols/ directory in this repository.)
Tests to be executed are defined in a simple text-based format which has the general form:
$TARGET must run $SERVICE [with $OPTION_NAME $VALUE] ..
You can see what the available tests look like in the sample test-file, and each of the included protocol-handlers are self-documenting which means you can view example usage via:
~$ overseer examples [pattern]
All protocol-tests transparently support testing IPv4 and IPv6 targets, although you may globally disable either address family if you wish.
To install locally the project:
git clone https://github.com/cmaster11/overseer
cd overseer
go install
A sample deployment is provided in the example-kubernetes
folder. Please take a look at the
README
for more instructions.
Beyond the compile-time dependencies overseer requires a redis server which is used for two things:
- As the storage-queue for parsed-jobs.
- As the storage-queue for test-results.
Because overseer is executed in a distributed fashion tests are not executed as they are parsed/read, instead they are inserted into a redis-queue. A worker, or number of workers, poll the queue fetching & executing jobs as they become available.
In small-scale deployments it is probably sufficient to have a single worker, and all the software running upon a single host. For a larger number of tests (1000+) it might make more sense to have a pool of hosts each running a worker.
Because we don't want to be tied to a specific notification-system results of each test are also posted to the same redis-host, which allows results to be retrieved and transmitted to your preferred notifier.
More details about notifications are available later in this document.
As mentioned already executing tests a two-step process:
- First of all tests are parsed and inserted into a redis-based queue.
- Secondly the tests are pulled from that queue and executed.
This might seem a little convoluted, however it is a great design if you have a lot of tests to be executed, because it allows you to deploy multiple workers. Instead of having a single host executing all the tests you can can have 10 hosts, each watching the same redis-queue pulling jobs, & executing them as they become available.
In short using a central queue allows you to scale out the testing horizontally.
To add your tests to the queue you should run:
$ overseer enqueue \
-redis-host=queue.example.com:6379 [-redis-pass='secret.here'] \
test.file.1 test.file.2 .. test.file.N
This will parse the tests contained in the specified files, adding each of them to the (shared) redis queue. Once all of the jobs have been parsed and inserted into the queue the process will terminate.
To drain the queue you can should now start a worker, which will fetch the tests and process them:
$ overseer worker -verbose \
-redis-host=queue.example.com:6379 [-redis-pass='secret']
The worker will run constantly, not terminating unless manually killed. With
the worker running you can add more jobs by re-running the overseer enqueue
command.
To run tests in parallel simply launch more instances of the worker, on the same host, or on different hosts.
By default the worker will process in parallel a number of tests equal to the number of the current machine's logical
CPUs. To alter this behavior, you can use the -parallel
flag:
$ # Runs 9 tests at a time
$ overseer worker -parallel 9
Using a higher number of parallel tests is useful if running any long-running tests, to not delay executions of any others.
Let's imagine that you want to test how many times your web service fails in 1 minute. You can run period-tests:
https://example.com must run http with pt-duration 60s with pt-sleep 2s with pt-threshold 15%
The previous line will trigger a period-test, where the same test https://example.com must run http
will be tested over and over for a duration of 60 seconds (pt-duration 60s
), with a pause of 2 seconds (pt-sleep 2s
)
between each test. At the end of the testing period, if the percentage of errors is higher than 15% (pt-threshold 15%
),
an alert will be generated, e.g:
8 tests failed out of 21 (38.10%)
You can also test multiple cases with a dumb test:
dumb-test1 must run dumb-test with pt-duration 5s with pt-sleep 200ms with pt-threshold 0% with dumb-duration-max 100ms
dumb-test2 must run dumb-test with pt-duration 5s with pt-sleep 200ms with pt-threshold 20% with dumb-duration-max 100ms
dumb-test3 must run dumb-test with pt-duration 5s with pt-sleep 200ms with pt-threshold 40% with dumb-duration-max 100ms
If no pt-sleep
is defined, Overseer will default to the -period-test-sleep
command line variable value, or to 5s
.
If no pt-threshold
is defined, Overseer will default to the -period-test-threshold
command line variable value, or to 0%
.
Note: the pt-
flags are shortened versions of the also usable longer tags:
pt-duration -> period-test-duration
pt-sleep -> period-test-sleep
pt-threshold -> period-test-threshold
Note: period-tests, by default, have no enabled deduplication rules. To enable deduplication, you need
to manually add the with dedup 5m
flag.
You can test Overseer functionalities locally using some scripts.
Setup Overseer with:
- Run a local redis:
./scripts/test-run-redis.sh
(hosts the processing queue) - Run a local worker:
./scripts/test-run-worker.sh
(runs the actual tests) - Run a local webhook bridge:
./scripts/test-run-webhook-bridge.sh
(fetches test results from the queue and triggers the webhook) - Run a local http webhook listener:
./scripts/test-run-http-dump.sh
(listens for webhooks requests and dumps them tostdout
)
Run tests with:
- Sample dumb tests:
./scripts/test-run-enqueue.sh
- An always-failing test:
./scripts/test-run-enqueue-fail.sh
- An sample period-test:
./scripts/test-run-enqueue-period.sh
- Custom rules:
./scripts/test-run-enqueue-stdin.sh "https://google.com must run http"
Beneath systemd/ you will find some sample service-files which can be used to deploy overseer upon a single host:
- A service to start a single worker, fetching jobs from a redis server.
- The redis-server is assumed to be running on
localhost
.
- The redis-server is assumed to be running on
- A service & timer to regularly populate the queue with fresh jobs to be executed.
- i.e. The first service is the worker, this second one feeds the worker.
To avoid triggering false alerts due to transient (network/host) failures tests which fail are retried several times before triggering a notification.
This smoothing is designed to avoid raising an alert, which then clears upon the next overseer run, but the downside is that flapping services might not necessarily become visible.
If you're absolutely certain that your connectivity is good, and that
alerts should always be raised for failing services you can disable this
retry-logic via the command-line flag -retry=false
.
The result of each test is submitted to the central redis-host, from where it can be pulled and used to notify a human of a problem.
Sample result-processors are included in this repository which post test-results via webhook (e.g. to trigger notifications with Notify17, to a purppura instance or via email.
The sample bridges are primarily included for demonstration purposes, the expectation is you'll prefer to process the results and issue notifications to humans via your favourite in-house tool - be it Notify17, or something similar.
The results themselves are published as JSON objects to the overseer.results
set. Your notifier should remove the results from this set, as it generates alerts to prevent it from growing indefinitely.
You can check the size of the results set at any time via redis-cli
like so:
$ redis-cli llen overseer.results
(integer) 0
The JSON object used to describe each test-result has the following fields:
Field Name | Field Value |
---|---|
input |
The input as read from the configuration-file. |
error |
If the test failed this will explain why, otherwise it will be null. |
time |
The time the result was posted, in seconds past the epoch. |
target |
The target of the test, either an IPv4 address or an IPv6 one. |
type |
The type of test (ssh, ftp, etc). |
isDedup |
If true, the alert is a duplicate of a previously triggered one (see deduplication). |
recovered |
If true, the alert has recovered from a previous error (see deduplication). |
NOTE: The input
field will be updated to mask any password options which have been submitted with the tests.
As mentioned this repository contains some demonstration "bridges", which poll the results from Redis, and forward them to more useful systems:
webhook-bridge/main.go
- Forwards each test-result to a generic URL (e.g. to trigger notifications with Notify17).
- If started with the flag
-send-test-recovered=true
, tests which recovered from failure (see deduplication) are sent. - If started with the flag
-send-test-success=true
, successful tests are sent.
queue-bridge/main.go
- Clones test results to multiple
-destionation-queues
, so that the can be processed by multiple other bridges, like email and webhook (example).
- Clones test results to multiple
email-bridge/main.go
- This posts test-failures via email.
- If started with the flag
-send-test-recovered=true
, tests which recovered from failure (see deduplication) are sent. - If started with the flag
-send-test-success=true
, successful tests are sent.
sendmail-bridge/main.go
- This posts test-failures via sendemail.
- Tests which pass are not reported.
purppura-bridge/main.go
- This forwards each test-result to a purppura host.
Disclaimer: deduplication has been fully developed only for the webhook and email bridges.
It is possible to enable the deduplication of alerts by using the with dedup 5m
rule, or by starting overseer worker
with the -dedup=5m
flag.
What deduplication does is:
- When an alert gets triggered because of an error:
- Calculates a unique hash for the generated alert, based on the input rule.
- If the alert has been already generated in the past, closer than the period of time specified (e.g.
5m
for 5 minutes), a new alert will NOT be triggered. - If the alert has been already generated in the past, but enough time has passed (e.g. > 5 min ago), a new alert will be generated, and will carry the
isDedup
flag set totrue
.
- When a test succeeds, after having failed in the past:
- A new alert will be generated, having
error
set tonull
andrecovered
set totrue
.
- A new alert will be generated, having
Overseer has partial built-in support for exporting metrics to a remote carbon-server:
- Details of the system itself.
- Via the go-metrics package.
- Details of the tests executed.
- Including the time to run tests, perform DNS lookups, and retry-counts.
To enable this support simply export the environmental variable METRICS
with the hostname of your remote metrics-host prior to launching the worker.
We use Redis as a queue as it is simple to deploy, stable, and well-known.
Redis doesn't natively operate as a queue, so we replicate this via the "list" primitives. Adding a job to a queue is performed via a "rpush" operation, and pulling a job from the queue is achieved via an "blpop" command.
We use the following lists as queues:
overseer.jobs
- For storing tests to be executed by a worker.
overseer.results
- For storing results, to be processed by a notifier.
You can examine the length of either queue via the llen operation.
- To view jobs pending execution:
redis-cli lrange overseer.jobs 0 -1
- Or to view just the count
redis-cli llen overseer.jobs
- To view test-results which have yet to be notified:
redis-cli lrange overseer.results 0 -1
- Or to view just the count
redis-cli llen overseer.results