This project contains an example bash script that shows how Helix components could be monitored from a Nagios server. The plugin currently checks:
- if the P4D server is responding.
- if the P4D server license is about to expire.
- if the P4D disk space is low.
- if the number of running P4D processes is excessive.
- if replication is running (replica server only).
The return states are configurable so could be customised for other monitoring solutions as needed:
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
check_helix_p4d_health
Usage:
check_helix_p4d_health -p <p4port> [options] [tests]
options: [-u <p4-user>] [-T <p4-ticket-file-name>] [-P <p4-password>]
[-t <p4trust-file>] [--tips] [--comment] [--version] [--help]
tests: [--all]
[--remoteonly]
[--licensecheck] [--licthreshold <expire-days-threshold>]
[--pidcheck] [--pidthreshold <running-p4-pid-threshold>]
[--p4monitorcheck] [--monthreshold <p4-processes-threshold>]
[--p4diskcheck] [--diskthreshold <disk-used-threshold>]
[--p4repcheck]
dependencies: The plugin relies on 'p4' being installed on the
machine executing the plugin.
Nagios plugins are required to only output one line of text. The '--tips'
flag can be used to display full error/status output and a helpful tip
to help find the cause or solve the problem.
The '-u' flag specifies the P4D user name. This user must be an 'operator'
user or must have 'super' access to the P4D server
The '-p' flag specifies the P4D hostname and port in the
format 'hostname:port'.
The '-t' flag specifies the location of the P4TRUST file for SSL
connections.
The '-T' flag specifies the location of the tickets file.
The '-P' flag explicitly sets the password to use. Note this may not be
secure and the use of 'p4 login' and a tickets file ('-T') on the server
provides better security.
If no tests are specified then only the P4D server online test is run.
Multiple test arguments can be supplied. Thresholds can be supplied
with specific tests or when using '--all' or '--remoteonly'.
The '--all' flag runs all tests.
The '--remoteonly' flag runs only remote tests using the p4 client. These
tests can be run from any machine with network access to the P4D server
(including the Nagios server).
The '--licensecheck' flag tests if the license file is nearing it's
expiry date. By default it checks for expiry within '30' days but this
can be overriden with the '--licthreshold' flag.
The '--p4diskcheck' flag checks for free disk space on the P4D drives using
the Perforce command 'p4 diskspace' and warns if the disks are over 95% used.
This value can be overriden by speciying a value between 0 and 99 using with
the '--diskthreshold' flag.
The '--pidcheck' flag counts the number of connected p4d processes using
'netstat' and warns if the are over 500 processes running. This value can
be overriden with the '--pidthreshold' flag. This test must be run on the
P4D server machine.
The '--p4monitorcheck' flag counts the number of commands in the
'p4 monitor' table and warns if there are over 500 running. This value can be
overriden with the '--monthreshold' flag.
The '--p4repcheck' flags (REPLICA ONLY) checks that replication is running.
The '--message' flag allows you to prepend an alert with a custom message.
Example output without '--tips':
CRITICAL: P4D server not responding!
Example output with '--tips':
CRITICAL: P4D server not responding!
Perforce client error:
Connect to server failed; check $P4PORT.
TCP connect to localhost:1666 failed.
connect: 127.0.0.1:1666: Connection refused
TIP:
Check if the 'p4d' process is running on the box.
Check the P4D log file for errors if it unexpectedly
stopped.
Examples:
Run all checks against server on localhost:1666
check_helix_p4d_health -p 1666 --all
Check if license will expire in next 45 days
check_helix_p4d_health -p 1666 --licensecheck -licthreshold 45
If the P4D server is running using SSL then the connection must first be trusted. For example to trust connection 'ssl:localhost:1666':
export P4TRUST=/etc/nagios/.p4trust
touch $P4TRUST
p4 -p ssl:localhost:1666 trust
chown nagios:nagios $P4TRUST
chmod 600 $P4TRUST
The Nagios command then may look similar to:
check_helix_p4d_health -p 1666 --p4diskcheck -u operator -T /etc/nagios/.p4tickets
Many of the Perforce tests require the user to be logged in. The plugin allows you to specify a password with the '-P' flag but it's better practice to use a long lived ticket owned by the Nagios user on the monitored system. For example to create a ticket under '/etc/nagios/.p4tickets' for a Helix P4D user called 'operator':
export P4TICKETS=/etc/nagios/.p4tickets
touch $P4TICKETS
p4 -p localhost:1666 -u operator login
chown nagios:nagios $P4TICKETS
chmod 600 $P4TICKETS
The Nagios command then may look similar to:
check_helix_p4d_health -p localhost:1666 -u operator -T /etc/nagios/.p4tickets --p4diskcheck
It's also good practice that the P4D user is an 'operator' user to decrease the amount of power this user has. If you have any queries about Helix P4D users and security discuss this with your security team or '[email protected]'.
While this document uses the term "Nagios" it's more precise to say these instructions are for Nagios Core, the GPL version of the Nagios monitoring engine. As Nagios XI uses the Nagios Core component for the checks, there is no reason to believe the plugin will not work in Nagios XI. However, directly editing the config files will not work in Nagios XI unless you use their static directory. See https://assets.nagios.com/downloads/nagiosxi/docs/Managing-Config-Files-Manually-With-Nagios-XI.pdf for additional information.
The script can be run using SSH or NRPE. In the example below I use the NRPE plugin. Note that it requires that the NRPE service accepts arguments which can be a security risk. If you have an doubts use SSH or hard code the paramaters on the monitored server.
Below are the command definitions for the NRPE service and scripts on the Nagios server.
define command{
command_name check_nrpe
command_line /usr/lib/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$
}
define command{
command_name check_helix_p4d_health
command_line $USER1$/check_helix_p4d_health $ARG1$
}
Below is a service definition that runs the script 'check_helix_p4d_health' use 'check_nrpe' and provides the parameters '-p localhost:1666 --licensecheck'. This can be used to check the Helix P4D license status on port 1666 on theserver 'master-helix-server'.
define service {
host_name master-helix-server
service_description Helix P4D ServerName 1666 License Check
check_command check_nrpe!check_helix_p4d_health!'-p localhost:1666 --licensecheck'
max_check_attempts 2
check_interval 2
retry_interval 2
check_period 24x7
check_freshness 1
contact_groups admins
notification_interval 2
notification_period 24x7
notifications_enabled 1
register 1
}
As you can see the '-p' flag is the value that will be run on the server running the script. When using NRPE the script will be run on the Perforce server so 'localhost:1666' could be passed.
Note: You can choose to have a seperate service definition that allows you to monitor each test as a seperate entry in Nagios as above, or specify multipls tests in one definition (for example using '--all'). For example:
define service {
host_name master-helix-server
service_description Helix P4D ServerName 1666
check_command check_nrpe!check_helix_p4d_health!'-p localhost:1666 --all -u operator -T /etc/nagios/.p4tickets'
max_check_attempts 2
check_interval 2
retry_interval 2
check_period 24x7
check_freshness 1
contact_groups admins
notification_interval 2
notification_period 24x7
notifications_enabled 1
register 1
}
On the monitored server the command is configured within NRPE as a single line:
command[check_helix_p4d_health]=/usr/lib/nagios/plugins/check_helix_p4d_health $ARG1$*
More information on configuring NRPE can be found on the Nagios website.
The plugin has been tested on Ubuntu 14.04 with P4D 2016.1 Beta, P4D 2015.2 and P4D 2013.1. The test harness has been provided to allow it to easily be tested on other platforms with other P4D versions. The tests should be run from the plugin directory using:
perl t/check_helix_p4d_health.t
The tests require that P4 and P4D are in the path and will start P4D servers on ports 1234 and 1235 under the directory './t/tmp/'. At the end of a the test suite the P4D servers on these ports and the './t/tmp' directory will be removed.
IMPORTANT NOTE: This test harness is intended for development purposes only and should not be run on a live P4D machine.
Additional P4D checks:
- highlight when replication slows down.
- check btree depths.
- warn on long running commands.
Additional Helix checks:
- Swarm checks.
- Helix4Git checks.