Skip to content

Commit

Permalink
Merge pull request #39 from nuriel77/add/prom-iri-restart
Browse files Browse the repository at this point in the history
Add/prom iri restart
  • Loading branch information
nuriel77 authored Dec 28, 2017
2 parents e00df8a + 5c43b71 commit a38e8f1
Show file tree
Hide file tree
Showing 10 changed files with 200 additions and 2 deletions.
110 changes: 110 additions & 0 deletions docs/appendix.rst
Original file line number Diff line number Diff line change
Expand Up @@ -254,3 +254,113 @@ The same can be done with ``alertmanager``.


For more information see `Documentation Prometheus Alertmanager <https://prometheus.io/docs/alerting/alertmanager/>`_



Restart IRI On Latest Subtangle Milestone Stuck
===============================================

A feature is added to alertmanager through which it is possible to trigger a IRI restart when the Latest Subtangle Milestone Stuck is stuck.


.. warning::

This feature is disabled by default as this is not considered a permanent or ideal solution. Please, first try to download a fully sycned database as proposed in the faq, or try to find "healthier" neighbors.


Enabling the Feature
--------------------

Log in to your node and edit the alertmanager configuration file: ``/opt/prometheus/alertmanager/config.yml``.

You will find the following lines::

# routes:
# - receiver: 'executor'
# match:
# alertname: MileStoneNoIncrease

Remove the ``#`` comments, resulting in::

routes:
- receiver: 'executor'
match:
alertname: MileStoneNoIncrease

Try not to mess up the indentation (should be 2 spaces to begin with).

After having applied the changes, save the file and restart alertmanager: ``systemctl restart alertmanager``.

What will happen next is that the service called ``prom-am-executor`` will be called and trigger a restart to IRI when the Latest Subtangle Milestone is stuck for more than ``30`` minutes.


.. note::

This alert-trigger is set to only execute if the Latest Subtangle Milestone is stuck and not equal to 243000 (which is the case when starting up or restarting IRI).


Disabling the Feature
---------------------
A quick way to disable this feature:

.. code:: bash
systemctl stop prom-am-executor && systemctl disable && prom-am-executor
To re-enable:

.. code:: bash
systemctl enable prom-am-executor && systemctl start prom-am-executor
Configuring the Feature
-----------------------

You can choose to tweak some values for this feature, for example how long to wait on stuck milestones before restarting IRI:

Edit the file ``/etc/prometheus/alert.rules.yml``, find the alert definition::

# If latest subtangle milestone doesn't increase for 30 minutes
- alert: MileStoneNoIncrease
expr: increase(iota_node_info_latest_subtangle_milestone[30m]) == 0
and iota_node_info_latest_subtangle_milestone != 243000
for: 1m
labels:
severity: critical
annotations:
description: 'Latest Subtangle Milestone increase is {{ $value }}'
summary: 'Latest Subtangle Milestone not increasing'

The line that denotes the time: ``increase(iota_node_info_latest_subtangle_milestone[30m]) == 0`` -- here you can replace the ``30m`` with any other value in the same format (e.g. ``1h``, ``15m`` etc...)

If any changes to this file, remember to restart prometheus: ``systemctl restart prometheus``


Upgrading the Playbook to Get the Feature
-----------------------------------------

If you installed the playbook before this feature was release you can still install it.

1. Enter the iri-playbook directory and pull new changes:

.. code:: bash
cd /opt/iri-playbook && git pull
If this command breaks, it means that you have conflicting changes in one of the configuration files. Try to identify those manually, create a backup of those files if required, revert and re-run the above command (or hit me up on slack or github for assitance)

2. WARNING, this will overwrite changes to your monitoring configuration files if you had any manually applied! Run the playbook's monitoring role:

.. code:: bash
ansible-playbook -i inventory -v site.yml --tags=monitoring_role -e overwrite=true
3. **If** the playbook fails with 401 authorization error (probably when trying to run prometheus grafana datasource), you will have to re-run the command and supply your web-authentication password together with the command:

.. code:: bash
ansible-playbook -i inventory -v site.yml --tags=monitoring_role -e overwrite=true -e iotapm_nginx_password="mypassword"
2 changes: 2 additions & 0 deletions group_vars/all/monitoring.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,5 @@ alertmanager_email_to: root@localhost
alertmanager_loglevel: info
smtp_host: localhost
smtp_port: 25

prom_am_executor_port: 9158
2 changes: 1 addition & 1 deletion roles/monitoring/files/alert.rules.yml
Original file line number Diff line number Diff line change
Expand Up @@ -242,4 +242,4 @@ groups:
severity: critical
annotations:
description: 'Latest Subtangle Milestone increase is {{ $value }}'
summary: 'Latest Subtangle Milestone not increasing'
summary: 'Latest Subtangle Milestone not increasing'
Binary file added roles/monitoring/files/prom-am-executor
Binary file not shown.
3 changes: 3 additions & 0 deletions roles/monitoring/files/restart_iri.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash

/bin/systemctl restart iri
6 changes: 6 additions & 0 deletions roles/monitoring/handlers/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,9 @@
name: iota-prom-exporter.service
state: restarted
enabled: yes

- name: restart prom-am-executor
systemd:
name: prom-am-executor.service
state: restarted
enabled: yes
25 changes: 25 additions & 0 deletions roles/monitoring/tasks/prom-am-executor.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Executor credits go to https://github.com/imgix/prometheus-am-executor
- name: copy prometheus alertmanager executor binary
copy:
src: files/prom-am-executor
dest: "{{ prom_basedir }}/prom-am-executor"
mode: 0700

- name: copy iri restart script
copy:
src: files/restart_iri.sh
dest: "{{ prom_basedir }}/restart_iri.sh"
mode: 0700

- name: copy prometheus alertmanager executor service file
template:
src: templates/prom-am-executor.service.j2
dest: /etc/systemd/system/prom-am-executor.service
notify:
- reload systemd

- name: ensure prom-am-executor enabled and started
systemd:
name: prom-am-executor.service
state: started
enabled: yes
4 changes: 4 additions & 0 deletions roles/monitoring/tasks/role.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,7 @@
- import_tasks: iota-prom-exporter.yml
tags:
- iota_prom_exporter

- import_tasks: prom-am-executor.yml
tags:
- prom_am_executor
18 changes: 17 additions & 1 deletion roles/monitoring/templates/alertmanager.cfg.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,22 @@ route:
group_by: [Alertname]
repeat_interval: 1h
receiver: email-me

# Uncomment these lines to enable webhook executor to restart iri
# when subtangle milestone stuck:

# routes:
# - receiver: 'executor'
# match:
# alertname: MileStoneNoIncrease



# Templates directory
templates:
- {{ alertmanager_basedir }}/template/*.tmpl
receivers:

receivers:
# Send using postfix local mailer
# You can send to a gmail or hotmail address
# but these will most probably be put into junkmail
Expand All @@ -31,6 +43,10 @@ receivers:
smarthost: {{ smtp_host }}:{{ smtp_port }}
send_resolved: true

- name: executor
webhook_configs:
- url: http://localhost:{{ prom_am_executor_port }}

# For gmail, replace the variables/placeholders with your data
#- name: email-me
# email_configs:
Expand Down
32 changes: 32 additions & 0 deletions roles/monitoring/templates/prom-am-executor.service.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
[Unit]
Description=Restart IRI Executor For Prometheus
Wants=network-online.target
After=network.target

[Service]
WorkingDirectory={{ prom_basedir }}
Restart=on-failure
ExecStart={{ prom_basedir }}/prom-am-executor -v -l 127.0.0.1:{{ prom_am_executor_port }} {{ prom_basedir }}/restart_iri.sh
Type=simple

# No need that exporter messes with /dev
PrivateDevices=yes

# Dedicated /tmp
PrivateTmp=yes

# Make /usr, /boot, /etc read only
ProtectSystem=full

# /home is not accessible at all
ProtectHome=yes

# This service has to be able to issue a systemctl restart.
# Attempting to use sudoers special rule didn't work out
# to let an unprivileged user to run a sudo command.
# That's why the use of user root here.
User=root
Group=root

[Install]
WantedBy=multi-user.target

0 comments on commit a38e8f1

Please sign in to comment.