Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take check timeout into account when rescheduling remote checks #10362

Open
dirk-4 opened this issue Mar 7, 2025 · 9 comments
Open

Take check timeout into account when rescheduling remote checks #10362

dirk-4 opened this issue Mar 7, 2025 · 9 comments
Labels
area/checks Check execution and results area/distributed Distributed monitoring (master, satellites, clients)

Comments

@dirk-4
Copy link

dirk-4 commented Mar 7, 2025

Describe the bug

Setting retry interval for a service to a non-default value, will not be honored. If the service status is != 0, it will be retried every minute, regardless of the retry interval setting. It will even be retried, before the current check run finished (if it is running longer then 1 minute), whereas this might be intentional to prevent infinite waiting states.

To Reproduce

Have a look at this community post please, it contains the test results.

  1. create a command that is just sleeping for 10 minutes (we need a long execution time)
  2. create a service executing that command and set check interval=14400, retry interval=3600 and check timeout=14400 (just examples)
  3. assign the service to a host and execute it
  4. cancel the process on OS level, so you get a result != 0
  5. rerun the check and watch the process list

You'll see 1 additional instance of the command started (about) every minute, until the first started one comes back after 10 minutes and sets the service status to OK ( =0 ). From this moment on no new process instances will be launched (because we don't to rerun the check anymore), only the result string is getting updated in Icinga database whenever one of the running instances finishes.

Expected behavior

Retry runs in case of a check result != 0 should not start 1 minute after we entered that state, but after retry interval (in this example after 1 hour).

Screenshots

Have a look at this community post please, it contains the test results.

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version): r2.14.5-1
  • Operating System and version: Red Hat Enterprise Linux 8.10 (Ootpa) x86_64
  • Enabled features (icinga2 feature list): api checker ido-mysql mainlog notification
  • Icinga Web 2 version and modules (System - About): 2.12.2
  • Config validation (icinga2 daemon -C): no errors
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes. -> Sorry, not allowed to provide hostnames here.

Additional context

@yhabteab
Copy link
Member

yhabteab commented Mar 7, 2025

Hi thanks for reporting!

  • assign the service to a host and execute it

How exactly did you do that? As far as I can see from the icinga2 object list ... output you posted in the community form, active checks are disabled for that service, so Icinga 2 should never run the command on its own.

...
enable_active_checks = false
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 941:5-941:32

If you have debug logs enabled, you should be able to find such logs indicating that the checker will not schedule a check for this service.

  • Skipping check for service 'HOST!ServiceName': active service checks are disabled or
  • Checks for checkable 'Host!ServiceName' are disabled. Rescheduling check.

@dirk-4
Copy link
Author

dirk-4 commented Mar 7, 2025

Hi yhabteab,

you are right, the check was not enabled, but manually executed, because it was created for this test only. I found the issue with another check, but to make sure the command (script) content is not triggering anything unexpected, I created a new command and service explicitly to verify the behaviour.

Shouldn't make a difference, if I trigger the check via GUI manually or via timer.

Regards,
dirk-4

@yhabteab
Copy link
Member

yhabteab commented Mar 7, 2025

the check was not enabled, but manually executed, because it was created for this test only.

If you set enable_active_checks to false, then there is no way you can execute the check for this service, not manually, not in Icinga 2 itself, so I want to know what exactly manually executed means. I know there is a single way to somehow trigger that check command in such a case, namely via the /v1/actions/execute-command API endpoint, but that's a completely different one and I'm not sure if that's what you're referring to.

Even if enable_active_checks is set to true, Icinga 2 should never start a new process while the former one is still running. So if the checker does indeed ignore your retry interval, it will end here and refuse to start a new check process.

/* don't run another check if there is one pending */
if (m_CheckRunning)
return;

So if these processes are indeed started by Icinga 2 and all belong to the same service, please provide the debug logs of Icinga 2 (you can anonymise all sensible information, such as host, service name etc.) and an output of e.g. the ps command on Linux (ps -ef | grep -i icinga2 or pstree | grep icinga2).

@dirk-4
Copy link
Author

dirk-4 commented Mar 10, 2025

If you set enable_active_checks to false, then there is no way you can execute the check for this service, not manually, not in Icinga 2 itself, so I want to know what exactly manually executed means.

You might be right here, I assume I already deactivated the service after the tests when I ran that command.

But however, the check was triggered via GUI and that's not what needs to be discussed here.
I started another run and executed the commands you requested, but they don't show much:

# pstree|grep -i icinga2
        |-icinga2-+-icinga2-+-icinga2---3*[sh---sleep]
        |         |         `-27*[{icinga2}]
        |         `-10*[{icinga2}]
# ps -ef|grep icinga2
nagiosh+ 1276022       1  0 09:35 ?        00:00:00 /usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e /var/log/icinga2/error.log
nagiosh+ 1276060 1276022  0 09:35 ?        00:00:00 /usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e /var/log/icinga2/error.log
nagiosh+ 1276071 1276060  0 09:35 ?        00:00:00 /usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e /var/log/icinga2/error.log
newrelic 1289090 1816981  0 09:39 ?        00:00:00 /bin/systemctl --property=MainPID show icinga2
root     1289092 1153021  0 09:39 pts/0    00:00:00 grep --color=auto icinga2

But I can see them, if I search for the job name:

# ps -ef|grep spawn_test
nagiosh+ 1276294 1276071  0 09:36 ?        00:00:00 /bin/sh /usr/lib64/nagios/plugins/spawn_test
nagiosh+ 1282168 1276071  0 09:37 ?        00:00:00 /bin/sh /usr/lib64/nagios/plugins/spawn_test
nagiosh+ 1287278 1276071  0 09:39 ?        00:00:00 /bin/sh /usr/lib64/nagios/plugins/spawn_test

And you can see the relation when querying the service status:

# systemctl status icinga2
● icinga2.service - Icinga host/service/network monitoring system
   Loaded: loaded (/usr/lib/systemd/system/icinga2.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2025-03-10 09:35:54 CET; 3min 57s ago
  Process: 1276015 ExecStartPre=/usr/lib/icinga2/prepare-dirs /etc/sysconfig/icinga2 (code=exited, status=0/SUCCESS)
Main PID: 1276022 (icinga2)
   Status: "Startup finished."
    Tasks: 46 (limit: 408629)
   Memory: 20.8M
   CGroup: /system.slice/icinga2.service
           ├─1276022 /usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e /var/log/icinga2/error.log
           ├─1276060 /usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e /var/log/icinga2/error.log
           ├─1276071 /usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e /var/log/icinga2/error.log
           ├─1276294 /bin/sh /usr/lib64/nagios/plugins/spawn_test
           ├─1276296 sleep 600
           ├─1282168 /bin/sh /usr/lib64/nagios/plugins/spawn_test
           ├─1282170 sleep 600
           ├─1287278 /bin/sh /usr/lib64/nagios/plugins/spawn_test
           └─1287280 sleep 600

I attached a debug.log from endpoint node containing 3 consecutive starts. That matches the ouput above.

icinga_debug_log.txt

@yhabteab
Copy link
Member

I attached a debug.log from endpoint node containing 3 consecutive starts. That matches the ouput above.

icinga_debug_log.txt

The debug logs don't say too much as the commands seem to be triggered from a remote endpoint. In order to see if these processes belong to the very same service, I would need the debug logs from your satellite sending these commands.

[2025-03-10 09:36:09 +0100] notice/JsonRpcConnection: Received 'event::ExecuteCommand' message from identity '<satellite>'.
[2025-03-10 09:36:09 +0100] notice/Process: Running command '/usr/lib64/nagios/plugins/spawn_test': PID 1276294
...
[2025-03-10 09:37:39 +0100] notice/JsonRpcConnection: Received 'event::ExecuteCommand' message from identity '<satellite>'.
[2025-03-10 09:37:39 +0100] notice/Process: Running command '/usr/lib64/nagios/plugins/spawn_test': PID 1282168
...
[2025-03-10 09:39:09 +0100] notice/JsonRpcConnection: Received 'event::ExecuteCommand' message from identity '<satellite>'.
[2025-03-10 09:39:09 +0100] notice/Process: Running command '/usr/lib64/nagios/plugins/spawn_test': PID 1287278

@dirk-4
Copy link
Author

dirk-4 commented Mar 10, 2025

Fresh run, endpoint values for reference:

# ps -ef|grep spawn_test
nagiosh+ 1276294 1276071  0 09:36 ?        00:00:00 /bin/sh /usr/lib64/nagios/plugins/spawn_test
nagiosh+ 1282168 1276071  0 09:37 ?        00:00:00 /bin/sh /usr/lib64/nagios/plugins/spawn_test
nagiosh+ 1287278 1276071  0 09:39 ?        00:00:00 /bin/sh /usr/lib64/nagios/plugins/spawn_test
# systemctl status icinga2
● icinga2.service - Icinga host/service/network monitoring system
   Loaded: loaded (/usr/lib/systemd/system/icinga2.service; enabled; vendor p>
   Active: active (running) since Mon 2025-03-10 09:40:35 CET; 2h 26min ago
  Process: 1292272 ExecStartPre=/usr/lib/icinga2/prepare-dirs /etc/sysconfig/>
Main PID: 1292280 (icinga2)
   Status: "Startup finished."
    Tasks: 48 (limit: 408629)
   Memory: 23.6M
   CGroup: /system.slice/icinga2.service
           ├─1292280 /usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon>
           ├─1292318 /usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon>
           ├─1292329 /usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon>
           ├─1781780 /bin/sh /usr/lib64/nagios/plugins/spawn_test
           ├─1781782 sleep 600
           ├─1786976 /bin/sh /usr/lib64/nagios/plugins/spawn_test
           ├─1786978 sleep 600
           ├─1791902 /bin/sh /usr/lib64/nagios/plugins/spawn_test
           ├─1791904 sleep 600
           ├─1797438 /bin/sh /usr/lib64/nagios/plugins/spawn_test
           └─1797440 sleep 600

Debug log of satellite system is attached. Service name is "test_service".

icinga_debug.log.cleaned.satellite.txt

@yhabteab
Copy link
Member

yhabteab commented Mar 10, 2025

Thanks for the debug logs!

I can only partially confirm the issue as a bug in Icinga 2, i.e. with my previous comments I was assuming we were talking about the normal/locally executed checks, but that is no longer the case. What I mean by partial is that Icinga 2 should not be scheduling a new check every minute on its own, even if you have a huge check_timeout configured for that service 14400 and we will fix that. For now, until this gets fixed and released, you can work around this by setting check_command.timeout to the same value as in service.check_timeout. However, this will not stop any API user from re-triggering the check via the /v1/actions/reschedule-check API endpoint/Icinga Web 2 using the Check Now button; in this case, you will still end up with several check processes for the same service on the remote endpoint and I don’t think that we are ever going to fix it.

A side note on why Icinga 2 runs a check more or less every minute is due to the following:

/* Re-schedule the check so we don't run it again until after we've received
* a check result from the remote instance. The check will be re-scheduled
* using the proper check interval once we've received a check result.
*/
SetNextCheck(Utility::GetTime() + GetCheckCommand()->GetTimeout() + 30);

check_command.timeout is set to 1m by default and the scheduler will only wait for 1m+30s before rescheduling the next check, which is wrong because everywhere else checkable.check_timeout overrides check_command.timeout, but not here.

@yhabteab yhabteab changed the title retry interval setting of services is ignored Take check timeout into account when rescheduling remote checks Mar 10, 2025
@yhabteab yhabteab added area/distributed Distributed monitoring (master, satellites, clients) area/checks Check execution and results labels Mar 10, 2025
@moreamazingnick
Copy link

moreamazingnick commented Mar 11, 2025

In the forum post there is also a:

*max check attempts = 1

Shouldn't this make any retry_interval obsolete?

Best Regards
Nicolas

@yhabteab
Copy link
Member

*max check attempts = 1

Shouldn't this make any retry_interval obsolete?

Hi, as the new issue title implies, this has nothing to do with retry_interval, but yes, in general max_check_attempts = 1 means that every state change is treated as a hard state change and there will be no retries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks Check execution and results area/distributed Distributed monitoring (master, satellites, clients)
Projects
None yet
Development

No branches or pull requests

3 participants