Icinga daemon leaves zombie processes on very busy system #10355

oldelvet · 2025-02-28T11:11:56Z

Describe the bug

On a very heavily loaded system I frequently get zombie/defunct processes left from monitoring checks that exceeded the timeout period. These zombies stay around until the icinga2 daemon is restarted.

richm@gala:~$ ps -ef | grep 107103
nagios 107103 107094 0 Feb17 ? 00:01:11 /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e
nagios 143808 107103 0 Feb19 ? 00:00:00 [check_systemd_s]
nagios 143809 107103 0 Feb19 ? 00:00:00 [check_http]
nagios 192896 107103 0 Feb21 ? 00:00:00 [check_systemd_s]
nagios 196844 107103 0 Feb21 ? 00:00:00 [check_http]

Taking 196844 as an example the following is logged by the icinga2 daemon (full check command line parameters truncated because they are not relevant to the issue).

[2025-02-21 05:14:50 +0000] warning/Process: Terminating process 196844 ('/usr/lib/nagios/plugins/check_http' ...) after timeout of 60 seconds
[2025-02-21 05:14:50 +0000] warning/Process: Killing process group 196844 ('/usr/lib/nagios/plugins/check_http' ...) after timeout of 66 seconds
[2025-02-21 05:14:51 +0000] warning/Process: Couldn't kill the process group 196844 ('/usr/lib/nagios/plugins/check_http' ...): [errno 3] No such process
[2025-02-21 05:14:51 +0000] warning/PluginCheckTask: Check command for object 'gala.home.oldelvet.org.uk!http_v6toot api' (PID: 196844, arguments: '/usr/lib/nagios/plugins/check_http' ...) terminated with exit code 128, output:

To Reproduce

I cannot give a step by step guide to reproducing other than having a system that occasionally gets itself into a extremely high load situation.

However having studied the logs and the code I believe I can see what is happening and will describe it below.

Expected behavior

Terminated processes are reaped and not left in a zombie state.

Your Environment

Include as many relevant details about the environment you experienced the problem in
*

Version used (icinga2 --version):

Observed on multiple versions of icinga2

version: r2.14.5-1
version: r2.13.6-1
version: r2.12.3-1

Operating System and version:

linux amd64. Ubuntu 24.04 kernel with various Debian/Ubuntu versions running in LXC containers.

Additional context

Inspecting the code in lib/base/process.cpp DoEvents

The problem seems to be that icinga2 is skipping ProcessWaitPID if the process group kill fails. The logs above shows that the send of SIGTERM succeeds but presumably because the system is under heavy load the 6 second deadline for the termination to take effect has passed.

At this stage the code tries to send a SIGKILL to the process group but during this race condition the process actually dies and the process group kill fails with errno 3 No such process.

if (deadline < now) {
Log(LogWarning, "Process")
<< "Killing process group " << m_PID << " (" << PrettyPrintArguments(m_Arguments)
<< ") after timeout of " << timeout << " seconds";
#ifdef _WIN32
m_OutputStream << "";
TerminateProcess(m_Process, 3);
#else /* _WIN32 /
int error = ProcessKill(-m_Process, SIGKILL);
if (error) {
Log(LogWarning, "Process")
<< "Couldn't kill the process group " << m_PID << " (" << PrettyPrintArguments(m_Arguments)
<< "): [errno " << error << "] " << strerror(error);
could_not_kill = true;
}
#endif / _WIN32 */
is_timeout = true;
}

At this stage the could_not_kill flag is set to true and that causes the ProcessWaitPID to be skipped.

    int status, exitcode;
    if (could_not_kill || m_PID == -1) {
            exitcode = 128;
    } else if (ProcessWaitPID(m_Process, &status) != m_Process) {
            exitcode = 128;

            Log(LogWarning, "Process")
                    << "PID " << m_PID << " (" << PrettyPrintArguments(m_Arguments) << ") died mysteriously: waitpid failed";
    } else if (WIFEXITED(status)) {

Looking at the history of the could_not_kill flag it seems that it was introduced to ensure that an abnormal exit code was reported. Perhaps the correct behaviour should be to always perform the ProcessWaitPID (except when m_PID is -1). Then the only situation where exitcode is not 128 is if the process terminated with an exit code and in that case the existing check for m_SentSigterm could be extended as follows

            if (m_SentSigterm) {
                    exitcode = 128;
                    msg << " after sending SIGTERM";
            } else if (could_not_kill) {
                    exitcode = 128;
                    msg << " after failed SIGKILL";
            }

The text was updated successfully, but these errors were encountered:

yhabteab · 2025-03-05T07:45:01Z

Hi @oldelvet, thank you very much for reporting and analysis!

Perhaps the correct behaviour should be to always perform the ProcessWaitPID (except when m_PID is -1).

This would introduce another serious bug, e.g. what if the process could not be truly killed? In that case, waitpid(2) would block forever if the process do not terminate itself at some point. However, we could alternatively check for specific error codes returned by kill(2) and only set could_not_kill to true if, for example, error != ESRCH, but we should be very careful not to break any other existing use cases.

oldelvet · 2025-03-05T09:58:05Z

Hi @yhabteab yes I agree a very specific test for this failure mode is likely needed.

Just for information I did test a local build with the could_not_kill adjusted to always do a waitpid(2) (patch attached). I observed four examples of the race condition last night and in each case the leftover defunct process was properly cleaned up.

[2025-03-05 00:30:29 +0000] warning/Process: Terminating process 493132 ('/usr/lib/nagios/plugins/check_procs' '-c' '100' '-w' '70') after timeout of 60 seconds
[2025-03-05 00:30:29 +0000] warning/Process: Killing process group 493132 ('/usr/lib/nagios/plugins/check_procs' '-c' '100' '-w' '70') after timeout of 66 seconds
[2025-03-05 00:30:30 +0000] warning/Process: Couldn't kill the process group 493132 ('/usr/lib/nagios/plugins/check_procs' '-c' '100' '-w' '70'): [errno 3] No such process
[2025-03-05 00:30:31 +0000] warning/PluginCheckTask: Check command for object 'gala.home.oldelvet.org.uk!procsOverall' (PID: 493132, arguments: '/usr/lib/nagios/plugins/check_procs' '-c' '100' '-w' '70') terminated with exit code 128, output:

When I get time I will rebuild with an alternate patch that does not set could_not_kill when error == ESRCH and confirm that fixes the problem too.

couldnotkill.patch

oldelvet · 2025-03-12T14:41:31Z

The patch below has been testing for a week. I have gathered a few examples of timeouts including process group kills with both normal kill behaviour

root@gala:/var/log/icinga2# grep 118743 icinga2.log
[2025-03-12 00:28:19 +0000] warning/Process: Terminating process 118743 ('/usr/lib/nagios/plugins/check_systemd_service.sh' '...snip...') after timeout of 60 seconds
[2025-03-12 00:28:19 +0000] warning/Process: Killing process group 118743 ('/usr/lib/nagios/plugins/check_systemd_service.sh' '...snip...') after timeout of 66 seconds
[2025-03-12 00:28:19 +0000] warning/PluginCheckTask: Check command for object 'gala!sysd_service...snip...' (PID: 118743, arguments: '/usr/lib/nagios/plugins/check_systemd_service.sh' '...snip...') terminated with exit code 128, output:

and ESRCH error codes.

root@gala:/var/log/icinga2# grep 118742 icinga2.log
[2025-03-12 00:28:19 +0000] warning/Process: Terminating process 118742 ('/usr/lib/nagios/plugins/check_systemd_service.sh' '...snip...') after timeout of 60 seconds
[2025-03-12 00:28:19 +0000] warning/Process: Killing process group 118742 ('/usr/lib/nagios/plugins/check_systemd_service.sh' '...snip...') after timeout of 66 seconds
[2025-03-12 00:28:19 +0000] warning/Process: Couldn't kill the process group 118742 ('/usr/lib/nagios/plugins/check_systemd_service.sh' '...snip...'): [errno 3] No such process
[2025-03-12 00:28:19 +0000] warning/PluginCheckTask: Check command for object 'gala!sysd_...snip...' (PID: 118742, arguments: '/usr/lib/nagios/plugins/check_systemd_service.sh' '...snip...') terminated with exit code 128, output:

Otherwise monitoring has run normally.

ignoreesrch.patch

yhabteab · 2025-03-12T15:42:30Z

Hi thanks for patching and testing!
You can now create a pull request with your patch if you don't mind.

yhabteab added the area/checks Check execution and results label Mar 5, 2025

yhabteab assigned oldelvet Mar 5, 2025

oldelvet mentioned this issue Mar 13, 2025

ApiListener fails to reconnect on heavily loaded system #10376

Open

Al2Klimov linked a pull request Mar 13, 2025 that will close this issue

Icinga daemon leaves zombie processes on very busy system #10355 #10375

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Icinga daemon leaves zombie processes on very busy system #10355

Icinga daemon leaves zombie processes on very busy system #10355

oldelvet commented Feb 28, 2025

yhabteab commented Mar 5, 2025

oldelvet commented Mar 5, 2025

oldelvet commented Mar 12, 2025 •

edited

Loading

yhabteab commented Mar 12, 2025

Icinga daemon leaves zombie processes on very busy system #10355

Icinga daemon leaves zombie processes on very busy system #10355

Comments

oldelvet commented Feb 28, 2025

Describe the bug

To Reproduce

Expected behavior

Your Environment

Additional context

yhabteab commented Mar 5, 2025

oldelvet commented Mar 5, 2025

oldelvet commented Mar 12, 2025 • edited Loading

yhabteab commented Mar 12, 2025

oldelvet commented Mar 12, 2025 •

edited

Loading