Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga daemon leaves zombie processes on very busy system #10355

Open
oldelvet opened this issue Feb 28, 2025 · 4 comments · May be fixed by #10375
Open

Icinga daemon leaves zombie processes on very busy system #10355

oldelvet opened this issue Feb 28, 2025 · 4 comments · May be fixed by #10375
Assignees
Labels
area/checks Check execution and results

Comments

@oldelvet
Copy link

Describe the bug

On a very heavily loaded system I frequently get zombie/defunct processes left from monitoring checks that exceeded the timeout period. These zombies stay around until the icinga2 daemon is restarted.

richm@gala:~$ ps -ef | grep 107103
nagios 107103 107094 0 Feb17 ? 00:01:11 /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e
nagios 143808 107103 0 Feb19 ? 00:00:00 [check_systemd_s]
nagios 143809 107103 0 Feb19 ? 00:00:00 [check_http]
nagios 192896 107103 0 Feb21 ? 00:00:00 [check_systemd_s]
nagios 196844 107103 0 Feb21 ? 00:00:00 [check_http]

Taking 196844 as an example the following is logged by the icinga2 daemon (full check command line parameters truncated because they are not relevant to the issue).

[2025-02-21 05:14:50 +0000] warning/Process: Terminating process 196844 ('/usr/lib/nagios/plugins/check_http' ...) after timeout of 60 seconds
[2025-02-21 05:14:50 +0000] warning/Process: Killing process group 196844 ('/usr/lib/nagios/plugins/check_http' ...) after timeout of 66 seconds
[2025-02-21 05:14:51 +0000] warning/Process: Couldn't kill the process group 196844 ('/usr/lib/nagios/plugins/check_http' ...): [errno 3] No such process
[2025-02-21 05:14:51 +0000] warning/PluginCheckTask: Check command for object 'gala.home.oldelvet.org.uk!http_v6toot api' (PID: 196844, arguments: '/usr/lib/nagios/plugins/check_http' ...) terminated with exit code 128, output:

To Reproduce

I cannot give a step by step guide to reproducing other than having a system that occasionally gets itself into a extremely high load situation.

However having studied the logs and the code I believe I can see what is happening and will describe it below.

Expected behavior

Terminated processes are reaped and not left in a zombie state.

Your Environment

Include as many relevant details about the environment you experienced the problem in
*

  • Version used (icinga2 --version):

Observed on multiple versions of icinga2

version: r2.14.5-1
version: r2.13.6-1
version: r2.12.3-1

  • Operating System and version:

linux amd64. Ubuntu 24.04 kernel with various Debian/Ubuntu versions running in LXC containers.

Additional context

Inspecting the code in lib/base/process.cpp DoEvents

The problem seems to be that icinga2 is skipping ProcessWaitPID if the process group kill fails. The logs above shows that the send of SIGTERM succeeds but presumably because the system is under heavy load the 6 second deadline for the termination to take effect has passed.

At this stage the code tries to send a SIGKILL to the process group but during this race condition the process actually dies and the process group kill fails with errno 3 No such process.

if (deadline < now) {
Log(LogWarning, "Process")
<< "Killing process group " << m_PID << " (" << PrettyPrintArguments(m_Arguments)
<< ") after timeout of " << timeout << " seconds";
#ifdef _WIN32
m_OutputStream << "";
TerminateProcess(m_Process, 3);
#else /* _WIN32 /
int error = ProcessKill(-m_Process, SIGKILL);
if (error) {
Log(LogWarning, "Process")
<< "Couldn't kill the process group " << m_PID << " (" << PrettyPrintArguments(m_Arguments)
<< "): [errno " << error << "] " << strerror(error);
could_not_kill = true;
}
#endif /
_WIN32 */
is_timeout = true;
}

At this stage the could_not_kill flag is set to true and that causes the ProcessWaitPID to be skipped.

    int status, exitcode;
    if (could_not_kill || m_PID == -1) {
            exitcode = 128;
    } else if (ProcessWaitPID(m_Process, &status) != m_Process) {
            exitcode = 128;

            Log(LogWarning, "Process")
                    << "PID " << m_PID << " (" << PrettyPrintArguments(m_Arguments) << ") died mysteriously: waitpid failed";
    } else if (WIFEXITED(status)) {

Looking at the history of the could_not_kill flag it seems that it was introduced to ensure that an abnormal exit code was reported. Perhaps the correct behaviour should be to always perform the ProcessWaitPID (except when m_PID is -1). Then the only situation where exitcode is not 128 is if the process terminated with an exit code and in that case the existing check for m_SentSigterm could be extended as follows

            if (m_SentSigterm) {
                    exitcode = 128;
                    msg << " after sending SIGTERM";
            } else if (could_not_kill) {
                    exitcode = 128;
                    msg << " after failed SIGKILL";
            }
@yhabteab
Copy link
Member

yhabteab commented Mar 5, 2025

Hi @oldelvet, thank you very much for reporting and analysis!

Perhaps the correct behaviour should be to always perform the ProcessWaitPID (except when m_PID is -1).

This would introduce another serious bug, e.g. what if the process could not be truly killed? In that case, waitpid(2) would block forever if the process do not terminate itself at some point. However, we could alternatively check for specific error codes returned by kill(2) and only set could_not_kill to true if, for example, error != ESRCH, but we should be very careful not to break any other existing use cases.

@yhabteab yhabteab added the area/checks Check execution and results label Mar 5, 2025
@oldelvet
Copy link
Author

oldelvet commented Mar 5, 2025

Hi @yhabteab yes I agree a very specific test for this failure mode is likely needed.

Just for information I did test a local build with the could_not_kill adjusted to always do a waitpid(2) (patch attached). I observed four examples of the race condition last night and in each case the leftover defunct process was properly cleaned up.

[2025-03-05 00:30:29 +0000] warning/Process: Terminating process 493132 ('/usr/lib/nagios/plugins/check_procs' '-c' '100' '-w' '70') after timeout of 60 seconds
[2025-03-05 00:30:29 +0000] warning/Process: Killing process group 493132 ('/usr/lib/nagios/plugins/check_procs' '-c' '100' '-w' '70') after timeout of 66 seconds
[2025-03-05 00:30:30 +0000] warning/Process: Couldn't kill the process group 493132 ('/usr/lib/nagios/plugins/check_procs' '-c' '100' '-w' '70'): [errno 3] No such process
[2025-03-05 00:30:31 +0000] warning/PluginCheckTask: Check command for object 'gala.home.oldelvet.org.uk!procsOverall' (PID: 493132, arguments: '/usr/lib/nagios/plugins/check_procs' '-c' '100' '-w' '70') terminated with exit code 128, output:

When I get time I will rebuild with an alternate patch that does not set could_not_kill when error == ESRCH and confirm that fixes the problem too.

couldnotkill.patch

@oldelvet
Copy link
Author

oldelvet commented Mar 12, 2025

The patch below has been testing for a week. I have gathered a few examples of timeouts including process group kills with both normal kill behaviour

root@gala:/var/log/icinga2# grep 118743 icinga2.log
[2025-03-12 00:28:19 +0000] warning/Process: Terminating process 118743 ('/usr/lib/nagios/plugins/check_systemd_service.sh' '...snip...') after timeout of 60 seconds
[2025-03-12 00:28:19 +0000] warning/Process: Killing process group 118743 ('/usr/lib/nagios/plugins/check_systemd_service.sh' '...snip...') after timeout of 66 seconds
[2025-03-12 00:28:19 +0000] warning/PluginCheckTask: Check command for object 'gala!sysd_service...snip...' (PID: 118743, arguments: '/usr/lib/nagios/plugins/check_systemd_service.sh' '...snip...') terminated with exit code 128, output:

and ESRCH error codes.

root@gala:/var/log/icinga2# grep 118742 icinga2.log
[2025-03-12 00:28:19 +0000] warning/Process: Terminating process 118742 ('/usr/lib/nagios/plugins/check_systemd_service.sh' '...snip...') after timeout of 60 seconds
[2025-03-12 00:28:19 +0000] warning/Process: Killing process group 118742 ('/usr/lib/nagios/plugins/check_systemd_service.sh' '...snip...') after timeout of 66 seconds
[2025-03-12 00:28:19 +0000] warning/Process: Couldn't kill the process group 118742 ('/usr/lib/nagios/plugins/check_systemd_service.sh' '...snip...'): [errno 3] No such process
[2025-03-12 00:28:19 +0000] warning/PluginCheckTask: Check command for object 'gala!sysd_...snip...' (PID: 118742, arguments: '/usr/lib/nagios/plugins/check_systemd_service.sh' '...snip...') terminated with exit code 128, output:

Otherwise monitoring has run normally.

ignoreesrch.patch

@yhabteab
Copy link
Member

Hi thanks for patching and testing!
You can now create a pull request with your patch if you don't mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks Check execution and results
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants