-
Notifications
You must be signed in to change notification settings - Fork 585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icinga daemon leaves zombie processes on very busy system #10355
Comments
Hi @oldelvet, thank you very much for reporting and analysis!
This would introduce another serious bug, e.g. what if the process could not be truly killed? In that case, |
Hi @yhabteab yes I agree a very specific test for this failure mode is likely needed. Just for information I did test a local build with the
When I get time I will rebuild with an alternate patch that does not set |
The patch below has been testing for a week. I have gathered a few examples of timeouts including process group kills with both normal kill behaviour
and ESRCH error codes.
Otherwise monitoring has run normally. |
Hi thanks for patching and testing! |
Describe the bug
On a very heavily loaded system I frequently get zombie/defunct processes left from monitoring checks that exceeded the timeout period. These zombies stay around until the icinga2 daemon is restarted.
Taking 196844 as an example the following is logged by the icinga2 daemon (full check command line parameters truncated because they are not relevant to the issue).
To Reproduce
I cannot give a step by step guide to reproducing other than having a system that occasionally gets itself into a extremely high load situation.
However having studied the logs and the code I believe I can see what is happening and will describe it below.
Expected behavior
Terminated processes are reaped and not left in a zombie state.
Your Environment
Include as many relevant details about the environment you experienced the problem in
*
icinga2 --version
):Observed on multiple versions of icinga2
version: r2.14.5-1
version: r2.13.6-1
version: r2.12.3-1
linux amd64. Ubuntu 24.04 kernel with various Debian/Ubuntu versions running in LXC containers.
Additional context
Inspecting the code in
lib/base/process.cpp
DoEvents
The problem seems to be that icinga2 is skipping
ProcessWaitPID
if the process group kill fails. The logs above shows that the send ofSIGTERM
succeeds but presumably because the system is under heavy load the 6 second deadline for the termination to take effect has passed.At this stage the code tries to send a
SIGKILL
to the process group but during this race condition the process actually dies and the process group kill fails with errno 3No such process
.At this stage the
could_not_kill
flag is set to true and that causes theProcessWaitPID
to be skipped.Looking at the history of the
could_not_kill
flag it seems that it was introduced to ensure that an abnormal exit code was reported. Perhaps the correct behaviour should be to always perform theProcessWaitPID
(except whenm_PID
is -1). Then the only situation whereexitcode
is not 128 is if the process terminated with an exit code and in that case the existing check form_SentSigterm
could be extended as followsThe text was updated successfully, but these errors were encountered: