Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naemon stops executing checks and doesnt respawn Core Worker processes #418

Open
ccztux opened this issue Feb 27, 2023 · 6 comments
Open

Comments

@ccztux
Copy link
Contributor

ccztux commented Feb 27, 2023

On a system running Naemon Core 1.3.0 we ran into the issue, that naemon stops executing checks. There were no more worker processes. I have not seen anything suspicious in the system-journal or dmesg. No SIGSEGV or oom_killer in action.

Log snippet of the Naemon log (host and servicenames anonymized):

[1677024519] Warning:  Check of host 'myhost' did not exit properly!
[1677024519] HOST ALERT: myhost;DOWN;SOFT;2;(Host check did not exit properly)
[1677024520] wproc: Socket to worker Core Worker 4261 broken, removing
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN - check_nwc_health timed out after 50 seconds
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN - check_nwc_health timed out after 50 seconds
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024520] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024520] HOST ALERT: myhost;DOWN;SOFT;3;CRITICAL - 10.0.0.63: rta nan, lost 100%
[1677024521] Warning:  Check of service 'myservice' on host 'myhost' did not exit properly!
[1677024521] SERVICE ALERT: myhost;myservice;(Service check did not exit properly)
[1677024521] Warning:  Check of host 'myhost' did not exit properly!
[1677024521] HOST ALERT: myhost;DOWN;SOFT;2;(Host check did not exit properly)
[1677024521] Warning:  Check of service 'myservice' on host 'myhost' did not exit properly!
[1677024521] SERVICE ALERT: myhost;myservice;(Service check did not exit properly)
[1677024521] SERVICE INFO: myhost;myservice; Service switch to hard down state due to host down.
[1677024521] SERVICE ALERT: myhost;myservice;UNKNOWN: Execution exceeded timeout threshold of 58s
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4258 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4258 broken, removing
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4260 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4260 broken, removing
[1677024521] wproc: nm_bufferqueue_read() from Core Worker 4259 returned -1: Connection reset by peer
[1677024521] wproc: Socket to worker Core Worker 4259 broken, removing
[1677024526] Warning:  Check of host 'myhost' did not exit properly!
[1677024526] HOST ALERT: myhost;DOWN;SOFT;3;(Host check did not exit properly)
[1677024526] wproc: nm_bufferqueue_read() from Core Worker 4257 returned -1: Connection reset by peer
[1677024526] wproc: Socket to worker Core Worker 4257 broken, removing
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024526] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for host 'myhost' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)
[1677024527] Unable to send check for service 'myservice' to worker (ret=-2)

Independent of the root cause of the broken Core Worker processes, i think naemon should respawn the Core Worker processes, if there are no processes or less than desired.

This also happens with a manual installation with the actual version of the master branch Naemon Core 1.4.1.g2916d626.20230223.

Found this to reproduce the issue.

After looking into the source code i expected to hit the following if condition which doesnt happen:

if (workers.len <= 0) {
/* there aren't global workers left, we can't run any more checks
* we should try respawning a few of the standard ones
*/
nm_log(NSLOG_RUNTIME_ERROR, "wproc: All our workers are dead, we can't do anything!");
}

I will provide a fix for the respawning thing via a pull request.

@sni
Copy link
Contributor

sni commented Feb 27, 2023

Besides restarting the worker, it would be pretty interesting to know why the worker fail. Is it reproducable? If so, maybe attaching strace to one of the workers might reveil something.

@ccztux
Copy link
Contributor Author

ccztux commented Feb 27, 2023

Unfortunately it is not reproduceable.

I agree with that. Unfortunately there was no worker process left. I will communicate this in my team, that we should connect strace to one or both of the leftover processes if this issue will appear again.

It looked like this:

17:01:41 ✓ LAB-CL01 root@cl01 ~/git-repos/naemon-core # systemctl status naemon
● naemon.service - Naemon Monitoring Daemon
   Loaded: loaded (/usr/lib/systemd/system/naemon.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2023-02-27 17:01:41 CET; 2s ago
     Docs: http://naemon.org/documentation
  Process: 4711 ExecStart=/usr/bin/naemon --daemon /etc/naemon/naemon.cfg (code=exited, status=0/SUCCESS)
  Process: 4665 ExecStartPre=/bin/su naemon --login --shell=/bin/sh --command=/usr/bin/naemon --verify-config /etc/naemon/naemon.cfg (code=exited, status=0/SUCCESS)
  Process: 4663 ExecStartPre=/usr/bin/chown -R naemon:naemon /var/run/naemon/ (code=exited, status=0/SUCCESS)
  Process: 4661 ExecStartPre=/usr/bin/mkdir -p /var/run/naemon (code=exited, status=0/SUCCESS)
 Main PID: 4713 (naemon)
   CGroup: /system.slice/naemon.service
           ├─4713 /usr/bin/naemon --daemon /etc/naemon/naemon.cfg
           └─4719 /usr/bin/naemon --daemon /etc/naemon/naemon.cfg

Feb 27 17:01:41 cl01 systemd[1]: Stopped Naemon Monitoring Daemon.
Feb 27 17:01:41 cl01 systemd[1]: Starting Naemon Monitoring Daemon...
Feb 27 17:01:41 cl01 su[4665]: (to naemon) root on none
Feb 27 17:01:41 cl01 systemd[1]: Started Naemon Monitoring Daemon.

@ccztux
Copy link
Contributor Author

ccztux commented Feb 27, 2023

Just for clarifying, the root cause is not reproduceable, but if you kill all the worker processes you will see, that naemon doesnt respawn them, like described here.

@nook24
Copy link
Member

nook24 commented Feb 27, 2023

One of our users had the same issue a while ago. This was happening with Naemon 1.2.3 and this was the check plugin that manged to kill the worker process itself:
it-novum/openITCOCKPIT#1159 (comment)

Unfortunately i had no access to the system for further debugging.

@fermino
Copy link

fermino commented Apr 10, 2024

I have been hit by the same bug. I haven't found yet how to reproduce it as it is currently a production system, it would be nice though to restart the workers automatically on failure. Currently I'm just checking the logs for the error and restarting the instance when necessary.

@nook24
Copy link
Member

nook24 commented Apr 10, 2024

Which Naemon Version are you using @fermino?
Naemon 1.4.2 should restart dead core workers: #421

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants