Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naemon 1.4.1 livestatus not stable #428

Open
bjornfro opened this issue Apr 29, 2023 · 20 comments
Open

Naemon 1.4.1 livestatus not stable #428

bjornfro opened this issue Apr 29, 2023 · 20 comments

Comments

@bjornfro
Copy link

bjornfro commented Apr 29, 2023

Hi,

I get issues with livestatus causing naemon to core dump. Seems I get 1 of 2 issues. Either Livestatus loads with version 1.3.0 and then it core dump shortly after naemon start. Or Livestatus does not load at all.

I update with:

yum update naemon naemon-core libnaemon naemon-livestatus naemon-core-debugsource naemon-core-dbg naemon-thruk naemon-vimvault-debugsource naemon-livestatus-debugsource naemon-devel naemon-vimvault mod_gearman

Could be when it crashed I did not update mod-gearman. When I include mod-gearman it seems livestatus broker module is not loaded at all.

Is naemon.cfg livestatus conf OK?:

broker_module=/usr/lib64/naemon/naemon-livestatus/livestatus.so inet_addr=0.0.0.0:6557 debug=1 max_backlog=128 max_response_size=2000000000

  1. With naemon core dump
    Apr 29 12:55:00 server naemon[53434]: Naemon 1.4.1 starting... (PID=53434)
    Apr 29 12:55:00 server naemon[53434]: Local time is Sat Apr 29 12:55:00 UTC 2023
    Apr 29 12:55:00 server naemon[53434]: LOG VERSION: 2.0
    Apr 29 12:55:00 server naemon[53434]: qh: Socket '/var/lib/naemon/naemon.qh' successfully initialized
    Apr 29 12:55:00 server naemon[53434]: nerd: Channel hostchecks registered successfully
    Apr 29 12:55:00 server naemon[53434]: nerd: Channel servicechecks registered successfully
    Apr 29 12:55:00 server naemon[53434]: nerd: Fully initialized and ready to rock!
    Apr 29 12:55:01 server naemon[53434]: mod_gearman: initialized version 3.3.3 (libgearman 0.33)
    Apr 29 12:55:01 server naemon[53434]: Event broker module '/usr/lib64/mod_gearman/mod_gearman_naemon.o' initialized successfully.
    Apr 29 12:55:01 server naemon[53434]: livestatus: Setting debug level to 1
    Apr 29 12:55:01 server naemon[53434]: livestatus: Setting listen backlog to 128
    Apr 29 12:55:01 server naemon[53434]: livestatus: Setting maximum response size to 2000000000 bytes (1907.3 MB)
    Apr 29 12:55:01 server naemon[53434]: livestatus: Naemon Livestatus 1.3.0, TCP: '0.0.0.0:6557'
    Apr 29 12:55:01 server naemon[53434]: livestatus: Setup socket to listen on all interfaces
    Apr 29 12:55:01 server naemon[53434]: livestatus: Opened TCP socket 0.0.0.0:6557, backlog 128
    Apr 29 12:55:01 server naemon[53434]: livestatus: Your event_broker_options are sufficient for livestatus.
    Apr 29 12:55:01 server naemon[53434]: livestatus: Finished initialization. Further log messages go to /var/log/naemon/livestatus.log
    Apr 29 12:55:01 server naemon[53434]: Event broker module '/usr/lib64/naemon/naemon-livestatus/livestatus.so' initialized successfully.
    Apr 29 12:55:02 server naemon[53434]: livestatus: Cannot delete non-existing downtime/comment 103637
    Apr 29 12:55:03 server naemon[53434]: Successfully launched command file worker with pid 53504
    Apr 29 12:55:03 server naemon[53434]: TIMEPERIOD TRANSITION: 24x7;-1;1
    Apr 29 12:55:03 server naemon[53434]: TIMEPERIOD TRANSITION: 24x7_sans_holidays;-1;1
    Apr 29 12:55:03 server naemon[53434]: TIMEPERIOD TRANSITION: none;-1;0
    Apr 29 12:55:03 server naemon[53434]: TIMEPERIOD TRANSITION: us-holidays;-1;0
    Apr 29 12:55:03 server naemon[53434]: TIMEPERIOD TRANSITION: workhours;-1;0
    Apr 29 12:55:03 server kernel: naemon[53519]: segfault at 8 ip 00007ff936ea4e70 sp 00007ff93662e768 error 4 in livestatus.so[7ff936e31000+94000]
    Apr 29 12:55:03 server naemon[53504]: Command file worker: Failed to read from bufferqueue (Inappropriate ioctl for device)
    Apr 29 12:55:03 server systemd-coredump[53525]: Resource limits disable core dumping for process 53434 (naemon).
    Apr 29 12:55:03 server systemd-coredump[53525]: Process 53434 (naemon) of user 989 dumped core.
    Apr 29 12:55:03 server systemd[1]: naemon.service: Main process exited, code=killed, status=11/SEGV
    Apr 29 12:55:05 server systemd[1]: naemon.service: Failed with result 'signal'.

  2. with livestatus not loaded and not working

Apr 29 13:14:58 server naemon[43859]: Naemon 1.4.1 starting... (PID=43859)
Apr 29 13:14:58 server naemon[43859]: Local time is Sat Apr 29 13:14:58 UTC 2023
Apr 29 13:14:58 server naemon[43859]: LOG VERSION: 2.0
Apr 29 13:14:58 server naemon[43859]: qh: Socket '/var/lib/naemon/naemon.qh' successfully initialized
Apr 29 13:14:58 server naemon[43859]: nerd: Channel hostchecks registered successfully
Apr 29 13:14:58 server naemon[43859]: nerd: Channel servicechecks registered successfully
Apr 29 13:14:58 server naemon[43859]: nerd: Fully initialized and ready to rock!
Apr 29 13:14:58 server naemon[43859]: mod_gearman: initialized version 5.0.2 (libgearman 1.1.19.1)
Apr 29 13:14:58 server naemon[43859]: Event broker module '/usr/lib64/mod_gearman/mod_gearman_naemon.o' initialized successfully.

@bjornfro
Copy link
Author

Works now. Did not notice that the rpm upgrade commented this line in naemon.cfg:

broker_module=/usr/lib64/naemon/naemon-livestatus/livestatus.so inet_addr=0.0.0.0:6557 debug=1 max_backlog=128 max_response_size=2000000000

uncommented it and it worked again.

@sni
Copy link
Contributor

sni commented Apr 30, 2023

Glad to see it's working now. Nevertheless, Naemon should not segfault :-) If a incompatible module tries to load, it should print a message and unload the module again.

@bjornfro bjornfro changed the title Naemon 1.4.1 core dumps after upgrade from 1.3.0 - livestatus Naemon 1.4.1 livestatus not stable Apr 30, 2023
@bjornfro bjornfro reopened this Apr 30, 2023
@bjornfro
Copy link
Author

bjornfro commented Apr 30, 2023

Had to revert back to 1.3.0 since livestatus was not stable. Did sometimes not respond.

Verified with this command which sometimes did not work.

echo -e "GET hosts\nColumns: name\nLimit: 100\n" | netcat -w5 127.0.0.1 6557

Stable with 1.3.0 but not with 1.4.1

There is one error message when starting, not sure if relevant:

naemon[2080]: Successfully launched command file worker with pid 2159
naemon[2159]: Command file worker: Failed to read from bufferqueue (Inappropriate ioctl for device)

@sni
Copy link
Contributor

sni commented Apr 30, 2023

Alright, so it is not working with the latest release? Which repository did you use? Consol Labs or the SUSE one? And on which os is that?

@bjornfro
Copy link
Author

This is on Rocky Linux 8.5, using repo:

[labs_consol_stable]
name=labs_consol_stable
baseurl=http://labs.consol.de/repo/stable/rhel8/$basearch
type=yum
enabled=1
gpgcheck=0
gpgkey=http://labs.consol.de/repo/stable/RPM-GPG-KEY

@bjornfro
Copy link
Author

bjornfro commented May 2, 2023

Any hints on troubleshooting that can be done to investigate why Naemon 1.4.1 does not respond to livestatus requests at times?

@sni
Copy link
Contributor

sni commented May 2, 2023

Well, i tend to start with strace pretty soon. But enabling debug log might also reveal something already.

@pvdputte
Copy link
Contributor

pvdputte commented Jun 2, 2023

I'm also experiencing issues where livestatus won't respond, but I'm upgrading from 1.0.x to 1.4.1. Currently testing with the production dataset in a staging environment.

Not intending to hijack this issue, if you feel this is a different problem I'll open a separate one.

Environment:

Debian 11 'bullseye'
Kernel 5.10.0-22-amd64 #1 SMP Debian 5.10.178-3 (2023-04-22)
APT-Sources: http://labs.consol.de/repo/stable/debian bullseye/main amd64 Packages

# dpkg -l | grep -e naemon -e gearm
ii  gearman-job-server                   1.1.19.1+ds-2+b2               amd64        Job server for the Gearman distributed job queue
ii  gearman-tools                        1.1.19.1+ds-2+b2               amd64        Tools for the Gearman distributed job queue
ii  libgearman8:amd64                    1.1.19.1+ds-2+b2               amd64        Library providing Gearman client and worker functions
ii  libnaemon:amd64                      1.4.1                          amd64        Library for Naemon - common data files
ii  mod-gearman-module                   5.1.0                          amd64        Event broker module to distribute service checks.
ii  mod-gearman-tools                    5.1.0                          amd64        Tools for mod-gearman
ii  naemon-core                          1.4.1                          amd64        host/service/network monitoring and management system
ii  naemon-livestatus                    1.4.1                          amd64        contains the Naemon livestatus eventbroker module

# cat /etc/naemon/module-conf.d/livestatus.cfg 
# Naemon config
broker_module=/usr/lib/naemon/naemon-livestatus/livestatus.so /var/cache/naemon/live inet_addr=0.0.0.0:6557 debug=1
event_broker_options=-1

I've only been testing for a few days, trying to use the native TCP support in livestatus instead of my previous xinetd solution (running Thruk and custom dashboards from other locations).

If I leave everything running for a night, in the morning I'm greeted with 'No backend available' in Thruk. Not sure yet how long it takes for the problems to start.

I tried to isolate the problem by stopping all external livestatus clients and doing checks with netcat on the naemon host itself. It seems that the connection is stuck in SYN_SENT very often. Here you see a session where I try to run GET status in a while loop with a 1-second timeout: each line has timestamp + response size.

# while true; do echo "$(date --iso-8601=seconds) $(echo "GET status
" | netcat -w 1 <this host's public ip> 6557 | wc -c)"; sleep 1; done
2023-06-02T09:47:49+02:00 1015
2023-06-02T09:47:50+02:00 0
2023-06-02T09:47:52+02:00 0
2023-06-02T09:47:54+02:00 0
2023-06-02T09:47:56+02:00 0
2023-06-02T09:47:58+02:00 1015
2023-06-02T09:47:59+02:00 1015
2023-06-02T09:48:00+02:00 0
2023-06-02T09:48:02+02:00 1015
2023-06-02T09:48:03+02:00 0
2023-06-02T09:48:05+02:00 1015
2023-06-02T09:48:06+02:00 1015
2023-06-02T09:48:07+02:00 1015
2023-06-02T09:48:08+02:00 1015
2023-06-02T09:48:09+02:00 1015
2023-06-02T09:48:10+02:00 0
2023-06-02T09:48:12+02:00 0
2023-06-02T09:48:14+02:00 0
2023-06-02T09:48:16+02:00 1015
2023-06-02T09:48:17+02:00 0
2023-06-02T09:48:19+02:00 1015
2023-06-02T09:48:20+02:00 1015
2023-06-02T09:48:21+02:00 1015
2023-06-02T09:48:22+02:00 0
2023-06-02T09:48:24+02:00 1015
2023-06-02T09:48:25+02:00 1015
2023-06-02T09:48:26+02:00 0
2023-06-02T09:48:28+02:00 0
2023-06-02T09:48:30+02:00 1015
2023-06-02T09:48:31+02:00 1015
2023-06-02T09:48:32+02:00 0
2023-06-02T09:48:34+02:00 0
2023-06-02T09:48:36+02:00 0
2023-06-02T09:48:38+02:00 1015
2023-06-02T09:48:39+02:00 0
2023-06-02T09:48:41+02:00 0
2023-06-02T09:48:43+02:00 1015
2023-06-02T09:48:44+02:00 1015
2023-06-02T09:48:45+02:00 0
2023-06-02T09:48:47+02:00 1015
2023-06-02T09:48:48+02:00 0
2023-06-02T09:48:50+02:00 1015
2023-06-02T09:48:51+02:00 0
2023-06-02T09:48:53+02:00 0
2023-06-02T09:48:55+02:00 1015

The matching livestatus debug log only has the successful attempts:

2023-06-02 09:47:49 Query: GET status
2023-06-02 09:47:49 Time to process request: 32 us. Size of answer: 1015 bytes
2023-06-02 09:47:58 Query: GET status
2023-06-02 09:47:58 Time to process request: 27 us. Size of answer: 1015 bytes
2023-06-02 09:47:59 Query: GET status
2023-06-02 09:47:59 Time to process request: 28 us. Size of answer: 1015 bytes
2023-06-02 09:48:02 Query: GET status
2023-06-02 09:48:02 Time to process request: 30 us. Size of answer: 1015 bytes
2023-06-02 09:48:05 Query: GET status
2023-06-02 09:48:05 Time to process request: 30 us. Size of answer: 1015 bytes
2023-06-02 09:48:06 Query: GET status
2023-06-02 09:48:06 Time to process request: 29 us. Size of answer: 1015 bytes
2023-06-02 09:48:07 Query: GET status
2023-06-02 09:48:07 Time to process request: 28 us. Size of answer: 1015 bytes
2023-06-02 09:48:08 Query: GET status
2023-06-02 09:48:08 Time to process request: 30 us. Size of answer: 1015 bytes
2023-06-02 09:48:09 Query: GET status
2023-06-02 09:48:09 Time to process request: 28 us. Size of answer: 1015 bytes
2023-06-02 09:48:16 Query: GET status
2023-06-02 09:48:16 Time to process request: 29 us. Size of answer: 1015 bytes
2023-06-02 09:48:19 Query: GET status
2023-06-02 09:48:19 Time to process request: 30 us. Size of answer: 1015 bytes
2023-06-02 09:48:20 Query: GET status
2023-06-02 09:48:20 Time to process request: 28 us. Size of answer: 1015 bytes
2023-06-02 09:48:21 Query: GET status
2023-06-02 09:48:21 Time to process request: 93 us. Size of answer: 1015 bytes
2023-06-02 09:48:24 Query: GET status
2023-06-02 09:48:24 Time to process request: 28 us. Size of answer: 1015 bytes
2023-06-02 09:48:25 Query: GET status
2023-06-02 09:48:25 Time to process request: 25 us. Size of answer: 1015 bytes
2023-06-02 09:48:30 Query: GET status
2023-06-02 09:48:30 Time to process request: 32 us. Size of answer: 1015 bytes
2023-06-02 09:48:31 Query: GET status
2023-06-02 09:48:31 Time to process request: 27 us. Size of answer: 1015 bytes
2023-06-02 09:48:38 Query: GET status
2023-06-02 09:48:38 Time to process request: 28 us. Size of answer: 1015 bytes
2023-06-02 09:48:43 Query: GET status
2023-06-02 09:48:43 Time to process request: 28 us. Size of answer: 1015 bytes
2023-06-02 09:48:44 Query: GET status
2023-06-02 09:48:44 Time to process request: 39 us. Size of answer: 1015 bytes
2023-06-02 09:48:47 Query: GET status
2023-06-02 09:48:47 Time to process request: 28 us. Size of answer: 1015 bytes
2023-06-02 09:48:50 Query: GET status
2023-06-02 09:48:50 Time to process request: 28 us. Size of answer: 1015 bytes
2023-06-02 09:48:55 Query: GET status
2023-06-02 09:48:55 Time to process request: 29 us. Size of answer: 1015 bytes

Matching netstat grep for netcat connections:

2023-06-02T09:47:49+02:00 
2023-06-02T09:47:50+02:00 tcp        0      1 <ip redacted>:60792     <ip redacted>:6557      SYN_SENT    138685/netcat
2023-06-02T09:47:51+02:00 
2023-06-02T09:47:52+02:00 tcp        0      1 <ip redacted>:60806     <ip redacted>:6557      SYN_SENT    138703/netcat
2023-06-02T09:47:53+02:00 
2023-06-02T09:47:54+02:00 tcp        0      1 <ip redacted>:60818     <ip redacted>:6557      SYN_SENT    138722/netcat
2023-06-02T09:47:55+02:00 
2023-06-02T09:47:56+02:00 tcp        0      1 <ip redacted>:60822     <ip redacted>:6557      SYN_SENT    138740/netcat
2023-06-02T09:47:57+02:00 
2023-06-02T09:47:58+02:00 
2023-06-02T09:47:59+02:00 
2023-06-02T09:48:00+02:00 tcp        0      1 <ip redacted>:53660     <ip redacted>:6557      SYN_SENT    138785/netcat
2023-06-02T09:48:01+02:00 
2023-06-02T09:48:02+02:00 
2023-06-02T09:48:03+02:00 tcp        0      1 <ip redacted>:53676     <ip redacted>:6557      SYN_SENT    138820/netcat
2023-06-02T09:48:04+02:00 
2023-06-02T09:48:05+02:00 
2023-06-02T09:48:06+02:00 
2023-06-02T09:48:07+02:00 
2023-06-02T09:48:08+02:00 
2023-06-02T09:48:09+02:00 
2023-06-02T09:48:10+02:00 tcp        0      1 <ip redacted>:40268     <ip redacted>:6557      SYN_SENT    138903/netcat
2023-06-02T09:48:11+02:00 
2023-06-02T09:48:13+02:00 tcp        0      1 <ip redacted>:40284     <ip redacted>:6557      SYN_SENT    138929/netcat
2023-06-02T09:48:14+02:00 
2023-06-02T09:48:15+02:00 tcp        0      1 <ip redacted>:40286     <ip redacted>:6557      SYN_SENT    138947/netcat
2023-06-02T09:48:16+02:00 
2023-06-02T09:48:17+02:00 
2023-06-02T09:48:18+02:00 tcp        0      1 <ip redacted>:40302     <ip redacted>:6557      SYN_SENT    138978/netcat
2023-06-02T09:48:19+02:00 
2023-06-02T09:48:20+02:00 
2023-06-02T09:48:21+02:00 
2023-06-02T09:48:22+02:00 
2023-06-02T09:48:23+02:00 tcp        0      1 <ip redacted>:55282     <ip redacted>:6557      SYN_SENT    139035/netcat
2023-06-02T09:48:24+02:00 
2023-06-02T09:48:25+02:00 
2023-06-02T09:48:26+02:00 tcp        0      1 <ip redacted>:55310     <ip redacted>:6557      SYN_SENT    139073/netcat
2023-06-02T09:48:27+02:00 
2023-06-02T09:48:28+02:00 tcp        0      1 <ip redacted>:39512     <ip redacted>:6557      SYN_SENT    139091/netcat
2023-06-02T09:48:29+02:00 
2023-06-02T09:48:30+02:00 
2023-06-02T09:48:31+02:00 
2023-06-02T09:48:32+02:00 tcp        0      1 <ip redacted>:39544     <ip redacted>:6557      SYN_SENT    139135/netcat
2023-06-02T09:48:33+02:00 
2023-06-02T09:48:34+02:00 tcp        0      1 <ip redacted>:39556     <ip redacted>:6557      SYN_SENT    139154/netcat
2023-06-02T09:48:35+02:00 
2023-06-02T09:48:36+02:00 tcp        0      1 <ip redacted>:39566     <ip redacted>:6557      SYN_SENT    139172/netcat
2023-06-02T09:48:37+02:00 
2023-06-02T09:48:38+02:00 
2023-06-02T09:48:39+02:00 tcp        0      1 <ip redacted>:50034     <ip redacted>:6557      SYN_SENT    139203/netcat
2023-06-02T09:48:40+02:00 
2023-06-02T09:48:41+02:00 tcp        0      1 <ip redacted>:50046     <ip redacted>:6557      SYN_SENT    139221/netcat
2023-06-02T09:48:42+02:00 
2023-06-02T09:48:43+02:00 
2023-06-02T09:48:44+02:00 
2023-06-02T09:48:45+02:00 tcp        0      1 <ip redacted>:50080     <ip redacted>:6557      SYN_SENT    139267/netcat
2023-06-02T09:48:46+02:00 
2023-06-02T09:48:47+02:00 
2023-06-02T09:48:48+02:00 tcp        0      1 <ip redacted>:38382     <ip redacted>:6557      SYN_SENT    139298/netcat
2023-06-02T09:48:49+02:00 
2023-06-02T09:48:50+02:00 
2023-06-02T09:48:51+02:00 tcp        0      1 <ip redacted>:38398     <ip redacted>:6557      SYN_SENT    139329/netcat
2023-06-02T09:48:52+02:00 
2023-06-02T09:48:53+02:00 tcp        0      1 <ip redacted>:38406     <ip redacted>:6557      SYN_SENT    139347/netcat
2023-06-02T09:48:54+02:00 
2023-06-02T09:48:55+02:00 

So no other livestatus connections during this test. System load is very low (0.11) as in staging I run checks with way higher intervals.

@bjornfro
Copy link
Author

bjornfro commented Jun 2, 2023

I'm also using TCP, so perhaps related to that. Haven't had time to try/troubleshoot this lately so still on 1.3.0.

So, your issue has nothing to do with load? Or how much traffic do you have to it besides your script?

@pvdputte
Copy link
Contributor

pvdputte commented Jun 2, 2023

Zero load (active checks disabled) and almost zero livestatus calls (except for some GET log every 5 minutes that I'm not sure where it's coming from 🤔 )

It took a few hours to get into this state after the restart (couldn't disable active checks because Thruk wouldn't load 😉 so... restarted naemon), but I've got a trace now.

# pstree -l -p -t naemon 
naemon(143499)─┬─naemon(143508)
               ├─naemon(147031)
               ├─naemon(147032)
               ├─naemon(147033)
               ├─naemon(147034)
               ├─naemon(147035)
               ├─naemon(147036)
               └─{naemon}(147060)

Netstat calls:

# while true; do echo "$(date --iso-8601=seconds) $(echo "GET status
" | netcat -w 1 <ip redacted> 6557 | wc -c)"; sleep 1; done
2023-06-02T13:12:33+02:00 1004
2023-06-02T13:12:34+02:00 1004
2023-06-02T13:12:35+02:00 1004
2023-06-02T13:12:36+02:00 1004
2023-06-02T13:12:37+02:00 0
2023-06-02T13:12:39+02:00 0
2023-06-02T13:12:41+02:00 1004
2023-06-02T13:12:42+02:00 0
^C

Trace: strace.txt
On a successful connection I see an accept call and a clone, on failure I don't see that much happening.

@pvdputte
Copy link
Contributor

pvdputte commented Jun 2, 2023

What's odd though is that I cannot reproduce it in my vagrant environment, which has been running for two days now.

@pvdputte
Copy link
Contributor

pvdputte commented Jun 5, 2023

I think I've found something suspicious. On the staging host with naemon having issues after a while, I noticed there are actually two PIDs listening on tcp/6557:

# netstat -tupan|grep :6557|grep LISTEN
tcp        0      0 0.0.0.0:6557            0.0.0.0:*               LISTEN      238060/naemon       
tcp        4      0 0.0.0.0:6557            0.0.0.0:*               LISTEN      238072/naemon 

# pstree -a -l -p naemon 
naemon,238060 --daemon /etc/naemon/naemon.cfg
  ├─naemon,238072 --daemon /etc/naemon/naemon.cfg
  ├─naemon,258985 --worker /var/lib/naemon/naemon.qh
  ├─naemon,258986 --worker /var/lib/naemon/naemon.qh
  ├─naemon,258987 --worker /var/lib/naemon/naemon.qh
  ├─naemon,258988 --worker /var/lib/naemon/naemon.qh
  ├─naemon,258989 --worker /var/lib/naemon/naemon.qh
  ├─naemon,258990 --worker /var/lib/naemon/naemon.qh
  └─{naemon},259005

When I compare with my vagrant environment, pslist looks the same, but 238072 should not be listening to tcp/6557.

So I killed it to see what would happen.

# kill 238072

# pstree -a -l -p naemon 
naemon,238060 --daemon /etc/naemon/naemon.cfg
  ├─(naemon,238072)
  ├─naemon,258985 --worker /var/lib/naemon/naemon.qh
  ├─naemon,258986 --worker /var/lib/naemon/naemon.qh
  ├─naemon,258987 --worker /var/lib/naemon/naemon.qh
  ├─naemon,258988 --worker /var/lib/naemon/naemon.qh
  ├─naemon,258989 --worker /var/lib/naemon/naemon.qh
  ├─naemon,258990 --worker /var/lib/naemon/naemon.qh
  └─{naemon},259005

Apparently the parent process is not reaping it, so it turned into a zombie.
It stopped listening on tcp/6557 though, and the problem went away:

while true; do echo "$(date --iso-8601=seconds) $(echo "GET status
" | netcat -w 1 127.0.0.1 6557 | wc -c)"; sleep 1; done
2023-06-05T12:42:33+02:00 1008
2023-06-05T12:42:34+02:00 0
2023-06-05T12:42:36+02:00 1008
2023-06-05T12:42:37+02:00 0
2023-06-05T12:42:39+02:00 1008
2023-06-05T12:42:40+02:00 0
2023-06-05T12:42:42+02:00 1008
2023-06-05T12:42:43+02:00 0
2023-06-05T12:42:45+02:00 0
2023-06-05T12:42:47+02:00 0
2023-06-05T12:42:49+02:00 1008
2023-06-05T12:42:50+02:00 1008
2023-06-05T12:42:51+02:00 0
2023-06-05T12:42:53+02:00 1008
2023-06-05T12:42:54+02:00 1008
2023-06-05T12:42:55+02:00 0
2023-06-05T12:42:57+02:00 1008
2023-06-05T12:42:58+02:00 1013         <= kill 238072
2023-06-05T12:42:59+02:00 1013
2023-06-05T12:43:00+02:00 1013
2023-06-05T12:43:01+02:00 1013
2023-06-05T12:43:02+02:00 1013
2023-06-05T12:43:03+02:00 1013
2023-06-05T12:43:04+02:00 1013
2023-06-05T12:43:05+02:00 1013
2023-06-05T12:43:07+02:00 1013
2023-06-05T12:43:08+02:00 1013
2023-06-05T12:43:09+02:00 1013
2023-06-05T12:43:10+02:00 1013
2023-06-05T12:43:11+02:00 1013
2023-06-05T12:43:12+02:00 1013

Interestingly, the size of the responses also changed.

Naemon seems to end up in this state after a reload. So I can now trigger it on this host by doing a simple systemctl reload naemon:

# systemctl restart naemon
# netstat -tupan|grep :6557|grep LISTEN
tcp        0      0 0.0.0.0:6557            0.0.0.0:*               LISTEN      265000/naemon       

# systemctl reload naemon

# netstat -tupan|grep :6557|grep LISTEN
tcp        0      0 0.0.0.0:6557            0.0.0.0:*               LISTEN      265000/naemon       
tcp        0      0 0.0.0.0:6557            0.0.0.0:*               LISTEN      265007/naemon  

On my vagrant naemon host, this does not happen. Both are running the same Debian version and are configured by the same Ansible playbooks though.

@pvdputte
Copy link
Contributor

pvdputte commented Jun 5, 2023

I just realized staging had all package updates while my vagrant box did not. After running a full upgrade, I have the same problem in vagrant.

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following NEW packages will be installed:
  linux-image-5.10.0-23-amd64
The following packages will be upgraded:
  base-files curl dbus debian-archive-keyring distro-info-data dpkg grep grub-common grub-pc
  grub-pc-bin grub2-common isc-dhcp-client isc-dhcp-common libc-bin libc-l10n libc6 libcurl4
  libdbus-1-3 libexpat1 libgnutls30 libgssapi-krb5-2 libk5crypto3 libkrb5-3 libkrb5support0
  libksba8 libncurses6 libncursesw6 libnftables1 libpcre2-8-0 libssl1.1 libsystemd0 libtasn1-6
  libtinfo6 libtirpc-common libtirpc3 libudev1 libxml2 libxslt1.1 linux-image-5.10.0-16-amd64
  linux-image-amd64 locales nano ncurses-base ncurses-bin nftables open-vm-tools openssl sudo
  systemd systemd-sysv tzdata udev zlib1g

I'll start from scratch and see if I can pinpoint it further.

@pvdputte
Copy link
Contributor

pvdputte commented Jun 5, 2023

After the kernel update itself + VM restart, the problem was there, but when retrying with the old kernel the problem persisted so that was inconclusive.

But I'm now able to reliably reproduce it in vagrant:

$ vagrant init debian/bullseye64
$ vagrant up
$ vagrant ssh

Next, copy these commands into a script and execute it.

vagrant@bullseye:~$ wget -O reproduce https://github.com/naemon/naemon-core/files/11653356/reproduce.txt
vagrant@bullseye:~$ chmod +x reproduce
vagrant@bullseye:~$ ./reproduce

This should result in something like

[...installation...]

----------
Restarting
tcp        0      0 0.0.0.0:6557            0.0.0.0:*               LISTEN      4571/naemon         

naemon,4571 --daemon /etc/naemon/naemon.cfg
  ├─naemon,4572 --worker /var/lib/naemon/naemon.qh
  ├─naemon,4573 --worker /var/lib/naemon/naemon.qh
  ├─naemon,4574 --worker /var/lib/naemon/naemon.qh
  ├─naemon,4575 --worker /var/lib/naemon/naemon.qh
  └─naemon,4576 --daemon /etc/naemon/naemon.cfg

systemd,4353 --user
  └─(sd-pam),4354
----------------------
Reloading until broken

Success.
tcp        0      0 0.0.0.0:6557            0.0.0.0:*               LISTEN      4571/naemon         
tcp        0      0 0.0.0.0:6557            0.0.0.0:*               LISTEN      4576/naemon         

naemon,4571 --daemon /etc/naemon/naemon.cfg
  ├─naemon,4576 --daemon /etc/naemon/naemon.cfg
  ├─naemon,4594 --worker /var/lib/naemon/naemon.qh
  ├─naemon,4595 --worker /var/lib/naemon/naemon.qh
  ├─naemon,4596 --worker /var/lib/naemon/naemon.qh
  └─naemon,4597 --worker /var/lib/naemon/naemon.qh

systemd,4353 --user
  └─(sd-pam),4354

---------------------------------
Running "GET status" every second
2023-06-05T14:00:54+00:00 0
2023-06-05T14:00:57+00:00 0
2023-06-05T14:01:00+00:00 978
2023-06-05T14:01:01+00:00 0

@pvdputte
Copy link
Contributor

pvdputte commented Jun 5, 2023

Same problem with debian/buster64 🙂

@MattCrum1
Copy link

Hey @sni - I think we're running into this issue too; not delved into it as much as pvdputte, but running naemon-livestatus 1.4.1 results in a crash of our Naemon stack after a couple of hours.

Logs show that Livestatus shuts down and takes everything else with it - I see this error each time (not sure if it's related):

glib-warning **: 16:03:26.447: gmessages.c:1047: could not find handler with id '1'

We've reverted back to 1.4.0 for now - let me know if you need more details.

@bjornfro
Copy link
Author

@pvdputte How long does it take for you to get into the duplicate listener state when reloading? Running your script but have not entered it yet. However, running with a minimal host/service config which is not the same as production.

@pvdputte
Copy link
Contributor

pvdputte commented Sep 4, 2023

Just ran the script in vagrant again, success on the first try.

----------
Restarting
tcp        0      0 0.0.0.0:6557            0.0.0.0:*               LISTEN      4576/naemon         

naemon,4576 --daemon /etc/naemon/naemon.cfg
  ├─naemon,4577 --worker /var/lib/naemon/naemon.qh
  ├─naemon,4578 --worker /var/lib/naemon/naemon.qh
  ├─naemon,4579 --worker /var/lib/naemon/naemon.qh
  ├─naemon,4580 --worker /var/lib/naemon/naemon.qh
  └─naemon,4581 --daemon /etc/naemon/naemon.cfg

systemd,4358 --user
  └─(sd-pam),4359
----------------------
Reloading until broken

Success.
tcp        0      0 0.0.0.0:6557            0.0.0.0:*               LISTEN      4576/naemon         
tcp        0      0 0.0.0.0:6557            0.0.0.0:*               LISTEN      4581/naemon         

I'm not running the tcp livestatus in production yet.

@fermino
Copy link

fermino commented Nov 29, 2023

I think I've been hit by the same bug :)
After triggering a reload with SIGHUP, a second listener appears. It seems to be spawned by the second of the two naemon processes (idk what each of them does, but both seem to be there since startup).

Before reload:

root@060c1373b79f:/# ps aux | grep allow
root           1  0.2  0.0   1008     4 ?        Ss   19:59   0:00 /sbin/docker-init -- naemon --allow-root /etc/naemon/naemon.cfg
root           7  5.4  0.1 107876 38256 ?        S    19:59   0:00 naemon --allow-root /etc/naemon/naemon.cfg
root          38  0.1  0.1  41356 33468 ?        S    19:59   0:00 naemon --allow-root /etc/naemon/naemon.cfg

root@060c1373b79f:/# netstat -tunlp | grep 6557
tcp        0      0 0.0.0.0:6557            0.0.0.0:*               LISTEN      7/naemon        

After reload:

root@060c1373b79f:/# ps aux | grep allow
root           1  0.0  0.0   1008     4 ?        Ss   19:59   0:00 /sbin/docker-init -- naemon --allow-root /etc/naemon/naemon.cfg
root           7  3.3  0.1 108040 38584 ?        S    19:59   0:04 naemon --allow-root /etc/naemon/naemon.cfg
root          38  0.0  0.1  41356 33468 ?        S    19:59   0:00 naemon --allow-root /etc/naemon/naemon.cfg

root@060c1373b79f:/# netstat -tunlp | grep 6557
tcp        0      0 0.0.0.0:6557            0.0.0.0:*               LISTEN      7/naemon            
tcp        1      0 0.0.0.0:6557            0.0.0.0:*               LISTEN      38/naemon    

I'm willing to put in the time to debug this, but I'm not sure where to begin. In the meantime I'm thinking of using socat to share the socket over a tcp connection, or using docker shared volumes (as my whole setup is docker-based). Thanks for the time taken to maintain this project :)

Edit, as it might be important: if I kill the second process the whole thing starts working again.

@pvdputte
Copy link
Contributor

(moved my input to a separate issue because I feel it's different, never had naemon crash etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants