Fixed: Naemon stops executing checks and doesnt respawn Core Worker #419 #421

nook24 · 2023-03-02T22:11:58Z

While reviewing #419, I noticed that Naemon is not taking care about it's dead worker processes. Whenever a worker process dies, it will become a <defunct> zombie process.
The reason for this was, that Naemon did not call waitpid() when a worker process dies. I did not find any explanation why this was implemented this way.
Could waitpid(wp->pid, &i, 0) hang (wait forever), if the worker process was not forked from the currently running Naemon? We could probably replace this by waitpid(wp->pid, &i, WNOHANG), but I'm not 100% sure about this.
Best I could find about this was: NagiosEnterprises/nagioscore#635

processes (#418)

Fixed the indentation stuff...

Comment changed.

…r/419

…r process dies. This will cleanup any <defunct> processes. I'm not sure why the WPROC_FORCE flag exists at all. Maybe this could cause problems, when external workers not spawned by Naemon itself connect to the Query Handler? The original commit ec4dc03 from 10 years ago did not contain more information about this Signed-off-by: nook24 <[email protected]>

sni

looks good in general. I only have a small change request for the get_desired_workers

src/naemon/workers.c

as requested in #421 (review)

sni · 2023-03-17T10:00:39Z

I tried but couldn't get it working, whenever i kill a worker process i indeed get a message that a new worker got spawned, but a couple of seconds later the main naemon process aborts.

Program received signal SIGSEGV, Segmentation fault.
0x00007f475305bdfd in __strlen_avx2 () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glib2-2.68.4-5.el9.x86_64 libicu-67.1-9.el9.x86_64 libstdc++-11.3.1-2.1.el9.x86_64 pcre-8.44-3.el9.3.x86_64
(gdb) bt
#0  0x00007f475305bdfd in __strlen_avx2 () from /lib64/libc.so.6
#1  0x00007f4753456b68 in dopr (buffer=0x0, maxlen=0, format=0x7f4753459478 "Processing check result file: '%s'\n", args_in=0x7ffe7f668e10) at lib/snprintf.c:712
#2  0x00007f4753457d14 in smb_vsnprintf (str=0x0, count=0, fmt=0x7f4753459418 "Error: Unknown object check type for checkresult: %d; (host_name: %s; service_description: %s)\n", args=0x7ffe7f668e10) at lib/snprintf.c:1197
#3  0x00007f4753457e21 in vasprintf (ptr=0x7ffe7f668e50, format=0x7f4753459418 "Error: Unknown object check type for checkresult: %d; (host_name: %s; service_description: %s)\n", ap=0x7ffe7f668e58) at lib/snprintf.c:1232
#4  0x00007f47533feae9 in nm_log (data_type=1, fmt=0x7f4753459418 "Error: Unknown object check type for checkresult: %d; (host_name: %s; service_description: %s)\n") at src/naemon/logging.c:165
#5  0x00007f47533dcf04 in process_check_result (cr=0x14cbbd0) at src/naemon/checks.c:345
#6  0x00007f47533e21fb in handle_worker_service_check (wpres=0x7ffe7f668ff0, arg=0x14cbbd0, flags=0) at src/naemon/checks_service.c:404
#7  0x00007f475342317c in run_job_callback (job=0x1230520, wpres=0x7ffe7f668ff0, val=0) at src/naemon/workers.c:167
#8  0x00007f4753424046 in handle_worker_result (sd=7, events=1, arg=0x11a92c0) at src/naemon/workers.c:536
#9  0x00007f47534516d9 in iobroker_poll (iobs=0x11a93c0, timeout=785) at lib/iobroker.c:376
#10 0x00007f47533fcd21 in event_poll_full (iobs=0x11a93c0, timeout_ms=1500) at src/naemon/events.c:331
#11 0x00007f47533fce87 in event_poll () at src/naemon/events.c:384
#12 0x0000000000403916 in event_execution_loop () at src/naemon/naemon.c:119
#13 0x0000000000404ca2 in main (argc=2, argv=0x7ffe7f6694f8) at src/naemon/naemon.c:708

I assume it is because we removed the worker structure but it is still referended in some jobs.

…and gets respawned Signed-off-by: nook24 <[email protected]>

nook24 · 2023-03-23T21:34:56Z

I have tried to wrap my head around this, but I'm not sure about how valid my results are :)
I think the reason for the crash is, that wpress is empty (0x0) when the run_job_callback function get's executed:

naemon-core/src/naemon/workers.c

Lines 162 to 170 in d6c4c6e

    
           static void run_job_callback(struct wproc_job *job, struct wproc_result *wpres, int val) 
        
           { 
        
           	if (!job || !job->callback) 
        
           		return; 
        
           	if (!wpres) { 
        
           		nm_log(NSLOG_RUNTIME_ERROR, "---!!!--- wpres is null or so TODO REMOVE THIS"); 
        
           		return; 
        
           	}

Without the return, the callback function static void handle_worker_host_check or static void handle_worker_service_check would get called. At this point the check_result struct contains pure garbage which leads to the segmentation fault.

naemon-core/src/naemon/checks_host.c

Lines 630 to 633 in d6c4c6e

    
           static void handle_worker_host_check(wproc_result *wpres, void *arg, int flags) 
        
           { 
        
           	check_result *cr = (check_result *)arg; 
        
           	struct host *hst;

The segmentation fault occurs at the end of the handle_worker_host_check callback, when the cr should get freed.

The part I'm unsure about is, that destroy_job also calls run_job_callback as pass NULL as wpress. I'm not sure what this should cleanup?

naemon-core/src/naemon/workers.c

Lines 176 to 187 in d6c4c6e

    
           static void destroy_job(gpointer job_) 
        
           { 
        
           	struct wproc_job *job = job_; 
        
           	if (!job) 
        
           		return; 
        
           	/* call with NULL result to make callback clean things up */ 
        
           	run_job_callback(job, NULL, 0); 
        
           	nm_free(job->command); 
        
           	free(job); 
        
           }

With the current change, Naemon was not crashing anymore. I created this sketchy script for testing

<?php

exec('ps -eaf | grep "naemon --worker" | grep -v "grep"', $output);
foreach($output as $line){
	$line = explode(" ", $line);
	$pid = $line[6];
	if(is_numeric($pid)){
		echo "kill $pid\n";
		exec('kill '.$pid);
	}
}

watch --interval=5 "php kill.php" allows to test with different killing intervals. Naemon was running in my test for round about 1 hour, while the script killed the worker processes every 5 seconds. No crashes or memory leaks so far but this requires further testing

Previously we used a fixed size 8k buffer when parsing command line arguments. This sounds much, but there are command lines bigger than 8k and they are simply cut off without any warning. Instead we use a dynamic sized buffer with the size of the raw command now.

…ation Signed-off-by: nook24 <[email protected]>

Set LD_LIBRARY_PATH when running inside of VS Code to the correct loc…

Print the current and expected api version number along with the error. This gives a hint about wether the neb module is too new or too old.

this (should) fix this build error on obs: ``` [ 139s] /.build_patchrpmcheck_scr: line 55: systemd-tmpfiles: command not found [ 139s] postinstall script of naemon-core-1.4.1-lp154.18.1.x86_64.rpm failed ```

nook24 · 2023-09-18T07:26:20Z

Is there anything we can do to get this merged?

sni · 2023-10-12T11:38:26Z

is it stable now? Did you make any tests lately?

It's a good standard to do so and in fact, we do this already in several places, ex. the status.dat. This ensures the file is ready and completly written before it will be used. The issue here is, that naemon starts without any issues if the precached file is empty for any reason. Except it has zero hosts/services then and removes all existing states/downtime/comments. Signed-off-by: Sven Nierlein <[email protected]>

left over from copy/pasted code.

nook24 · 2023-10-25T08:16:49Z

is it stable now? Did you make any tests lately?

I will setup a test system and report

nook24 · 2023-11-06T08:42:40Z

Good morning, a quick update on this.

My test system
I had setup a system with the latest Naemon stable version and loaded the Statusengine Event Broker. (Ubuntu 22.04)
This system was running for 7 days to collect baseline metrics.

On November 2, I have deployed the patched version of Naemon. To make sure that the re-spawning of dead worker processes works as expected a cronjob kills random naemon worker processes every 5 minutes. I have only killed the worker processes (--worker).

Results
I did not noticed any difference. The patched Naemon respawned new worker processes as expected. All checks got executed and I did not see any memory leak.

The only thing i noticed is that the "Active service check latency" metric breaks.

Naemon Stable

Naemon PR 421

Naemon killing cronjob

Naemonstats

nook24 · 2023-11-30T08:16:15Z

3 weeks later, Naemon is still running fine and re-spawns randomly killed worker every 5 minutes

jacobbaungard · 2023-11-30T13:26:15Z

I guess this would need a fix for the check latency before it can be merged? Besides that, this is a great feature imo.

sni · 2023-11-30T18:29:50Z

what exactly breaks the latency here?

jacobbaungard · 2023-12-01T08:28:14Z

what exactly breaks the latency here?

Not sure why it breaks, I was merely referring to the comments/screenshots by @nook24 above.

nook24 · 2023-12-01T19:57:06Z

After reading all the event code over and over again, it could be that the latency issue is related to this change: 978bbf3

It is not part of Naemon 1.4.1 which is the version I used for my baseline measurements.
@sni mentioned in his commit message that the old calculation for latency is wrong. My test system has 21 Services (13 active / 8 passive) and a active check latency of 603ms (as Naemon 1.4.1 reports) sounds way to high if I think about it.

The current master branch / this branch is not reporting 0 values for latency, i was also able to get values from 1ms up to 55ms.

The code describes the latency value as follows:

/* latency is how long the event lagged behind the event queue */

If I understand this correct, the latency values represents how long the event was stored in the queue before a free worker picked it up for processing. Due to my test system is pretty small and has not much to do, it should not take to long for a free worker to handle the event.

How ever, I'm not sure what this tries to achieve:

naemon-core/src/naemon/events.c

Line 375 in 83b25ec

evprop.attributes.timed.latency = -time_diff / 1000.0;

sni · 2023-12-01T20:05:44Z

latency will always be in a range of 0-1000ms because the scheduled checks have 1 second resolution but are going to be randomized on millisecond resolution into the run queue.
If that is was is "broken" then it was like that before and fixing that would be purely cosmetic.

sni · 2023-12-01T20:07:44Z

just one thing, could you remove the "TODO" thing :-) Otherwise i am fine...

…r/419

Signed-off-by: nook24 <[email protected]>

nook24 · 2023-12-01T20:19:52Z

just one thing, could you remove the "TODO" thing :-) Otherwise i am fine...

Uups, I have removed this.

If that is was is "broken" then it was like that before and fixing that would be purely cosmetic.

I'm not sure if it is "broken" or not. As long as you are fine with the reported value for latency I'm fine with it as well I guess

Add Respawn of dead core worker naemon#421

Add Respawn of dead core worker #421

ccztux and others added 5 commits February 27, 2023 16:57

Fixed: Naemon stops executing checks and doesnt respawn Core Worker

347c547

processes (#418)

Update workers.c

bf8bf43

Fixed the indentation stuff...

Update workers.c

302af62

Comment changed.

Merge branch 'master' of https://github.com/naemon/naemon-core into p…

8f830cb

…r/419

nook24 mentioned this pull request Mar 2, 2023

Fixed: Naemon stops executing checks and doesnt respawn Core Worker #419

Closed

sni requested changes Mar 3, 2023

View reviewed changes

src/naemon/workers.c Show resolved Hide resolved

ccztux added 3 commits March 6, 2023 11:17

Merge branch 'naemon:master' into issue-418

adfcd8f

Merge remote-tracking branch 'upstream/pr/419' into issue-418

b244658

The function get_desired_workers now always returns the desired workers,

7940a81

as requested in #421 (review)

nook24 requested a review from sni March 15, 2023 15:28

Do not crash if wps is NULL. This could happen if a core worker dies …

d6c4c6e

…and gets respawned Signed-off-by: nook24 <[email protected]>

sni and others added 6 commits July 6, 2023 15:36

Set LD_LIBRARY_PATH when running inside of VS Code to the correct loc…

7dabe86

…ation Signed-off-by: nook24 <[email protected]>

Merge pull request #432 from naemon/vscode

7558c98

Set LD_LIBRARY_PATH when running inside of VS Code to the correct loc…

be more specific when unloading a neb module

59f73c0

Print the current and expected api version number along with the error. This gives a hint about wether the neb module is too new or too old.

Merge branch 'upstream' into pr/419

d24cc50

add systemd to build requires

ffeecc1

this (should) fix this build error on obs: ``` [ 139s] /.build_patchrpmcheck_scr: line 55: systemd-tmpfiles: command not found [ 139s] postinstall script of naemon-core-1.4.1-lp154.18.1.x86_64.rpm failed ```

sni added 2 commits October 13, 2023 11:41

fix typo in comment

83b25ec

left over from copy/pasted code.

sni approved these changes Dec 1, 2023

View reviewed changes

nook24 added 2 commits December 1, 2023 21:13

Merge branch 'pr/419' of https://github.com/naemon/naemon-core into p…

61374fa

…r/419

Remove debugging output

b829a68

Signed-off-by: nook24 <[email protected]>

nook24 merged commit 50b573a into master Jan 3, 2024
19 checks passed

nook24 added a commit to sni/naemon-core that referenced this pull request Jan 3, 2024

Update NEWS

6caa852

Add Respawn of dead core worker naemon#421

sni pushed a commit that referenced this pull request Jan 4, 2024

Update NEWS

8390694

Add Respawn of dead core worker #421

sni deleted the pr/419 branch January 4, 2024 14:31

nook24 mentioned this pull request Apr 10, 2024

Naemon stops executing checks and doesnt respawn Core Worker processes #418

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed: Naemon stops executing checks and doesnt respawn Core Worker #419 #421

Fixed: Naemon stops executing checks and doesnt respawn Core Worker #419 #421

nook24 commented Mar 2, 2023

sni left a comment

sni commented Mar 17, 2023

nook24 commented Mar 23, 2023

nook24 commented Sep 18, 2023

sni commented Oct 12, 2023

nook24 commented Oct 25, 2023

nook24 commented Nov 6, 2023

nook24 commented Nov 30, 2023

jacobbaungard commented Nov 30, 2023 •

edited

Loading

sni commented Nov 30, 2023

jacobbaungard commented Dec 1, 2023

nook24 commented Dec 1, 2023

sni commented Dec 1, 2023

sni commented Dec 1, 2023

nook24 commented Dec 1, 2023

Fixed: Naemon stops executing checks and doesnt respawn Core Worker #419 #421

Fixed: Naemon stops executing checks and doesnt respawn Core Worker #419 #421

Conversation

nook24 commented Mar 2, 2023

sni left a comment

Choose a reason for hiding this comment

sni commented Mar 17, 2023

nook24 commented Mar 23, 2023

nook24 commented Sep 18, 2023

sni commented Oct 12, 2023

nook24 commented Oct 25, 2023

nook24 commented Nov 6, 2023

nook24 commented Nov 30, 2023

jacobbaungard commented Nov 30, 2023 • edited Loading

sni commented Nov 30, 2023

jacobbaungard commented Dec 1, 2023

nook24 commented Dec 1, 2023

sni commented Dec 1, 2023

sni commented Dec 1, 2023

nook24 commented Dec 1, 2023

jacobbaungard commented Nov 30, 2023 •

edited

Loading