Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

naemon's logrotate leaves open file: livestatus.log.1 #146

Open
pbiering opened this issue Dec 31, 2022 · 7 comments
Open

naemon's logrotate leaves open file: livestatus.log.1 #146

pbiering opened this issue Dec 31, 2022 · 7 comments

Comments

@pbiering
Copy link
Contributor

Found:

# 'lsof' report of binaries which uses files with link count = 0 (missing restart after update)
# Method3:
naemon       1634                            site1   15w      REG                8,2        264     0   12590537 /opt/omd/sites/site1/var/naemon/livestatus.log.1 (deleted)

Processes

ps w -C naemon
    PID TTY      STAT   TIME COMMAND
   1554 ?        Ss   600:33 /omd/sites/site1/bin/naemon -ud /omd/sites/site1/tmp/naemon/naemon.cfg
   1634 ?        S      3:24 /omd/sites/site1/bin/naemon -ud /omd/sites/site1/tmp/naemon/naemon.cfg
 260391 ?        S      0:01 /omd/sites/site1/bin/naemon --worker /omd/sites/site1/var/naemon/naemon.qh
 260392 ?        S      0:01 /omd/sites/site1/bin/naemon --worker /omd/sites/site1/var/naemon/naemon.qh
 260393 ?        S      0:01 /omd/sites/site1/bin/naemon --worker /omd/sites/site1/var/naemon/naemon.qh
 260394 ?        S      0:01 /omd/sites/site1/bin/naemon --worker /omd/sites/site1/var/naemon/naemon.qh
 260395 ?        S      0:01 /omd/sites/site1/bin/naemon --worker /omd/sites/site1/var/naemon/naemon.qh
 260396 ?        S      0:01 /omd/sites/site1/bin/naemon --worker /omd/sites/site1/var/naemon/naemon.qh

pstree -p 1554
naemon(1554)-+-naemon(1634)
             |-naemon(260391)
             |-naemon(260392)
             |-naemon(260393)
             |-naemon(260394)
             |-naemon(260395)
             `-naemon(260396)

PID 1634 is the child of the master process 1554

Current logrotate config

cat /opt/omd/sites/site1/etc/logrotate.d/naemon 
/omd/sites/site1/var/naemon/naemon.log {
    daily
    rotate 3650
    nocompress
    olddir /omd/sites/site1/var/naemon/archive
    dateext
    dateformat -%Y%m%d
    missingok
    notifempty
    postrotate
      [ -f /omd/sites/site1/tmp/lock/naemon.lock ] && kill -s USR1 `cat /omd/sites/site1/tmp/lock/naemon.lock`
    endscript
    create 0664 site1 site1
}

/omd/sites/site1/var/naemon/livestatus.log {
	missingok
	rotate 7
	compress
	delaycompress
	notifempty
	create 640 site1 site1
}

only send SIGUSR to master process:

cat /omd/sites/site1/tmp/lock/naemon.lock
1554

Tried to send SIGUSR1 to the related child process, but this wasn't helpful, stale log file is still open

kill -s USR1 1634

SIGUSR1 to the master process was also not helpful

kill -s USR1 1554

Looks like some more magic is needed to avoid stale log files

@sni
Copy link
Contributor

sni commented Dec 31, 2022

SIGUSR1 is only implemented in naemon-core. Right now, there is no callback for NEB modules to register for log rotation events. Might be a good idea.

sni added a commit to sni/naemon-livestatus that referenced this issue Feb 20, 2023
since we no longer SIGHUP the core to rotate the logfile, we need to reopen the logfile in the module manually.

references:
 - ConSol-Monitoring/omd#146
@sni
Copy link
Contributor

sni commented Feb 20, 2023

should be fixed with naemon/naemon-livestatus#107

@sni sni closed this as completed Feb 20, 2023
sni added a commit to naemon/naemon-livestatus that referenced this issue Feb 21, 2023
since we no longer SIGHUP the core to rotate the logfile, we need to reopen the logfile in the module manually.

references:
 - ConSol-Monitoring/omd#146
@pbiering
Copy link
Contributor Author

pbiering commented Sep 2, 2023

The issue reappeared in 5.2

# omd sites
SITE             VERSION          COMMENTS
***          5.20-labs-edition default version 

# lsof -c naemon | grep deleted | grep livestatus
naemon  835922 ***   15w      REG                8,2      264 12592066 /opt/omd/sites/***/var/naemon/livestatus.log.1 (deleted)

# kill -SIGUSR1 835922

# lsof -c naemon | grep deleted | grep livestatus
naemon  835922 ***   15w      REG                8,2      264 12592066 /opt/omd/sites/***/var/naemon/livestatus.log.1 (deleted)

@sni : any further hints how to diagnost?

@sni
Copy link
Contributor

sni commented Sep 5, 2023

i'll have to investigate. At least i change the logrotate order to move the livestatus logfile before sending the rotate signal to naemon.

@pbiering
Copy link
Contributor Author

pbiering commented Jul 4, 2024

@sni at least in 5.3.0 this is not working so far even after the order was changed:

Retried sending SIGUSR1:

# status before
lsof -u omdmon | grep "naemon.*livestatus"
naemon      10734 omd1  mem       REG              253,0   576032 37253586 /opt/omd/versions/5.30-labs-edition/lib/naemon/livestatus.o
naemon      10734 omd1   14w      REG              253,0        0  6661059 /opt/omd/sites/omd1/var/naemon/livestatus.log
naemon      10741 omd1  mem       REG              253,0   576032 37253586 /opt/omd/versions/5.30-labs-edition/lib/naemon/livestatus.o
naemon      10741 omd1   14w      REG              253,0      461  7028160 /opt/omd/sites/omd1/var/naemon/livestatus.log.1 (deleted)

# send signal
kill -s USR1 `cat /omd/sites/omdmon/tmp/run/naemon.pid`

# status afterwards
lsof -u omdmon | grep "naemon.*livestatus"
naemon      10734 omd1  mem       REG              253,0   576032 37253586 /opt/omd/versions/5.30-labs-edition/lib/naemon/livestatus.o
naemon      10734 omd1   14w      REG              253,0        0  6661059 /opt/omd/sites/omd1/var/naemon/livestatus.log
naemon      10741 omd1  mem       REG              253,0   576032 37253586 /opt/omd/versions/5.30-labs-edition/lib/naemon/livestatus.o
naemon      10741 omd1   14w      REG              253,0      461  7028160 /opt/omd/sites/omd1/var/naemon/livestatus.log.1 (deleted)

Looks like there is still a somehow stale process alive, having the rotated file open, which is meanwhile compressed.

pstree:

naemon(10734)─┬─naemon(10741)
              ├─naemon(1385225)
              ├─naemon(1385226)
              ├─naemon(1385227)
              ├─naemon(1385228)
              ├─naemon(1385229)
              └─naemon(1385230)───"check..."

@sni sni reopened this Jul 4, 2024
sni added a commit to sni/naemon-core that referenced this issue Jul 4, 2024
since the command worker forks the main naemon process, it inherits all open
files like ex.: pidfile, logfiles, etc... It will keep those references open, even
if the main process rotates and reopens those files.

This patch closes query handler and pid file references after starting the
command worker and also moves starting the command worker before initializing
the neb modules, so it won't inherit open logfiles from neb modules.

references:

- ConSol-Monitoring/omd#146

Signed-off-by: Sven Nierlein <[email protected]>
@sni
Copy link
Contributor

sni commented Jul 4, 2024

according to the lsof, there are two processing holding references to the logfile. One reopens the logfile on rotation and the other does not. Turns out this is the "command file worker" spawned once during initial start which inherits the file handles but doesn't do anything with it.
So the proper fix is to simply close those references for the command worker as suggested in naemon/naemon-core#470

@sni
Copy link
Contributor

sni commented Jul 4, 2024

the patch is included in tomorrows omd daily if you want to give it a try

sni added a commit to sni/naemon-core that referenced this issue Jul 5, 2024
since the command worker forks the main naemon process, it inherits all open
files like ex.: pidfile, logfiles, etc... It will keep those references open, even
if the main process rotates and reopens those files.

This patch closes query handler and pid file references after starting the
command worker and also moves starting the command worker before initializing
the neb modules, so it won't inherit open logfiles from neb modules.

references:

- ConSol-Monitoring/omd#146

Signed-off-by: Sven Nierlein <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants