Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exec scripts respawning too fast #138

Open
naciohr opened this issue Jan 9, 2025 · 2 comments
Open

Exec scripts respawning too fast #138

naciohr opened this issue Jan 9, 2025 · 2 comments

Comments

@naciohr
Copy link

naciohr commented Jan 9, 2025

Dear Marc,

I'm having a weird issue with the service. I've got tac_plus-ng running in a docker and I'm getting these messages from time to time. It looks to be happening on scripts run on EXEC instructions.

Right now the tacspooflog.pl is the one issued.

626: 10:25:11.999 0/00000000: - "exec /tmp/tac_plus-ng/tac_plus-ng/extra/tacspooflog.pl 127.0.0.1" respawning too fast
3932: 10:25:11.999 0/00000000: - "exec /tmp/tac_plus-ng/tac_plus-ng/extra/tacspooflog.pl 127.0.0.1" respawning too fast

However, I had this error showing up for the other logging scripts for the AAA logs as well. These are python scripts that write to a remote database.

This is the current configuration.

id = spawnd {
	background = no
	listen { port = 49 realm = 1 }	
	listen { port = 4949 realm = 2 }
	listen { port = 8049 realm = 3 }
	spawn {
        instances min = 2
        instances max = 32
        users min = 40
        users max = 60
		# Use same worker for the same source IP (s)
		sticky cache period = 3600	
	}
}

id = tac_plus-ng {
# Prioritise new connections (slighly lower than max users)
    last-recently-used limit = 55

    # Terminate spawnd instance after 24h (s)
    retire timeout = 86400

# Logs definition
	log access_spoofing {
		# This is necessary to log the source IP address instead the own tacacs server IP
		destination ="|exec /tmp/tac_plus-ng/tac_plus-ng/extra/tacspooflog.pl 127.0.0.1"
	}
        log authen_db {
		destination ="|exec /home/tac_plus-ng/db_logging/authen.py"	
		authentication format = "%Y-%m-%d %H:%M:%S;${hostname};${user};${nas};${nac};${realm};${result};${msgid};${action};${authen-action};${authen-type};${authen-service}\n"
	}
	log author_db {
		destination ="|exec /home/tac_plus-ng/db_logging/author.py"	
		authorization format = "%Y-%m-%d %H:%M:%S;${hostname};${user};${nas};${nac};${realm};${result};${msgid};${priv-lvl};${service};${cmd};${args};${rargs}\n"
	}
	log account_db {
		destination ="|exec /home/tac_plus-ng/db_logging/account.py"
		accounting format = "%Y-%m-%d %H:%M:%S;${hostname};${user};${nas};${nac};${realm};${msgid};${priv-lvl};${accttype};${cmd};${args}\n"
	}
}

I suspect it may be related to #112 in the way the worker gets blocked and it's not able to release the pending connections. The docker runs on a server that has way available resources, so it doesn't look to be a problem with that. Actually, these errors are happening with low connection load.

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        5180  0.0  0.0   7064  1644 ?        Rs   13:07   0:00 ps -afxu
root           1  0.0  0.1  21188  8228 pts/0    Ss+  09:55   0:03 tac_plus-ng: 92 connections, accepting up to 1828 more
root         626 86.5  0.2  31940 15880 ?        Rs   10:25 140:43 tac_plus-ng: 29 connections
root         654  0.0  0.6 128840 38216 ?        Sl   10:25   0:00  \_ /usr/bin/python /home/tac_plus-ng/db_logging/authen.py
root         657  0.0  0.6 128840 40040 ?        Sl   10:25   0:00  \_ /usr/bin/python /home/tac_plus-ng/db_logging/author.py
root        2223 96.6  2.0 140856 120968 ?       Rs   11:25  99:22 tac_plus-ng: 12 connections
root        2231  1.0  0.6 128984 41720 ?        Sl   11:25   1:04  \_ /usr/bin/python /home/tac_plus-ng/db_logging/account.py
root        2259  0.1  0.6 128984 41784 ?        Sl   11:25   0:09  \_ /usr/bin/python /home/tac_plus-ng/db_logging/author.py
root        2261  0.1  0.6 128976 41820 ?        Sl   11:25   0:08  \_ /usr/bin/python /home/tac_plus-ng/db_logging/authen.py
root        5035  0.0  0.5  50208 33800 ?        S    12:56   0:00  \_ python mavis/mavis_ldap.py
root        5179  1.5  0.1  16844 11100 ?        S    13:07   0:00  \_ perl /tmp/tac_plus-ng/tac_plus-ng/extra/tacspooflog.pl 127.0.0.1
root        3932 74.1  0.2  32616 16300 ?        Ss   12:25  31:37 tac_plus-ng: 30 connections
root        3938  0.0  0.6 128852 41376 ?        Sl   12:25   0:00  \_ /usr/bin/python /home/tac_plus-ng/db_logging/account.py
root        3958  0.0  0.6 128844 41476 ?        Sl   12:25   0:00  \_ /usr/bin/python /home/tac_plus-ng/db_logging/author.py
root        3965  0.0  0.6 128840 41416 ?        Sl   12:25   0:00  \_ /usr/bin/python /home/tac_plus-ng/db_logging/authen.py

As you can see, this aggregated 92 connections doesn't match the showed connections. These could be because the remaining are kept in CLOSE_WAIT or FIN_WAIT state; I can check out the TCP timers, but I would like to know if you have got a clue for the reason for those respawning events or how to troubleshoot them.

Thanks and regards,

@MarcJHuber
Copy link
Owner

Hi,

uh, I'm flabbergasted that the connection count might cause issues, but I'll surely dive into that once more, just quite likely not today.

The "respawning too fast" typically means that your *.py script did terminate -- it's expected to be a long-running process that handles input from stdin in a loop until stdin breaks. Do your Python scrips work that way? You can check easily using "ps", the Python process IDs should not change at all.

Cheers,

Marc

@naciohr
Copy link
Author

naciohr commented Jan 10, 2025

Dear Marc,

Many thanks for your quick response. I had the same thought for the respawning events but wasn't sure about how the worker handles the scripts. Actually, I got these errors in tacspooflog.pl that is the one shipped with the repo.

My logging python scripts use a for loop rather than a while. From my tests, I understand logging exec scripts are not invoked until there's a message in the stdin pipe and then they remain running until the worker ends.

async def main():
    authen_info = ""
    for line in sys.stdin:
        try:
            authen_info = line
            authen_fields = ["date", "tacacs_server", "username", "nas_ip",  "nac_ip", "realm", "result", "message_id", "action", "authen_action", "authen_type", "authen_service"]
            authen_dic = dict(zip(authen_fields,authen_info.replace("\n","").split(";")))
            await db.insert_authen(authen_dic)
        
        except Exception as e:
            continue

if __name__ == "__main__":
    asyncio.run(main())

Tests proven to work successfully proven by the previous behaviour description and I'm not getting any respawning messages until some unknown conditions happen. I didn't implement a while loop at first in case the script got stuck and I've got the MAVIS LDAP module the same way, though this doesn't appear to fail.

Anyway, I'll have a go with the approach you advised as indeed a potential error in the stdin may be breaking this.

Thanks and regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants